From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-io1-f41.google.com (mail-io1-f41.google.com [209.85.166.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8EC8F28C2B9 for ; Wed, 7 May 2025 17:51:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.166.41 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1746640284; cv=none; b=WapbfrDUf8V4FZ902U0Lx1hWHf50Ejxn08X5TEBpM0snz/GUO8oznJ7g4PdfrfgthFmITQywQam2PeFCrOay4TJMH94EjhDvNXvhdz6FqG8J7mRMnrCyHzrBMc4fA4liptPoTIN6D7MTwjmkjV3Ux7DuvRT6x8FAQ0dFWeaWN5g= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1746640284; c=relaxed/simple; bh=Se3N3w9Ihx8YzyjZgG2am/XnYBiCebCqRcaoxwRkCz4=; h=Message-ID:Date:MIME-Version:Subject:To:References:From: In-Reply-To:Content-Type; b=LCKe7Zds1LfZbc/FXj7QQPZcdnTzG1lCHja33AaRZTP9dKPFjhokrZ/aPHcQHy5i9NyYs4MiHH5ewOq9/RguLSIUqRXGdvV+clm89/OcekLYum7GwRWAb9LLq57XYB1SgnuM9B4Y38bDHrtaa675sGyXHvXAtpPN/cenk82T3tc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk; spf=pass smtp.mailfrom=kernel.dk; dkim=pass (2048-bit key) header.d=kernel-dk.20230601.gappssmtp.com header.i=@kernel-dk.20230601.gappssmtp.com header.b=gFr5CwW1; arc=none smtp.client-ip=209.85.166.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.dk Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=kernel.dk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20230601.gappssmtp.com header.i=@kernel-dk.20230601.gappssmtp.com header.b="gFr5CwW1" Received: by mail-io1-f41.google.com with SMTP id ca18e2360f4ac-861b1f04b99so4006539f.0 for ; Wed, 07 May 2025 10:51:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1746640277; x=1747245077; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:content-language:from :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=c1Y/IUsJF0wFon3ryF5T3L6/jR6son6Hw7vzZyRqV1g=; b=gFr5CwW1axS71KAZm6otrr3vxC8RL89UWXakDxYKVeE5A/fg1tuZzAYa9YdE0miei7 Yf3usThGsua3QeSi8nddOHIv4UbIueHg1ARY4UmlpI3FDKgq3B6SyhBNJqJqiwfRch0/ 25AdUHgfDzb5200eXzyGUjqSm6f1V933exnAP7LhePj1k+LiEfvOXmfX5r+/+DIVFMEb 3rP2qe5cMsZK16LsTynye+vkXpI+B+QS1n5T14QRe8cND08Wyeh+2XiuBzcEmPRiFr0u DVrs+FnqRmC6oy7HiKSjUvJ5cvRsCaI5pU5p39QJ8ucQO1G3fJwCjfpC0vlzJ0RPT0GR jsdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746640277; x=1747245077; h=content-transfer-encoding:in-reply-to:content-language:from :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=c1Y/IUsJF0wFon3ryF5T3L6/jR6son6Hw7vzZyRqV1g=; b=Vtxty7mif0v9g9fzbcXWdN0PCVV8LcH5ugDHLROBqBwUduuayermj8r32iq/mMqFq+ CVtnIu1tj2AXFgq72saEe75zW3Fn4nBTR2ESMI+uva6yCEMdwJv+GFlMHC6e97EQ839X r0uyU6MvknE55TOi2P6kKvFpSquXW6i0QNsNLVCTSEDISi1YsblfMHDF6evlSFQN4xkm eIalOBOY6HzmDEbhP5AZKsTdqOKGvSylsyuMztfG3AoamrQvoiEZWhvTtegMHryr5RO2 p4lMvHxzSid3QwoQchJZ6GZpkRmXuMV33GveejF1djgIydri+qQ0q3rlgcf/mOiVuIFy SCzA== X-Forwarded-Encrypted: i=1; AJvYcCU3CWOpNg04i7KT+M/1juqfncC9y9V6WL0R73AjdWM0rHu8KbRChG08VAjiVLmRvPlm/mS03pWoJw==@vger.kernel.org X-Gm-Message-State: AOJu0YzXUDsFCJ98vMi6Yw1d2Y+JczrD1i0M5fJsNQ44mcALfzDlQCa3 yp1xDPfxIsmQPliEsnusp3jFdAej93ZYMRZcXsg1MxnGkQ/ux4EdnsdDHE10mgBKb0AWdGSBAkZ T X-Gm-Gg: ASbGnctqXlhgsB+eZIA+7rtAq88EerpjyugyGAAm+/IMJCsSJLdBo6hZr2HnpQXaiAe uZKy16B1DbOtRYymeJm9fW7PccWmDP2amVAIrrtqXBlExPJYt4P7yVwAohTvK5SsAnAf6e5x/82 Pnw6D1pzKkjhheNMAPsdGWZRJghfmZtlZIpTrMD0BT3PICXZqF2/06AOAWCkcol6AOxbXP2NUav d57OGZXDfrPVg8cslTRO2ZbxBcL2r5swPBxk9vxi2sSBSIhgAbQg9l0YyDATJ94mn+1iYEDVMhI +q7txQ/CAkvimHS8bMg6hx6hoobfb0PDsj4h X-Google-Smtp-Source: AGHT+IGhtU2lA6HJLfl+/IXHkahIw9uf0JVp6BNjOX37v/L63A2/Mvf6LxBTFNyOm+gCBZAeELsGcg== X-Received: by 2002:a05:6602:3f94:b0:85b:577b:37c9 with SMTP id ca18e2360f4ac-86755121be1mr61397739f.12.1746640277221; Wed, 07 May 2025 10:51:17 -0700 (PDT) Received: from [192.168.1.116] ([96.43.243.2]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4f88a91373asm2833701173.33.2025.05.07.10.51.16 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 07 May 2025 10:51:16 -0700 (PDT) Message-ID: <05a2682a-4e66-41eb-a978-ff94a7ab282a@kernel.dk> Date: Wed, 7 May 2025 11:51:16 -0600 Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot To: Pavel Begunkov , io-uring References: <611393de-4c50-4597-9290-82059e412a4b@kernel.dk> <654d7a07-5b5f-4f78-bef5-dda6a919c3e1@kernel.dk> <364679fa-8fc3-4bcb-8296-0877f39d6f2c@gmail.com> From: Jens Axboe Content-Language: en-US In-Reply-To: <364679fa-8fc3-4bcb-8296-0877f39d6f2c@gmail.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 5/7/25 10:18 AM, Pavel Begunkov wrote: > On 5/7/25 16:49, Jens Axboe wrote: >> On 5/7/25 9:46 AM, Pavel Begunkov wrote: >>> On 5/7/25 16:26, Jens Axboe wrote: >>>> On 5/7/25 9:11 AM, Pavel Begunkov wrote: >>>>> On 5/7/25 15:58, Jens Axboe wrote: >>>>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote: >>>>>>> On 5/7/25 14:56, Jens Axboe wrote: >>>>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when >>>>>>>> stopping it, it may finish up with a deferred completion. This is fine, >>>>>>>> except if another multishot event triggers before the deferred completions >>>>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring, >>>>>>>> as new multishot completions get posted before the deferred ones are >>>>>>>> flushed. This can cause confusion on the application side, if strict >>>>>>>> ordering is required for the use case. >>>>>>>> >>>>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred >>>>>>>> completions first, if any. >>>>>>>> >>>>>>>> Cc: stable@vger.kernel.org # 6.1+ >>>>>>>> Reported-by: Norman Maurer >>>>>>>> Reported-by: Christian Mazakas >>>>>>>> Signed-off-by: Jens Axboe >>>>>>>> >>>>>>>> --- >>>>>>>> >>>>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>>>>>> index 769814d71153..541e65a1eebf 100644 >>>>>>>> --- a/io_uring/io_uring.c >>>>>>>> +++ b/io_uring/io_uring.c >>>>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) >>>>>>>> struct io_ring_ctx *ctx = req->ctx; >>>>>>>> bool posted; >>>>>>>> + /* >>>>>>>> + * If multishot has already posted deferred completions, ensure that >>>>>>>> + * those are flushed first before posting this one. If not, CQEs >>>>>>>> + * could get reordered. >>>>>>>> + */ >>>>>>>> + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) >>>>>>>> + __io_submit_flush_completions(ctx); >>>>>>> >>>>>>> A request is already dead if it's in compl_reqs, there should be no >>>>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs >>>>>>> belonging to the same request? And what do you mean by "deferred" >>>>>>> completions? >>>>>> >>>>>> It's not the same req, it's different requests using the same >>>>>> provided buffer ring where it can be problematic. >>>>> >>>>> Ok, and there has never been any ordering guarantees between them. >>>>> Is there any report describing the problem? Why it's ok if >>>>> io_req_post_cqe() produced CQEs of two multishot requests get >>>>> reordered, but it's not when one of them is finishing a request? >>>>> What's the problem with provided buffers? >>>> >>>> There better be ordering between posting of the CQEs - I'm not talking >>>> about issue ordering in general, but if you have R1 (request 1) and R2 >>>> each consuming from the same buffer ring, then completion posting of >>>> those absolutely need to be ordered. If not, you can have: >>>> >>>> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because >>>> it's stopping, end up posting via io_req_task_complete -> >>>> io_req_complete_defer -> add to deferred list. >>>> >>>> Other task_work in that run has R2, grabbing buffers from BGID Z, >>>> doesn't terminate, posts CQE2 via io_req_post_cqe(). >>>> >>>> tw run done, deferred completions flushed, R1 posts CQE1. >>>> >>>> Then we have CQE2 ahead of CQE1 in the CQ ring. >>> >>> Which is why provided buffers from the beginning were returning >>> buffer ids (i.e. bid, id within the group). Is it incremental >>> consumption or some newer feature not being able to differentiate >>> buffer chunks apart from ordering? >> >> Both incrementally consumed and bundles are susceptible to this >> reordering. >> >>>>> It's a pretty bad spot to do such kinds of things, it disables any >>>>> mixed mshot with non-mshot batching, and nesting flushing under >>>>> an opcode handler is pretty nasty, it should rather stay top >>>>> level as it is. From what I hear it's a provided buffer specific >>>>> problem and should be fixed there. >>>> >>>> I agree it's not the prettiest, but I would prefer if we do it this way >>>> just in terms of stable backporting. This should be a relatively rare >>> >>> Sure. Are you looking into fixing it in a different way for >>> upstream? Because it's not the right level of abstraction. >> >> Yeah I think so - basically the common use case for stopping multishot >> doesn't really work for this, as there are two methods of posting the >> CQE. One is using io_req_post_cqe(), the other is a final deferred >> completion. The mix and match of those two is what's causing this, >> obviously. > > That's the reason why all mshot stuff I've been adding is solely > using io_req_post_cqe() for IO completions, and the final CQE only > carries error / 0. Maybe it's not too late to switch read/recv to > that model. > > Or use io_req_post_cqe() for the final completion and silence > the CQE coming from flush_completions() with REQ_F_SKIP_CQE? Yeah we can do something like that. Something like the below should work. Might be cleaner to simply add a new helper for this in io_uring.c and leave the overflow handling private and in there. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 541e65a1eebf..796ecd4bea5c 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -700,8 +700,8 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } } -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, - s32 res, u32 cflags, u64 extra1, u64 extra2) +bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, s32 res, + u32 cflags, u64 extra1, u64 extra2) { struct io_overflow_cqe *ocqe; size_t ocq_size = sizeof(struct io_overflow_cqe); diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index e4050b2d0821..dd0a89674095 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -82,6 +82,8 @@ void io_req_defer_failed(struct io_kiocb *req, s32 res); bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags); void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags); bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags); +bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, s32 res, + u32 cflags, u64 extra1, u64 extra2); void __io_commit_cqring_flush(struct io_ring_ctx *ctx); struct file *io_file_get_normal(struct io_kiocb *req, int fd); diff --git a/io_uring/poll.c b/io_uring/poll.c index 8eb744eb9f4c..c002508015a4 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -312,6 +312,24 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) return IOU_POLL_NO_ACTION; } +static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw) +{ + bool filled; + + filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags); + if (unlikely(!filled)) { + struct io_ring_ctx *ctx = req->ctx; + + spin_lock(&ctx->completion_lock); + filled = io_cqring_event_overflow(ctx, req->cqe.user_data, + req->cqe.res, req->cqe.flags, 0, 0); + spin_unlock(&ctx->completion_lock); + } + if (filled) + req->flags |= REQ_F_CQE_SKIP; + io_req_task_complete(req, tw); +} + void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw) { int ret; @@ -349,7 +367,7 @@ void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw) io_tw_lock(req->ctx, tw); if (ret == IOU_POLL_REMOVE_POLL_USE_RES) - io_req_task_complete(req, tw); + io_poll_req_complete(req, tw); else if (ret == IOU_POLL_DONE || ret == IOU_POLL_REISSUE) io_req_task_submit(req, tw); else -- Jens Axboe