public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Pavel Begunkov <asml.silence@gmail.com>,
	io-uring <io-uring@vger.kernel.org>
Subject: Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
Date: Wed, 7 May 2025 11:51:16 -0600	[thread overview]
Message-ID: <05a2682a-4e66-41eb-a978-ff94a7ab282a@kernel.dk> (raw)
In-Reply-To: <364679fa-8fc3-4bcb-8296-0877f39d6f2c@gmail.com>

On 5/7/25 10:18 AM, Pavel Begunkov wrote:
> On 5/7/25 16:49, Jens Axboe wrote:
>> On 5/7/25 9:46 AM, Pavel Begunkov wrote:
>>> On 5/7/25 16:26, Jens Axboe wrote:
>>>> On 5/7/25 9:11 AM, Pavel Begunkov wrote:
>>>>> On 5/7/25 15:58, Jens Axboe wrote:
>>>>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote:
>>>>>>> On 5/7/25 14:56, Jens Axboe wrote:
>>>>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when
>>>>>>>> stopping it, it may finish up with a deferred completion. This is fine,
>>>>>>>> except if another multishot event triggers before the deferred completions
>>>>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>>>>>>>> as new multishot completions get posted before the deferred ones are
>>>>>>>> flushed. This can cause confusion on the application side, if strict
>>>>>>>> ordering is required for the use case.
>>>>>>>>
>>>>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>>>>>>>> completions first, if any.
>>>>>>>>
>>>>>>>> Cc: stable@vger.kernel.org # 6.1+
>>>>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>>>>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>>>>>> index 769814d71153..541e65a1eebf 100644
>>>>>>>> --- a/io_uring/io_uring.c
>>>>>>>> +++ b/io_uring/io_uring.c
>>>>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>>>>>>>          struct io_ring_ctx *ctx = req->ctx;
>>>>>>>>          bool posted;
>>>>>>>>      +    /*
>>>>>>>> +     * If multishot has already posted deferred completions, ensure that
>>>>>>>> +     * those are flushed first before posting this one. If not, CQEs
>>>>>>>> +     * could get reordered.
>>>>>>>> +     */
>>>>>>>> +    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>>>>>>>> +        __io_submit_flush_completions(ctx);
>>>>>>>
>>>>>>> A request is already dead if it's in compl_reqs, there should be no
>>>>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs
>>>>>>> belonging to the same request? And what do you mean by "deferred"
>>>>>>> completions?
>>>>>>
>>>>>> It's not the same req, it's different requests using the same
>>>>>> provided buffer ring where it can be problematic.
>>>>>
>>>>> Ok, and there has never been any ordering guarantees between them.
>>>>> Is there any report describing the problem? Why it's ok if
>>>>> io_req_post_cqe() produced CQEs of two multishot requests get
>>>>> reordered, but it's not when one of them is finishing a request?
>>>>> What's the problem with provided buffers?
>>>>
>>>> There better be ordering between posting of the CQEs - I'm not talking
>>>> about issue ordering in general, but if you have R1 (request 1) and R2
>>>> each consuming from the same buffer ring, then completion posting of
>>>> those absolutely need to be ordered. If not, you can have:
>>>>
>>>> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because
>>>> it's stopping, end up posting via io_req_task_complete ->
>>>> io_req_complete_defer -> add to deferred list.
>>>>
>>>> Other task_work in that run has R2, grabbing buffers from BGID Z,
>>>> doesn't terminate, posts CQE2 via io_req_post_cqe().
>>>>
>>>> tw run done, deferred completions flushed, R1 posts CQE1.
>>>>
>>>> Then we have CQE2 ahead of CQE1 in the CQ ring.
>>>
>>> Which is why provided buffers from the beginning were returning
>>> buffer ids (i.e. bid, id within the group). Is it incremental
>>> consumption or some newer feature not being able to differentiate
>>> buffer chunks apart from ordering?
>>
>> Both incrementally consumed and bundles are susceptible to this
>> reordering.
>>
>>>>> It's a pretty bad spot to do such kinds of things, it disables any
>>>>> mixed mshot with non-mshot batching, and nesting flushing under
>>>>> an opcode handler is pretty nasty, it should rather stay top
>>>>> level as it is. From what I hear it's a provided buffer specific
>>>>> problem and should be fixed there.
>>>>
>>>> I agree it's not the prettiest, but I would prefer if we do it this way
>>>> just in terms of stable backporting. This should be a relatively rare
>>>
>>> Sure. Are you looking into fixing it in a different way for
>>> upstream? Because it's not the right level of abstraction.
>>
>> Yeah I think so - basically the common use case for stopping multishot
>> doesn't really work for this, as there are two methods of posting the
>> CQE. One is using io_req_post_cqe(), the other is a final deferred
>> completion. The mix and match of those two is what's causing this,
>> obviously.
> 
> That's the reason why all mshot stuff I've been adding is solely
> using io_req_post_cqe() for IO completions, and the final CQE only
> carries error / 0. Maybe it's not too late to switch read/recv to
> that model.
> 
> Or use io_req_post_cqe() for the final completion and silence
> the CQE coming from flush_completions() with REQ_F_SKIP_CQE?

Yeah we can do something like that. Something like the below should
work. Might be cleaner to simply add a new helper for this in
io_uring.c and leave the overflow handling private and in there.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 541e65a1eebf..796ecd4bea5c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -700,8 +700,8 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task)
 	}
 }
 
-static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
-				     s32 res, u32 cflags, u64 extra1, u64 extra2)
+bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, s32 res,
+			      u32 cflags, u64 extra1, u64 extra2)
 {
 	struct io_overflow_cqe *ocqe;
 	size_t ocq_size = sizeof(struct io_overflow_cqe);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e4050b2d0821..dd0a89674095 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -82,6 +82,8 @@ void io_req_defer_failed(struct io_kiocb *req, s32 res);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
 void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
 bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
+bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, s32 res,
+			      u32 cflags, u64 extra1, u64 extra2);
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
 
 struct file *io_file_get_normal(struct io_kiocb *req, int fd);
diff --git a/io_uring/poll.c b/io_uring/poll.c
index 8eb744eb9f4c..c002508015a4 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -312,6 +312,24 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 	return IOU_POLL_NO_ACTION;
 }
 
+static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw)
+{
+	bool filled;
+
+	filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags);
+	if (unlikely(!filled)) {
+		struct io_ring_ctx *ctx = req->ctx;
+
+		spin_lock(&ctx->completion_lock);
+		filled = io_cqring_event_overflow(ctx, req->cqe.user_data,
+					req->cqe.res, req->cqe.flags, 0, 0);
+		spin_unlock(&ctx->completion_lock);
+	}
+	if (filled)
+		req->flags |= REQ_F_CQE_SKIP;
+	io_req_task_complete(req, tw);
+}
+
 void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw)
 {
 	int ret;
@@ -349,7 +367,7 @@ void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw)
 		io_tw_lock(req->ctx, tw);
 
 		if (ret == IOU_POLL_REMOVE_POLL_USE_RES)
-			io_req_task_complete(req, tw);
+			io_poll_req_complete(req, tw);
 		else if (ret == IOU_POLL_DONE || ret == IOU_POLL_REISSUE)
 			io_req_task_submit(req, tw);
 		else

-- 
Jens Axboe

      reply	other threads:[~2025-05-07 17:51 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 13:56 [PATCH] io_uring: ensure deferred completions are flushed for multishot Jens Axboe
2025-05-07 14:36 ` Pavel Begunkov
2025-05-07 14:58   ` Jens Axboe
2025-05-07 15:11     ` Pavel Begunkov
2025-05-07 15:26       ` Jens Axboe
2025-05-07 15:46         ` Pavel Begunkov
2025-05-07 15:49           ` Jens Axboe
2025-05-07 16:18             ` Pavel Begunkov
2025-05-07 17:51               ` Jens Axboe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05a2682a-4e66-41eb-a978-ff94a7ab282a@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox