From: Pavel Begunkov <asml.silence@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, io-uring <io-uring@vger.kernel.org>
Subject: Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
Date: Wed, 7 May 2025 16:46:09 +0100 [thread overview]
Message-ID: <f62f094c-346c-49d0-a80e-bc5fc0dcdc34@gmail.com> (raw)
In-Reply-To: <611393de-4c50-4597-9290-82059e412a4b@kernel.dk>
On 5/7/25 16:26, Jens Axboe wrote:
> On 5/7/25 9:11 AM, Pavel Begunkov wrote:
>> On 5/7/25 15:58, Jens Axboe wrote:
>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote:
>>>> On 5/7/25 14:56, Jens Axboe wrote:
>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when
>>>>> stopping it, it may finish up with a deferred completion. This is fine,
>>>>> except if another multishot event triggers before the deferred completions
>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>>>>> as new multishot completions get posted before the deferred ones are
>>>>> flushed. This can cause confusion on the application side, if strict
>>>>> ordering is required for the use case.
>>>>>
>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>>>>> completions first, if any.
>>>>>
>>>>> Cc: stable@vger.kernel.org # 6.1+
>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>>> index 769814d71153..541e65a1eebf 100644
>>>>> --- a/io_uring/io_uring.c
>>>>> +++ b/io_uring/io_uring.c
>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>>>> struct io_ring_ctx *ctx = req->ctx;
>>>>> bool posted;
>>>>> + /*
>>>>> + * If multishot has already posted deferred completions, ensure that
>>>>> + * those are flushed first before posting this one. If not, CQEs
>>>>> + * could get reordered.
>>>>> + */
>>>>> + if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>>>>> + __io_submit_flush_completions(ctx);
>>>>
>>>> A request is already dead if it's in compl_reqs, there should be no
>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs
>>>> belonging to the same request? And what do you mean by "deferred"
>>>> completions?
>>>
>>> It's not the same req, it's different requests using the same
>>> provided buffer ring where it can be problematic.
>>
>> Ok, and there has never been any ordering guarantees between them.
>> Is there any report describing the problem? Why it's ok if
>> io_req_post_cqe() produced CQEs of two multishot requests get
>> reordered, but it's not when one of them is finishing a request?
>> What's the problem with provided buffers?
>
> There better be ordering between posting of the CQEs - I'm not talking
> about issue ordering in general, but if you have R1 (request 1) and R2
> each consuming from the same buffer ring, then completion posting of
> those absolutely need to be ordered. If not, you can have:
>
> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because
> it's stopping, end up posting via io_req_task_complete ->
> io_req_complete_defer -> add to deferred list.
>
> Other task_work in that run has R2, grabbing buffers from BGID Z,
> doesn't terminate, posts CQE2 via io_req_post_cqe().
>
> tw run done, deferred completions flushed, R1 posts CQE1.
>
> Then we have CQE2 ahead of CQE1 in the CQ ring.
Which is why provided buffers from the beginning were returning
buffer ids (i.e. bid, id within the group). Is it incremental
consumption or some newer feature not being able to differentiate
buffer chunks apart from ordering?
>> It's a pretty bad spot to do such kinds of things, it disables any
>> mixed mshot with non-mshot batching, and nesting flushing under
>> an opcode handler is pretty nasty, it should rather stay top
>> level as it is. From what I hear it's a provided buffer specific
>> problem and should be fixed there.
>
> I agree it's not the prettiest, but I would prefer if we do it this way
> just in terms of stable backporting. This should be a relatively rare
Sure. Are you looking into fixing it in a different way for
upstream? Because it's not the right level of abstraction.
> occurence, so I'm not too concerned about missed batching here.
Any mixed recv / send app, especially with some send zc, if the
amount of traffic is large enough it'll definitely be hitting
it.
--
Pavel Begunkov
next prev parent reply other threads:[~2025-05-07 15:44 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-07 13:56 [PATCH] io_uring: ensure deferred completions are flushed for multishot Jens Axboe
2025-05-07 14:36 ` Pavel Begunkov
2025-05-07 14:58 ` Jens Axboe
2025-05-07 15:11 ` Pavel Begunkov
2025-05-07 15:26 ` Jens Axboe
2025-05-07 15:46 ` Pavel Begunkov [this message]
2025-05-07 15:49 ` Jens Axboe
2025-05-07 16:18 ` Pavel Begunkov
2025-05-07 17:51 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f62f094c-346c-49d0-a80e-bc5fc0dcdc34@gmail.com \
--to=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=io-uring@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox