public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, io-uring <io-uring@vger.kernel.org>
Subject: Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
Date: Wed, 7 May 2025 17:18:10 +0100	[thread overview]
Message-ID: <364679fa-8fc3-4bcb-8296-0877f39d6f2c@gmail.com> (raw)
In-Reply-To: <654d7a07-5b5f-4f78-bef5-dda6a919c3e1@kernel.dk>

On 5/7/25 16:49, Jens Axboe wrote:
> On 5/7/25 9:46 AM, Pavel Begunkov wrote:
>> On 5/7/25 16:26, Jens Axboe wrote:
>>> On 5/7/25 9:11 AM, Pavel Begunkov wrote:
>>>> On 5/7/25 15:58, Jens Axboe wrote:
>>>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote:
>>>>>> On 5/7/25 14:56, Jens Axboe wrote:
>>>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when
>>>>>>> stopping it, it may finish up with a deferred completion. This is fine,
>>>>>>> except if another multishot event triggers before the deferred completions
>>>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>>>>>>> as new multishot completions get posted before the deferred ones are
>>>>>>> flushed. This can cause confusion on the application side, if strict
>>>>>>> ordering is required for the use case.
>>>>>>>
>>>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>>>>>>> completions first, if any.
>>>>>>>
>>>>>>> Cc: stable@vger.kernel.org # 6.1+
>>>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>>>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>>>>> index 769814d71153..541e65a1eebf 100644
>>>>>>> --- a/io_uring/io_uring.c
>>>>>>> +++ b/io_uring/io_uring.c
>>>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>>>>>>          struct io_ring_ctx *ctx = req->ctx;
>>>>>>>          bool posted;
>>>>>>>      +    /*
>>>>>>> +     * If multishot has already posted deferred completions, ensure that
>>>>>>> +     * those are flushed first before posting this one. If not, CQEs
>>>>>>> +     * could get reordered.
>>>>>>> +     */
>>>>>>> +    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>>>>>>> +        __io_submit_flush_completions(ctx);
>>>>>>
>>>>>> A request is already dead if it's in compl_reqs, there should be no
>>>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs
>>>>>> belonging to the same request? And what do you mean by "deferred"
>>>>>> completions?
>>>>>
>>>>> It's not the same req, it's different requests using the same
>>>>> provided buffer ring where it can be problematic.
>>>>
>>>> Ok, and there has never been any ordering guarantees between them.
>>>> Is there any report describing the problem? Why it's ok if
>>>> io_req_post_cqe() produced CQEs of two multishot requests get
>>>> reordered, but it's not when one of them is finishing a request?
>>>> What's the problem with provided buffers?
>>>
>>> There better be ordering between posting of the CQEs - I'm not talking
>>> about issue ordering in general, but if you have R1 (request 1) and R2
>>> each consuming from the same buffer ring, then completion posting of
>>> those absolutely need to be ordered. If not, you can have:
>>>
>>> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because
>>> it's stopping, end up posting via io_req_task_complete ->
>>> io_req_complete_defer -> add to deferred list.
>>>
>>> Other task_work in that run has R2, grabbing buffers from BGID Z,
>>> doesn't terminate, posts CQE2 via io_req_post_cqe().
>>>
>>> tw run done, deferred completions flushed, R1 posts CQE1.
>>>
>>> Then we have CQE2 ahead of CQE1 in the CQ ring.
>>
>> Which is why provided buffers from the beginning were returning
>> buffer ids (i.e. bid, id within the group). Is it incremental
>> consumption or some newer feature not being able to differentiate
>> buffer chunks apart from ordering?
> 
> Both incrementally consumed and bundles are susceptible to this
> reordering.
> 
>>>> It's a pretty bad spot to do such kinds of things, it disables any
>>>> mixed mshot with non-mshot batching, and nesting flushing under
>>>> an opcode handler is pretty nasty, it should rather stay top
>>>> level as it is. From what I hear it's a provided buffer specific
>>>> problem and should be fixed there.
>>>
>>> I agree it's not the prettiest, but I would prefer if we do it this way
>>> just in terms of stable backporting. This should be a relatively rare
>>
>> Sure. Are you looking into fixing it in a different way for
>> upstream? Because it's not the right level of abstraction.
> 
> Yeah I think so - basically the common use case for stopping multishot
> doesn't really work for this, as there are two methods of posting the
> CQE. One is using io_req_post_cqe(), the other is a final deferred
> completion. The mix and match of those two is what's causing this,
> obviously.

That's the reason why all mshot stuff I've been adding is solely
using io_req_post_cqe() for IO completions, and the final CQE only
carries error / 0. Maybe it's not too late to switch read/recv to
that model.

Or use io_req_post_cqe() for the final completion and silence
the CQE coming from flush_completions() with REQ_F_SKIP_CQE?

> I'd do this patch upstream, and then unify how the various users of
> io_req_post_cqe() terminate, and ensure that only one channel is used
> for posting CQEs for that case. When we're happy with that, we can
> always flag that for stable fixing this one too, at least for 6.x where
> x == newer.
> 
> Cooking that up right now on top...

-- 
Pavel Begunkov


  reply	other threads:[~2025-05-07 16:16 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 13:56 [PATCH] io_uring: ensure deferred completions are flushed for multishot Jens Axboe
2025-05-07 14:36 ` Pavel Begunkov
2025-05-07 14:58   ` Jens Axboe
2025-05-07 15:11     ` Pavel Begunkov
2025-05-07 15:26       ` Jens Axboe
2025-05-07 15:46         ` Pavel Begunkov
2025-05-07 15:49           ` Jens Axboe
2025-05-07 16:18             ` Pavel Begunkov [this message]
2025-05-07 17:51               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=364679fa-8fc3-4bcb-8296-0877f39d6f2c@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox