[PATCH] io_uring: ensure deferred completions are flushed for multishot

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] io_uring: ensure deferred completions are flushed for multishot
@ 2025-05-07 13:56 Jens Axboe
  2025-05-07 14:36 ` Pavel Begunkov
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-05-07 13:56 UTC (permalink / raw)
  To: io-uring

Multishot normally uses io_req_post_cqe() to post completions, but when
stopping it, it may finish up with a deferred completion. This is fine,
except if another multishot event triggers before the deferred completions
get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
as new multishot completions get posted before the deferred ones are
flushed. This can cause confusion on the application side, if strict
ordering is required for the use case.

When multishot posting via io_req_post_cqe(), flush any pending deferred
completions first, if any.

Cc: stable@vger.kernel.org # 6.1+
Reported-by: Norman Maurer <norman_maurer@apple.com>
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 769814d71153..541e65a1eebf 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
 	struct io_ring_ctx *ctx = req->ctx;
 	bool posted;

+	/*
+	 * If multishot has already posted deferred completions, ensure that
+	 * those are flushed first before posting this one. If not, CQEs
+	 * could get reordered.
+	 */
+	if (!wq_list_empty(&ctx->submit_state.compl_reqs))
+		__io_submit_flush_completions(ctx);
+
 	lockdep_assert(!io_wq_current_is_worker());
 	lockdep_assert_held(&ctx->uring_lock);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
  2025-05-07 13:56 [PATCH] io_uring: ensure deferred completions are flushed for multishot Jens Axboe
@ 2025-05-07 14:36 ` Pavel Begunkov
  2025-05-07 14:58   ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Pavel Begunkov @ 2025-05-07 14:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/7/25 14:56, Jens Axboe wrote:
> Multishot normally uses io_req_post_cqe() to post completions, but when
> stopping it, it may finish up with a deferred completion. This is fine,
> except if another multishot event triggers before the deferred completions
> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
> as new multishot completions get posted before the deferred ones are
> flushed. This can cause confusion on the application side, if strict
> ordering is required for the use case.
> 
> When multishot posting via io_req_post_cqe(), flush any pending deferred
> completions first, if any.
> 
> Cc: stable@vger.kernel.org # 6.1+
> Reported-by: Norman Maurer <norman_maurer@apple.com>
> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 769814d71153..541e65a1eebf 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>   	struct io_ring_ctx *ctx = req->ctx;
>   	bool posted;
>   
> +	/*
> +	 * If multishot has already posted deferred completions, ensure that
> +	 * those are flushed first before posting this one. If not, CQEs
> +	 * could get reordered.
> +	 */
> +	if (!wq_list_empty(&ctx->submit_state.compl_reqs))
> +		__io_submit_flush_completions(ctx);

A request is already dead if it's in compl_reqs, there should be no
way io_req_post_cqe() is called with it. Is it reordering of CQEs
belonging to the same request? And what do you mean by "deferred"
completions?


> +
>   	lockdep_assert(!io_wq_current_is_worker());
>   	lockdep_assert_held(&ctx->uring_lock);
>   

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
  2025-05-07 14:36 ` Pavel Begunkov
@ 2025-05-07 14:58   ` Jens Axboe
  2025-05-07 15:11     ` Pavel Begunkov
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-05-07 14:58 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/7/25 8:36 AM, Pavel Begunkov wrote:
> On 5/7/25 14:56, Jens Axboe wrote:
>> Multishot normally uses io_req_post_cqe() to post completions, but when
>> stopping it, it may finish up with a deferred completion. This is fine,
>> except if another multishot event triggers before the deferred completions
>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>> as new multishot completions get posted before the deferred ones are
>> flushed. This can cause confusion on the application side, if strict
>> ordering is required for the use case.
>>
>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>> completions first, if any.
>>
>> Cc: stable@vger.kernel.org # 6.1+
>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 769814d71153..541e65a1eebf 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>       struct io_ring_ctx *ctx = req->ctx;
>>       bool posted;
>>   +    /*
>> +     * If multishot has already posted deferred completions, ensure that
>> +     * those are flushed first before posting this one. If not, CQEs
>> +     * could get reordered.
>> +     */
>> +    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>> +        __io_submit_flush_completions(ctx);
> 
> A request is already dead if it's in compl_reqs, there should be no
> way io_req_post_cqe() is called with it. Is it reordering of CQEs
> belonging to the same request? And what do you mean by "deferred"
> completions?

It's not the same req, it's different requests using the same
provided buffer ring where it can be problematic.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
  2025-05-07 14:58   ` Jens Axboe
@ 2025-05-07 15:11     ` Pavel Begunkov
  2025-05-07 15:26       ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Pavel Begunkov @ 2025-05-07 15:11 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/7/25 15:58, Jens Axboe wrote:
> On 5/7/25 8:36 AM, Pavel Begunkov wrote:
>> On 5/7/25 14:56, Jens Axboe wrote:
>>> Multishot normally uses io_req_post_cqe() to post completions, but when
>>> stopping it, it may finish up with a deferred completion. This is fine,
>>> except if another multishot event triggers before the deferred completions
>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>>> as new multishot completions get posted before the deferred ones are
>>> flushed. This can cause confusion on the application side, if strict
>>> ordering is required for the use case.
>>>
>>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>>> completions first, if any.
>>>
>>> Cc: stable@vger.kernel.org # 6.1+
>>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>
>>> ---
>>>
>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>> index 769814d71153..541e65a1eebf 100644
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>>        struct io_ring_ctx *ctx = req->ctx;
>>>        bool posted;
>>>    +    /*
>>> +     * If multishot has already posted deferred completions, ensure that
>>> +     * those are flushed first before posting this one. If not, CQEs
>>> +     * could get reordered.
>>> +     */
>>> +    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>>> +        __io_submit_flush_completions(ctx);
>>
>> A request is already dead if it's in compl_reqs, there should be no
>> way io_req_post_cqe() is called with it. Is it reordering of CQEs
>> belonging to the same request? And what do you mean by "deferred"
>> completions?
> 
> It's not the same req, it's different requests using the same
> provided buffer ring where it can be problematic.

Ok, and there has never been any ordering guarantees between them.
Is there any report describing the problem? Why it's ok if
io_req_post_cqe() produced CQEs of two multishot requests get
reordered, but it's not when one of them is finishing a request?
What's the problem with provided buffers?

It's a pretty bad spot to do such kinds of things, it disables any
mixed mshot with non-mshot batching, and nesting flushing under
an opcode handler is pretty nasty, it should rather stay top
level as it is. From what I hear it's a provided buffer specific
problem and should be fixed there.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
  2025-05-07 15:11     ` Pavel Begunkov
@ 2025-05-07 15:26       ` Jens Axboe
  2025-05-07 15:46         ` Pavel Begunkov
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-05-07 15:26 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/7/25 9:11 AM, Pavel Begunkov wrote:
> On 5/7/25 15:58, Jens Axboe wrote:
>> On 5/7/25 8:36 AM, Pavel Begunkov wrote:
>>> On 5/7/25 14:56, Jens Axboe wrote:
>>>> Multishot normally uses io_req_post_cqe() to post completions, but when
>>>> stopping it, it may finish up with a deferred completion. This is fine,
>>>> except if another multishot event triggers before the deferred completions
>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>>>> as new multishot completions get posted before the deferred ones are
>>>> flushed. This can cause confusion on the application side, if strict
>>>> ordering is required for the use case.
>>>>
>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>>>> completions first, if any.
>>>>
>>>> Cc: stable@vger.kernel.org # 6.1+
>>>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>
>>>> ---
>>>>
>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>> index 769814d71153..541e65a1eebf 100644
>>>> --- a/io_uring/io_uring.c
>>>> +++ b/io_uring/io_uring.c
>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>>>        struct io_ring_ctx *ctx = req->ctx;
>>>>        bool posted;
>>>>    +    /*
>>>> +     * If multishot has already posted deferred completions, ensure that
>>>> +     * those are flushed first before posting this one. If not, CQEs
>>>> +     * could get reordered.
>>>> +     */
>>>> +    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>>>> +        __io_submit_flush_completions(ctx);
>>>
>>> A request is already dead if it's in compl_reqs, there should be no
>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs
>>> belonging to the same request? And what do you mean by "deferred"
>>> completions?
>>
>> It's not the same req, it's different requests using the same
>> provided buffer ring where it can be problematic.
> 
> Ok, and there has never been any ordering guarantees between them.
> Is there any report describing the problem? Why it's ok if
> io_req_post_cqe() produced CQEs of two multishot requests get
> reordered, but it's not when one of them is finishing a request?
> What's the problem with provided buffers?

There better be ordering between posting of the CQEs - I'm not talking
about issue ordering in general, but if you have R1 (request 1) and R2
each consuming from the same buffer ring, then completion posting of
those absolutely need to be ordered. If not, you can have:

R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because
it's stopping, end up posting via io_req_task_complete ->
io_req_complete_defer -> add to deferred list.

Other task_work in that run has R2, grabbing buffers from BGID Z,
doesn't terminate, posts CQE2 via io_req_post_cqe().

tw run done, deferred completions flushed, R1 posts CQE1.

Then we have CQE2 ahead of CQE1 in the CQ ring.

> It's a pretty bad spot to do such kinds of things, it disables any
> mixed mshot with non-mshot batching, and nesting flushing under
> an opcode handler is pretty nasty, it should rather stay top
> level as it is. From what I hear it's a provided buffer specific
> problem and should be fixed there.

I agree it's not the prettiest, but I would prefer if we do it this way
just in terms of stable backporting. This should be a relatively rare
occurence, so I'm not too concerned about missed batching here.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
  2025-05-07 15:26       ` Jens Axboe
@ 2025-05-07 15:46         ` Pavel Begunkov
  2025-05-07 15:49           ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Pavel Begunkov @ 2025-05-07 15:46 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/7/25 16:26, Jens Axboe wrote:
> On 5/7/25 9:11 AM, Pavel Begunkov wrote:
>> On 5/7/25 15:58, Jens Axboe wrote:
>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote:
>>>> On 5/7/25 14:56, Jens Axboe wrote:
>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when
>>>>> stopping it, it may finish up with a deferred completion. This is fine,
>>>>> except if another multishot event triggers before the deferred completions
>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>>>>> as new multishot completions get posted before the deferred ones are
>>>>> flushed. This can cause confusion on the application side, if strict
>>>>> ordering is required for the use case.
>>>>>
>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>>>>> completions first, if any.
>>>>>
>>>>> Cc: stable@vger.kernel.org # 6.1+
>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>>> index 769814d71153..541e65a1eebf 100644
>>>>> --- a/io_uring/io_uring.c
>>>>> +++ b/io_uring/io_uring.c
>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>>>>         struct io_ring_ctx *ctx = req->ctx;
>>>>>         bool posted;
>>>>>     +    /*
>>>>> +     * If multishot has already posted deferred completions, ensure that
>>>>> +     * those are flushed first before posting this one. If not, CQEs
>>>>> +     * could get reordered.
>>>>> +     */
>>>>> +    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>>>>> +        __io_submit_flush_completions(ctx);
>>>>
>>>> A request is already dead if it's in compl_reqs, there should be no
>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs
>>>> belonging to the same request? And what do you mean by "deferred"
>>>> completions?
>>>
>>> It's not the same req, it's different requests using the same
>>> provided buffer ring where it can be problematic.
>>
>> Ok, and there has never been any ordering guarantees between them.
>> Is there any report describing the problem? Why it's ok if
>> io_req_post_cqe() produced CQEs of two multishot requests get
>> reordered, but it's not when one of them is finishing a request?
>> What's the problem with provided buffers?
> 
> There better be ordering between posting of the CQEs - I'm not talking
> about issue ordering in general, but if you have R1 (request 1) and R2
> each consuming from the same buffer ring, then completion posting of
> those absolutely need to be ordered. If not, you can have:
> 
> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because
> it's stopping, end up posting via io_req_task_complete ->
> io_req_complete_defer -> add to deferred list.
> 
> Other task_work in that run has R2, grabbing buffers from BGID Z,
> doesn't terminate, posts CQE2 via io_req_post_cqe().
> 
> tw run done, deferred completions flushed, R1 posts CQE1.
> 
> Then we have CQE2 ahead of CQE1 in the CQ ring.

Which is why provided buffers from the beginning were returning
buffer ids (i.e. bid, id within the group). Is it incremental
consumption or some newer feature not being able to differentiate
buffer chunks apart from ordering?

>> It's a pretty bad spot to do such kinds of things, it disables any
>> mixed mshot with non-mshot batching, and nesting flushing under
>> an opcode handler is pretty nasty, it should rather stay top
>> level as it is. From what I hear it's a provided buffer specific
>> problem and should be fixed there.
> 
> I agree it's not the prettiest, but I would prefer if we do it this way
> just in terms of stable backporting. This should be a relatively rare

Sure. Are you looking into fixing it in a different way for
upstream? Because it's not the right level of abstraction.

> occurence, so I'm not too concerned about missed batching here.

Any mixed recv / send app, especially with some send zc, if the
amount of traffic is large enough it'll definitely be hitting
it.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
  2025-05-07 15:46         ` Pavel Begunkov
@ 2025-05-07 15:49           ` Jens Axboe
  2025-05-07 16:18             ` Pavel Begunkov
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-05-07 15:49 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/7/25 9:46 AM, Pavel Begunkov wrote:
> On 5/7/25 16:26, Jens Axboe wrote:
>> On 5/7/25 9:11 AM, Pavel Begunkov wrote:
>>> On 5/7/25 15:58, Jens Axboe wrote:
>>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote:
>>>>> On 5/7/25 14:56, Jens Axboe wrote:
>>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when
>>>>>> stopping it, it may finish up with a deferred completion. This is fine,
>>>>>> except if another multishot event triggers before the deferred completions
>>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>>>>>> as new multishot completions get posted before the deferred ones are
>>>>>> flushed. This can cause confusion on the application side, if strict
>>>>>> ordering is required for the use case.
>>>>>>
>>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>>>>>> completions first, if any.
>>>>>>
>>>>>> Cc: stable@vger.kernel.org # 6.1+
>>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>>>> index 769814d71153..541e65a1eebf 100644
>>>>>> --- a/io_uring/io_uring.c
>>>>>> +++ b/io_uring/io_uring.c
>>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>>>>>         struct io_ring_ctx *ctx = req->ctx;
>>>>>>         bool posted;
>>>>>>     +    /*
>>>>>> +     * If multishot has already posted deferred completions, ensure that
>>>>>> +     * those are flushed first before posting this one. If not, CQEs
>>>>>> +     * could get reordered.
>>>>>> +     */
>>>>>> +    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>>>>>> +        __io_submit_flush_completions(ctx);
>>>>>
>>>>> A request is already dead if it's in compl_reqs, there should be no
>>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs
>>>>> belonging to the same request? And what do you mean by "deferred"
>>>>> completions?
>>>>
>>>> It's not the same req, it's different requests using the same
>>>> provided buffer ring where it can be problematic.
>>>
>>> Ok, and there has never been any ordering guarantees between them.
>>> Is there any report describing the problem? Why it's ok if
>>> io_req_post_cqe() produced CQEs of two multishot requests get
>>> reordered, but it's not when one of them is finishing a request?
>>> What's the problem with provided buffers?
>>
>> There better be ordering between posting of the CQEs - I'm not talking
>> about issue ordering in general, but if you have R1 (request 1) and R2
>> each consuming from the same buffer ring, then completion posting of
>> those absolutely need to be ordered. If not, you can have:
>>
>> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because
>> it's stopping, end up posting via io_req_task_complete ->
>> io_req_complete_defer -> add to deferred list.
>>
>> Other task_work in that run has R2, grabbing buffers from BGID Z,
>> doesn't terminate, posts CQE2 via io_req_post_cqe().
>>
>> tw run done, deferred completions flushed, R1 posts CQE1.
>>
>> Then we have CQE2 ahead of CQE1 in the CQ ring.
> 
> Which is why provided buffers from the beginning were returning
> buffer ids (i.e. bid, id within the group). Is it incremental
> consumption or some newer feature not being able to differentiate
> buffer chunks apart from ordering?

Both incrementally consumed and bundles are susceptible to this
reordering.

>>> It's a pretty bad spot to do such kinds of things, it disables any
>>> mixed mshot with non-mshot batching, and nesting flushing under
>>> an opcode handler is pretty nasty, it should rather stay top
>>> level as it is. From what I hear it's a provided buffer specific
>>> problem and should be fixed there.
>>
>> I agree it's not the prettiest, but I would prefer if we do it this way
>> just in terms of stable backporting. This should be a relatively rare
> 
> Sure. Are you looking into fixing it in a different way for
> upstream? Because it's not the right level of abstraction.

Yeah I think so - basically the common use case for stopping multishot
doesn't really work for this, as there are two methods of posting the
CQE. One is using io_req_post_cqe(), the other is a final deferred
completion. The mix and match of those two is what's causing this,
obviously.

I'd do this patch upstream, and then unify how the various users of
io_req_post_cqe() terminate, and ensure that only one channel is used
for posting CQEs for that case. When we're happy with that, we can
always flag that for stable fixing this one too, at least for 6.x where
x == newer.

Cooking that up right now on top...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
  2025-05-07 15:49           ` Jens Axboe
@ 2025-05-07 16:18             ` Pavel Begunkov
  2025-05-07 17:51               ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Pavel Begunkov @ 2025-05-07 16:18 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 5/7/25 16:49, Jens Axboe wrote:
> On 5/7/25 9:46 AM, Pavel Begunkov wrote:
>> On 5/7/25 16:26, Jens Axboe wrote:
>>> On 5/7/25 9:11 AM, Pavel Begunkov wrote:
>>>> On 5/7/25 15:58, Jens Axboe wrote:
>>>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote:
>>>>>> On 5/7/25 14:56, Jens Axboe wrote:
>>>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when
>>>>>>> stopping it, it may finish up with a deferred completion. This is fine,
>>>>>>> except if another multishot event triggers before the deferred completions
>>>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>>>>>>> as new multishot completions get posted before the deferred ones are
>>>>>>> flushed. This can cause confusion on the application side, if strict
>>>>>>> ordering is required for the use case.
>>>>>>>
>>>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>>>>>>> completions first, if any.
>>>>>>>
>>>>>>> Cc: stable@vger.kernel.org # 6.1+
>>>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>>>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>>>>> index 769814d71153..541e65a1eebf 100644
>>>>>>> --- a/io_uring/io_uring.c
>>>>>>> +++ b/io_uring/io_uring.c
>>>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>>>>>>          struct io_ring_ctx *ctx = req->ctx;
>>>>>>>          bool posted;
>>>>>>>      +    /*
>>>>>>> +     * If multishot has already posted deferred completions, ensure that
>>>>>>> +     * those are flushed first before posting this one. If not, CQEs
>>>>>>> +     * could get reordered.
>>>>>>> +     */
>>>>>>> +    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>>>>>>> +        __io_submit_flush_completions(ctx);
>>>>>>
>>>>>> A request is already dead if it's in compl_reqs, there should be no
>>>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs
>>>>>> belonging to the same request? And what do you mean by "deferred"
>>>>>> completions?
>>>>>
>>>>> It's not the same req, it's different requests using the same
>>>>> provided buffer ring where it can be problematic.
>>>>
>>>> Ok, and there has never been any ordering guarantees between them.
>>>> Is there any report describing the problem? Why it's ok if
>>>> io_req_post_cqe() produced CQEs of two multishot requests get
>>>> reordered, but it's not when one of them is finishing a request?
>>>> What's the problem with provided buffers?
>>>
>>> There better be ordering between posting of the CQEs - I'm not talking
>>> about issue ordering in general, but if you have R1 (request 1) and R2
>>> each consuming from the same buffer ring, then completion posting of
>>> those absolutely need to be ordered. If not, you can have:
>>>
>>> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because
>>> it's stopping, end up posting via io_req_task_complete ->
>>> io_req_complete_defer -> add to deferred list.
>>>
>>> Other task_work in that run has R2, grabbing buffers from BGID Z,
>>> doesn't terminate, posts CQE2 via io_req_post_cqe().
>>>
>>> tw run done, deferred completions flushed, R1 posts CQE1.
>>>
>>> Then we have CQE2 ahead of CQE1 in the CQ ring.
>>
>> Which is why provided buffers from the beginning were returning
>> buffer ids (i.e. bid, id within the group). Is it incremental
>> consumption or some newer feature not being able to differentiate
>> buffer chunks apart from ordering?
> 
> Both incrementally consumed and bundles are susceptible to this
> reordering.
> 
>>>> It's a pretty bad spot to do such kinds of things, it disables any
>>>> mixed mshot with non-mshot batching, and nesting flushing under
>>>> an opcode handler is pretty nasty, it should rather stay top
>>>> level as it is. From what I hear it's a provided buffer specific
>>>> problem and should be fixed there.
>>>
>>> I agree it's not the prettiest, but I would prefer if we do it this way
>>> just in terms of stable backporting. This should be a relatively rare
>>
>> Sure. Are you looking into fixing it in a different way for
>> upstream? Because it's not the right level of abstraction.
> 
> Yeah I think so - basically the common use case for stopping multishot
> doesn't really work for this, as there are two methods of posting the
> CQE. One is using io_req_post_cqe(), the other is a final deferred
> completion. The mix and match of those two is what's causing this,
> obviously.

That's the reason why all mshot stuff I've been adding is solely
using io_req_post_cqe() for IO completions, and the final CQE only
carries error / 0. Maybe it's not too late to switch read/recv to
that model.

Or use io_req_post_cqe() for the final completion and silence
the CQE coming from flush_completions() with REQ_F_SKIP_CQE?

> I'd do this patch upstream, and then unify how the various users of
> io_req_post_cqe() terminate, and ensure that only one channel is used
> for posting CQEs for that case. When we're happy with that, we can
> always flag that for stable fixing this one too, at least for 6.x where
> x == newer.
> 
> Cooking that up right now on top...

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot
  2025-05-07 16:18             ` Pavel Begunkov
@ 2025-05-07 17:51               ` Jens Axboe
  0 siblings, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2025-05-07 17:51 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 5/7/25 10:18 AM, Pavel Begunkov wrote:
> On 5/7/25 16:49, Jens Axboe wrote:
>> On 5/7/25 9:46 AM, Pavel Begunkov wrote:
>>> On 5/7/25 16:26, Jens Axboe wrote:
>>>> On 5/7/25 9:11 AM, Pavel Begunkov wrote:
>>>>> On 5/7/25 15:58, Jens Axboe wrote:
>>>>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote:
>>>>>>> On 5/7/25 14:56, Jens Axboe wrote:
>>>>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when
>>>>>>>> stopping it, it may finish up with a deferred completion. This is fine,
>>>>>>>> except if another multishot event triggers before the deferred completions
>>>>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
>>>>>>>> as new multishot completions get posted before the deferred ones are
>>>>>>>> flushed. This can cause confusion on the application side, if strict
>>>>>>>> ordering is required for the use case.
>>>>>>>>
>>>>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred
>>>>>>>> completions first, if any.
>>>>>>>>
>>>>>>>> Cc: stable@vger.kernel.org # 6.1+
>>>>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com>
>>>>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>>>>>>>> index 769814d71153..541e65a1eebf 100644
>>>>>>>> --- a/io_uring/io_uring.c
>>>>>>>> +++ b/io_uring/io_uring.c
>>>>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>>>>>>>          struct io_ring_ctx *ctx = req->ctx;
>>>>>>>>          bool posted;
>>>>>>>>      +    /*
>>>>>>>> +     * If multishot has already posted deferred completions, ensure that
>>>>>>>> +     * those are flushed first before posting this one. If not, CQEs
>>>>>>>> +     * could get reordered.
>>>>>>>> +     */
>>>>>>>> +    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>>>>>>>> +        __io_submit_flush_completions(ctx);
>>>>>>>
>>>>>>> A request is already dead if it's in compl_reqs, there should be no
>>>>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs
>>>>>>> belonging to the same request? And what do you mean by "deferred"
>>>>>>> completions?
>>>>>>
>>>>>> It's not the same req, it's different requests using the same
>>>>>> provided buffer ring where it can be problematic.
>>>>>
>>>>> Ok, and there has never been any ordering guarantees between them.
>>>>> Is there any report describing the problem? Why it's ok if
>>>>> io_req_post_cqe() produced CQEs of two multishot requests get
>>>>> reordered, but it's not when one of them is finishing a request?
>>>>> What's the problem with provided buffers?
>>>>
>>>> There better be ordering between posting of the CQEs - I'm not talking
>>>> about issue ordering in general, but if you have R1 (request 1) and R2
>>>> each consuming from the same buffer ring, then completion posting of
>>>> those absolutely need to be ordered. If not, you can have:
>>>>
>>>> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because
>>>> it's stopping, end up posting via io_req_task_complete ->
>>>> io_req_complete_defer -> add to deferred list.
>>>>
>>>> Other task_work in that run has R2, grabbing buffers from BGID Z,
>>>> doesn't terminate, posts CQE2 via io_req_post_cqe().
>>>>
>>>> tw run done, deferred completions flushed, R1 posts CQE1.
>>>>
>>>> Then we have CQE2 ahead of CQE1 in the CQ ring.
>>>
>>> Which is why provided buffers from the beginning were returning
>>> buffer ids (i.e. bid, id within the group). Is it incremental
>>> consumption or some newer feature not being able to differentiate
>>> buffer chunks apart from ordering?
>>
>> Both incrementally consumed and bundles are susceptible to this
>> reordering.
>>
>>>>> It's a pretty bad spot to do such kinds of things, it disables any
>>>>> mixed mshot with non-mshot batching, and nesting flushing under
>>>>> an opcode handler is pretty nasty, it should rather stay top
>>>>> level as it is. From what I hear it's a provided buffer specific
>>>>> problem and should be fixed there.
>>>>
>>>> I agree it's not the prettiest, but I would prefer if we do it this way
>>>> just in terms of stable backporting. This should be a relatively rare
>>>
>>> Sure. Are you looking into fixing it in a different way for
>>> upstream? Because it's not the right level of abstraction.
>>
>> Yeah I think so - basically the common use case for stopping multishot
>> doesn't really work for this, as there are two methods of posting the
>> CQE. One is using io_req_post_cqe(), the other is a final deferred
>> completion. The mix and match of those two is what's causing this,
>> obviously.
> 
> That's the reason why all mshot stuff I've been adding is solely
> using io_req_post_cqe() for IO completions, and the final CQE only
> carries error / 0. Maybe it's not too late to switch read/recv to
> that model.
> 
> Or use io_req_post_cqe() for the final completion and silence
> the CQE coming from flush_completions() with REQ_F_SKIP_CQE?

Yeah we can do something like that. Something like the below should
work. Might be cleaner to simply add a new helper for this in
io_uring.c and leave the overflow handling private and in there.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 541e65a1eebf..796ecd4bea5c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -700,8 +700,8 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task)
 	}
 }
 
-static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
-				     s32 res, u32 cflags, u64 extra1, u64 extra2)
+bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, s32 res,
+			      u32 cflags, u64 extra1, u64 extra2)
 {
 	struct io_overflow_cqe *ocqe;
 	size_t ocq_size = sizeof(struct io_overflow_cqe);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e4050b2d0821..dd0a89674095 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -82,6 +82,8 @@ void io_req_defer_failed(struct io_kiocb *req, s32 res);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
 void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
 bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
+bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, s32 res,
+			      u32 cflags, u64 extra1, u64 extra2);
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
 
 struct file *io_file_get_normal(struct io_kiocb *req, int fd);
diff --git a/io_uring/poll.c b/io_uring/poll.c
index 8eb744eb9f4c..c002508015a4 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -312,6 +312,24 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 	return IOU_POLL_NO_ACTION;
 }
 
+static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw)
+{
+	bool filled;
+
+	filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags);
+	if (unlikely(!filled)) {
+		struct io_ring_ctx *ctx = req->ctx;
+
+		spin_lock(&ctx->completion_lock);
+		filled = io_cqring_event_overflow(ctx, req->cqe.user_data,
+					req->cqe.res, req->cqe.flags, 0, 0);
+		spin_unlock(&ctx->completion_lock);
+	}
+	if (filled)
+		req->flags |= REQ_F_CQE_SKIP;
+	io_req_task_complete(req, tw);
+}
+
 void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw)
 {
 	int ret;
@@ -349,7 +367,7 @@ void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw)
 		io_tw_lock(req->ctx, tw);
 
 		if (ret == IOU_POLL_REMOVE_POLL_USE_RES)
-			io_req_task_complete(req, tw);
+			io_poll_req_complete(req, tw);
 		else if (ret == IOU_POLL_DONE || ret == IOU_POLL_REISSUE)
 			io_req_task_submit(req, tw);
 		else

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-05-07 17:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 13:56 [PATCH] io_uring: ensure deferred completions are flushed for multishot Jens Axboe
2025-05-07 14:36 ` Pavel Begunkov
2025-05-07 14:58   ` Jens Axboe
2025-05-07 15:11     ` Pavel Begunkov
2025-05-07 15:26       ` Jens Axboe
2025-05-07 15:46         ` Pavel Begunkov
2025-05-07 15:49           ` Jens Axboe
2025-05-07 16:18             ` Pavel Begunkov
2025-05-07 17:51               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox