* [PATCH] io_uring: ensure deferred completions are flushed for multishot @ 2025-05-07 13:56 Jens Axboe 2025-05-07 14:36 ` Pavel Begunkov 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2025-05-07 13:56 UTC (permalink / raw) To: io-uring Multishot normally uses io_req_post_cqe() to post completions, but when stopping it, it may finish up with a deferred completion. This is fine, except if another multishot event triggers before the deferred completions get flushed. If this occurs, then CQEs may get reordered in the CQ ring, as new multishot completions get posted before the deferred ones are flushed. This can cause confusion on the application side, if strict ordering is required for the use case. When multishot posting via io_req_post_cqe(), flush any pending deferred completions first, if any. Cc: stable@vger.kernel.org # 6.1+ Reported-by: Norman Maurer <norman_maurer@apple.com> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> --- diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 769814d71153..541e65a1eebf 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) struct io_ring_ctx *ctx = req->ctx; bool posted; + /* + * If multishot has already posted deferred completions, ensure that + * those are flushed first before posting this one. If not, CQEs + * could get reordered. + */ + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) + __io_submit_flush_completions(ctx); + lockdep_assert(!io_wq_current_is_worker()); lockdep_assert_held(&ctx->uring_lock); -- Jens Axboe ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot 2025-05-07 13:56 [PATCH] io_uring: ensure deferred completions are flushed for multishot Jens Axboe @ 2025-05-07 14:36 ` Pavel Begunkov 2025-05-07 14:58 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Pavel Begunkov @ 2025-05-07 14:36 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/7/25 14:56, Jens Axboe wrote: > Multishot normally uses io_req_post_cqe() to post completions, but when > stopping it, it may finish up with a deferred completion. This is fine, > except if another multishot event triggers before the deferred completions > get flushed. If this occurs, then CQEs may get reordered in the CQ ring, > as new multishot completions get posted before the deferred ones are > flushed. This can cause confusion on the application side, if strict > ordering is required for the use case. > > When multishot posting via io_req_post_cqe(), flush any pending deferred > completions first, if any. > > Cc: stable@vger.kernel.org # 6.1+ > Reported-by: Norman Maurer <norman_maurer@apple.com> > Reported-by: Christian Mazakas <christian.mazakas@gmail.com> > Signed-off-by: Jens Axboe <axboe@kernel.dk> > > --- > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 769814d71153..541e65a1eebf 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) > struct io_ring_ctx *ctx = req->ctx; > bool posted; > > + /* > + * If multishot has already posted deferred completions, ensure that > + * those are flushed first before posting this one. If not, CQEs > + * could get reordered. > + */ > + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) > + __io_submit_flush_completions(ctx); A request is already dead if it's in compl_reqs, there should be no way io_req_post_cqe() is called with it. Is it reordering of CQEs belonging to the same request? And what do you mean by "deferred" completions? > + > lockdep_assert(!io_wq_current_is_worker()); > lockdep_assert_held(&ctx->uring_lock); > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot 2025-05-07 14:36 ` Pavel Begunkov @ 2025-05-07 14:58 ` Jens Axboe 2025-05-07 15:11 ` Pavel Begunkov 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2025-05-07 14:58 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/7/25 8:36 AM, Pavel Begunkov wrote: > On 5/7/25 14:56, Jens Axboe wrote: >> Multishot normally uses io_req_post_cqe() to post completions, but when >> stopping it, it may finish up with a deferred completion. This is fine, >> except if another multishot event triggers before the deferred completions >> get flushed. If this occurs, then CQEs may get reordered in the CQ ring, >> as new multishot completions get posted before the deferred ones are >> flushed. This can cause confusion on the application side, if strict >> ordering is required for the use case. >> >> When multishot posting via io_req_post_cqe(), flush any pending deferred >> completions first, if any. >> >> Cc: stable@vger.kernel.org # 6.1+ >> Reported-by: Norman Maurer <norman_maurer@apple.com> >> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> >> --- >> >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >> index 769814d71153..541e65a1eebf 100644 >> --- a/io_uring/io_uring.c >> +++ b/io_uring/io_uring.c >> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) >> struct io_ring_ctx *ctx = req->ctx; >> bool posted; >> + /* >> + * If multishot has already posted deferred completions, ensure that >> + * those are flushed first before posting this one. If not, CQEs >> + * could get reordered. >> + */ >> + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) >> + __io_submit_flush_completions(ctx); > > A request is already dead if it's in compl_reqs, there should be no > way io_req_post_cqe() is called with it. Is it reordering of CQEs > belonging to the same request? And what do you mean by "deferred" > completions? It's not the same req, it's different requests using the same provided buffer ring where it can be problematic. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot 2025-05-07 14:58 ` Jens Axboe @ 2025-05-07 15:11 ` Pavel Begunkov 2025-05-07 15:26 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Pavel Begunkov @ 2025-05-07 15:11 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/7/25 15:58, Jens Axboe wrote: > On 5/7/25 8:36 AM, Pavel Begunkov wrote: >> On 5/7/25 14:56, Jens Axboe wrote: >>> Multishot normally uses io_req_post_cqe() to post completions, but when >>> stopping it, it may finish up with a deferred completion. This is fine, >>> except if another multishot event triggers before the deferred completions >>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring, >>> as new multishot completions get posted before the deferred ones are >>> flushed. This can cause confusion on the application side, if strict >>> ordering is required for the use case. >>> >>> When multishot posting via io_req_post_cqe(), flush any pending deferred >>> completions first, if any. >>> >>> Cc: stable@vger.kernel.org # 6.1+ >>> Reported-by: Norman Maurer <norman_maurer@apple.com> >>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> >>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>> >>> --- >>> >>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>> index 769814d71153..541e65a1eebf 100644 >>> --- a/io_uring/io_uring.c >>> +++ b/io_uring/io_uring.c >>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) >>> struct io_ring_ctx *ctx = req->ctx; >>> bool posted; >>> + /* >>> + * If multishot has already posted deferred completions, ensure that >>> + * those are flushed first before posting this one. If not, CQEs >>> + * could get reordered. >>> + */ >>> + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) >>> + __io_submit_flush_completions(ctx); >> >> A request is already dead if it's in compl_reqs, there should be no >> way io_req_post_cqe() is called with it. Is it reordering of CQEs >> belonging to the same request? And what do you mean by "deferred" >> completions? > > It's not the same req, it's different requests using the same > provided buffer ring where it can be problematic. Ok, and there has never been any ordering guarantees between them. Is there any report describing the problem? Why it's ok if io_req_post_cqe() produced CQEs of two multishot requests get reordered, but it's not when one of them is finishing a request? What's the problem with provided buffers? It's a pretty bad spot to do such kinds of things, it disables any mixed mshot with non-mshot batching, and nesting flushing under an opcode handler is pretty nasty, it should rather stay top level as it is. From what I hear it's a provided buffer specific problem and should be fixed there. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot 2025-05-07 15:11 ` Pavel Begunkov @ 2025-05-07 15:26 ` Jens Axboe 2025-05-07 15:46 ` Pavel Begunkov 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2025-05-07 15:26 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/7/25 9:11 AM, Pavel Begunkov wrote: > On 5/7/25 15:58, Jens Axboe wrote: >> On 5/7/25 8:36 AM, Pavel Begunkov wrote: >>> On 5/7/25 14:56, Jens Axboe wrote: >>>> Multishot normally uses io_req_post_cqe() to post completions, but when >>>> stopping it, it may finish up with a deferred completion. This is fine, >>>> except if another multishot event triggers before the deferred completions >>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring, >>>> as new multishot completions get posted before the deferred ones are >>>> flushed. This can cause confusion on the application side, if strict >>>> ordering is required for the use case. >>>> >>>> When multishot posting via io_req_post_cqe(), flush any pending deferred >>>> completions first, if any. >>>> >>>> Cc: stable@vger.kernel.org # 6.1+ >>>> Reported-by: Norman Maurer <norman_maurer@apple.com> >>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> >>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>> >>>> --- >>>> >>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>> index 769814d71153..541e65a1eebf 100644 >>>> --- a/io_uring/io_uring.c >>>> +++ b/io_uring/io_uring.c >>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) >>>> struct io_ring_ctx *ctx = req->ctx; >>>> bool posted; >>>> + /* >>>> + * If multishot has already posted deferred completions, ensure that >>>> + * those are flushed first before posting this one. If not, CQEs >>>> + * could get reordered. >>>> + */ >>>> + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) >>>> + __io_submit_flush_completions(ctx); >>> >>> A request is already dead if it's in compl_reqs, there should be no >>> way io_req_post_cqe() is called with it. Is it reordering of CQEs >>> belonging to the same request? And what do you mean by "deferred" >>> completions? >> >> It's not the same req, it's different requests using the same >> provided buffer ring where it can be problematic. > > Ok, and there has never been any ordering guarantees between them. > Is there any report describing the problem? Why it's ok if > io_req_post_cqe() produced CQEs of two multishot requests get > reordered, but it's not when one of them is finishing a request? > What's the problem with provided buffers? There better be ordering between posting of the CQEs - I'm not talking about issue ordering in general, but if you have R1 (request 1) and R2 each consuming from the same buffer ring, then completion posting of those absolutely need to be ordered. If not, you can have: R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because it's stopping, end up posting via io_req_task_complete -> io_req_complete_defer -> add to deferred list. Other task_work in that run has R2, grabbing buffers from BGID Z, doesn't terminate, posts CQE2 via io_req_post_cqe(). tw run done, deferred completions flushed, R1 posts CQE1. Then we have CQE2 ahead of CQE1 in the CQ ring. > It's a pretty bad spot to do such kinds of things, it disables any > mixed mshot with non-mshot batching, and nesting flushing under > an opcode handler is pretty nasty, it should rather stay top > level as it is. From what I hear it's a provided buffer specific > problem and should be fixed there. I agree it's not the prettiest, but I would prefer if we do it this way just in terms of stable backporting. This should be a relatively rare occurence, so I'm not too concerned about missed batching here. -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot 2025-05-07 15:26 ` Jens Axboe @ 2025-05-07 15:46 ` Pavel Begunkov 2025-05-07 15:49 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Pavel Begunkov @ 2025-05-07 15:46 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/7/25 16:26, Jens Axboe wrote: > On 5/7/25 9:11 AM, Pavel Begunkov wrote: >> On 5/7/25 15:58, Jens Axboe wrote: >>> On 5/7/25 8:36 AM, Pavel Begunkov wrote: >>>> On 5/7/25 14:56, Jens Axboe wrote: >>>>> Multishot normally uses io_req_post_cqe() to post completions, but when >>>>> stopping it, it may finish up with a deferred completion. This is fine, >>>>> except if another multishot event triggers before the deferred completions >>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring, >>>>> as new multishot completions get posted before the deferred ones are >>>>> flushed. This can cause confusion on the application side, if strict >>>>> ordering is required for the use case. >>>>> >>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred >>>>> completions first, if any. >>>>> >>>>> Cc: stable@vger.kernel.org # 6.1+ >>>>> Reported-by: Norman Maurer <norman_maurer@apple.com> >>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> >>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>>> >>>>> --- >>>>> >>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>>> index 769814d71153..541e65a1eebf 100644 >>>>> --- a/io_uring/io_uring.c >>>>> +++ b/io_uring/io_uring.c >>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) >>>>> struct io_ring_ctx *ctx = req->ctx; >>>>> bool posted; >>>>> + /* >>>>> + * If multishot has already posted deferred completions, ensure that >>>>> + * those are flushed first before posting this one. If not, CQEs >>>>> + * could get reordered. >>>>> + */ >>>>> + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) >>>>> + __io_submit_flush_completions(ctx); >>>> >>>> A request is already dead if it's in compl_reqs, there should be no >>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs >>>> belonging to the same request? And what do you mean by "deferred" >>>> completions? >>> >>> It's not the same req, it's different requests using the same >>> provided buffer ring where it can be problematic. >> >> Ok, and there has never been any ordering guarantees between them. >> Is there any report describing the problem? Why it's ok if >> io_req_post_cqe() produced CQEs of two multishot requests get >> reordered, but it's not when one of them is finishing a request? >> What's the problem with provided buffers? > > There better be ordering between posting of the CQEs - I'm not talking > about issue ordering in general, but if you have R1 (request 1) and R2 > each consuming from the same buffer ring, then completion posting of > those absolutely need to be ordered. If not, you can have: > > R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because > it's stopping, end up posting via io_req_task_complete -> > io_req_complete_defer -> add to deferred list. > > Other task_work in that run has R2, grabbing buffers from BGID Z, > doesn't terminate, posts CQE2 via io_req_post_cqe(). > > tw run done, deferred completions flushed, R1 posts CQE1. > > Then we have CQE2 ahead of CQE1 in the CQ ring. Which is why provided buffers from the beginning were returning buffer ids (i.e. bid, id within the group). Is it incremental consumption or some newer feature not being able to differentiate buffer chunks apart from ordering? >> It's a pretty bad spot to do such kinds of things, it disables any >> mixed mshot with non-mshot batching, and nesting flushing under >> an opcode handler is pretty nasty, it should rather stay top >> level as it is. From what I hear it's a provided buffer specific >> problem and should be fixed there. > > I agree it's not the prettiest, but I would prefer if we do it this way > just in terms of stable backporting. This should be a relatively rare Sure. Are you looking into fixing it in a different way for upstream? Because it's not the right level of abstraction. > occurence, so I'm not too concerned about missed batching here. Any mixed recv / send app, especially with some send zc, if the amount of traffic is large enough it'll definitely be hitting it. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot 2025-05-07 15:46 ` Pavel Begunkov @ 2025-05-07 15:49 ` Jens Axboe 2025-05-07 16:18 ` Pavel Begunkov 0 siblings, 1 reply; 9+ messages in thread From: Jens Axboe @ 2025-05-07 15:49 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/7/25 9:46 AM, Pavel Begunkov wrote: > On 5/7/25 16:26, Jens Axboe wrote: >> On 5/7/25 9:11 AM, Pavel Begunkov wrote: >>> On 5/7/25 15:58, Jens Axboe wrote: >>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote: >>>>> On 5/7/25 14:56, Jens Axboe wrote: >>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when >>>>>> stopping it, it may finish up with a deferred completion. This is fine, >>>>>> except if another multishot event triggers before the deferred completions >>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring, >>>>>> as new multishot completions get posted before the deferred ones are >>>>>> flushed. This can cause confusion on the application side, if strict >>>>>> ordering is required for the use case. >>>>>> >>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred >>>>>> completions first, if any. >>>>>> >>>>>> Cc: stable@vger.kernel.org # 6.1+ >>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com> >>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> >>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>>>> >>>>>> --- >>>>>> >>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>>>> index 769814d71153..541e65a1eebf 100644 >>>>>> --- a/io_uring/io_uring.c >>>>>> +++ b/io_uring/io_uring.c >>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) >>>>>> struct io_ring_ctx *ctx = req->ctx; >>>>>> bool posted; >>>>>> + /* >>>>>> + * If multishot has already posted deferred completions, ensure that >>>>>> + * those are flushed first before posting this one. If not, CQEs >>>>>> + * could get reordered. >>>>>> + */ >>>>>> + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) >>>>>> + __io_submit_flush_completions(ctx); >>>>> >>>>> A request is already dead if it's in compl_reqs, there should be no >>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs >>>>> belonging to the same request? And what do you mean by "deferred" >>>>> completions? >>>> >>>> It's not the same req, it's different requests using the same >>>> provided buffer ring where it can be problematic. >>> >>> Ok, and there has never been any ordering guarantees between them. >>> Is there any report describing the problem? Why it's ok if >>> io_req_post_cqe() produced CQEs of two multishot requests get >>> reordered, but it's not when one of them is finishing a request? >>> What's the problem with provided buffers? >> >> There better be ordering between posting of the CQEs - I'm not talking >> about issue ordering in general, but if you have R1 (request 1) and R2 >> each consuming from the same buffer ring, then completion posting of >> those absolutely need to be ordered. If not, you can have: >> >> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because >> it's stopping, end up posting via io_req_task_complete -> >> io_req_complete_defer -> add to deferred list. >> >> Other task_work in that run has R2, grabbing buffers from BGID Z, >> doesn't terminate, posts CQE2 via io_req_post_cqe(). >> >> tw run done, deferred completions flushed, R1 posts CQE1. >> >> Then we have CQE2 ahead of CQE1 in the CQ ring. > > Which is why provided buffers from the beginning were returning > buffer ids (i.e. bid, id within the group). Is it incremental > consumption or some newer feature not being able to differentiate > buffer chunks apart from ordering? Both incrementally consumed and bundles are susceptible to this reordering. >>> It's a pretty bad spot to do such kinds of things, it disables any >>> mixed mshot with non-mshot batching, and nesting flushing under >>> an opcode handler is pretty nasty, it should rather stay top >>> level as it is. From what I hear it's a provided buffer specific >>> problem and should be fixed there. >> >> I agree it's not the prettiest, but I would prefer if we do it this way >> just in terms of stable backporting. This should be a relatively rare > > Sure. Are you looking into fixing it in a different way for > upstream? Because it's not the right level of abstraction. Yeah I think so - basically the common use case for stopping multishot doesn't really work for this, as there are two methods of posting the CQE. One is using io_req_post_cqe(), the other is a final deferred completion. The mix and match of those two is what's causing this, obviously. I'd do this patch upstream, and then unify how the various users of io_req_post_cqe() terminate, and ensure that only one channel is used for posting CQEs for that case. When we're happy with that, we can always flag that for stable fixing this one too, at least for 6.x where x == newer. Cooking that up right now on top... -- Jens Axboe ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot 2025-05-07 15:49 ` Jens Axboe @ 2025-05-07 16:18 ` Pavel Begunkov 2025-05-07 17:51 ` Jens Axboe 0 siblings, 1 reply; 9+ messages in thread From: Pavel Begunkov @ 2025-05-07 16:18 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/7/25 16:49, Jens Axboe wrote: > On 5/7/25 9:46 AM, Pavel Begunkov wrote: >> On 5/7/25 16:26, Jens Axboe wrote: >>> On 5/7/25 9:11 AM, Pavel Begunkov wrote: >>>> On 5/7/25 15:58, Jens Axboe wrote: >>>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote: >>>>>> On 5/7/25 14:56, Jens Axboe wrote: >>>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when >>>>>>> stopping it, it may finish up with a deferred completion. This is fine, >>>>>>> except if another multishot event triggers before the deferred completions >>>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring, >>>>>>> as new multishot completions get posted before the deferred ones are >>>>>>> flushed. This can cause confusion on the application side, if strict >>>>>>> ordering is required for the use case. >>>>>>> >>>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred >>>>>>> completions first, if any. >>>>>>> >>>>>>> Cc: stable@vger.kernel.org # 6.1+ >>>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com> >>>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> >>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>>>>> >>>>>>> --- >>>>>>> >>>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>>>>> index 769814d71153..541e65a1eebf 100644 >>>>>>> --- a/io_uring/io_uring.c >>>>>>> +++ b/io_uring/io_uring.c >>>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) >>>>>>> struct io_ring_ctx *ctx = req->ctx; >>>>>>> bool posted; >>>>>>> + /* >>>>>>> + * If multishot has already posted deferred completions, ensure that >>>>>>> + * those are flushed first before posting this one. If not, CQEs >>>>>>> + * could get reordered. >>>>>>> + */ >>>>>>> + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) >>>>>>> + __io_submit_flush_completions(ctx); >>>>>> >>>>>> A request is already dead if it's in compl_reqs, there should be no >>>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs >>>>>> belonging to the same request? And what do you mean by "deferred" >>>>>> completions? >>>>> >>>>> It's not the same req, it's different requests using the same >>>>> provided buffer ring where it can be problematic. >>>> >>>> Ok, and there has never been any ordering guarantees between them. >>>> Is there any report describing the problem? Why it's ok if >>>> io_req_post_cqe() produced CQEs of two multishot requests get >>>> reordered, but it's not when one of them is finishing a request? >>>> What's the problem with provided buffers? >>> >>> There better be ordering between posting of the CQEs - I'm not talking >>> about issue ordering in general, but if you have R1 (request 1) and R2 >>> each consuming from the same buffer ring, then completion posting of >>> those absolutely need to be ordered. If not, you can have: >>> >>> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because >>> it's stopping, end up posting via io_req_task_complete -> >>> io_req_complete_defer -> add to deferred list. >>> >>> Other task_work in that run has R2, grabbing buffers from BGID Z, >>> doesn't terminate, posts CQE2 via io_req_post_cqe(). >>> >>> tw run done, deferred completions flushed, R1 posts CQE1. >>> >>> Then we have CQE2 ahead of CQE1 in the CQ ring. >> >> Which is why provided buffers from the beginning were returning >> buffer ids (i.e. bid, id within the group). Is it incremental >> consumption or some newer feature not being able to differentiate >> buffer chunks apart from ordering? > > Both incrementally consumed and bundles are susceptible to this > reordering. > >>>> It's a pretty bad spot to do such kinds of things, it disables any >>>> mixed mshot with non-mshot batching, and nesting flushing under >>>> an opcode handler is pretty nasty, it should rather stay top >>>> level as it is. From what I hear it's a provided buffer specific >>>> problem and should be fixed there. >>> >>> I agree it's not the prettiest, but I would prefer if we do it this way >>> just in terms of stable backporting. This should be a relatively rare >> >> Sure. Are you looking into fixing it in a different way for >> upstream? Because it's not the right level of abstraction. > > Yeah I think so - basically the common use case for stopping multishot > doesn't really work for this, as there are two methods of posting the > CQE. One is using io_req_post_cqe(), the other is a final deferred > completion. The mix and match of those two is what's causing this, > obviously. That's the reason why all mshot stuff I've been adding is solely using io_req_post_cqe() for IO completions, and the final CQE only carries error / 0. Maybe it's not too late to switch read/recv to that model. Or use io_req_post_cqe() for the final completion and silence the CQE coming from flush_completions() with REQ_F_SKIP_CQE? > I'd do this patch upstream, and then unify how the various users of > io_req_post_cqe() terminate, and ensure that only one channel is used > for posting CQEs for that case. When we're happy with that, we can > always flag that for stable fixing this one too, at least for 6.x where > x == newer. > > Cooking that up right now on top... -- Pavel Begunkov ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] io_uring: ensure deferred completions are flushed for multishot 2025-05-07 16:18 ` Pavel Begunkov @ 2025-05-07 17:51 ` Jens Axboe 0 siblings, 0 replies; 9+ messages in thread From: Jens Axboe @ 2025-05-07 17:51 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/7/25 10:18 AM, Pavel Begunkov wrote: > On 5/7/25 16:49, Jens Axboe wrote: >> On 5/7/25 9:46 AM, Pavel Begunkov wrote: >>> On 5/7/25 16:26, Jens Axboe wrote: >>>> On 5/7/25 9:11 AM, Pavel Begunkov wrote: >>>>> On 5/7/25 15:58, Jens Axboe wrote: >>>>>> On 5/7/25 8:36 AM, Pavel Begunkov wrote: >>>>>>> On 5/7/25 14:56, Jens Axboe wrote: >>>>>>>> Multishot normally uses io_req_post_cqe() to post completions, but when >>>>>>>> stopping it, it may finish up with a deferred completion. This is fine, >>>>>>>> except if another multishot event triggers before the deferred completions >>>>>>>> get flushed. If this occurs, then CQEs may get reordered in the CQ ring, >>>>>>>> as new multishot completions get posted before the deferred ones are >>>>>>>> flushed. This can cause confusion on the application side, if strict >>>>>>>> ordering is required for the use case. >>>>>>>> >>>>>>>> When multishot posting via io_req_post_cqe(), flush any pending deferred >>>>>>>> completions first, if any. >>>>>>>> >>>>>>>> Cc: stable@vger.kernel.org # 6.1+ >>>>>>>> Reported-by: Norman Maurer <norman_maurer@apple.com> >>>>>>>> Reported-by: Christian Mazakas <christian.mazakas@gmail.com> >>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk> >>>>>>>> >>>>>>>> --- >>>>>>>> >>>>>>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>>>>>> index 769814d71153..541e65a1eebf 100644 >>>>>>>> --- a/io_uring/io_uring.c >>>>>>>> +++ b/io_uring/io_uring.c >>>>>>>> @@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) >>>>>>>> struct io_ring_ctx *ctx = req->ctx; >>>>>>>> bool posted; >>>>>>>> + /* >>>>>>>> + * If multishot has already posted deferred completions, ensure that >>>>>>>> + * those are flushed first before posting this one. If not, CQEs >>>>>>>> + * could get reordered. >>>>>>>> + */ >>>>>>>> + if (!wq_list_empty(&ctx->submit_state.compl_reqs)) >>>>>>>> + __io_submit_flush_completions(ctx); >>>>>>> >>>>>>> A request is already dead if it's in compl_reqs, there should be no >>>>>>> way io_req_post_cqe() is called with it. Is it reordering of CQEs >>>>>>> belonging to the same request? And what do you mean by "deferred" >>>>>>> completions? >>>>>> >>>>>> It's not the same req, it's different requests using the same >>>>>> provided buffer ring where it can be problematic. >>>>> >>>>> Ok, and there has never been any ordering guarantees between them. >>>>> Is there any report describing the problem? Why it's ok if >>>>> io_req_post_cqe() produced CQEs of two multishot requests get >>>>> reordered, but it's not when one of them is finishing a request? >>>>> What's the problem with provided buffers? >>>> >>>> There better be ordering between posting of the CQEs - I'm not talking >>>> about issue ordering in general, but if you have R1 (request 1) and R2 >>>> each consuming from the same buffer ring, then completion posting of >>>> those absolutely need to be ordered. If not, you can have: >>>> >>>> R1 peek X buffers, BID Y, BGID Z. Doesn't use io_req_post_cqe() because >>>> it's stopping, end up posting via io_req_task_complete -> >>>> io_req_complete_defer -> add to deferred list. >>>> >>>> Other task_work in that run has R2, grabbing buffers from BGID Z, >>>> doesn't terminate, posts CQE2 via io_req_post_cqe(). >>>> >>>> tw run done, deferred completions flushed, R1 posts CQE1. >>>> >>>> Then we have CQE2 ahead of CQE1 in the CQ ring. >>> >>> Which is why provided buffers from the beginning were returning >>> buffer ids (i.e. bid, id within the group). Is it incremental >>> consumption or some newer feature not being able to differentiate >>> buffer chunks apart from ordering? >> >> Both incrementally consumed and bundles are susceptible to this >> reordering. >> >>>>> It's a pretty bad spot to do such kinds of things, it disables any >>>>> mixed mshot with non-mshot batching, and nesting flushing under >>>>> an opcode handler is pretty nasty, it should rather stay top >>>>> level as it is. From what I hear it's a provided buffer specific >>>>> problem and should be fixed there. >>>> >>>> I agree it's not the prettiest, but I would prefer if we do it this way >>>> just in terms of stable backporting. This should be a relatively rare >>> >>> Sure. Are you looking into fixing it in a different way for >>> upstream? Because it's not the right level of abstraction. >> >> Yeah I think so - basically the common use case for stopping multishot >> doesn't really work for this, as there are two methods of posting the >> CQE. One is using io_req_post_cqe(), the other is a final deferred >> completion. The mix and match of those two is what's causing this, >> obviously. > > That's the reason why all mshot stuff I've been adding is solely > using io_req_post_cqe() for IO completions, and the final CQE only > carries error / 0. Maybe it's not too late to switch read/recv to > that model. > > Or use io_req_post_cqe() for the final completion and silence > the CQE coming from flush_completions() with REQ_F_SKIP_CQE? Yeah we can do something like that. Something like the below should work. Might be cleaner to simply add a new helper for this in io_uring.c and leave the overflow handling private and in there. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 541e65a1eebf..796ecd4bea5c 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -700,8 +700,8 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } } -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, - s32 res, u32 cflags, u64 extra1, u64 extra2) +bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, s32 res, + u32 cflags, u64 extra1, u64 extra2) { struct io_overflow_cqe *ocqe; size_t ocq_size = sizeof(struct io_overflow_cqe); diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index e4050b2d0821..dd0a89674095 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -82,6 +82,8 @@ void io_req_defer_failed(struct io_kiocb *req, s32 res); bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags); void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags); bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags); +bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, s32 res, + u32 cflags, u64 extra1, u64 extra2); void __io_commit_cqring_flush(struct io_ring_ctx *ctx); struct file *io_file_get_normal(struct io_kiocb *req, int fd); diff --git a/io_uring/poll.c b/io_uring/poll.c index 8eb744eb9f4c..c002508015a4 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -312,6 +312,24 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) return IOU_POLL_NO_ACTION; } +static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw) +{ + bool filled; + + filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags); + if (unlikely(!filled)) { + struct io_ring_ctx *ctx = req->ctx; + + spin_lock(&ctx->completion_lock); + filled = io_cqring_event_overflow(ctx, req->cqe.user_data, + req->cqe.res, req->cqe.flags, 0, 0); + spin_unlock(&ctx->completion_lock); + } + if (filled) + req->flags |= REQ_F_CQE_SKIP; + io_req_task_complete(req, tw); +} + void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw) { int ret; @@ -349,7 +367,7 @@ void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw) io_tw_lock(req->ctx, tw); if (ret == IOU_POLL_REMOVE_POLL_USE_RES) - io_req_task_complete(req, tw); + io_poll_req_complete(req, tw); else if (ret == IOU_POLL_DONE || ret == IOU_POLL_REISSUE) io_req_task_submit(req, tw); else -- Jens Axboe ^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-05-07 17:51 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-07 13:56 [PATCH] io_uring: ensure deferred completions are flushed for multishot Jens Axboe 2025-05-07 14:36 ` Pavel Begunkov 2025-05-07 14:58 ` Jens Axboe 2025-05-07 15:11 ` Pavel Begunkov 2025-05-07 15:26 ` Jens Axboe 2025-05-07 15:46 ` Pavel Begunkov 2025-05-07 15:49 ` Jens Axboe 2025-05-07 16:18 ` Pavel Begunkov 2025-05-07 17:51 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox