* [PATCH 0/4] overflow completion enhancements @ 2025-05-14 8:07 Pavel Begunkov 2025-05-14 8:07 ` [PATCH 1/4] io_uring: open code io_req_cqe_overflow() Pavel Begunkov ` (3 more replies) 0 siblings, 4 replies; 12+ messages in thread From: Pavel Begunkov @ 2025-05-14 8:07 UTC (permalink / raw) To: io-uring; +Cc: asml.silence Sligtly improve overflow CQE handling. First two patches are just preps. For DEFER_TASKRUN patch 3 minimises the critical section, moves allocations out of it, and eliminates GFP_ATOMIC. And as a follow up to a recent review, Patch 4 adds a lockdep assertion against overflow posting abuse. Pavel Begunkov (4): io_uring: open code io_req_cqe_overflow() io_uring: move locking inside overflow posting io_uring: alloc overflow entry before locking io_uring: add lockdep warning for overflow posting io_uring/io_uring.c | 90 ++++++++++++++++++++++++++------------------- 1 file changed, 53 insertions(+), 37 deletions(-) -- 2.49.0 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/4] io_uring: open code io_req_cqe_overflow() 2025-05-14 8:07 [PATCH 0/4] overflow completion enhancements Pavel Begunkov @ 2025-05-14 8:07 ` Pavel Begunkov 2025-05-14 8:07 ` [PATCH 2/4] io_uring: move locking inside overflow posting Pavel Begunkov ` (2 subsequent siblings) 3 siblings, 0 replies; 12+ messages in thread From: Pavel Begunkov @ 2025-05-14 8:07 UTC (permalink / raw) To: io-uring; +Cc: asml.silence A preparation patch, just open code io_req_cqe_overflow(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- io_uring/io_uring.c | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 9a9b8d35349b..068e140b6bd8 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -760,14 +760,6 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, return true; } -static void io_req_cqe_overflow(struct io_kiocb *req) -{ - io_cqring_event_overflow(req->ctx, req->cqe.user_data, - req->cqe.res, req->cqe.flags, - req->big_cqe.extra1, req->big_cqe.extra2); - memset(&req->big_cqe, 0, sizeof(req->big_cqe)); -} - /* * writes to the cq entry need to come after reading head; the * control dependency is enough as we're using WRITE_ONCE to @@ -1442,11 +1434,19 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) unlikely(!io_fill_cqe_req(ctx, req))) { if (ctx->lockless_cq) { spin_lock(&ctx->completion_lock); - io_req_cqe_overflow(req); + io_cqring_event_overflow(req->ctx, req->cqe.user_data, + req->cqe.res, req->cqe.flags, + req->big_cqe.extra1, + req->big_cqe.extra2); spin_unlock(&ctx->completion_lock); } else { - io_req_cqe_overflow(req); + io_cqring_event_overflow(req->ctx, req->cqe.user_data, + req->cqe.res, req->cqe.flags, + req->big_cqe.extra1, + req->big_cqe.extra2); } + + memset(&req->big_cqe, 0, sizeof(req->big_cqe)); } } __io_cq_unlock_post(ctx); -- 2.49.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/4] io_uring: move locking inside overflow posting 2025-05-14 8:07 [PATCH 0/4] overflow completion enhancements Pavel Begunkov 2025-05-14 8:07 ` [PATCH 1/4] io_uring: open code io_req_cqe_overflow() Pavel Begunkov @ 2025-05-14 8:07 ` Pavel Begunkov 2025-05-14 16:42 ` Jens Axboe 2025-05-14 8:07 ` [PATCH 3/4] io_uring: alloc overflow entry before locking Pavel Begunkov 2025-05-14 8:07 ` [PATCH 4/4] io_uring: add lockdep warning for overflow posting Pavel Begunkov 3 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2025-05-14 8:07 UTC (permalink / raw) To: io-uring; +Cc: asml.silence A preparation patch moving locking protecting the overflow list into io_cqring_event_overflow(). The locking needs to be conditional because callers might already hold the lock. It's not the prettiest option, but it's not much different from the current state. Hopefully, one day we'll get rid of this nasty locking pattern. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- io_uring/io_uring.c | 53 +++++++++++++++++++++++++++------------------ 1 file changed, 32 insertions(+), 21 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 068e140b6bd8..5b253e2b6c49 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -718,8 +718,9 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } } -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, - s32 res, u32 cflags, u64 extra1, u64 extra2) +static bool __io_cqring_event_overflow(struct io_ring_ctx *ctx, + u64 user_data, s32 res, u32 cflags, + u64 extra1, u64 extra2) { struct io_overflow_cqe *ocqe; size_t ocq_size = sizeof(struct io_overflow_cqe); @@ -760,6 +761,24 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, return true; } +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, + u64 user_data, s32 res, u32 cflags, + u64 extra1, u64 extra2) +{ + bool queued; + + if (locked) { + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, + extra1, extra2); + } else { + spin_lock(&ctx->completion_lock); + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, + extra1, extra2); + spin_unlock(&ctx->completion_lock); + } + return queued; +} + /* * writes to the cq entry need to come after reading head; the * control dependency is enough as we're using WRITE_ONCE to @@ -825,7 +844,8 @@ bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags io_cq_lock(ctx); filled = io_fill_cqe_aux(ctx, user_data, res, cflags); if (!filled) - filled = io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0); + filled = io_cqring_event_overflow(ctx, true, + user_data, res, cflags, 0, 0); io_cq_unlock_post(ctx); return filled; } @@ -839,11 +859,9 @@ void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags) lockdep_assert_held(&ctx->uring_lock); lockdep_assert(ctx->lockless_cq); - if (!io_fill_cqe_aux(ctx, user_data, res, cflags)) { - spin_lock(&ctx->completion_lock); - io_cqring_event_overflow(ctx, user_data, res, cflags, 0, 0); - spin_unlock(&ctx->completion_lock); - } + if (!io_fill_cqe_aux(ctx, user_data, res, cflags)) + io_cqring_event_overflow(ctx, false, user_data, res, cflags, 0, 0); + ctx->submit_state.cq_flush = true; } @@ -1432,20 +1450,13 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) */ if (!(req->flags & (REQ_F_CQE_SKIP | REQ_F_REISSUE)) && unlikely(!io_fill_cqe_req(ctx, req))) { - if (ctx->lockless_cq) { - spin_lock(&ctx->completion_lock); - io_cqring_event_overflow(req->ctx, req->cqe.user_data, - req->cqe.res, req->cqe.flags, - req->big_cqe.extra1, - req->big_cqe.extra2); - spin_unlock(&ctx->completion_lock); - } else { - io_cqring_event_overflow(req->ctx, req->cqe.user_data, - req->cqe.res, req->cqe.flags, - req->big_cqe.extra1, - req->big_cqe.extra2); - } + bool locked = !ctx->lockless_cq; + io_cqring_event_overflow(req->ctx, locked, + req->cqe.user_data, + req->cqe.res, req->cqe.flags, + req->big_cqe.extra1, + req->big_cqe.extra2); memset(&req->big_cqe, 0, sizeof(req->big_cqe)); } } -- 2.49.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] io_uring: move locking inside overflow posting 2025-05-14 8:07 ` [PATCH 2/4] io_uring: move locking inside overflow posting Pavel Begunkov @ 2025-05-14 16:42 ` Jens Axboe 2025-05-14 17:18 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2025-05-14 16:42 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/14/25 2:07 AM, Pavel Begunkov wrote: > A preparation patch moving locking protecting the overflow list into > io_cqring_event_overflow(). The locking needs to be conditional because > callers might already hold the lock. It's not the prettiest option, but > it's not much different from the current state. Hopefully, one day we'll > get rid of this nasty locking pattern. > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > io_uring/io_uring.c | 53 +++++++++++++++++++++++++++------------------ > 1 file changed, 32 insertions(+), 21 deletions(-) > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 068e140b6bd8..5b253e2b6c49 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -718,8 +718,9 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) > } > } > > -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, > - s32 res, u32 cflags, u64 extra1, u64 extra2) > +static bool __io_cqring_event_overflow(struct io_ring_ctx *ctx, > + u64 user_data, s32 res, u32 cflags, > + u64 extra1, u64 extra2) > { > struct io_overflow_cqe *ocqe; > size_t ocq_size = sizeof(struct io_overflow_cqe); > @@ -760,6 +761,24 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data, > return true; > } > > +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, > + u64 user_data, s32 res, u32 cflags, > + u64 extra1, u64 extra2) > +{ > + bool queued; > + > + if (locked) { > + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, > + extra1, extra2); > + } else { > + spin_lock(&ctx->completion_lock); > + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, > + extra1, extra2); > + spin_unlock(&ctx->completion_lock); > + } > + return queued; > +} Really not a fan of passing in locking state and having a split helper like that. It's also pretty unwieldy with 7 arguments. Overall, why do we care about atomic vs non-atomic allocations for overflows? If you're hitting overflow, you've messed up... And if it's about cost, gfp atomic alloc will be predictable, vs GFP_KERNEL which can be a lot more involved. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] io_uring: move locking inside overflow posting 2025-05-14 16:42 ` Jens Axboe @ 2025-05-14 17:18 ` Pavel Begunkov 2025-05-14 19:25 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2025-05-14 17:18 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/14/25 17:42, Jens Axboe wrote: > On 5/14/25 2:07 AM, Pavel Begunkov wrote: ...>> +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, >> + u64 user_data, s32 res, u32 cflags, >> + u64 extra1, u64 extra2) >> +{ >> + bool queued; >> + >> + if (locked) { >> + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, >> + extra1, extra2); >> + } else { >> + spin_lock(&ctx->completion_lock); >> + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, >> + extra1, extra2); >> + spin_unlock(&ctx->completion_lock); >> + } >> + return queued; >> +} > > Really not a fan of passing in locking state and having a split helper > like that. I'd agree in general, but it's the same thing that's already in __io_submit_flush_completions(), just with a named argument instead of backflipping on ctx->lockless_cq. > It's also pretty unwieldy with 7 arguments. It's 6 vs 7, not much difference, and the real problem here is passing all cqe parts as separate arguments, which this series doesn't touch. > Overall, why do we care about atomic vs non-atomic allocations for > overflows? If you're hitting overflow, you've messed up... And if it's Not really, it's not far fetched to overflow with multishots unless you're overallocating CQ quite a bit, especially if traffic is not so homogeneous and has spikes. Expensive and should be minimised b/c of that, but it's still reasonably possible in a well structured app. > about cost, gfp atomic alloc will be predictable, vs GFP_KERNEL which > can be a lot more involved. Not the cost but reliability. If overflowing fails, the userspace can't reasonably do anything to recover, so either terminate or leak. And it's taking the atomic reserves from those who actually need it like irq. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] io_uring: move locking inside overflow posting 2025-05-14 17:18 ` Pavel Begunkov @ 2025-05-14 19:25 ` Jens Axboe 2025-05-14 20:00 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2025-05-14 19:25 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/14/25 11:18 AM, Pavel Begunkov wrote: > On 5/14/25 17:42, Jens Axboe wrote: >> On 5/14/25 2:07 AM, Pavel Begunkov wrote: > ...>> +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, >>> + u64 user_data, s32 res, u32 cflags, >>> + u64 extra1, u64 extra2) >>> +{ >>> + bool queued; >>> + >>> + if (locked) { >>> + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, >>> + extra1, extra2); >>> + } else { >>> + spin_lock(&ctx->completion_lock); >>> + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, >>> + extra1, extra2); >>> + spin_unlock(&ctx->completion_lock); >>> + } >>> + return queued; >>> +} >> >> Really not a fan of passing in locking state and having a split helper >> like that. > > I'd agree in general, but it's the same thing that's already in > __io_submit_flush_completions(), just with a named argument > instead of backflipping on ctx->lockless_cq. And I honestly _greatly_ prefer that to passing in some random bool where you have to go find the function to even see what it does... >> It's also pretty unwieldy with 7 arguments. > > It's 6 vs 7, not much difference, and the real problem here is > passing all cqe parts as separate arguments, which this series > doesn't touch. It's 6 you're passing on, it's 7 for the overflow helper. >> Overall, why do we care about atomic vs non-atomic allocations for >> overflows? If you're hitting overflow, you've messed up... And if it's > > Not really, it's not far fetched to overflow with multishots unless > you're overallocating CQ quite a bit, especially if traffic is not so > homogeneous and has spikes. Expensive and should be minimised b/c of > that, but it's still reasonably possible in a well structured app. It's still possible, but it should be a fairly rare occurence after all. >> about cost, gfp atomic alloc will be predictable, vs GFP_KERNEL which >> can be a lot more involved. > > Not the cost but reliability. If overflowing fails, the userspace can't > reasonably do anything to recover, so either terminate or leak. And > it's taking the atomic reserves from those who actually need it > like irq. I don't really disagree on that part, GFP_ATOMIC is only really used because of the locking, as this series deals with. Just saying that it'll make overflow jitter more pronounced, potentially, though not something I'm worried about. I'd much rather handle that sanely instead of passing in locking state. At least a gfp_t argument is immediately apparent what it does. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] io_uring: move locking inside overflow posting 2025-05-14 19:25 ` Jens Axboe @ 2025-05-14 20:00 ` Pavel Begunkov 2025-05-14 20:05 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2025-05-14 20:00 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/14/25 20:25, Jens Axboe wrote: > On 5/14/25 11:18 AM, Pavel Begunkov wrote: >> On 5/14/25 17:42, Jens Axboe wrote: >>> On 5/14/25 2:07 AM, Pavel Begunkov wrote: >> ...>> +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, >>>> + u64 user_data, s32 res, u32 cflags, >>>> + u64 extra1, u64 extra2) >>>> +{ >>>> + bool queued; >>>> + >>>> + if (locked) { >>>> + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, >>>> + extra1, extra2); >>>> + } else { >>>> + spin_lock(&ctx->completion_lock); >>>> + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, >>>> + extra1, extra2); >>>> + spin_unlock(&ctx->completion_lock); >>>> + } >>>> + return queued; >>>> +} >>> >>> Really not a fan of passing in locking state and having a split helper >>> like that. >> >> I'd agree in general, but it's the same thing that's already in >> __io_submit_flush_completions(), just with a named argument >> instead of backflipping on ctx->lockless_cq. > > And I honestly _greatly_ prefer that to passing in some random bool > where you have to go find the function to even see what it does... I see, it's much worse than having it somewhat abstracted, trying to be more reliable and self checking to prevent bugs, I give up on trying to do anything with it. >>> It's also pretty unwieldy with 7 arguments. >> >> It's 6 vs 7, not much difference, and the real problem here is >> passing all cqe parts as separate arguments, which this series >> doesn't touch. > > It's 6 you're passing on, it's 7 for the overflow helper. I don't get what you're saying. >>> Overall, why do we care about atomic vs non-atomic allocations for >>> overflows? If you're hitting overflow, you've messed up... And if it's >> >> Not really, it's not far fetched to overflow with multishots unless >> you're overallocating CQ quite a bit, especially if traffic is not so >> homogeneous and has spikes. Expensive and should be minimised b/c of >> that, but it's still reasonably possible in a well structured app. > > It's still possible, but it should be a fairly rare occurence after all. That's what I'm saying. And handling that "rare" well is the difference between having reliable software and terminally screwing userspace. >>> about cost, gfp atomic alloc will be predictable, vs GFP_KERNEL which >>> can be a lot more involved. >> >> Not the cost but reliability. If overflowing fails, the userspace can't >> reasonably do anything to recover, so either terminate or leak. And >> it's taking the atomic reserves from those who actually need it >> like irq. > > I don't really disagree on that part, GFP_ATOMIC is only really used > because of the locking, as this series deals with. Just saying that > it'll make overflow jitter more pronounced, potentially, though not It's like saying that the allocation makes heavier so let's just drop the CQE on the floor. It should do reclaim only when the situation with memory is already bad enough and you're risking failing atomic allocations (even if not exactly that). > something I'm worried about. I'd much rather handle that sanely instead > of passing in locking state. At least a gfp_t argument is immediately > apparent what it does. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] io_uring: move locking inside overflow posting 2025-05-14 20:00 ` Pavel Begunkov @ 2025-05-14 20:05 ` Jens Axboe 2025-05-14 21:52 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2025-05-14 20:05 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 5/14/25 2:00 PM, Pavel Begunkov wrote: > On 5/14/25 20:25, Jens Axboe wrote: >> On 5/14/25 11:18 AM, Pavel Begunkov wrote: >>> On 5/14/25 17:42, Jens Axboe wrote: >>>> On 5/14/25 2:07 AM, Pavel Begunkov wrote: >>> ...>> +static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, >>>>> + u64 user_data, s32 res, u32 cflags, >>>>> + u64 extra1, u64 extra2) >>>>> +{ >>>>> + bool queued; >>>>> + >>>>> + if (locked) { >>>>> + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, >>>>> + extra1, extra2); >>>>> + } else { >>>>> + spin_lock(&ctx->completion_lock); >>>>> + queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, >>>>> + extra1, extra2); >>>>> + spin_unlock(&ctx->completion_lock); >>>>> + } >>>>> + return queued; >>>>> +} >>>> >>>> Really not a fan of passing in locking state and having a split helper >>>> like that. >>> >>> I'd agree in general, but it's the same thing that's already in >>> __io_submit_flush_completions(), just with a named argument >>> instead of backflipping on ctx->lockless_cq. >> >> And I honestly _greatly_ prefer that to passing in some random bool >> where you have to go find the function to even see what it does... > > I see, it's much worse than having it somewhat abstracted, > trying to be more reliable and self checking to prevent bugs, > I give up on trying to do anything with it. I'm just saying that I don't like 'bool locked' kind of arguments. We do have some of them, but I always find it confusing. I do think the __io_submit_flush_completions() approach is more robust. Maybe have an alloc helper for the cqe and pass that in instead? Then we could get rid of adding 'locked' AND 5 other arguments? >>>> It's also pretty unwieldy with 7 arguments. >>> >>> It's 6 vs 7, not much difference, and the real problem here is >>> passing all cqe parts as separate arguments, which this series >>> doesn't touch. >> >> It's 6 you're passing on, it's 7 for the overflow helper. > > I don't get what you're saying. The helper you add, io_cqring_event_overflow(), takes 7 arguments. And yes most of that is cqe arguments, the bool locked means we're now at 7 rather than 7 arguments. Not the end of the world, but both of them are pretty horrid in that sense. >>>> Overall, why do we care about atomic vs non-atomic allocations for >>>> overflows? If you're hitting overflow, you've messed up... And if it's >>> >>> Not really, it's not far fetched to overflow with multishots unless >>> you're overallocating CQ quite a bit, especially if traffic is not so >>> homogeneous and has spikes. Expensive and should be minimised b/c of >>> that, but it's still reasonably possible in a well structured app. >> >> It's still possible, but it should be a fairly rare occurence after all. > > That's what I'm saying. And handling that "rare" well is the difference > between having reliable software and terminally screwing userspace. As I wrote just a bit below, I don't disagree on getting rid of GFP_ATOMIC at all, just stating that it'll be a bit more jitter. But that's still better than a potentially less reliable allocation for overflow. And the jitter part should hopefully be rare, as overflows should not be a common occurence. >>>> about cost, gfp atomic alloc will be predictable, vs GFP_KERNEL which >>>> can be a lot more involved. >>> >>> Not the cost but reliability. If overflowing fails, the userspace can't >>> reasonably do anything to recover, so either terminate or leak. And >>> it's taking the atomic reserves from those who actually need it >>> like irq. >> >> I don't really disagree on that part, GFP_ATOMIC is only really used >> because of the locking, as this series deals with. Just saying that >> it'll make overflow jitter more pronounced, potentially, though not > > It's like saying that the allocation makes heavier so let's just > drop the CQE on the floor. It should do reclaim only when the > situation with memory is already bad enough and you're risking > failing atomic allocations (even if not exactly that). Right, like I said I agree on that part... -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] io_uring: move locking inside overflow posting 2025-05-14 20:05 ` Jens Axboe @ 2025-05-14 21:52 ` Jens Axboe 2025-05-15 11:04 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2025-05-14 21:52 UTC (permalink / raw) To: Pavel Begunkov, io-uring Since sometimes it's easier to talk in code that in English, something like the below (just applied on top, and utterly untested) I think is much cleaner. Didn't spend a lot of time on it, I'm sure it could get condensed down some more with a helper or something. But it keeps the locking in the caller, while still allowing GFP_KERNEL alloc for lockless_cq. Somewhat unrelated, but also fills in an io_cqe and passes in for the non-cqe32 parts, just for readability's sake. diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index da1075b66a87..9b6d09633a29 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -744,44 +744,12 @@ static bool __io_cqring_event_overflow(struct io_ring_ctx *ctx, return true; } -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, - u64 user_data, s32 res, u32 cflags, - u64 extra1, u64 extra2) +static void io_cqring_event_overflow(struct io_ring_ctx *ctx, + struct io_overflow_cqe *ocqe) { - struct io_overflow_cqe *ocqe; - size_t ocq_size = sizeof(struct io_overflow_cqe); - bool is_cqe32 = (ctx->flags & IORING_SETUP_CQE32); - gfp_t gfp = GFP_KERNEL_ACCOUNT; - bool queued; - - io_lockdep_assert_cq_locked(ctx); - - if (is_cqe32) - ocq_size += sizeof(struct io_uring_cqe); - if (locked) - gfp = GFP_ATOMIC | __GFP_ACCOUNT; - - ocqe = kmalloc(ocq_size, gfp); - trace_io_uring_cqe_overflow(ctx, user_data, res, cflags, ocqe); - - if (ocqe) { - ocqe->cqe.user_data = user_data; - ocqe->cqe.res = res; - ocqe->cqe.flags = cflags; - if (is_cqe32) { - ocqe->cqe.big_cqe[0] = extra1; - ocqe->cqe.big_cqe[1] = extra2; - } - } - - if (locked) { - queued = __io_cqring_event_overflow(ctx, ocqe); - } else { - spin_lock(&ctx->completion_lock); - queued = __io_cqring_event_overflow(ctx, ocqe); - spin_unlock(&ctx->completion_lock); - } - return queued; + spin_lock(&ctx->completion_lock); + __io_cqring_event_overflow(ctx, ocqe); + spin_unlock(&ctx->completion_lock); } /* @@ -842,15 +810,49 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, return false; } +static struct io_overflow_cqe *io_get_ocqe(struct io_ring_ctx *ctx, + struct io_cqe *cqe, u64 extra1, + u64 extra2, gfp_t gfp) +{ + size_t ocq_size = sizeof(struct io_overflow_cqe); + bool is_cqe32 = ctx->flags & IORING_SETUP_CQE32; + struct io_overflow_cqe *ocqe; + + if (is_cqe32) + ocq_size += sizeof(struct io_uring_cqe); + + ocqe = kmalloc(ocq_size, gfp); + trace_io_uring_cqe_overflow(ctx, cqe->user_data, cqe->res, cqe->flags, ocqe); + if (ocqe) { + ocqe->cqe.user_data = cqe->user_data; + ocqe->cqe.res = cqe->res; + ocqe->cqe.flags = cqe->flags; + if (is_cqe32) { + ocqe->cqe.big_cqe[0] = extra1; + ocqe->cqe.big_cqe[1] = extra2; + } + } + + return ocqe; +} + bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags) { bool filled; io_cq_lock(ctx); filled = io_fill_cqe_aux(ctx, user_data, res, cflags); - if (!filled) - filled = io_cqring_event_overflow(ctx, true, - user_data, res, cflags, 0, 0); + if (unlikely(!filled)) { + struct io_cqe cqe = { + .user_data = user_data, + .res = res, + .flags = cflags + }; + struct io_overflow_cqe *ocqe; + + ocqe = io_get_ocqe(ctx, &cqe, 0, 0, GFP_ATOMIC); + filled = __io_cqring_event_overflow(ctx, ocqe); + } io_cq_unlock_post(ctx); return filled; } @@ -864,8 +866,17 @@ void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags) lockdep_assert_held(&ctx->uring_lock); lockdep_assert(ctx->lockless_cq); - if (!io_fill_cqe_aux(ctx, user_data, res, cflags)) - io_cqring_event_overflow(ctx, false, user_data, res, cflags, 0, 0); + if (!io_fill_cqe_aux(ctx, user_data, res, cflags)) { + struct io_cqe cqe = { + .user_data = user_data, + .res = res, + .flags = cflags + }; + struct io_overflow_cqe *ocqe; + + ocqe = io_get_ocqe(ctx, &cqe, 0, 0, GFP_KERNEL); + io_cqring_event_overflow(ctx, ocqe); + } ctx->submit_state.cq_flush = true; } @@ -1437,6 +1448,20 @@ static void io_free_batch_list(struct io_ring_ctx *ctx, } while (node); } +static __cold void io_cqe_overflow_fill(struct io_ring_ctx *ctx, + struct io_kiocb *req) +{ + gfp_t gfp = ctx->lockless_cq ? GFP_KERNEL : GFP_ATOMIC; + struct io_overflow_cqe *ocqe; + + ocqe = io_get_ocqe(ctx, &req->cqe, req->big_cqe.extra1, req->big_cqe.extra2, gfp); + if (ctx->lockless_cq) + io_cqring_event_overflow(ctx, ocqe); + else + __io_cqring_event_overflow(ctx, ocqe); + memset(&req->big_cqe, 0, sizeof(req->big_cqe)); +} + void __io_submit_flush_completions(struct io_ring_ctx *ctx) __must_hold(&ctx->uring_lock) { @@ -1453,17 +1478,10 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) * will go through the io-wq retry machinery and post one * later. */ - if (!(req->flags & (REQ_F_CQE_SKIP | REQ_F_REISSUE)) && - unlikely(!io_fill_cqe_req(ctx, req))) { - bool locked = !ctx->lockless_cq; - - io_cqring_event_overflow(req->ctx, locked, - req->cqe.user_data, - req->cqe.res, req->cqe.flags, - req->big_cqe.extra1, - req->big_cqe.extra2); - memset(&req->big_cqe, 0, sizeof(req->big_cqe)); - } + if (req->flags & (REQ_F_CQE_SKIP | REQ_F_REISSUE)) + continue; + if (unlikely(!io_fill_cqe_req(ctx, req))) + io_cqe_overflow_fill(ctx, req); } __io_cq_unlock_post(ctx); -- Jens Axboe ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/4] io_uring: move locking inside overflow posting 2025-05-14 21:52 ` Jens Axboe @ 2025-05-15 11:04 ` Pavel Begunkov 0 siblings, 0 replies; 12+ messages in thread From: Pavel Begunkov @ 2025-05-15 11:04 UTC (permalink / raw) To: Jens Axboe, io-uring On 5/14/25 22:52, Jens Axboe wrote: > Since sometimes it's easier to talk in code that in English, something > like the below (just applied on top, and utterly untested) I think is > much cleaner. Didn't spend a lot of time on it, I'm sure it could get > condensed down some more with a helper or something. But it keeps the > locking in the caller, while still allowing GFP_KERNEL alloc for > lockless_cq. > > Somewhat unrelated, but also fills in an io_cqe and passes in for the > non-cqe32 parts, just for readability's sake. > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index da1075b66a87..9b6d09633a29 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -744,44 +744,12 @@ static bool __io_cqring_event_overflow(struct io_ring_ctx *ctx, > return true; > } > > -static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, > - u64 user_data, s32 res, u32 cflags, > - u64 extra1, u64 extra2) > +static void io_cqring_event_overflow(struct io_ring_ctx *ctx, > + struct io_overflow_cqe *ocqe) > { > - struct io_overflow_cqe *ocqe; > - size_t ocq_size = sizeof(struct io_overflow_cqe); > - bool is_cqe32 = (ctx->flags & IORING_SETUP_CQE32); > - gfp_t gfp = GFP_KERNEL_ACCOUNT; > - bool queued; > - > - io_lockdep_assert_cq_locked(ctx); > - > - if (is_cqe32) > - ocq_size += sizeof(struct io_uring_cqe); > - if (locked) > - gfp = GFP_ATOMIC | __GFP_ACCOUNT; > - > - ocqe = kmalloc(ocq_size, gfp); > - trace_io_uring_cqe_overflow(ctx, user_data, res, cflags, ocqe); > - > - if (ocqe) { > - ocqe->cqe.user_data = user_data; > - ocqe->cqe.res = res; > - ocqe->cqe.flags = cflags; > - if (is_cqe32) { > - ocqe->cqe.big_cqe[0] = extra1; > - ocqe->cqe.big_cqe[1] = extra2; > - } > - } > - > - if (locked) { > - queued = __io_cqring_event_overflow(ctx, ocqe); > - } else { > - spin_lock(&ctx->completion_lock); > - queued = __io_cqring_event_overflow(ctx, ocqe); > - spin_unlock(&ctx->completion_lock); > - } > - return queued; > + spin_lock(&ctx->completion_lock); > + __io_cqring_event_overflow(ctx, ocqe); > + spin_unlock(&ctx->completion_lock); > } > > /* > @@ -842,15 +810,49 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, > return false; > } > > +static struct io_overflow_cqe *io_get_ocqe(struct io_ring_ctx *ctx, > + struct io_cqe *cqe, u64 extra1, > + u64 extra2, gfp_t gfp) > +{ > + size_t ocq_size = sizeof(struct io_overflow_cqe); > + bool is_cqe32 = ctx->flags & IORING_SETUP_CQE32; > + struct io_overflow_cqe *ocqe; > + > + if (is_cqe32) > + ocq_size += sizeof(struct io_uring_cqe); > + > + ocqe = kmalloc(ocq_size, gfp); > + trace_io_uring_cqe_overflow(ctx, cqe->user_data, cqe->res, cqe->flags, ocqe); > + if (ocqe) { > + ocqe->cqe.user_data = cqe->user_data; > + ocqe->cqe.res = cqe->res; > + ocqe->cqe.flags = cqe->flags; > + if (is_cqe32) { > + ocqe->cqe.big_cqe[0] = extra1; > + ocqe->cqe.big_cqe[1] = extra2; > + } > + } > + > + return ocqe; > +} > + > bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags) > { > bool filled; > > io_cq_lock(ctx); > filled = io_fill_cqe_aux(ctx, user_data, res, cflags); > - if (!filled) > - filled = io_cqring_event_overflow(ctx, true, > - user_data, res, cflags, 0, 0); > + if (unlikely(!filled)) { > + struct io_cqe cqe = { > + .user_data = user_data, > + .res = res, > + .flags = cflags > + }; > + struct io_overflow_cqe *ocqe; > + > + ocqe = io_get_ocqe(ctx, &cqe, 0, 0, GFP_ATOMIC); > + filled = __io_cqring_event_overflow(ctx, ocqe); > + } > io_cq_unlock_post(ctx); > return filled; > } > @@ -864,8 +866,17 @@ void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags) > lockdep_assert_held(&ctx->uring_lock); > lockdep_assert(ctx->lockless_cq); > > - if (!io_fill_cqe_aux(ctx, user_data, res, cflags)) > - io_cqring_event_overflow(ctx, false, user_data, res, cflags, 0, 0); > + if (!io_fill_cqe_aux(ctx, user_data, res, cflags)) { > + struct io_cqe cqe = { > + .user_data = user_data, > + .res = res, > + .flags = cflags > + }; > + struct io_overflow_cqe *ocqe; > + > + ocqe = io_get_ocqe(ctx, &cqe, 0, 0, GFP_KERNEL); > + io_cqring_event_overflow(ctx, ocqe); > + } > > ctx->submit_state.cq_flush = true; > } > @@ -1437,6 +1448,20 @@ static void io_free_batch_list(struct io_ring_ctx *ctx, > } while (node); > } > > +static __cold void io_cqe_overflow_fill(struct io_ring_ctx *ctx, > + struct io_kiocb *req) > +{ > + gfp_t gfp = ctx->lockless_cq ? GFP_KERNEL : GFP_ATOMIC; > + struct io_overflow_cqe *ocqe; > + > + ocqe = io_get_ocqe(ctx, &req->cqe, req->big_cqe.extra1, req->big_cqe.extra2, gfp); > + if (ctx->lockless_cq) > + io_cqring_event_overflow(ctx, ocqe); > + else > + __io_cqring_event_overflow(ctx, ocqe); > + memset(&req->big_cqe, 0, sizeof(req->big_cqe)); > +} Implicitly passing the locking state b/w functions is much worse. And instead being contained in a single helper all callers need to care about some allocator function, which doesn't help in making it cleaner and also contributes to bloating. Let's just leave How it's already in the tree. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 3/4] io_uring: alloc overflow entry before locking 2025-05-14 8:07 [PATCH 0/4] overflow completion enhancements Pavel Begunkov 2025-05-14 8:07 ` [PATCH 1/4] io_uring: open code io_req_cqe_overflow() Pavel Begunkov 2025-05-14 8:07 ` [PATCH 2/4] io_uring: move locking inside overflow posting Pavel Begunkov @ 2025-05-14 8:07 ` Pavel Begunkov 2025-05-14 8:07 ` [PATCH 4/4] io_uring: add lockdep warning for overflow posting Pavel Begunkov 3 siblings, 0 replies; 12+ messages in thread From: Pavel Begunkov @ 2025-05-14 8:07 UTC (permalink / raw) To: io-uring; +Cc: asml.silence Allocate and populate struct io_overflow_cqe before taking the lock. It simplifies __io_cqring_event_overflow(), and that allows to avoid GFP_ATOMIC for DEFER_TASKRUN rings. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- io_uring/io_uring.c | 49 ++++++++++++++++++++++++--------------------- 1 file changed, 26 insertions(+), 23 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 5b253e2b6c49..927d8e45dbdb 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -719,20 +719,10 @@ static __cold void io_uring_drop_tctx_refs(struct task_struct *task) } static bool __io_cqring_event_overflow(struct io_ring_ctx *ctx, - u64 user_data, s32 res, u32 cflags, - u64 extra1, u64 extra2) + struct io_overflow_cqe *ocqe) { - struct io_overflow_cqe *ocqe; - size_t ocq_size = sizeof(struct io_overflow_cqe); - bool is_cqe32 = (ctx->flags & IORING_SETUP_CQE32); - lockdep_assert_held(&ctx->completion_lock); - if (is_cqe32) - ocq_size += sizeof(struct io_uring_cqe); - - ocqe = kmalloc(ocq_size, GFP_ATOMIC | __GFP_ACCOUNT); - trace_io_uring_cqe_overflow(ctx, user_data, res, cflags, ocqe); if (!ocqe) { struct io_rings *r = ctx->rings; @@ -745,17 +735,10 @@ static bool __io_cqring_event_overflow(struct io_ring_ctx *ctx, set_bit(IO_CHECK_CQ_DROPPED_BIT, &ctx->check_cq); return false; } + if (list_empty(&ctx->cq_overflow_list)) { set_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq); atomic_or(IORING_SQ_CQ_OVERFLOW, &ctx->rings->sq_flags); - - } - ocqe->cqe.user_data = user_data; - ocqe->cqe.res = res; - ocqe->cqe.flags = cflags; - if (is_cqe32) { - ocqe->cqe.big_cqe[0] = extra1; - ocqe->cqe.big_cqe[1] = extra2; } list_add_tail(&ocqe->list, &ctx->cq_overflow_list); return true; @@ -765,15 +748,35 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, u64 user_data, s32 res, u32 cflags, u64 extra1, u64 extra2) { + struct io_overflow_cqe *ocqe; + size_t ocq_size = sizeof(struct io_overflow_cqe); + bool is_cqe32 = (ctx->flags & IORING_SETUP_CQE32); + gfp_t gfp = GFP_KERNEL_ACCOUNT; bool queued; + if (is_cqe32) + ocq_size += sizeof(struct io_uring_cqe); + if (locked) + gfp = GFP_ATOMIC | __GFP_ACCOUNT; + + ocqe = kmalloc(ocq_size, gfp); + trace_io_uring_cqe_overflow(ctx, user_data, res, cflags, ocqe); + + if (ocqe) { + ocqe->cqe.user_data = user_data; + ocqe->cqe.res = res; + ocqe->cqe.flags = cflags; + if (is_cqe32) { + ocqe->cqe.big_cqe[0] = extra1; + ocqe->cqe.big_cqe[1] = extra2; + } + } + if (locked) { - queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, - extra1, extra2); + queued = __io_cqring_event_overflow(ctx, ocqe); } else { spin_lock(&ctx->completion_lock); - queued = __io_cqring_event_overflow(ctx, user_data, res, cflags, - extra1, extra2); + queued = __io_cqring_event_overflow(ctx, ocqe); spin_unlock(&ctx->completion_lock); } return queued; -- 2.49.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/4] io_uring: add lockdep warning for overflow posting 2025-05-14 8:07 [PATCH 0/4] overflow completion enhancements Pavel Begunkov ` (2 preceding siblings ...) 2025-05-14 8:07 ` [PATCH 3/4] io_uring: alloc overflow entry before locking Pavel Begunkov @ 2025-05-14 8:07 ` Pavel Begunkov 3 siblings, 0 replies; 12+ messages in thread From: Pavel Begunkov @ 2025-05-14 8:07 UTC (permalink / raw) To: io-uring; +Cc: asml.silence io_cqring_event_overflow() must be called in the same CQ protection section as the preceding io_get_cqe(), otherwise overflowed and normal CQEs can get out of order. It's hard to debug check that exactly, but we can verify that io_cqring_event_overflow() is called with the CQ locked, which should be good enough to catch most cases of misuse. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- io_uring/io_uring.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 927d8e45dbdb..da1075b66a87 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -754,6 +754,8 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, bool locked, gfp_t gfp = GFP_KERNEL_ACCOUNT; bool queued; + io_lockdep_assert_cq_locked(ctx); + if (is_cqe32) ocq_size += sizeof(struct io_uring_cqe); if (locked) -- 2.49.0 ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-05-15 11:03 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-14 8:07 [PATCH 0/4] overflow completion enhancements Pavel Begunkov 2025-05-14 8:07 ` [PATCH 1/4] io_uring: open code io_req_cqe_overflow() Pavel Begunkov 2025-05-14 8:07 ` [PATCH 2/4] io_uring: move locking inside overflow posting Pavel Begunkov 2025-05-14 16:42 ` Jens Axboe 2025-05-14 17:18 ` Pavel Begunkov 2025-05-14 19:25 ` Jens Axboe 2025-05-14 20:00 ` Pavel Begunkov 2025-05-14 20:05 ` Jens Axboe 2025-05-14 21:52 ` Jens Axboe 2025-05-15 11:04 ` Pavel Begunkov 2025-05-14 8:07 ` [PATCH 3/4] io_uring: alloc overflow entry before locking Pavel Begunkov 2025-05-14 8:07 ` [PATCH 4/4] io_uring: add lockdep warning for overflow posting Pavel Begunkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox