* [PATCH] io_uring/poll: use regular CQE posting for multishot termination
@ 2025-05-07 18:08 Jens Axboe
2025-05-07 20:53 ` Pavel Begunkov
0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2025-05-07 18:08 UTC (permalink / raw)
To: io-uring
A previous patch avoided reordering of multiple multishot requests
getting their CQEs potentiall reordered when one of them terminates, as
that last termination CQE is posted as a deferred completion rather than
directly as a CQE. This can reduce the efficiency of the batched
posting, hence was not ideal.
Provide a basic helper that poll can use for this kind of termination,
which does a normal CQE posting rather than a deferred one. With that,
the work-around where io_req_post_cqe() needs to flush deferred
completions can be removed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
This removes the io_req_post_cqe() flushing, and instead puts the honus
on the poll side to provide the ordering. I've verified that this also
fixes the reported issue. The previous patch can be easily backported to
stable, so makes sense to keep that one.
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 541e65a1eebf..505959fc2de0 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -848,14 +848,6 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
struct io_ring_ctx *ctx = req->ctx;
bool posted;
- /*
- * If multishot has already posted deferred completions, ensure that
- * those are flushed first before posting this one. If not, CQEs
- * could get reordered.
- */
- if (!wq_list_empty(&ctx->submit_state.compl_reqs))
- __io_submit_flush_completions(ctx);
-
lockdep_assert(!io_wq_current_is_worker());
lockdep_assert_held(&ctx->uring_lock);
@@ -871,6 +863,23 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
return posted;
}
+bool io_req_post_cqe_overflow(struct io_kiocb *req)
+{
+ bool filled;
+
+ filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags);
+ if (unlikely(!filled)) {
+ struct io_ring_ctx *ctx = req->ctx;
+
+ spin_lock(&ctx->completion_lock);
+ filled = io_cqring_event_overflow(ctx, req->cqe.user_data,
+ req->cqe.res, req->cqe.flags, 0, 0);
+ spin_unlock(&ctx->completion_lock);
+ }
+
+ return filled;
+}
+
static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
{
struct io_ring_ctx *ctx = req->ctx;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e4050b2d0821..d2d4bf7c3b29 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -82,6 +82,7 @@ void io_req_defer_failed(struct io_kiocb *req, s32 res);
bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
+bool io_req_post_cqe_overflow(struct io_kiocb *req);
void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
struct file *io_file_get_normal(struct io_kiocb *req, int fd);
diff --git a/io_uring/poll.c b/io_uring/poll.c
index 8eb744eb9f4c..af8e3d4f6f1f 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -312,6 +312,13 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
return IOU_POLL_NO_ACTION;
}
+static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw)
+{
+ if (io_req_post_cqe_overflow(req))
+ req->flags |= REQ_F_CQE_SKIP;
+ io_req_task_complete(req, tw);
+}
+
void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw)
{
int ret;
@@ -349,7 +356,7 @@ void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw)
io_tw_lock(req->ctx, tw);
if (ret == IOU_POLL_REMOVE_POLL_USE_RES)
- io_req_task_complete(req, tw);
+ io_poll_req_complete(req, tw);
else if (ret == IOU_POLL_DONE || ret == IOU_POLL_REISSUE)
io_req_task_submit(req, tw);
else
--
Jens Axboe
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] io_uring/poll: use regular CQE posting for multishot termination
2025-05-07 18:08 [PATCH] io_uring/poll: use regular CQE posting for multishot termination Jens Axboe
@ 2025-05-07 20:53 ` Pavel Begunkov
2025-05-07 23:11 ` Jens Axboe
0 siblings, 1 reply; 4+ messages in thread
From: Pavel Begunkov @ 2025-05-07 20:53 UTC (permalink / raw)
To: Jens Axboe, io-uring
On 5/7/25 19:08, Jens Axboe wrote:
> A previous patch avoided reordering of multiple multishot requests
> getting their CQEs potentiall reordered when one of them terminates, as
> that last termination CQE is posted as a deferred completion rather than
> directly as a CQE. This can reduce the efficiency of the batched
> posting, hence was not ideal.
>
> Provide a basic helper that poll can use for this kind of termination,
> which does a normal CQE posting rather than a deferred one. With that,
> the work-around where io_req_post_cqe() needs to flush deferred
> completions can be removed.
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> ---
>
> This removes the io_req_post_cqe() flushing, and instead puts the honus
> on the poll side to provide the ordering. I've verified that this also
> fixes the reported issue. The previous patch can be easily backported to
> stable, so makes sense to keep that one.
It still gives a bad feeling tbh, it's not a polling problem,
we're working around shortcomings of the incremental / bundled
uapi and/or design. Patching it in semi unrelated places will
defitely bite back.
Can it be fixed in relevant opcodes? So it stays close to
those who actually use it. And let me ask since I'm lost in
new features, can the uapi be fixed so that it doesn't
depend on request ordering?
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 541e65a1eebf..505959fc2de0 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -848,14 +848,6 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
> struct io_ring_ctx *ctx = req->ctx;
> bool posted;
>
> - /*
> - * If multishot has already posted deferred completions, ensure that
> - * those are flushed first before posting this one. If not, CQEs
> - * could get reordered.
> - */
> - if (!wq_list_empty(&ctx->submit_state.compl_reqs))
> - __io_submit_flush_completions(ctx);
> -
> lockdep_assert(!io_wq_current_is_worker());
> lockdep_assert_held(&ctx->uring_lock);
>
> @@ -871,6 +863,23 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
> return posted;
> }
>
> +bool io_req_post_cqe_overflow(struct io_kiocb *req)
"overflow" here is rather confusing, it could mean lots of things.
Maybe some *_post_poll_complete for now?
> +{
> + bool filled;
> +
> + filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags);
posting and overflow must be under the same CQ critical section,
like io_cq_lock(). Just copy io_post_aux_cqe() and add
ctx->cq_extra--? Hopefully we'll remove the cq_extra ugliness
later and combine them after.
> + if (unlikely(!filled)) {
> + struct io_ring_ctx *ctx = req->ctx;
> +
> + spin_lock(&ctx->completion_lock);
> + filled = io_cqring_event_overflow(ctx, req->cqe.user_data,
> + req->cqe.res, req->cqe.flags, 0, 0);
> + spin_unlock(&ctx->completion_lock);
> + }
> +
> + return filled;
> +}
> +
> static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
> {
> struct io_ring_ctx *ctx = req->ctx;
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index e4050b2d0821..d2d4bf7c3b29 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -82,6 +82,7 @@ void io_req_defer_failed(struct io_kiocb *req, s32 res);
> bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
> void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
> bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
> +bool io_req_post_cqe_overflow(struct io_kiocb *req);
> void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
>
> struct file *io_file_get_normal(struct io_kiocb *req, int fd);
> diff --git a/io_uring/poll.c b/io_uring/poll.c
> index 8eb744eb9f4c..af8e3d4f6f1f 100644
> --- a/io_uring/poll.c
> +++ b/io_uring/poll.c
> @@ -312,6 +312,13 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
> return IOU_POLL_NO_ACTION;
> }
>
> +static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw)
> +{
> + if (io_req_post_cqe_overflow(req))
> + req->flags |= REQ_F_CQE_SKIP;
Unconditional would be better. It'd still end up in attempting
to post, likely failing and reattemptng allocation just one
extra time, not like it gives any reliability. And if I'd be
choosing b/w dropping a completion or potentially getting a
botched completion as per the problem you tried, I say the
former is better.
> + io_req_task_complete(req, tw);
> +}
> +
> void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw)
> {
> int ret;
> @@ -349,7 +356,7 @@ void io_poll_task_func(struct io_kiocb *req, io_tw_token_t tw)
> io_tw_lock(req->ctx, tw);
>
> if (ret == IOU_POLL_REMOVE_POLL_USE_RES)
> - io_req_task_complete(req, tw);
> + io_poll_req_complete(req, tw);
> else if (ret == IOU_POLL_DONE || ret == IOU_POLL_REISSUE)
> io_req_task_submit(req, tw);
> else
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] io_uring/poll: use regular CQE posting for multishot termination
2025-05-07 20:53 ` Pavel Begunkov
@ 2025-05-07 23:11 ` Jens Axboe
2025-05-08 8:11 ` Pavel Begunkov
0 siblings, 1 reply; 4+ messages in thread
From: Jens Axboe @ 2025-05-07 23:11 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
On 5/7/25 2:53 PM, Pavel Begunkov wrote:
> On 5/7/25 19:08, Jens Axboe wrote:
>> A previous patch avoided reordering of multiple multishot requests
>> getting their CQEs potentiall reordered when one of them terminates, as
>> that last termination CQE is posted as a deferred completion rather than
>> directly as a CQE. This can reduce the efficiency of the batched
>> posting, hence was not ideal.
>>
>> Provide a basic helper that poll can use for this kind of termination,
>> which does a normal CQE posting rather than a deferred one. With that,
>> the work-around where io_req_post_cqe() needs to flush deferred
>> completions can be removed.
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> This removes the io_req_post_cqe() flushing, and instead puts the honus
>> on the poll side to provide the ordering. I've verified that this also
>> fixes the reported issue. The previous patch can be easily backported to
>> stable, so makes sense to keep that one.
>
> It still gives a bad feeling tbh, it's not a polling problem,
> we're working around shortcomings of the incremental / bundled
> uapi and/or design. Patching it in semi unrelated places will
> defitely bite back.
I don't think that's fair, we should always strive to have as close to
ordered completions as we can. The fact that multishot ends up using a
mix of both methods to fill CQEs is problematic.
> Can it be fixed in relevant opcodes? So it stays close to
> those who actually use it. And let me ask since I'm lost in
> new features, can the uapi be fixed so that it doesn't
> depend on request ordering?
The API absolutely relies on ordering within a buffer group ID.
It can certainly be fixed at the opcode sites, but there'd be 3 spots in
net and one in rw.c, and for each spot it'd be more involved to fix it.
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 541e65a1eebf..505959fc2de0 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -848,14 +848,6 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>> struct io_ring_ctx *ctx = req->ctx;
>> bool posted;
>> - /*
>> - * If multishot has already posted deferred completions, ensure that
>> - * those are flushed first before posting this one. If not, CQEs
>> - * could get reordered.
>> - */
>> - if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>> - __io_submit_flush_completions(ctx);
>> -
>> lockdep_assert(!io_wq_current_is_worker());
>> lockdep_assert_held(&ctx->uring_lock);
>> @@ -871,6 +863,23 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>> return posted;
>> }
>> +bool io_req_post_cqe_overflow(struct io_kiocb *req)
>
> "overflow" here is rather confusing, it could mean lots of things.
> Maybe some *_post_poll_complete for now?
Yeah it's not a great name, just didn't have any better ideas at the
time. I'll ponder a bit, __complete() isn't terrible.
>> +{
>> + bool filled;
>> +
>> + filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags);
>
> posting and overflow must be under the same CQ critical section,
> like io_cq_lock(). Just copy io_post_aux_cqe() and add
> ctx->cq_extra--? Hopefully we'll remove the cq_extra ugliness
> later and combine them after.
Would be great to combine those, as it stands there's a mix of them, and
io_add_aux_cqe() for example does split locking. I'll update the
locking.
>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>> index 8eb744eb9f4c..af8e3d4f6f1f 100644
>> --- a/io_uring/poll.c
>> +++ b/io_uring/poll.c
>> @@ -312,6 +312,13 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
>> return IOU_POLL_NO_ACTION;
>> }
>> +static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw)
>> +{
>> + if (io_req_post_cqe_overflow(req))
>> + req->flags |= REQ_F_CQE_SKIP;
>
> Unconditional would be better. It'd still end up in attempting
> to post, likely failing and reattemptng allocation just one
> extra time, not like it gives any reliability. And if I'd be
> choosing b/w dropping a completion or potentially getting a
> botched completion as per the problem you tried, I say the
> former is better.
Not sure I follow, unconditional what? SKIP? Yes that probably makes
sense, if we don't overflow post, it'll get logged as such anyway.
--
Jens Axboe
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] io_uring/poll: use regular CQE posting for multishot termination
2025-05-07 23:11 ` Jens Axboe
@ 2025-05-08 8:11 ` Pavel Begunkov
0 siblings, 0 replies; 4+ messages in thread
From: Pavel Begunkov @ 2025-05-08 8:11 UTC (permalink / raw)
To: Jens Axboe, io-uring
On 5/8/25 00:11, Jens Axboe wrote:
...>>> This removes the io_req_post_cqe() flushing, and instead puts the honus
>>> on the poll side to provide the ordering. I've verified that this also
>>> fixes the reported issue. The previous patch can be easily backported to
>>> stable, so makes sense to keep that one.
>>
>> It still gives a bad feeling tbh, it's not a polling problem,
>> we're working around shortcomings of the incremental / bundled
>> uapi and/or design. Patching it in semi unrelated places will
>> defitely bite back.
>
> I don't think that's fair, we should always strive to have as close to
> ordered completions as we can. The fact that multishot ends up using a
As a nice thing to have / perf / qos? Sure, but this makes it a
correctness requirement, which always takes precedence over any
future potential optimisation / cleanup that might lead to a rare
reordering.
> mix of both methods to fill CQEs is problematic.
It's a nuisance, I agree, but the problem here is introducing
implicit inter request ordering. I assume for incremental you're
returning {bid, len} and the user has to have a state to track
the current offset based on CQE ordering? In which case the
problem is really not returning the offset explicitly.
>> Can it be fixed in relevant opcodes? So it stays close to
>> those who actually use it. And let me ask since I'm lost in
>> new features, can the uapi be fixed so that it doesn't
>> depend on request ordering?
>
> The API absolutely relies on ordering within a buffer group ID.
>
> It can certainly be fixed at the opcode sites, but there'd be 3 spots in
> net and one in rw.c, and for each spot it'd be more involved to fix it.
But that's the spots that use provided buffers, while polling shouldn't
have a reason to know about provided buffers. For example, is it safe
to mix such mshot and oneshot requests? Let's say during initial
submission oneshot request succeeds and queues a deferred completion,
i.e. io_issue_sqe() -> io_req_complete_defer(). Next it submits an
mshot request, which posts some aux cqes, and only then we file a
CQE for the first request. Or it can be just an mshot executed not
from the polling loop but inline.
>
...>>> +{
>>> + bool filled;
>>> +
>>> + filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags);
>>
>> posting and overflow must be under the same CQ critical section,
>> like io_cq_lock(). Just copy io_post_aux_cqe() and add
>> ctx->cq_extra--? Hopefully we'll remove the cq_extra ugliness
>> later and combine them after.
>
> Would be great to combine those, as it stands there's a mix of them, and
> io_add_aux_cqe() for example does split locking. I'll update the
> locking.
It would be, but that's buggy and can cause reordering. io_add_aux_cqe()
doesn't split the _CQ critical section_, it relies / requires
DEFER_TASKRUN, and ->completion_lock is not equivalent to that.
>>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>>> index 8eb744eb9f4c..af8e3d4f6f1f 100644
>>> --- a/io_uring/poll.c
>>> +++ b/io_uring/poll.c
>>> @@ -312,6 +312,13 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
>>> return IOU_POLL_NO_ACTION;
>>> }
>>> +static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw)
>>> +{
>>> + if (io_req_post_cqe_overflow(req))
>>> + req->flags |= REQ_F_CQE_SKIP;
>>
>> Unconditional would be better. It'd still end up in attempting
>> to post, likely failing and reattemptng allocation just one
>> extra time, not like it gives any reliability. And if I'd be
>> choosing b/w dropping a completion or potentially getting a
>> botched completion as per the problem you tried, I say the
>> former is better.
>
> Not sure I follow, unconditional what? SKIP? Yes that probably makes
Yes
io_req_post_cqe_overflow(req);
req->flags |= REQ_F_CQE_SKIP;
> sense, if we don't overflow post, it'll get logged as such anyway.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-05-08 8:10 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 18:08 [PATCH] io_uring/poll: use regular CQE posting for multishot termination Jens Axboe
2025-05-07 20:53 ` Pavel Begunkov
2025-05-07 23:11 ` Jens Axboe
2025-05-08 8:11 ` Pavel Begunkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox