From: Jens Axboe <axboe@kernel.dk>
To: Pavel Begunkov <asml.silence@gmail.com>,
io-uring <io-uring@vger.kernel.org>
Subject: Re: [PATCH] io_uring/poll: use regular CQE posting for multishot termination
Date: Wed, 7 May 2025 17:11:48 -0600 [thread overview]
Message-ID: <d0c88f28-3915-4860-93d7-3a383aff8061@kernel.dk> (raw)
In-Reply-To: <1711744d-1dd1-4efc-87e2-6ddc1124a95e@gmail.com>
On 5/7/25 2:53 PM, Pavel Begunkov wrote:
> On 5/7/25 19:08, Jens Axboe wrote:
>> A previous patch avoided reordering of multiple multishot requests
>> getting their CQEs potentiall reordered when one of them terminates, as
>> that last termination CQE is posted as a deferred completion rather than
>> directly as a CQE. This can reduce the efficiency of the batched
>> posting, hence was not ideal.
>>
>> Provide a basic helper that poll can use for this kind of termination,
>> which does a normal CQE posting rather than a deferred one. With that,
>> the work-around where io_req_post_cqe() needs to flush deferred
>> completions can be removed.
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> This removes the io_req_post_cqe() flushing, and instead puts the honus
>> on the poll side to provide the ordering. I've verified that this also
>> fixes the reported issue. The previous patch can be easily backported to
>> stable, so makes sense to keep that one.
>
> It still gives a bad feeling tbh, it's not a polling problem,
> we're working around shortcomings of the incremental / bundled
> uapi and/or design. Patching it in semi unrelated places will
> defitely bite back.
I don't think that's fair, we should always strive to have as close to
ordered completions as we can. The fact that multishot ends up using a
mix of both methods to fill CQEs is problematic.
> Can it be fixed in relevant opcodes? So it stays close to
> those who actually use it. And let me ask since I'm lost in
> new features, can the uapi be fixed so that it doesn't
> depend on request ordering?
The API absolutely relies on ordering within a buffer group ID.
It can certainly be fixed at the opcode sites, but there'd be 3 spots in
net and one in rw.c, and for each spot it'd be more involved to fix it.
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 541e65a1eebf..505959fc2de0 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -848,14 +848,6 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>> struct io_ring_ctx *ctx = req->ctx;
>> bool posted;
>> - /*
>> - * If multishot has already posted deferred completions, ensure that
>> - * those are flushed first before posting this one. If not, CQEs
>> - * could get reordered.
>> - */
>> - if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>> - __io_submit_flush_completions(ctx);
>> -
>> lockdep_assert(!io_wq_current_is_worker());
>> lockdep_assert_held(&ctx->uring_lock);
>> @@ -871,6 +863,23 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>> return posted;
>> }
>> +bool io_req_post_cqe_overflow(struct io_kiocb *req)
>
> "overflow" here is rather confusing, it could mean lots of things.
> Maybe some *_post_poll_complete for now?
Yeah it's not a great name, just didn't have any better ideas at the
time. I'll ponder a bit, __complete() isn't terrible.
>> +{
>> + bool filled;
>> +
>> + filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags);
>
> posting and overflow must be under the same CQ critical section,
> like io_cq_lock(). Just copy io_post_aux_cqe() and add
> ctx->cq_extra--? Hopefully we'll remove the cq_extra ugliness
> later and combine them after.
Would be great to combine those, as it stands there's a mix of them, and
io_add_aux_cqe() for example does split locking. I'll update the
locking.
>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>> index 8eb744eb9f4c..af8e3d4f6f1f 100644
>> --- a/io_uring/poll.c
>> +++ b/io_uring/poll.c
>> @@ -312,6 +312,13 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
>> return IOU_POLL_NO_ACTION;
>> }
>> +static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw)
>> +{
>> + if (io_req_post_cqe_overflow(req))
>> + req->flags |= REQ_F_CQE_SKIP;
>
> Unconditional would be better. It'd still end up in attempting
> to post, likely failing and reattemptng allocation just one
> extra time, not like it gives any reliability. And if I'd be
> choosing b/w dropping a completion or potentially getting a
> botched completion as per the problem you tried, I say the
> former is better.
Not sure I follow, unconditional what? SKIP? Yes that probably makes
sense, if we don't overflow post, it'll get logged as such anyway.
--
Jens Axboe
next prev parent reply other threads:[~2025-05-07 23:11 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-07 18:08 [PATCH] io_uring/poll: use regular CQE posting for multishot termination Jens Axboe
2025-05-07 20:53 ` Pavel Begunkov
2025-05-07 23:11 ` Jens Axboe [this message]
2025-05-08 8:11 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d0c88f28-3915-4860-93d7-3a383aff8061@kernel.dk \
--to=axboe@kernel.dk \
--cc=asml.silence@gmail.com \
--cc=io-uring@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox