Re: [PATCH] io_uring/poll: use regular CQE posting for multishot termination

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: Pavel Begunkov <asml.silence@gmail.com>,
	io-uring <io-uring@vger.kernel.org>
Subject: Re: [PATCH] io_uring/poll: use regular CQE posting for multishot termination
Date: Wed, 7 May 2025 17:11:48 -0600	[thread overview]
Message-ID: <d0c88f28-3915-4860-93d7-3a383aff8061@kernel.dk> (raw)
In-Reply-To: <1711744d-1dd1-4efc-87e2-6ddc1124a95e@gmail.com>

On 5/7/25 2:53 PM, Pavel Begunkov wrote:
> On 5/7/25 19:08, Jens Axboe wrote:
>> A previous patch avoided reordering of multiple multishot requests
>> getting their CQEs potentiall reordered when one of them terminates, as
>> that last termination CQE is posted as a deferred completion rather than
>> directly as a CQE. This can reduce the efficiency of the batched
>> posting, hence was not ideal.
>>
>> Provide a basic helper that poll can use for this kind of termination,
>> which does a normal CQE posting rather than a deferred one. With that,
>> the work-around where io_req_post_cqe() needs to flush deferred
>> completions can be removed.
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> This removes the io_req_post_cqe() flushing, and instead puts the honus
>> on the poll side to provide the ordering. I've verified that this also
>> fixes the reported issue. The previous patch can be easily backported to
>> stable, so makes sense to keep that one.
> 
> It still gives a bad feeling tbh, it's not a polling problem,
> we're working around shortcomings of the incremental / bundled
> uapi and/or design. Patching it in semi unrelated places will
> defitely bite back.

I don't think that's fair, we should always strive to have as close to
ordered completions as we can. The fact that multishot ends up using a
mix of both methods to fill CQEs is problematic.

> Can it be fixed in relevant opcodes? So it stays close to
> those who actually use it. And let me ask since I'm lost in
> new features, can the uapi be fixed so that it doesn't
> depend on request ordering?

The API absolutely relies on ordering within a buffer group ID.

It can certainly be fixed at the opcode sites, but there'd be 3 spots in
net and one in rw.c, and for each spot it'd be more involved to fix it.

>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 541e65a1eebf..505959fc2de0 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -848,14 +848,6 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>       struct io_ring_ctx *ctx = req->ctx;
>>       bool posted;
>>   -    /*
>> -     * If multishot has already posted deferred completions, ensure that
>> -     * those are flushed first before posting this one. If not, CQEs
>> -     * could get reordered.
>> -     */
>> -    if (!wq_list_empty(&ctx->submit_state.compl_reqs))
>> -        __io_submit_flush_completions(ctx);
>> -
>>       lockdep_assert(!io_wq_current_is_worker());
>>       lockdep_assert_held(&ctx->uring_lock);
>>   @@ -871,6 +863,23 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
>>       return posted;
>>   }
>>   +bool io_req_post_cqe_overflow(struct io_kiocb *req)
> 
> "overflow" here is rather confusing, it could mean lots of things.
> Maybe some *_post_poll_complete for now?

Yeah it's not a great name, just didn't have any better ideas at the
time. I'll ponder a bit, __complete() isn't terrible.

>> +{
>> +    bool filled;
>> +
>> +    filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags);
> 
> posting and overflow must be under the same CQ critical section,
> like io_cq_lock(). Just copy io_post_aux_cqe() and add
> ctx->cq_extra--? Hopefully we'll remove the cq_extra ugliness
> later and combine them after.

Would be great to combine those, as it stands there's a mix of them, and
io_add_aux_cqe() for example does split locking. I'll update the
locking.

>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>> index 8eb744eb9f4c..af8e3d4f6f1f 100644
>> --- a/io_uring/poll.c
>> +++ b/io_uring/poll.c
>> @@ -312,6 +312,13 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
>>       return IOU_POLL_NO_ACTION;
>>   }
>>   +static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw)
>> +{
>> +    if (io_req_post_cqe_overflow(req))
>> +        req->flags |= REQ_F_CQE_SKIP;
> 
> Unconditional would be better. It'd still end up in attempting
> to post, likely failing and reattemptng allocation just one
> extra time, not like it gives any reliability. And if I'd be
> choosing b/w dropping a completion or potentially getting a
> botched completion as per the problem you tried, I say the
> former is better.

Not sure I follow, unconditional what? SKIP? Yes that probably makes
sense, if we don't overflow post, it'll get logged as such anyway.

-- 
Jens Axboe

next prev parent reply	other threads:[~2025-05-07 23:11 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 18:08 [PATCH] io_uring/poll: use regular CQE posting for multishot termination Jens Axboe
2025-05-07 20:53 ` Pavel Begunkov
2025-05-07 23:11   ` Jens Axboe [this message]
2025-05-08  8:11     ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0c88f28-3915-4860-93d7-3a383aff8061@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox