Re: [PATCH] io_uring/poll: use regular CQE posting for multishot termination

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Pavel Begunkov <asml.silence@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, io-uring <io-uring@vger.kernel.org>
Subject: Re: [PATCH] io_uring/poll: use regular CQE posting for multishot termination
Date: Thu, 8 May 2025 09:11:27 +0100	[thread overview]
Message-ID: <c0f72ca7-76f9-41b3-9bcd-18deba603ab3@gmail.com> (raw)
In-Reply-To: <d0c88f28-3915-4860-93d7-3a383aff8061@kernel.dk>

On 5/8/25 00:11, Jens Axboe wrote:
...>>> This removes the io_req_post_cqe() flushing, and instead puts the honus
>>> on the poll side to provide the ordering. I've verified that this also
>>> fixes the reported issue. The previous patch can be easily backported to
>>> stable, so makes sense to keep that one.
>>
>> It still gives a bad feeling tbh, it's not a polling problem,
>> we're working around shortcomings of the incremental / bundled
>> uapi and/or design. Patching it in semi unrelated places will
>> defitely bite back.
> 
> I don't think that's fair, we should always strive to have as close to
> ordered completions as we can. The fact that multishot ends up using a

As a nice thing to have / perf / qos? Sure, but this makes it a
correctness requirement, which always takes precedence over any
future potential optimisation / cleanup that might lead to a rare
reordering.

> mix of both methods to fill CQEs is problematic.

It's a nuisance, I agree, but the problem here is introducing
implicit inter request ordering. I assume for incremental you're
returning {bid, len} and the user has to have a state to track
the current offset based on CQE ordering? In which case the
problem is really not returning the offset explicitly.


>> Can it be fixed in relevant opcodes? So it stays close to
>> those who actually use it. And let me ask since I'm lost in
>> new features, can the uapi be fixed so that it doesn't
>> depend on request ordering?
> 
> The API absolutely relies on ordering within a buffer group ID.
> 
> It can certainly be fixed at the opcode sites, but there'd be 3 spots in
> net and one in rw.c, and for each spot it'd be more involved to fix it.
But that's the spots that use provided buffers, while polling shouldn't
have a reason to know about provided buffers. For example, is it safe
to mix such mshot and oneshot requests? Let's say during initial
submission oneshot request succeeds and queues a deferred completion,
i.e. io_issue_sqe() -> io_req_complete_defer(). Next it submits an
mshot request, which posts some aux cqes, and only then we file a
CQE for the first request. Or it can be just an mshot executed not
from the polling loop but inline.

> 
...>>> +{
>>> +    bool filled;
>>> +
>>> +    filled = io_req_post_cqe(req, req->cqe.res, req->cqe.flags);
>>
>> posting and overflow must be under the same CQ critical section,
>> like io_cq_lock(). Just copy io_post_aux_cqe() and add
>> ctx->cq_extra--? Hopefully we'll remove the cq_extra ugliness
>> later and combine them after.
> 
> Would be great to combine those, as it stands there's a mix of them, and
> io_add_aux_cqe() for example does split locking. I'll update the
> locking.

It would be, but that's buggy and can cause reordering. io_add_aux_cqe()
doesn't split the _CQ critical section_, it relies / requires
DEFER_TASKRUN, and ->completion_lock is not equivalent to that.

>>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>>> index 8eb744eb9f4c..af8e3d4f6f1f 100644
>>> --- a/io_uring/poll.c
>>> +++ b/io_uring/poll.c
>>> @@ -312,6 +312,13 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
>>>        return IOU_POLL_NO_ACTION;
>>>    }
>>>    +static void io_poll_req_complete(struct io_kiocb *req, io_tw_token_t tw)
>>> +{
>>> +    if (io_req_post_cqe_overflow(req))
>>> +        req->flags |= REQ_F_CQE_SKIP;
>>
>> Unconditional would be better. It'd still end up in attempting
>> to post, likely failing and reattemptng allocation just one
>> extra time, not like it gives any reliability. And if I'd be
>> choosing b/w dropping a completion or potentially getting a
>> botched completion as per the problem you tried, I say the
>> former is better.
> 
> Not sure I follow, unconditional what? SKIP? Yes that probably makes

Yes

io_req_post_cqe_overflow(req);
req->flags |= REQ_F_CQE_SKIP;

> sense, if we don't overflow post, it'll get logged as such anyway.

-- 
Pavel Begunkov

     prev parent reply	other threads:[~2025-05-08  8:10 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-07 18:08 [PATCH] io_uring/poll: use regular CQE posting for multishot termination Jens Axboe
2025-05-07 20:53 ` Pavel Begunkov
2025-05-07 23:11   ` Jens Axboe
2025-05-08  8:11     ` Pavel Begunkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c0f72ca7-76f9-41b3-9bcd-18deba603ab3@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox