Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: Pavel Begunkov <asml.silence@gmail.com>,
	io-uring <io-uring@vger.kernel.org>
Subject: Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
Date: Mon, 16 Mar 2026 16:31:22 -0600	[thread overview]
Message-ID: <be88cb1c-9e08-49dd-9671-1d3887918935@kernel.dk> (raw)
In-Reply-To: <d8df1979-5534-4703-9e68-17b152d6595e@gmail.com>

On 3/16/26 4:24 PM, Pavel Begunkov wrote:
> On 3/16/26 18:40, Jens Axboe wrote:
>> On 3/16/26 9:16 AM, Jens Axboe wrote:
>>> On 3/16/26 8:44 AM, Pavel Begunkov wrote:
>>>> On 3/16/26 14:40, Pavel Begunkov wrote:
>>>>> On 3/16/26 14:28, Jens Axboe wrote:
>>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>>>>>>> On 3/15/26 16:19, Jens Axboe wrote:
>>>>>>>> When a socket send and shutdown() happen back-to-back, both fire
>>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first
>>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>>>>>>> the first event was consumed. Since the shutdown is a persistent state
>>>>>>>> change, no further wakeups will happen, and the multishot recv can hang
>>>>>>>> forever.
>>>>>>>>
>>>>>>>> Fix this by only draining a single poll ref after io_poll_issue()
>>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>>>>>>> discovers the remaining state.
>>>>>>>
>>>>>>> How often will iterate with no effect for normal execution (i.e.
>>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
>>>>>>
>>>>>> That is my worry too. I spent a bit of time on it this morning to figure
>>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this
>>>>>> commit in particular:
>>>>>>
>>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
>>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com>
>>>>>> Date:   Wed Jul 2 22:35:18 2025 +0000
>>>>>>
>>>>>>       af_unix: Introduce SO_INQ.
>>>>>>
>>>>>> which is then not the first time I've had to fix fallout from that
>>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about
>>>>>> missing events. Yes if both poll triggers are of the same type, eg
>>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
>>>>>> would anything else where you'd need separate handling for the trigger.
>>>>>
>>>>> Thinking more, I don't think the patch is correct either. Seems you
>>>>> expect the last recv to return 0, but let's say you have 2 refs and
>>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation
>>>>> fails. The 2nd recv call returns another 4K, and now you're in the
>>>>> same situation as before.
>>>>>
>>>>> You're trying to rely on a too specific behaviour. HUP handling should
>>>>> be better.
>>>>
>>>> Some variation on, if HUP'ed, it spins until the opcode give up.
>>>
>>> Took a quick look, and we don't even get a HUP, the hangup side
>>> ends up with a 0 mask. Which is less than useful... I'll keep
>>> digging.
>>
>> How about something like this? Will only retry if hup was seen, and
>> there are multiple refs. Avoids re-iterating for eg multiple POLLIN
>> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1.
>> Keeps it local too.
> 
> HUP handling is just a hack, it'd be best to avoid complicating

It is, that's why I wasn't a huge fan of gating on that in the first
place!

> the pool loop logic for that (and those continue do).
> 
> io_poll_loop_retry() {
>     ...
>     atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs);
> }
> if (req->cqe.res & (POLLHUP | POLLRDHUP))
>     io_poll_loop_retry();
> 
> 
> Can we isolate it like this? Nobody should care about extra
> atomics for this case.

It's not the extra atomic, I agree it doesn't matter for this non-hot
case. It's more that setting the retry flag isn't enough, you need to
have it do another loop at that point. And if you just set that, then
it'll drop all refs and happily return and wait for the next poll
trigger that won't happen past HUP.

It's not impossible we can make this work, but I don't see a great way
of doing it without messing with the loop like I did :/. Or checking
->poll_refs again, and at that point the separate variable is better
imho.

>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>> index b671b84657d9..bd79a04a2c59 100644
>> --- a/io_uring/poll.c
>> +++ b/io_uring/poll.c
>> @@ -228,6 +228,16 @@ static inline void io_poll_execute(struct io_kiocb *req, int res)
>>           __io_poll_execute(req, res);
>>   }
>>   +static inline bool io_poll_loop_retry(struct io_kiocb *req, int v)
>> +{
>> +    if (req->opcode == IORING_OP_POLL_ADD)
>> +        return false;
>> +    /* multiple refs and HUP, ensure we loop once more */
>> +    if (v != 1 && req->cqe.res & (POLLHUP | POLLRDHUP))
> 
> v != 1 looks suspicious, at this stage it's hard to trace what
> io_recv_finish() is really doing, but better to drop the check.
>  
> req->cqe.res should already be in a register, makes more sense
> to gate on that first.

Yeah that's probably fine, I mainly wanted to avoid hitting any of this
unless we had more than one event queued. ->cqe.res should be in a
register, but 'v' I strongly suspect 'v' will be too as it's used
throughout too.

-- 
Jens Axboe

next prev parent reply	other threads:[~2026-03-16 22:31 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-15 16:19 [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race Jens Axboe
2026-03-16 14:17 ` Pavel Begunkov
2026-03-16 14:28   ` Jens Axboe
2026-03-16 14:40     ` Pavel Begunkov
2026-03-16 14:44       ` Pavel Begunkov
2026-03-16 15:16         ` Jens Axboe
2026-03-16 18:40           ` Jens Axboe
2026-03-16 22:24             ` Pavel Begunkov
2026-03-16 22:31               ` Jens Axboe [this message]
2026-03-16 23:08                 ` Pavel Begunkov
2026-03-17  1:14                   ` Jens Axboe
2026-03-17  1:36                     ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=be88cb1c-9e08-49dd-9671-1d3887918935@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox