[PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
@ 2026-03-15 16:19 Jens Axboe
  2026-03-16 14:17 ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2026-03-15 16:19 UTC (permalink / raw)
  To: io-uring; +Cc: francis

When a socket send and shutdown() happen back-to-back, both fire
wake-ups before the receiver's task_work has a chance to run. The first
wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
When io_poll_check_events() runs, it calls io_poll_issue() which does a
recv that reads the data and returns IOU_RETRY. The loop then drains all
accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
the first event was consumed. Since the shutdown is a persistent state
change, no further wakeups will happen, and the multishot recv can hang
forever.

Fix this by only draining a single poll ref after io_poll_issue()
returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
discovers the remaining state.

Move the v &= IO_POLL_REF_MASK (drain all refs) into the non-APOLL
multishot poll path, since poll CQEs report the current mask state
rather than consuming individual events.

Cc: stable@vger.kernel.org
Fixes: dbc2564cfe0f ("io_uring: let fast poll support multishot")
Reported-by: francis <francis@brosseau.dev>
Link: https://github.com/axboe/liburing/issues/1549
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

diff --git a/io_uring/poll.c b/io_uring/poll.c
index b671b84657d9..0f0949d919e9 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -303,6 +303,7 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 				io_req_set_res(req, mask, 0);
 				return IOU_POLL_REMOVE_POLL_USE_RES;
 			}
+			v &= IO_POLL_REF_MASK;
 		} else {
 			int ret = io_poll_issue(req, tw);

@@ -312,6 +313,11 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 				return IOU_POLL_REQUEUE;
 			if (ret != IOU_RETRY && ret < 0)
 				return ret;
+			/*
+			 * One event consumed, but additional wakes may have
+			 * raced. Only drain a single ref.
+			 */
+			v = 1;
 		}

 		/* force the next iteration to vfs_poll() */
@@ -321,7 +327,6 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 		 * Release all references, retry if someone tried to restart
 		 * task_work while we were executing it.
 		 */
-		v &= IO_POLL_REF_MASK;
 	} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);

 	io_napi_add(req);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-15 16:19 [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race Jens Axboe
@ 2026-03-16 14:17 ` Pavel Begunkov
  2026-03-16 14:28   ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2026-03-16 14:17 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: francis

On 3/15/26 16:19, Jens Axboe wrote:
> When a socket send and shutdown() happen back-to-back, both fire
> wake-ups before the receiver's task_work has a chance to run. The first
> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
> When io_poll_check_events() runs, it calls io_poll_issue() which does a
> recv that reads the data and returns IOU_RETRY. The loop then drains all
> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
> the first event was consumed. Since the shutdown is a persistent state
> change, no further wakeups will happen, and the multishot recv can hang
> forever.
> 
> Fix this by only draining a single poll ref after io_poll_issue()
> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
> discovers the remaining state.

How often will iterate with no effect for normal execution (i.e.
no shutdown)? And how costly it'll be? Why not handle HUP instead?

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-16 14:17 ` Pavel Begunkov
@ 2026-03-16 14:28   ` Jens Axboe
  2026-03-16 14:40     ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2026-03-16 14:28 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: francis

On 3/16/26 8:17 AM, Pavel Begunkov wrote:
> On 3/15/26 16:19, Jens Axboe wrote:
>> When a socket send and shutdown() happen back-to-back, both fire
>> wake-ups before the receiver's task_work has a chance to run. The first
>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>> the first event was consumed. Since the shutdown is a persistent state
>> change, no further wakeups will happen, and the multishot recv can hang
>> forever.
>>
>> Fix this by only draining a single poll ref after io_poll_issue()
>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>> discovers the remaining state.
> 
> How often will iterate with no effect for normal execution (i.e.
> no shutdown)? And how costly it'll be? Why not handle HUP instead?

That is my worry too. I spent a bit of time on it this morning to figure
out why this is a new issue, and traced it down to 6.16..6.17, and this
commit in particular:

commit df30285b3670bf52e1e5512e4d4482bec5e93c16
Author: Kuniyuki Iwashima <kuniyu@google.com>
Date:   Wed Jul 2 22:35:18 2025 +0000

    af_unix: Introduce SO_INQ.

which is then not the first time I've had to fix fallout from that
commit. Need to dig a bit deeper. That said, I do also worry a bit about
missing events. Yes if both poll triggers are of the same type, eg
POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
would anything else where you'd need separate handling for the trigger.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-16 14:28   ` Jens Axboe
@ 2026-03-16 14:40     ` Pavel Begunkov
  2026-03-16 14:44       ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2026-03-16 14:40 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: francis

On 3/16/26 14:28, Jens Axboe wrote:
> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>> On 3/15/26 16:19, Jens Axboe wrote:
>>> When a socket send and shutdown() happen back-to-back, both fire
>>> wake-ups before the receiver's task_work has a chance to run. The first
>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>> the first event was consumed. Since the shutdown is a persistent state
>>> change, no further wakeups will happen, and the multishot recv can hang
>>> forever.
>>>
>>> Fix this by only draining a single poll ref after io_poll_issue()
>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>> discovers the remaining state.
>>
>> How often will iterate with no effect for normal execution (i.e.
>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
> 
> That is my worry too. I spent a bit of time on it this morning to figure
> out why this is a new issue, and traced it down to 6.16..6.17, and this
> commit in particular:
> 
> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
> Author: Kuniyuki Iwashima <kuniyu@google.com>
> Date:   Wed Jul 2 22:35:18 2025 +0000
> 
>      af_unix: Introduce SO_INQ.
> 
> which is then not the first time I've had to fix fallout from that
> commit. Need to dig a bit deeper. That said, I do also worry a bit about
> missing events. Yes if both poll triggers are of the same type, eg
> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
> would anything else where you'd need separate handling for the trigger.

Thinking more, I don't think the patch is correct either. Seems you
expect the last recv to return 0, but let's say you have 2 refs and
8K in the rx queue. The first recv call gets 4K b/c some allocation
fails. The 2nd recv call returns another 4K, and now you're in the
same situation as before.

You're trying to rely on a too specific behaviour. HUP handling should
be better. Ideally, the opcode would handle it, e.g. zcrx doesn't have
the problem IIUC, but that might be harder to do.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-16 14:40     ` Pavel Begunkov
@ 2026-03-16 14:44       ` Pavel Begunkov
  2026-03-16 15:16         ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2026-03-16 14:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: francis

On 3/16/26 14:40, Pavel Begunkov wrote:
> On 3/16/26 14:28, Jens Axboe wrote:
>> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>>> On 3/15/26 16:19, Jens Axboe wrote:
>>>> When a socket send and shutdown() happen back-to-back, both fire
>>>> wake-ups before the receiver's task_work has a chance to run. The first
>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>>> the first event was consumed. Since the shutdown is a persistent state
>>>> change, no further wakeups will happen, and the multishot recv can hang
>>>> forever.
>>>>
>>>> Fix this by only draining a single poll ref after io_poll_issue()
>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>>> discovers the remaining state.
>>>
>>> How often will iterate with no effect for normal execution (i.e.
>>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
>>
>> That is my worry too. I spent a bit of time on it this morning to figure
>> out why this is a new issue, and traced it down to 6.16..6.17, and this
>> commit in particular:
>>
>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
>> Author: Kuniyuki Iwashima <kuniyu@google.com>
>> Date:   Wed Jul 2 22:35:18 2025 +0000
>>
>>      af_unix: Introduce SO_INQ.
>>
>> which is then not the first time I've had to fix fallout from that
>> commit. Need to dig a bit deeper. That said, I do also worry a bit about
>> missing events. Yes if both poll triggers are of the same type, eg
>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
>> would anything else where you'd need separate handling for the trigger.
> 
> Thinking more, I don't think the patch is correct either. Seems you
> expect the last recv to return 0, but let's say you have 2 refs and
> 8K in the rx queue. The first recv call gets 4K b/c some allocation
> fails. The 2nd recv call returns another 4K, and now you're in the
> same situation as before.
> 
> You're trying to rely on a too specific behaviour. HUP handling should
> be better.

Some variation on, if HUP'ed, it spins until the opcode give up.

diff --git a/io_uring/poll.c b/io_uring/poll.c
index b671b84657d9..3944deb55234 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -43,7 +43,8 @@ struct io_poll_table {
  
  #define IO_POLL_CANCEL_FLAG	BIT(31)
  #define IO_POLL_RETRY_FLAG	BIT(30)
-#define IO_POLL_REF_MASK	GENMASK(29, 0)
+#define IO_POLL_HUP_FLAG	BIT(29)
+#define IO_POLL_REF_MASK	GENMASK(28, 0)
  
  /*
   * We usually have 1-2 refs taken, 128 is more than enough and we want to
@@ -272,6 +273,8 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
  				atomic_andnot(IO_POLL_RETRY_FLAG, &req->poll_refs);
  				v &= ~IO_POLL_RETRY_FLAG;
  			}
+			if (v & IO_POLL_HUP_FLAG)
+				atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs);
  		}
  
  		/* the mask was stashed in __io_poll_execute */
@@ -390,6 +393,14 @@ static __cold int io_pollfree_wake(struct io_kiocb *req, struct io_poll *poll)
  	return 1;
  }
  
+static void io_handle_hup(struct io_kiocb *req, struct io_poll *poll)
+{
+	if (req->opcode == IORING_OP_POLL_ADD)
+		return;
+	if (poll->events & (POLLIN|EPOLLRDNORM))
+		atomic_or(IO_POLL_HUP_FLAG, &req->poll_refs);
+}
+
  static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
  			void *key)
  {
@@ -397,8 +408,12 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
  	struct io_poll *poll = container_of(wait, struct io_poll, wait);
  	__poll_t mask = key_to_poll(key);
  
-	if (unlikely(mask & POLLFREE))
-		return io_pollfree_wake(req, poll);
+	if (unlikely(mask & (POLLFREE|POLLHUP))) {
+		if (mask & POLLFREE)
+			return io_pollfree_wake(req, poll);
+		if (mask & POLLHUP)
+			io_handle_hup(req, poll);
+	}
  
  	/* for instances that support it check for an event match first */
  	if (mask && !(mask & (poll->events & ~IO_ASYNC_POLL_COMMON)))

-- 
Pavel Begunkov


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-16 14:44       ` Pavel Begunkov
@ 2026-03-16 15:16         ` Jens Axboe
  2026-03-16 18:40           ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2026-03-16 15:16 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: francis

On 3/16/26 8:44 AM, Pavel Begunkov wrote:
> On 3/16/26 14:40, Pavel Begunkov wrote:
>> On 3/16/26 14:28, Jens Axboe wrote:
>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>>>> On 3/15/26 16:19, Jens Axboe wrote:
>>>>> When a socket send and shutdown() happen back-to-back, both fire
>>>>> wake-ups before the receiver's task_work has a chance to run. The first
>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>>>> the first event was consumed. Since the shutdown is a persistent state
>>>>> change, no further wakeups will happen, and the multishot recv can hang
>>>>> forever.
>>>>>
>>>>> Fix this by only draining a single poll ref after io_poll_issue()
>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>>>> discovers the remaining state.
>>>>
>>>> How often will iterate with no effect for normal execution (i.e.
>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
>>>
>>> That is my worry too. I spent a bit of time on it this morning to figure
>>> out why this is a new issue, and traced it down to 6.16..6.17, and this
>>> commit in particular:
>>>
>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
>>> Author: Kuniyuki Iwashima <kuniyu@google.com>
>>> Date:   Wed Jul 2 22:35:18 2025 +0000
>>>
>>>      af_unix: Introduce SO_INQ.
>>>
>>> which is then not the first time I've had to fix fallout from that
>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about
>>> missing events. Yes if both poll triggers are of the same type, eg
>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
>>> would anything else where you'd need separate handling for the trigger.
>>
>> Thinking more, I don't think the patch is correct either. Seems you
>> expect the last recv to return 0, but let's say you have 2 refs and
>> 8K in the rx queue. The first recv call gets 4K b/c some allocation
>> fails. The 2nd recv call returns another 4K, and now you're in the
>> same situation as before.
>>
>> You're trying to rely on a too specific behaviour. HUP handling should
>> be better.
> 
> Some variation on, if HUP'ed, it spins until the opcode give up.

Took a quick look, and we don't even get a HUP, the hangup side
ends up with a 0 mask. Which is less than useful... I'll keep
digging.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-16 15:16         ` Jens Axboe
@ 2026-03-16 18:40           ` Jens Axboe
  2026-03-16 22:24             ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2026-03-16 18:40 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: francis

On 3/16/26 9:16 AM, Jens Axboe wrote:
> On 3/16/26 8:44 AM, Pavel Begunkov wrote:
>> On 3/16/26 14:40, Pavel Begunkov wrote:
>>> On 3/16/26 14:28, Jens Axboe wrote:
>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>>>>> On 3/15/26 16:19, Jens Axboe wrote:
>>>>>> When a socket send and shutdown() happen back-to-back, both fire
>>>>>> wake-ups before the receiver's task_work has a chance to run. The first
>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>>>>> the first event was consumed. Since the shutdown is a persistent state
>>>>>> change, no further wakeups will happen, and the multishot recv can hang
>>>>>> forever.
>>>>>>
>>>>>> Fix this by only draining a single poll ref after io_poll_issue()
>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>>>>> discovers the remaining state.
>>>>>
>>>>> How often will iterate with no effect for normal execution (i.e.
>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
>>>>
>>>> That is my worry too. I spent a bit of time on it this morning to figure
>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this
>>>> commit in particular:
>>>>
>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
>>>> Author: Kuniyuki Iwashima <kuniyu@google.com>
>>>> Date:   Wed Jul 2 22:35:18 2025 +0000
>>>>
>>>>      af_unix: Introduce SO_INQ.
>>>>
>>>> which is then not the first time I've had to fix fallout from that
>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about
>>>> missing events. Yes if both poll triggers are of the same type, eg
>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
>>>> would anything else where you'd need separate handling for the trigger.
>>>
>>> Thinking more, I don't think the patch is correct either. Seems you
>>> expect the last recv to return 0, but let's say you have 2 refs and
>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation
>>> fails. The 2nd recv call returns another 4K, and now you're in the
>>> same situation as before.
>>>
>>> You're trying to rely on a too specific behaviour. HUP handling should
>>> be better.
>>
>> Some variation on, if HUP'ed, it spins until the opcode give up.
> 
> Took a quick look, and we don't even get a HUP, the hangup side
> ends up with a 0 mask. Which is less than useful... I'll keep
> digging.

How about something like this? Will only retry if hup was seen, and
there are multiple refs. Avoids re-iterating for eg multiple POLLIN
wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1.
Keeps it local too.

diff --git a/io_uring/poll.c b/io_uring/poll.c
index b671b84657d9..bd79a04a2c59 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -228,6 +228,16 @@ static inline void io_poll_execute(struct io_kiocb *req, int res)
 		__io_poll_execute(req, res);
 }
 
+static inline bool io_poll_loop_retry(struct io_kiocb *req, int v)
+{
+	if (req->opcode == IORING_OP_POLL_ADD)
+		return false;
+	/* multiple refs and HUP, ensure we loop once more */
+	if (v != 1 && req->cqe.res & (POLLHUP | POLLRDHUP))
+		return true;
+	return false;
+}
+
 /*
  * All poll tw should go through this. Checks for poll events, manages
  * references, does rewait, etc.
@@ -246,8 +256,9 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 		return -ECANCELED;
 
 	do {
-		v = atomic_read(&req->poll_refs);
+		bool retry = false;
 
+		v = atomic_read(&req->poll_refs);
 		if (unlikely(v != 1)) {
 			/* tw should be the owner and so have some refs */
 			if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))
@@ -287,13 +298,15 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 			if (unlikely(!req->cqe.res)) {
 				/* Multishot armed need not reissue */
 				if (!(req->apoll_events & EPOLLONESHOT))
-					continue;
+					goto finish;
 				return IOU_POLL_REISSUE;
 			}
 		}
 		if (req->apoll_events & EPOLLONESHOT)
 			return IOU_POLL_DONE;
 
+		retry = io_poll_loop_retry(req, v);
+
 		/* multishot, just fill a CQE and proceed */
 		if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
 			__poll_t mask = mangle_poll(req->cqe.res &
@@ -317,12 +330,17 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 		/* force the next iteration to vfs_poll() */
 		req->cqe.res = 0;
 
+		if (retry)
+			continue;
 		/*
 		 * Release all references, retry if someone tried to restart
 		 * task_work while we were executing it.
 		 */
+finish:
 		v &= IO_POLL_REF_MASK;
-	} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);
+		if (!(atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK))
+			break;
+	} while (1);
 
 	io_napi_add(req);
 	return IOU_POLL_NO_ACTION;

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-16 18:40           ` Jens Axboe
@ 2026-03-16 22:24             ` Pavel Begunkov
  2026-03-16 22:31               ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2026-03-16 22:24 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 3/16/26 18:40, Jens Axboe wrote:
> On 3/16/26 9:16 AM, Jens Axboe wrote:
>> On 3/16/26 8:44 AM, Pavel Begunkov wrote:
>>> On 3/16/26 14:40, Pavel Begunkov wrote:
>>>> On 3/16/26 14:28, Jens Axboe wrote:
>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>>>>>> On 3/15/26 16:19, Jens Axboe wrote:
>>>>>>> When a socket send and shutdown() happen back-to-back, both fire
>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first
>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>>>>>> the first event was consumed. Since the shutdown is a persistent state
>>>>>>> change, no further wakeups will happen, and the multishot recv can hang
>>>>>>> forever.
>>>>>>>
>>>>>>> Fix this by only draining a single poll ref after io_poll_issue()
>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>>>>>> discovers the remaining state.
>>>>>>
>>>>>> How often will iterate with no effect for normal execution (i.e.
>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
>>>>>
>>>>> That is my worry too. I spent a bit of time on it this morning to figure
>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this
>>>>> commit in particular:
>>>>>
>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com>
>>>>> Date:   Wed Jul 2 22:35:18 2025 +0000
>>>>>
>>>>>       af_unix: Introduce SO_INQ.
>>>>>
>>>>> which is then not the first time I've had to fix fallout from that
>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about
>>>>> missing events. Yes if both poll triggers are of the same type, eg
>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
>>>>> would anything else where you'd need separate handling for the trigger.
>>>>
>>>> Thinking more, I don't think the patch is correct either. Seems you
>>>> expect the last recv to return 0, but let's say you have 2 refs and
>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation
>>>> fails. The 2nd recv call returns another 4K, and now you're in the
>>>> same situation as before.
>>>>
>>>> You're trying to rely on a too specific behaviour. HUP handling should
>>>> be better.
>>>
>>> Some variation on, if HUP'ed, it spins until the opcode give up.
>>
>> Took a quick look, and we don't even get a HUP, the hangup side
>> ends up with a 0 mask. Which is less than useful... I'll keep
>> digging.
> 
> How about something like this? Will only retry if hup was seen, and
> there are multiple refs. Avoids re-iterating for eg multiple POLLIN
> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1.
> Keeps it local too.

HUP handling is just a hack, it'd be best to avoid complicating
the pool loop logic for that (and those continue do).

io_poll_loop_retry() {
	...
	atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs);
}
if (req->cqe.res & (POLLHUP | POLLRDHUP))
	io_poll_loop_retry();


Can we isolate it like this? Nobody should care about extra
atomics for this case.


> diff --git a/io_uring/poll.c b/io_uring/poll.c
> index b671b84657d9..bd79a04a2c59 100644
> --- a/io_uring/poll.c
> +++ b/io_uring/poll.c
> @@ -228,6 +228,16 @@ static inline void io_poll_execute(struct io_kiocb *req, int res)
>   		__io_poll_execute(req, res);
>   }
>   
> +static inline bool io_poll_loop_retry(struct io_kiocb *req, int v)
> +{
> +	if (req->opcode == IORING_OP_POLL_ADD)
> +		return false;
> +	/* multiple refs and HUP, ensure we loop once more */
> +	if (v != 1 && req->cqe.res & (POLLHUP | POLLRDHUP))

v != 1 looks suspicious, at this stage it's hard to trace what
io_recv_finish() is really doing, but better to drop the check.
  
req->cqe.res should already be in a register, makes more sense
to gate on that first.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-16 22:24             ` Pavel Begunkov
@ 2026-03-16 22:31               ` Jens Axboe
  2026-03-16 23:08                 ` Pavel Begunkov
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2026-03-16 22:31 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/16/26 4:24 PM, Pavel Begunkov wrote:
> On 3/16/26 18:40, Jens Axboe wrote:
>> On 3/16/26 9:16 AM, Jens Axboe wrote:
>>> On 3/16/26 8:44 AM, Pavel Begunkov wrote:
>>>> On 3/16/26 14:40, Pavel Begunkov wrote:
>>>>> On 3/16/26 14:28, Jens Axboe wrote:
>>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>>>>>>> On 3/15/26 16:19, Jens Axboe wrote:
>>>>>>>> When a socket send and shutdown() happen back-to-back, both fire
>>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first
>>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>>>>>>> the first event was consumed. Since the shutdown is a persistent state
>>>>>>>> change, no further wakeups will happen, and the multishot recv can hang
>>>>>>>> forever.
>>>>>>>>
>>>>>>>> Fix this by only draining a single poll ref after io_poll_issue()
>>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>>>>>>> discovers the remaining state.
>>>>>>>
>>>>>>> How often will iterate with no effect for normal execution (i.e.
>>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
>>>>>>
>>>>>> That is my worry too. I spent a bit of time on it this morning to figure
>>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this
>>>>>> commit in particular:
>>>>>>
>>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
>>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com>
>>>>>> Date:   Wed Jul 2 22:35:18 2025 +0000
>>>>>>
>>>>>>       af_unix: Introduce SO_INQ.
>>>>>>
>>>>>> which is then not the first time I've had to fix fallout from that
>>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about
>>>>>> missing events. Yes if both poll triggers are of the same type, eg
>>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
>>>>>> would anything else where you'd need separate handling for the trigger.
>>>>>
>>>>> Thinking more, I don't think the patch is correct either. Seems you
>>>>> expect the last recv to return 0, but let's say you have 2 refs and
>>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation
>>>>> fails. The 2nd recv call returns another 4K, and now you're in the
>>>>> same situation as before.
>>>>>
>>>>> You're trying to rely on a too specific behaviour. HUP handling should
>>>>> be better.
>>>>
>>>> Some variation on, if HUP'ed, it spins until the opcode give up.
>>>
>>> Took a quick look, and we don't even get a HUP, the hangup side
>>> ends up with a 0 mask. Which is less than useful... I'll keep
>>> digging.
>>
>> How about something like this? Will only retry if hup was seen, and
>> there are multiple refs. Avoids re-iterating for eg multiple POLLIN
>> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1.
>> Keeps it local too.
> 
> HUP handling is just a hack, it'd be best to avoid complicating

It is, that's why I wasn't a huge fan of gating on that in the first
place!

> the pool loop logic for that (and those continue do).
> 
> io_poll_loop_retry() {
>     ...
>     atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs);
> }
> if (req->cqe.res & (POLLHUP | POLLRDHUP))
>     io_poll_loop_retry();
> 
> 
> Can we isolate it like this? Nobody should care about extra
> atomics for this case.

It's not the extra atomic, I agree it doesn't matter for this non-hot
case. It's more that setting the retry flag isn't enough, you need to
have it do another loop at that point. And if you just set that, then
it'll drop all refs and happily return and wait for the next poll
trigger that won't happen past HUP.

It's not impossible we can make this work, but I don't see a great way
of doing it without messing with the loop like I did :/. Or checking
->poll_refs again, and at that point the separate variable is better
imho.

>> diff --git a/io_uring/poll.c b/io_uring/poll.c
>> index b671b84657d9..bd79a04a2c59 100644
>> --- a/io_uring/poll.c
>> +++ b/io_uring/poll.c
>> @@ -228,6 +228,16 @@ static inline void io_poll_execute(struct io_kiocb *req, int res)
>>           __io_poll_execute(req, res);
>>   }
>>   +static inline bool io_poll_loop_retry(struct io_kiocb *req, int v)
>> +{
>> +    if (req->opcode == IORING_OP_POLL_ADD)
>> +        return false;
>> +    /* multiple refs and HUP, ensure we loop once more */
>> +    if (v != 1 && req->cqe.res & (POLLHUP | POLLRDHUP))
> 
> v != 1 looks suspicious, at this stage it's hard to trace what
> io_recv_finish() is really doing, but better to drop the check.
>  
> req->cqe.res should already be in a register, makes more sense
> to gate on that first.

Yeah that's probably fine, I mainly wanted to avoid hitting any of this
unless we had more than one event queued. ->cqe.res should be in a
register, but 'v' I strongly suspect 'v' will be too as it's used
throughout too.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-16 22:31               ` Jens Axboe
@ 2026-03-16 23:08                 ` Pavel Begunkov
  2026-03-17  1:14                   ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Pavel Begunkov @ 2026-03-16 23:08 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 3/16/26 22:31, Jens Axboe wrote:
> On 3/16/26 4:24 PM, Pavel Begunkov wrote:
>> On 3/16/26 18:40, Jens Axboe wrote:
>>> On 3/16/26 9:16 AM, Jens Axboe wrote:
>>>> On 3/16/26 8:44 AM, Pavel Begunkov wrote:
>>>>> On 3/16/26 14:40, Pavel Begunkov wrote:
>>>>>> On 3/16/26 14:28, Jens Axboe wrote:
>>>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>>>>>>>> On 3/15/26 16:19, Jens Axboe wrote:
>>>>>>>>> When a socket send and shutdown() happen back-to-back, both fire
>>>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first
>>>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>>>>>>>> the first event was consumed. Since the shutdown is a persistent state
>>>>>>>>> change, no further wakeups will happen, and the multishot recv can hang
>>>>>>>>> forever.
>>>>>>>>>
>>>>>>>>> Fix this by only draining a single poll ref after io_poll_issue()
>>>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>>>>>>>> discovers the remaining state.
>>>>>>>>
>>>>>>>> How often will iterate with no effect for normal execution (i.e.
>>>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
>>>>>>>
>>>>>>> That is my worry too. I spent a bit of time on it this morning to figure
>>>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this
>>>>>>> commit in particular:
>>>>>>>
>>>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
>>>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com>
>>>>>>> Date:   Wed Jul 2 22:35:18 2025 +0000
>>>>>>>
>>>>>>>        af_unix: Introduce SO_INQ.
>>>>>>>
>>>>>>> which is then not the first time I've had to fix fallout from that
>>>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about
>>>>>>> missing events. Yes if both poll triggers are of the same type, eg
>>>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
>>>>>>> would anything else where you'd need separate handling for the trigger.
>>>>>>
>>>>>> Thinking more, I don't think the patch is correct either. Seems you
>>>>>> expect the last recv to return 0, but let's say you have 2 refs and
>>>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation
>>>>>> fails. The 2nd recv call returns another 4K, and now you're in the
>>>>>> same situation as before.
>>>>>>
>>>>>> You're trying to rely on a too specific behaviour. HUP handling should
>>>>>> be better.
>>>>>
>>>>> Some variation on, if HUP'ed, it spins until the opcode give up.
>>>>
>>>> Took a quick look, and we don't even get a HUP, the hangup side
>>>> ends up with a 0 mask. Which is less than useful... I'll keep
>>>> digging.
>>>
>>> How about something like this? Will only retry if hup was seen, and
>>> there are multiple refs. Avoids re-iterating for eg multiple POLLIN
>>> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1.
>>> Keeps it local too.
>>
>> HUP handling is just a hack, it'd be best to avoid complicating
> 
> It is, that's why I wasn't a huge fan of gating on that in the first
> place!
> 
>> the pool loop logic for that (and those continue do).
>>
>> io_poll_loop_retry() {
>>      ...
>>      atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs);
>> }
>> if (req->cqe.res & (POLLHUP | POLLRDHUP))
>>      io_poll_loop_retry();
>>
>>
>> Can we isolate it like this? Nobody should care about extra
>> atomics for this case.
> 
> It's not the extra atomic, I agree it doesn't matter for this non-hot
> case. It's more that setting the retry flag isn't enough, you need to
> have it do another loop at that point. And if you just set that, then
> it'll drop all refs and happily return and wait for the next poll
> trigger that won't happen past HUP.

It was actually supposed to force it into another iteration, but
I see what you're saying. I'll take a look at this part tomorrow

> It's not impossible we can make this work, but I don't see a great way
> of doing it without messing with the loop like I did :/. Or checking
> ->poll_refs again, and at that point the separate variable is better
> imho.
-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-16 23:08                 ` Pavel Begunkov
@ 2026-03-17  1:14                   ` Jens Axboe
  2026-03-17  1:36                     ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2026-03-17  1:14 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/16/26 5:08 PM, Pavel Begunkov wrote:
> On 3/16/26 22:31, Jens Axboe wrote:
>> On 3/16/26 4:24 PM, Pavel Begunkov wrote:
>>> On 3/16/26 18:40, Jens Axboe wrote:
>>>> On 3/16/26 9:16 AM, Jens Axboe wrote:
>>>>> On 3/16/26 8:44 AM, Pavel Begunkov wrote:
>>>>>> On 3/16/26 14:40, Pavel Begunkov wrote:
>>>>>>> On 3/16/26 14:28, Jens Axboe wrote:
>>>>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>>>>>>>>> On 3/15/26 16:19, Jens Axboe wrote:
>>>>>>>>>> When a socket send and shutdown() happen back-to-back, both fire
>>>>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first
>>>>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>>>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>>>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>>>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>>>>>>>>> the first event was consumed. Since the shutdown is a persistent state
>>>>>>>>>> change, no further wakeups will happen, and the multishot recv can hang
>>>>>>>>>> forever.
>>>>>>>>>>
>>>>>>>>>> Fix this by only draining a single poll ref after io_poll_issue()
>>>>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>>>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>>>>>>>>> discovers the remaining state.
>>>>>>>>>
>>>>>>>>> How often will iterate with no effect for normal execution (i.e.
>>>>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
>>>>>>>>
>>>>>>>> That is my worry too. I spent a bit of time on it this morning to figure
>>>>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this
>>>>>>>> commit in particular:
>>>>>>>>
>>>>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
>>>>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com>
>>>>>>>> Date:   Wed Jul 2 22:35:18 2025 +0000
>>>>>>>>
>>>>>>>>        af_unix: Introduce SO_INQ.
>>>>>>>>
>>>>>>>> which is then not the first time I've had to fix fallout from that
>>>>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about
>>>>>>>> missing events. Yes if both poll triggers are of the same type, eg
>>>>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
>>>>>>>> would anything else where you'd need separate handling for the trigger.
>>>>>>>
>>>>>>> Thinking more, I don't think the patch is correct either. Seems you
>>>>>>> expect the last recv to return 0, but let's say you have 2 refs and
>>>>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation
>>>>>>> fails. The 2nd recv call returns another 4K, and now you're in the
>>>>>>> same situation as before.
>>>>>>>
>>>>>>> You're trying to rely on a too specific behaviour. HUP handling should
>>>>>>> be better.
>>>>>>
>>>>>> Some variation on, if HUP'ed, it spins until the opcode give up.
>>>>>
>>>>> Took a quick look, and we don't even get a HUP, the hangup side
>>>>> ends up with a 0 mask. Which is less than useful... I'll keep
>>>>> digging.
>>>>
>>>> How about something like this? Will only retry if hup was seen, and
>>>> there are multiple refs. Avoids re-iterating for eg multiple POLLIN
>>>> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1.
>>>> Keeps it local too.
>>>
>>> HUP handling is just a hack, it'd be best to avoid complicating
>>
>> It is, that's why I wasn't a huge fan of gating on that in the first
>> place!
>>
>>> the pool loop logic for that (and those continue do).
>>>
>>> io_poll_loop_retry() {
>>>      ...
>>>      atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs);
>>> }
>>> if (req->cqe.res & (POLLHUP | POLLRDHUP))
>>>      io_poll_loop_retry();
>>>
>>>
>>> Can we isolate it like this? Nobody should care about extra
>>> atomics for this case.
>>
>> It's not the extra atomic, I agree it doesn't matter for this non-hot
>> case. It's more that setting the retry flag isn't enough, you need to
>> have it do another loop at that point. And if you just set that, then
>> it'll drop all refs and happily return and wait for the next poll
>> trigger that won't happen past HUP.
> 
> It was actually supposed to force it into another iteration, but
> I see what you're saying. I'll take a look at this part tomorrow

I mean, we can just reuse 'v' for this, and set IO_POLL_RETRY_FLAG.
There's no reason to set it in ->poll_refs as it's all local anyway.
Then that at least gets rid of the 'retry' variable.


diff --git a/io_uring/poll.c b/io_uring/poll.c
index aac4b3b881fb..f1a45e69b9af 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -228,6 +228,18 @@ static inline void io_poll_execute(struct io_kiocb *req, int res)
 		__io_poll_execute(req, res);
 }
 
+static inline void io_poll_check_retry(struct io_kiocb *req, int *v)
+{
+	if (req->opcode == IORING_OP_POLL_ADD)
+		return;
+	if (!(req->cqe.res & (POLLHUP | POLLRDHUP)))
+		return;
+	if (*v == 1)
+		return;
+	/* multiple refs and HUP, ensure we loop once more */
+	*v |= IO_POLL_RETRY_FLAG;
+}
+
 /*
  * All poll tw should go through this. Checks for poll events, manages
  * references, does rewait, etc.
@@ -287,13 +299,15 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 			if (unlikely(!req->cqe.res)) {
 				/* Multishot armed need not reissue */
 				if (!(req->apoll_events & EPOLLONESHOT))
-					continue;
+					goto finish;
 				return IOU_POLL_REISSUE;
 			}
 		}
 		if (req->apoll_events & EPOLLONESHOT)
 			return IOU_POLL_DONE;
 
+		io_poll_check_retry(req, &v);
+
 		/* multishot, just fill a CQE and proceed */
 		if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
 			__poll_t mask = mangle_poll(req->cqe.res &
@@ -317,12 +331,17 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 		/* force the next iteration to vfs_poll() */
 		req->cqe.res = 0;
 
+		if (v & IO_POLL_RETRY_FLAG)
+			continue;
 		/*
 		 * Release all references, retry if someone tried to restart
 		 * task_work while we were executing it.
 		 */
+finish:
 		v &= IO_POLL_REF_MASK;
-	} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);
+		if (!(atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK))
+			break;
+	} while (1);
 
 	io_napi_add(req);
 	return IOU_POLL_NO_ACTION;

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
  2026-03-17  1:14                   ` Jens Axboe
@ 2026-03-17  1:36                     ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2026-03-17  1:36 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 3/16/26 7:14 PM, Jens Axboe wrote:
> On 3/16/26 5:08 PM, Pavel Begunkov wrote:
>> On 3/16/26 22:31, Jens Axboe wrote:
>>> On 3/16/26 4:24 PM, Pavel Begunkov wrote:
>>>> On 3/16/26 18:40, Jens Axboe wrote:
>>>>> On 3/16/26 9:16 AM, Jens Axboe wrote:
>>>>>> On 3/16/26 8:44 AM, Pavel Begunkov wrote:
>>>>>>> On 3/16/26 14:40, Pavel Begunkov wrote:
>>>>>>>> On 3/16/26 14:28, Jens Axboe wrote:
>>>>>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote:
>>>>>>>>>> On 3/15/26 16:19, Jens Axboe wrote:
>>>>>>>>>>> When a socket send and shutdown() happen back-to-back, both fire
>>>>>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first
>>>>>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
>>>>>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a
>>>>>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all
>>>>>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
>>>>>>>>>>> the first event was consumed. Since the shutdown is a persistent state
>>>>>>>>>>> change, no further wakeups will happen, and the multishot recv can hang
>>>>>>>>>>> forever.
>>>>>>>>>>>
>>>>>>>>>>> Fix this by only draining a single poll ref after io_poll_issue()
>>>>>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
>>>>>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
>>>>>>>>>>> discovers the remaining state.
>>>>>>>>>>
>>>>>>>>>> How often will iterate with no effect for normal execution (i.e.
>>>>>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead?
>>>>>>>>>
>>>>>>>>> That is my worry too. I spent a bit of time on it this morning to figure
>>>>>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this
>>>>>>>>> commit in particular:
>>>>>>>>>
>>>>>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16
>>>>>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com>
>>>>>>>>> Date:   Wed Jul 2 22:35:18 2025 +0000
>>>>>>>>>
>>>>>>>>>        af_unix: Introduce SO_INQ.
>>>>>>>>>
>>>>>>>>> which is then not the first time I've had to fix fallout from that
>>>>>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about
>>>>>>>>> missing events. Yes if both poll triggers are of the same type, eg
>>>>>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as
>>>>>>>>> would anything else where you'd need separate handling for the trigger.
>>>>>>>>
>>>>>>>> Thinking more, I don't think the patch is correct either. Seems you
>>>>>>>> expect the last recv to return 0, but let's say you have 2 refs and
>>>>>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation
>>>>>>>> fails. The 2nd recv call returns another 4K, and now you're in the
>>>>>>>> same situation as before.
>>>>>>>>
>>>>>>>> You're trying to rely on a too specific behaviour. HUP handling should
>>>>>>>> be better.
>>>>>>>
>>>>>>> Some variation on, if HUP'ed, it spins until the opcode give up.
>>>>>>
>>>>>> Took a quick look, and we don't even get a HUP, the hangup side
>>>>>> ends up with a 0 mask. Which is less than useful... I'll keep
>>>>>> digging.
>>>>>
>>>>> How about something like this? Will only retry if hup was seen, and
>>>>> there are multiple refs. Avoids re-iterating for eg multiple POLLIN
>>>>> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1.
>>>>> Keeps it local too.
>>>>
>>>> HUP handling is just a hack, it'd be best to avoid complicating
>>>
>>> It is, that's why I wasn't a huge fan of gating on that in the first
>>> place!
>>>
>>>> the pool loop logic for that (and those continue do).
>>>>
>>>> io_poll_loop_retry() {
>>>>      ...
>>>>      atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs);
>>>> }
>>>> if (req->cqe.res & (POLLHUP | POLLRDHUP))
>>>>      io_poll_loop_retry();
>>>>
>>>>
>>>> Can we isolate it like this? Nobody should care about extra
>>>> atomics for this case.
>>>
>>> It's not the extra atomic, I agree it doesn't matter for this non-hot
>>> case. It's more that setting the retry flag isn't enough, you need to
>>> have it do another loop at that point. And if you just set that, then
>>> it'll drop all refs and happily return and wait for the next poll
>>> trigger that won't happen past HUP.
>>
>> It was actually supposed to force it into another iteration, but
>> I see what you're saying. I'll take a look at this part tomorrow
> 
> I mean, we can just reuse 'v' for this, and set IO_POLL_RETRY_FLAG.
> There's no reason to set it in ->poll_refs as it's all local anyway.
> Then that at least gets rid of the 'retry' variable.

This is as clean/simple as I can get it, less text too for this one.

diff --git a/io_uring/poll.c b/io_uring/poll.c
index aac4b3b881fb..06183cf60201 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -228,6 +228,22 @@ static inline void io_poll_execute(struct io_kiocb *req, int res)
 		__io_poll_execute(req, res);
 }
 
+static inline void io_poll_check_retry(struct io_kiocb *req, int *v)
+{
+	if (req->opcode == IORING_OP_POLL_ADD)
+		return;
+	if (!(req->cqe.res & (POLLHUP | POLLRDHUP)))
+		return;
+	/*
+	 * Release all references, retry if someone tried to restart
+	 * task_work while we were executing it.
+	 */
+	*v &= IO_POLL_REF_MASK;
+
+	/* multiple refs and HUP, ensure we loop once more */
+	(*v)--;
+}
+
 /*
  * All poll tw should go through this. Checks for poll events, manages
  * references, does rewait, etc.
@@ -294,6 +310,8 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 		if (req->apoll_events & EPOLLONESHOT)
 			return IOU_POLL_DONE;
 
+		io_poll_check_retry(req, &v);
+
 		/* multishot, just fill a CQE and proceed */
 		if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
 			__poll_t mask = mangle_poll(req->cqe.res &
@@ -316,12 +334,6 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
 
 		/* force the next iteration to vfs_poll() */
 		req->cqe.res = 0;
-
-		/*
-		 * Release all references, retry if someone tried to restart
-		 * task_work while we were executing it.
-		 */
-		v &= IO_POLL_REF_MASK;
 	} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);
 
 	io_napi_add(req);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-03-17  1:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-15 16:19 [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race Jens Axboe
2026-03-16 14:17 ` Pavel Begunkov
2026-03-16 14:28   ` Jens Axboe
2026-03-16 14:40     ` Pavel Begunkov
2026-03-16 14:44       ` Pavel Begunkov
2026-03-16 15:16         ` Jens Axboe
2026-03-16 18:40           ` Jens Axboe
2026-03-16 22:24             ` Pavel Begunkov
2026-03-16 22:31               ` Jens Axboe
2026-03-16 23:08                 ` Pavel Begunkov
2026-03-17  1:14                   ` Jens Axboe
2026-03-17  1:36                     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox