* [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race
@ 2026-03-15 16:19 Jens Axboe
2026-03-16 14:17 ` Pavel Begunkov
0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2026-03-15 16:19 UTC (permalink / raw)
To: io-uring; +Cc: francis
When a socket send and shutdown() happen back-to-back, both fire
wake-ups before the receiver's task_work has a chance to run. The first
wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
When io_poll_check_events() runs, it calls io_poll_issue() which does a
recv that reads the data and returns IOU_RETRY. The loop then drains all
accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
the first event was consumed. Since the shutdown is a persistent state
change, no further wakeups will happen, and the multishot recv can hang
forever.
Fix this by only draining a single poll ref after io_poll_issue()
returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes
raced in (poll_refs was > 1), the loop iterates again, vfs_poll()
discovers the remaining state.
Move the v &= IO_POLL_REF_MASK (drain all refs) into the non-APOLL
multishot poll path, since poll CQEs report the current mask state
rather than consuming individual events.
Cc: stable@vger.kernel.org
Fixes: dbc2564cfe0f ("io_uring: let fast poll support multishot")
Reported-by: francis <francis@brosseau.dev>
Link: https://github.com/axboe/liburing/issues/1549
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
diff --git a/io_uring/poll.c b/io_uring/poll.c
index b671b84657d9..0f0949d919e9 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -303,6 +303,7 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
io_req_set_res(req, mask, 0);
return IOU_POLL_REMOVE_POLL_USE_RES;
}
+ v &= IO_POLL_REF_MASK;
} else {
int ret = io_poll_issue(req, tw);
@@ -312,6 +313,11 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
return IOU_POLL_REQUEUE;
if (ret != IOU_RETRY && ret < 0)
return ret;
+ /*
+ * One event consumed, but additional wakes may have
+ * raced. Only drain a single ref.
+ */
+ v = 1;
}
/* force the next iteration to vfs_poll() */
@@ -321,7 +327,6 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw)
* Release all references, retry if someone tried to restart
* task_work while we were executing it.
*/
- v &= IO_POLL_REF_MASK;
} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);
io_napi_add(req);
--
Jens Axboe
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-15 16:19 [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race Jens Axboe @ 2026-03-16 14:17 ` Pavel Begunkov 2026-03-16 14:28 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2026-03-16 14:17 UTC (permalink / raw) To: Jens Axboe, io-uring; +Cc: francis On 3/15/26 16:19, Jens Axboe wrote: > When a socket send and shutdown() happen back-to-back, both fire > wake-ups before the receiver's task_work has a chance to run. The first > wake gets poll ownership (poll_refs=1), and the second bumps it to 2. > When io_poll_check_events() runs, it calls io_poll_issue() which does a > recv that reads the data and returns IOU_RETRY. The loop then drains all > accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only > the first event was consumed. Since the shutdown is a persistent state > change, no further wakeups will happen, and the multishot recv can hang > forever. > > Fix this by only draining a single poll ref after io_poll_issue() > returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes > raced in (poll_refs was > 1), the loop iterates again, vfs_poll() > discovers the remaining state. How often will iterate with no effect for normal execution (i.e. no shutdown)? And how costly it'll be? Why not handle HUP instead? -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-16 14:17 ` Pavel Begunkov @ 2026-03-16 14:28 ` Jens Axboe 2026-03-16 14:40 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2026-03-16 14:28 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: francis On 3/16/26 8:17 AM, Pavel Begunkov wrote: > On 3/15/26 16:19, Jens Axboe wrote: >> When a socket send and shutdown() happen back-to-back, both fire >> wake-ups before the receiver's task_work has a chance to run. The first >> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >> When io_poll_check_events() runs, it calls io_poll_issue() which does a >> recv that reads the data and returns IOU_RETRY. The loop then drains all >> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >> the first event was consumed. Since the shutdown is a persistent state >> change, no further wakeups will happen, and the multishot recv can hang >> forever. >> >> Fix this by only draining a single poll ref after io_poll_issue() >> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >> discovers the remaining state. > > How often will iterate with no effect for normal execution (i.e. > no shutdown)? And how costly it'll be? Why not handle HUP instead? That is my worry too. I spent a bit of time on it this morning to figure out why this is a new issue, and traced it down to 6.16..6.17, and this commit in particular: commit df30285b3670bf52e1e5512e4d4482bec5e93c16 Author: Kuniyuki Iwashima <kuniyu@google.com> Date: Wed Jul 2 22:35:18 2025 +0000 af_unix: Introduce SO_INQ. which is then not the first time I've had to fix fallout from that commit. Need to dig a bit deeper. That said, I do also worry a bit about missing events. Yes if both poll triggers are of the same type, eg POLLIN, then we don't need to iterate again. IN + HUP is problematic, as would anything else where you'd need separate handling for the trigger. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-16 14:28 ` Jens Axboe @ 2026-03-16 14:40 ` Pavel Begunkov 2026-03-16 14:44 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2026-03-16 14:40 UTC (permalink / raw) To: Jens Axboe, io-uring; +Cc: francis On 3/16/26 14:28, Jens Axboe wrote: > On 3/16/26 8:17 AM, Pavel Begunkov wrote: >> On 3/15/26 16:19, Jens Axboe wrote: >>> When a socket send and shutdown() happen back-to-back, both fire >>> wake-ups before the receiver's task_work has a chance to run. The first >>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >>> When io_poll_check_events() runs, it calls io_poll_issue() which does a >>> recv that reads the data and returns IOU_RETRY. The loop then drains all >>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >>> the first event was consumed. Since the shutdown is a persistent state >>> change, no further wakeups will happen, and the multishot recv can hang >>> forever. >>> >>> Fix this by only draining a single poll ref after io_poll_issue() >>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >>> discovers the remaining state. >> >> How often will iterate with no effect for normal execution (i.e. >> no shutdown)? And how costly it'll be? Why not handle HUP instead? > > That is my worry too. I spent a bit of time on it this morning to figure > out why this is a new issue, and traced it down to 6.16..6.17, and this > commit in particular: > > commit df30285b3670bf52e1e5512e4d4482bec5e93c16 > Author: Kuniyuki Iwashima <kuniyu@google.com> > Date: Wed Jul 2 22:35:18 2025 +0000 > > af_unix: Introduce SO_INQ. > > which is then not the first time I've had to fix fallout from that > commit. Need to dig a bit deeper. That said, I do also worry a bit about > missing events. Yes if both poll triggers are of the same type, eg > POLLIN, then we don't need to iterate again. IN + HUP is problematic, as > would anything else where you'd need separate handling for the trigger. Thinking more, I don't think the patch is correct either. Seems you expect the last recv to return 0, but let's say you have 2 refs and 8K in the rx queue. The first recv call gets 4K b/c some allocation fails. The 2nd recv call returns another 4K, and now you're in the same situation as before. You're trying to rely on a too specific behaviour. HUP handling should be better. Ideally, the opcode would handle it, e.g. zcrx doesn't have the problem IIUC, but that might be harder to do. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-16 14:40 ` Pavel Begunkov @ 2026-03-16 14:44 ` Pavel Begunkov 2026-03-16 15:16 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2026-03-16 14:44 UTC (permalink / raw) To: Jens Axboe, io-uring; +Cc: francis On 3/16/26 14:40, Pavel Begunkov wrote: > On 3/16/26 14:28, Jens Axboe wrote: >> On 3/16/26 8:17 AM, Pavel Begunkov wrote: >>> On 3/15/26 16:19, Jens Axboe wrote: >>>> When a socket send and shutdown() happen back-to-back, both fire >>>> wake-ups before the receiver's task_work has a chance to run. The first >>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a >>>> recv that reads the data and returns IOU_RETRY. The loop then drains all >>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >>>> the first event was consumed. Since the shutdown is a persistent state >>>> change, no further wakeups will happen, and the multishot recv can hang >>>> forever. >>>> >>>> Fix this by only draining a single poll ref after io_poll_issue() >>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >>>> discovers the remaining state. >>> >>> How often will iterate with no effect for normal execution (i.e. >>> no shutdown)? And how costly it'll be? Why not handle HUP instead? >> >> That is my worry too. I spent a bit of time on it this morning to figure >> out why this is a new issue, and traced it down to 6.16..6.17, and this >> commit in particular: >> >> commit df30285b3670bf52e1e5512e4d4482bec5e93c16 >> Author: Kuniyuki Iwashima <kuniyu@google.com> >> Date: Wed Jul 2 22:35:18 2025 +0000 >> >> af_unix: Introduce SO_INQ. >> >> which is then not the first time I've had to fix fallout from that >> commit. Need to dig a bit deeper. That said, I do also worry a bit about >> missing events. Yes if both poll triggers are of the same type, eg >> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as >> would anything else where you'd need separate handling for the trigger. > > Thinking more, I don't think the patch is correct either. Seems you > expect the last recv to return 0, but let's say you have 2 refs and > 8K in the rx queue. The first recv call gets 4K b/c some allocation > fails. The 2nd recv call returns another 4K, and now you're in the > same situation as before. > > You're trying to rely on a too specific behaviour. HUP handling should > be better. Some variation on, if HUP'ed, it spins until the opcode give up. diff --git a/io_uring/poll.c b/io_uring/poll.c index b671b84657d9..3944deb55234 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -43,7 +43,8 @@ struct io_poll_table { #define IO_POLL_CANCEL_FLAG BIT(31) #define IO_POLL_RETRY_FLAG BIT(30) -#define IO_POLL_REF_MASK GENMASK(29, 0) +#define IO_POLL_HUP_FLAG BIT(29) +#define IO_POLL_REF_MASK GENMASK(28, 0) /* * We usually have 1-2 refs taken, 128 is more than enough and we want to @@ -272,6 +273,8 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) atomic_andnot(IO_POLL_RETRY_FLAG, &req->poll_refs); v &= ~IO_POLL_RETRY_FLAG; } + if (v & IO_POLL_HUP_FLAG) + atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs); } /* the mask was stashed in __io_poll_execute */ @@ -390,6 +393,14 @@ static __cold int io_pollfree_wake(struct io_kiocb *req, struct io_poll *poll) return 1; } +static void io_handle_hup(struct io_kiocb *req, struct io_poll *poll) +{ + if (req->opcode == IORING_OP_POLL_ADD) + return; + if (poll->events & (POLLIN|EPOLLRDNORM)) + atomic_or(IO_POLL_HUP_FLAG, &req->poll_refs); +} + static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, void *key) { @@ -397,8 +408,12 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, struct io_poll *poll = container_of(wait, struct io_poll, wait); __poll_t mask = key_to_poll(key); - if (unlikely(mask & POLLFREE)) - return io_pollfree_wake(req, poll); + if (unlikely(mask & (POLLFREE|POLLHUP))) { + if (mask & POLLFREE) + return io_pollfree_wake(req, poll); + if (mask & POLLHUP) + io_handle_hup(req, poll); + } /* for instances that support it check for an event match first */ if (mask && !(mask & (poll->events & ~IO_ASYNC_POLL_COMMON))) -- Pavel Begunkov ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-16 14:44 ` Pavel Begunkov @ 2026-03-16 15:16 ` Jens Axboe 2026-03-16 18:40 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2026-03-16 15:16 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: francis On 3/16/26 8:44 AM, Pavel Begunkov wrote: > On 3/16/26 14:40, Pavel Begunkov wrote: >> On 3/16/26 14:28, Jens Axboe wrote: >>> On 3/16/26 8:17 AM, Pavel Begunkov wrote: >>>> On 3/15/26 16:19, Jens Axboe wrote: >>>>> When a socket send and shutdown() happen back-to-back, both fire >>>>> wake-ups before the receiver's task_work has a chance to run. The first >>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a >>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all >>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >>>>> the first event was consumed. Since the shutdown is a persistent state >>>>> change, no further wakeups will happen, and the multishot recv can hang >>>>> forever. >>>>> >>>>> Fix this by only draining a single poll ref after io_poll_issue() >>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >>>>> discovers the remaining state. >>>> >>>> How often will iterate with no effect for normal execution (i.e. >>>> no shutdown)? And how costly it'll be? Why not handle HUP instead? >>> >>> That is my worry too. I spent a bit of time on it this morning to figure >>> out why this is a new issue, and traced it down to 6.16..6.17, and this >>> commit in particular: >>> >>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16 >>> Author: Kuniyuki Iwashima <kuniyu@google.com> >>> Date: Wed Jul 2 22:35:18 2025 +0000 >>> >>> af_unix: Introduce SO_INQ. >>> >>> which is then not the first time I've had to fix fallout from that >>> commit. Need to dig a bit deeper. That said, I do also worry a bit about >>> missing events. Yes if both poll triggers are of the same type, eg >>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as >>> would anything else where you'd need separate handling for the trigger. >> >> Thinking more, I don't think the patch is correct either. Seems you >> expect the last recv to return 0, but let's say you have 2 refs and >> 8K in the rx queue. The first recv call gets 4K b/c some allocation >> fails. The 2nd recv call returns another 4K, and now you're in the >> same situation as before. >> >> You're trying to rely on a too specific behaviour. HUP handling should >> be better. > > Some variation on, if HUP'ed, it spins until the opcode give up. Took a quick look, and we don't even get a HUP, the hangup side ends up with a 0 mask. Which is less than useful... I'll keep digging. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-16 15:16 ` Jens Axboe @ 2026-03-16 18:40 ` Jens Axboe 2026-03-16 22:24 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2026-03-16 18:40 UTC (permalink / raw) To: Pavel Begunkov, io-uring; +Cc: francis On 3/16/26 9:16 AM, Jens Axboe wrote: > On 3/16/26 8:44 AM, Pavel Begunkov wrote: >> On 3/16/26 14:40, Pavel Begunkov wrote: >>> On 3/16/26 14:28, Jens Axboe wrote: >>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote: >>>>> On 3/15/26 16:19, Jens Axboe wrote: >>>>>> When a socket send and shutdown() happen back-to-back, both fire >>>>>> wake-ups before the receiver's task_work has a chance to run. The first >>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a >>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all >>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >>>>>> the first event was consumed. Since the shutdown is a persistent state >>>>>> change, no further wakeups will happen, and the multishot recv can hang >>>>>> forever. >>>>>> >>>>>> Fix this by only draining a single poll ref after io_poll_issue() >>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >>>>>> discovers the remaining state. >>>>> >>>>> How often will iterate with no effect for normal execution (i.e. >>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead? >>>> >>>> That is my worry too. I spent a bit of time on it this morning to figure >>>> out why this is a new issue, and traced it down to 6.16..6.17, and this >>>> commit in particular: >>>> >>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16 >>>> Author: Kuniyuki Iwashima <kuniyu@google.com> >>>> Date: Wed Jul 2 22:35:18 2025 +0000 >>>> >>>> af_unix: Introduce SO_INQ. >>>> >>>> which is then not the first time I've had to fix fallout from that >>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about >>>> missing events. Yes if both poll triggers are of the same type, eg >>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as >>>> would anything else where you'd need separate handling for the trigger. >>> >>> Thinking more, I don't think the patch is correct either. Seems you >>> expect the last recv to return 0, but let's say you have 2 refs and >>> 8K in the rx queue. The first recv call gets 4K b/c some allocation >>> fails. The 2nd recv call returns another 4K, and now you're in the >>> same situation as before. >>> >>> You're trying to rely on a too specific behaviour. HUP handling should >>> be better. >> >> Some variation on, if HUP'ed, it spins until the opcode give up. > > Took a quick look, and we don't even get a HUP, the hangup side > ends up with a 0 mask. Which is less than useful... I'll keep > digging. How about something like this? Will only retry if hup was seen, and there are multiple refs. Avoids re-iterating for eg multiple POLLIN wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1. Keeps it local too. diff --git a/io_uring/poll.c b/io_uring/poll.c index b671b84657d9..bd79a04a2c59 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -228,6 +228,16 @@ static inline void io_poll_execute(struct io_kiocb *req, int res) __io_poll_execute(req, res); } +static inline bool io_poll_loop_retry(struct io_kiocb *req, int v) +{ + if (req->opcode == IORING_OP_POLL_ADD) + return false; + /* multiple refs and HUP, ensure we loop once more */ + if (v != 1 && req->cqe.res & (POLLHUP | POLLRDHUP)) + return true; + return false; +} + /* * All poll tw should go through this. Checks for poll events, manages * references, does rewait, etc. @@ -246,8 +256,9 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) return -ECANCELED; do { - v = atomic_read(&req->poll_refs); + bool retry = false; + v = atomic_read(&req->poll_refs); if (unlikely(v != 1)) { /* tw should be the owner and so have some refs */ if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK))) @@ -287,13 +298,15 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) if (unlikely(!req->cqe.res)) { /* Multishot armed need not reissue */ if (!(req->apoll_events & EPOLLONESHOT)) - continue; + goto finish; return IOU_POLL_REISSUE; } } if (req->apoll_events & EPOLLONESHOT) return IOU_POLL_DONE; + retry = io_poll_loop_retry(req, v); + /* multishot, just fill a CQE and proceed */ if (!(req->flags & REQ_F_APOLL_MULTISHOT)) { __poll_t mask = mangle_poll(req->cqe.res & @@ -317,12 +330,17 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) /* force the next iteration to vfs_poll() */ req->cqe.res = 0; + if (retry) + continue; /* * Release all references, retry if someone tried to restart * task_work while we were executing it. */ +finish: v &= IO_POLL_REF_MASK; - } while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK); + if (!(atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK)) + break; + } while (1); io_napi_add(req); return IOU_POLL_NO_ACTION; -- Jens Axboe ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-16 18:40 ` Jens Axboe @ 2026-03-16 22:24 ` Pavel Begunkov 2026-03-16 22:31 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2026-03-16 22:24 UTC (permalink / raw) To: Jens Axboe, io-uring On 3/16/26 18:40, Jens Axboe wrote: > On 3/16/26 9:16 AM, Jens Axboe wrote: >> On 3/16/26 8:44 AM, Pavel Begunkov wrote: >>> On 3/16/26 14:40, Pavel Begunkov wrote: >>>> On 3/16/26 14:28, Jens Axboe wrote: >>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote: >>>>>> On 3/15/26 16:19, Jens Axboe wrote: >>>>>>> When a socket send and shutdown() happen back-to-back, both fire >>>>>>> wake-ups before the receiver's task_work has a chance to run. The first >>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a >>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all >>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >>>>>>> the first event was consumed. Since the shutdown is a persistent state >>>>>>> change, no further wakeups will happen, and the multishot recv can hang >>>>>>> forever. >>>>>>> >>>>>>> Fix this by only draining a single poll ref after io_poll_issue() >>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >>>>>>> discovers the remaining state. >>>>>> >>>>>> How often will iterate with no effect for normal execution (i.e. >>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead? >>>>> >>>>> That is my worry too. I spent a bit of time on it this morning to figure >>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this >>>>> commit in particular: >>>>> >>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16 >>>>> Author: Kuniyuki Iwashima <kuniyu@google.com> >>>>> Date: Wed Jul 2 22:35:18 2025 +0000 >>>>> >>>>> af_unix: Introduce SO_INQ. >>>>> >>>>> which is then not the first time I've had to fix fallout from that >>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about >>>>> missing events. Yes if both poll triggers are of the same type, eg >>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as >>>>> would anything else where you'd need separate handling for the trigger. >>>> >>>> Thinking more, I don't think the patch is correct either. Seems you >>>> expect the last recv to return 0, but let's say you have 2 refs and >>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation >>>> fails. The 2nd recv call returns another 4K, and now you're in the >>>> same situation as before. >>>> >>>> You're trying to rely on a too specific behaviour. HUP handling should >>>> be better. >>> >>> Some variation on, if HUP'ed, it spins until the opcode give up. >> >> Took a quick look, and we don't even get a HUP, the hangup side >> ends up with a 0 mask. Which is less than useful... I'll keep >> digging. > > How about something like this? Will only retry if hup was seen, and > there are multiple refs. Avoids re-iterating for eg multiple POLLIN > wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1. > Keeps it local too. HUP handling is just a hack, it'd be best to avoid complicating the pool loop logic for that (and those continue do). io_poll_loop_retry() { ... atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs); } if (req->cqe.res & (POLLHUP | POLLRDHUP)) io_poll_loop_retry(); Can we isolate it like this? Nobody should care about extra atomics for this case. > diff --git a/io_uring/poll.c b/io_uring/poll.c > index b671b84657d9..bd79a04a2c59 100644 > --- a/io_uring/poll.c > +++ b/io_uring/poll.c > @@ -228,6 +228,16 @@ static inline void io_poll_execute(struct io_kiocb *req, int res) > __io_poll_execute(req, res); > } > > +static inline bool io_poll_loop_retry(struct io_kiocb *req, int v) > +{ > + if (req->opcode == IORING_OP_POLL_ADD) > + return false; > + /* multiple refs and HUP, ensure we loop once more */ > + if (v != 1 && req->cqe.res & (POLLHUP | POLLRDHUP)) v != 1 looks suspicious, at this stage it's hard to trace what io_recv_finish() is really doing, but better to drop the check. req->cqe.res should already be in a register, makes more sense to gate on that first. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-16 22:24 ` Pavel Begunkov @ 2026-03-16 22:31 ` Jens Axboe 2026-03-16 23:08 ` Pavel Begunkov 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2026-03-16 22:31 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 3/16/26 4:24 PM, Pavel Begunkov wrote: > On 3/16/26 18:40, Jens Axboe wrote: >> On 3/16/26 9:16 AM, Jens Axboe wrote: >>> On 3/16/26 8:44 AM, Pavel Begunkov wrote: >>>> On 3/16/26 14:40, Pavel Begunkov wrote: >>>>> On 3/16/26 14:28, Jens Axboe wrote: >>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote: >>>>>>> On 3/15/26 16:19, Jens Axboe wrote: >>>>>>>> When a socket send and shutdown() happen back-to-back, both fire >>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first >>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a >>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all >>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >>>>>>>> the first event was consumed. Since the shutdown is a persistent state >>>>>>>> change, no further wakeups will happen, and the multishot recv can hang >>>>>>>> forever. >>>>>>>> >>>>>>>> Fix this by only draining a single poll ref after io_poll_issue() >>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >>>>>>>> discovers the remaining state. >>>>>>> >>>>>>> How often will iterate with no effect for normal execution (i.e. >>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead? >>>>>> >>>>>> That is my worry too. I spent a bit of time on it this morning to figure >>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this >>>>>> commit in particular: >>>>>> >>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16 >>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com> >>>>>> Date: Wed Jul 2 22:35:18 2025 +0000 >>>>>> >>>>>> af_unix: Introduce SO_INQ. >>>>>> >>>>>> which is then not the first time I've had to fix fallout from that >>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about >>>>>> missing events. Yes if both poll triggers are of the same type, eg >>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as >>>>>> would anything else where you'd need separate handling for the trigger. >>>>> >>>>> Thinking more, I don't think the patch is correct either. Seems you >>>>> expect the last recv to return 0, but let's say you have 2 refs and >>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation >>>>> fails. The 2nd recv call returns another 4K, and now you're in the >>>>> same situation as before. >>>>> >>>>> You're trying to rely on a too specific behaviour. HUP handling should >>>>> be better. >>>> >>>> Some variation on, if HUP'ed, it spins until the opcode give up. >>> >>> Took a quick look, and we don't even get a HUP, the hangup side >>> ends up with a 0 mask. Which is less than useful... I'll keep >>> digging. >> >> How about something like this? Will only retry if hup was seen, and >> there are multiple refs. Avoids re-iterating for eg multiple POLLIN >> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1. >> Keeps it local too. > > HUP handling is just a hack, it'd be best to avoid complicating It is, that's why I wasn't a huge fan of gating on that in the first place! > the pool loop logic for that (and those continue do). > > io_poll_loop_retry() { > ... > atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs); > } > if (req->cqe.res & (POLLHUP | POLLRDHUP)) > io_poll_loop_retry(); > > > Can we isolate it like this? Nobody should care about extra > atomics for this case. It's not the extra atomic, I agree it doesn't matter for this non-hot case. It's more that setting the retry flag isn't enough, you need to have it do another loop at that point. And if you just set that, then it'll drop all refs and happily return and wait for the next poll trigger that won't happen past HUP. It's not impossible we can make this work, but I don't see a great way of doing it without messing with the loop like I did :/. Or checking ->poll_refs again, and at that point the separate variable is better imho. >> diff --git a/io_uring/poll.c b/io_uring/poll.c >> index b671b84657d9..bd79a04a2c59 100644 >> --- a/io_uring/poll.c >> +++ b/io_uring/poll.c >> @@ -228,6 +228,16 @@ static inline void io_poll_execute(struct io_kiocb *req, int res) >> __io_poll_execute(req, res); >> } >> +static inline bool io_poll_loop_retry(struct io_kiocb *req, int v) >> +{ >> + if (req->opcode == IORING_OP_POLL_ADD) >> + return false; >> + /* multiple refs and HUP, ensure we loop once more */ >> + if (v != 1 && req->cqe.res & (POLLHUP | POLLRDHUP)) > > v != 1 looks suspicious, at this stage it's hard to trace what > io_recv_finish() is really doing, but better to drop the check. > > req->cqe.res should already be in a register, makes more sense > to gate on that first. Yeah that's probably fine, I mainly wanted to avoid hitting any of this unless we had more than one event queued. ->cqe.res should be in a register, but 'v' I strongly suspect 'v' will be too as it's used throughout too. -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-16 22:31 ` Jens Axboe @ 2026-03-16 23:08 ` Pavel Begunkov 2026-03-17 1:14 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Pavel Begunkov @ 2026-03-16 23:08 UTC (permalink / raw) To: Jens Axboe, io-uring On 3/16/26 22:31, Jens Axboe wrote: > On 3/16/26 4:24 PM, Pavel Begunkov wrote: >> On 3/16/26 18:40, Jens Axboe wrote: >>> On 3/16/26 9:16 AM, Jens Axboe wrote: >>>> On 3/16/26 8:44 AM, Pavel Begunkov wrote: >>>>> On 3/16/26 14:40, Pavel Begunkov wrote: >>>>>> On 3/16/26 14:28, Jens Axboe wrote: >>>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote: >>>>>>>> On 3/15/26 16:19, Jens Axboe wrote: >>>>>>>>> When a socket send and shutdown() happen back-to-back, both fire >>>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first >>>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >>>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a >>>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all >>>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >>>>>>>>> the first event was consumed. Since the shutdown is a persistent state >>>>>>>>> change, no further wakeups will happen, and the multishot recv can hang >>>>>>>>> forever. >>>>>>>>> >>>>>>>>> Fix this by only draining a single poll ref after io_poll_issue() >>>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >>>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >>>>>>>>> discovers the remaining state. >>>>>>>> >>>>>>>> How often will iterate with no effect for normal execution (i.e. >>>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead? >>>>>>> >>>>>>> That is my worry too. I spent a bit of time on it this morning to figure >>>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this >>>>>>> commit in particular: >>>>>>> >>>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16 >>>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com> >>>>>>> Date: Wed Jul 2 22:35:18 2025 +0000 >>>>>>> >>>>>>> af_unix: Introduce SO_INQ. >>>>>>> >>>>>>> which is then not the first time I've had to fix fallout from that >>>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about >>>>>>> missing events. Yes if both poll triggers are of the same type, eg >>>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as >>>>>>> would anything else where you'd need separate handling for the trigger. >>>>>> >>>>>> Thinking more, I don't think the patch is correct either. Seems you >>>>>> expect the last recv to return 0, but let's say you have 2 refs and >>>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation >>>>>> fails. The 2nd recv call returns another 4K, and now you're in the >>>>>> same situation as before. >>>>>> >>>>>> You're trying to rely on a too specific behaviour. HUP handling should >>>>>> be better. >>>>> >>>>> Some variation on, if HUP'ed, it spins until the opcode give up. >>>> >>>> Took a quick look, and we don't even get a HUP, the hangup side >>>> ends up with a 0 mask. Which is less than useful... I'll keep >>>> digging. >>> >>> How about something like this? Will only retry if hup was seen, and >>> there are multiple refs. Avoids re-iterating for eg multiple POLLIN >>> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1. >>> Keeps it local too. >> >> HUP handling is just a hack, it'd be best to avoid complicating > > It is, that's why I wasn't a huge fan of gating on that in the first > place! > >> the pool loop logic for that (and those continue do). >> >> io_poll_loop_retry() { >> ... >> atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs); >> } >> if (req->cqe.res & (POLLHUP | POLLRDHUP)) >> io_poll_loop_retry(); >> >> >> Can we isolate it like this? Nobody should care about extra >> atomics for this case. > > It's not the extra atomic, I agree it doesn't matter for this non-hot > case. It's more that setting the retry flag isn't enough, you need to > have it do another loop at that point. And if you just set that, then > it'll drop all refs and happily return and wait for the next poll > trigger that won't happen past HUP. It was actually supposed to force it into another iteration, but I see what you're saying. I'll take a look at this part tomorrow > It's not impossible we can make this work, but I don't see a great way > of doing it without messing with the loop like I did :/. Or checking > ->poll_refs again, and at that point the separate variable is better > imho. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-16 23:08 ` Pavel Begunkov @ 2026-03-17 1:14 ` Jens Axboe 2026-03-17 1:36 ` Jens Axboe 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2026-03-17 1:14 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 3/16/26 5:08 PM, Pavel Begunkov wrote: > On 3/16/26 22:31, Jens Axboe wrote: >> On 3/16/26 4:24 PM, Pavel Begunkov wrote: >>> On 3/16/26 18:40, Jens Axboe wrote: >>>> On 3/16/26 9:16 AM, Jens Axboe wrote: >>>>> On 3/16/26 8:44 AM, Pavel Begunkov wrote: >>>>>> On 3/16/26 14:40, Pavel Begunkov wrote: >>>>>>> On 3/16/26 14:28, Jens Axboe wrote: >>>>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote: >>>>>>>>> On 3/15/26 16:19, Jens Axboe wrote: >>>>>>>>>> When a socket send and shutdown() happen back-to-back, both fire >>>>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first >>>>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >>>>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a >>>>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all >>>>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >>>>>>>>>> the first event was consumed. Since the shutdown is a persistent state >>>>>>>>>> change, no further wakeups will happen, and the multishot recv can hang >>>>>>>>>> forever. >>>>>>>>>> >>>>>>>>>> Fix this by only draining a single poll ref after io_poll_issue() >>>>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >>>>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >>>>>>>>>> discovers the remaining state. >>>>>>>>> >>>>>>>>> How often will iterate with no effect for normal execution (i.e. >>>>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead? >>>>>>>> >>>>>>>> That is my worry too. I spent a bit of time on it this morning to figure >>>>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this >>>>>>>> commit in particular: >>>>>>>> >>>>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16 >>>>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com> >>>>>>>> Date: Wed Jul 2 22:35:18 2025 +0000 >>>>>>>> >>>>>>>> af_unix: Introduce SO_INQ. >>>>>>>> >>>>>>>> which is then not the first time I've had to fix fallout from that >>>>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about >>>>>>>> missing events. Yes if both poll triggers are of the same type, eg >>>>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as >>>>>>>> would anything else where you'd need separate handling for the trigger. >>>>>>> >>>>>>> Thinking more, I don't think the patch is correct either. Seems you >>>>>>> expect the last recv to return 0, but let's say you have 2 refs and >>>>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation >>>>>>> fails. The 2nd recv call returns another 4K, and now you're in the >>>>>>> same situation as before. >>>>>>> >>>>>>> You're trying to rely on a too specific behaviour. HUP handling should >>>>>>> be better. >>>>>> >>>>>> Some variation on, if HUP'ed, it spins until the opcode give up. >>>>> >>>>> Took a quick look, and we don't even get a HUP, the hangup side >>>>> ends up with a 0 mask. Which is less than useful... I'll keep >>>>> digging. >>>> >>>> How about something like this? Will only retry if hup was seen, and >>>> there are multiple refs. Avoids re-iterating for eg multiple POLLIN >>>> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1. >>>> Keeps it local too. >>> >>> HUP handling is just a hack, it'd be best to avoid complicating >> >> It is, that's why I wasn't a huge fan of gating on that in the first >> place! >> >>> the pool loop logic for that (and those continue do). >>> >>> io_poll_loop_retry() { >>> ... >>> atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs); >>> } >>> if (req->cqe.res & (POLLHUP | POLLRDHUP)) >>> io_poll_loop_retry(); >>> >>> >>> Can we isolate it like this? Nobody should care about extra >>> atomics for this case. >> >> It's not the extra atomic, I agree it doesn't matter for this non-hot >> case. It's more that setting the retry flag isn't enough, you need to >> have it do another loop at that point. And if you just set that, then >> it'll drop all refs and happily return and wait for the next poll >> trigger that won't happen past HUP. > > It was actually supposed to force it into another iteration, but > I see what you're saying. I'll take a look at this part tomorrow I mean, we can just reuse 'v' for this, and set IO_POLL_RETRY_FLAG. There's no reason to set it in ->poll_refs as it's all local anyway. Then that at least gets rid of the 'retry' variable. diff --git a/io_uring/poll.c b/io_uring/poll.c index aac4b3b881fb..f1a45e69b9af 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -228,6 +228,18 @@ static inline void io_poll_execute(struct io_kiocb *req, int res) __io_poll_execute(req, res); } +static inline void io_poll_check_retry(struct io_kiocb *req, int *v) +{ + if (req->opcode == IORING_OP_POLL_ADD) + return; + if (!(req->cqe.res & (POLLHUP | POLLRDHUP))) + return; + if (*v == 1) + return; + /* multiple refs and HUP, ensure we loop once more */ + *v |= IO_POLL_RETRY_FLAG; +} + /* * All poll tw should go through this. Checks for poll events, manages * references, does rewait, etc. @@ -287,13 +299,15 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) if (unlikely(!req->cqe.res)) { /* Multishot armed need not reissue */ if (!(req->apoll_events & EPOLLONESHOT)) - continue; + goto finish; return IOU_POLL_REISSUE; } } if (req->apoll_events & EPOLLONESHOT) return IOU_POLL_DONE; + io_poll_check_retry(req, &v); + /* multishot, just fill a CQE and proceed */ if (!(req->flags & REQ_F_APOLL_MULTISHOT)) { __poll_t mask = mangle_poll(req->cqe.res & @@ -317,12 +331,17 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) /* force the next iteration to vfs_poll() */ req->cqe.res = 0; + if (v & IO_POLL_RETRY_FLAG) + continue; /* * Release all references, retry if someone tried to restart * task_work while we were executing it. */ +finish: v &= IO_POLL_REF_MASK; - } while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK); + if (!(atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK)) + break; + } while (1); io_napi_add(req); return IOU_POLL_NO_ACTION; -- Jens Axboe ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race 2026-03-17 1:14 ` Jens Axboe @ 2026-03-17 1:36 ` Jens Axboe 0 siblings, 0 replies; 12+ messages in thread From: Jens Axboe @ 2026-03-17 1:36 UTC (permalink / raw) To: Pavel Begunkov, io-uring On 3/16/26 7:14 PM, Jens Axboe wrote: > On 3/16/26 5:08 PM, Pavel Begunkov wrote: >> On 3/16/26 22:31, Jens Axboe wrote: >>> On 3/16/26 4:24 PM, Pavel Begunkov wrote: >>>> On 3/16/26 18:40, Jens Axboe wrote: >>>>> On 3/16/26 9:16 AM, Jens Axboe wrote: >>>>>> On 3/16/26 8:44 AM, Pavel Begunkov wrote: >>>>>>> On 3/16/26 14:40, Pavel Begunkov wrote: >>>>>>>> On 3/16/26 14:28, Jens Axboe wrote: >>>>>>>>> On 3/16/26 8:17 AM, Pavel Begunkov wrote: >>>>>>>>>> On 3/15/26 16:19, Jens Axboe wrote: >>>>>>>>>>> When a socket send and shutdown() happen back-to-back, both fire >>>>>>>>>>> wake-ups before the receiver's task_work has a chance to run. The first >>>>>>>>>>> wake gets poll ownership (poll_refs=1), and the second bumps it to 2. >>>>>>>>>>> When io_poll_check_events() runs, it calls io_poll_issue() which does a >>>>>>>>>>> recv that reads the data and returns IOU_RETRY. The loop then drains all >>>>>>>>>>> accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only >>>>>>>>>>> the first event was consumed. Since the shutdown is a persistent state >>>>>>>>>>> change, no further wakeups will happen, and the multishot recv can hang >>>>>>>>>>> forever. >>>>>>>>>>> >>>>>>>>>>> Fix this by only draining a single poll ref after io_poll_issue() >>>>>>>>>>> returns IOU_RETRY for the APOLL_MULTISHOT path. If additional wakes >>>>>>>>>>> raced in (poll_refs was > 1), the loop iterates again, vfs_poll() >>>>>>>>>>> discovers the remaining state. >>>>>>>>>> >>>>>>>>>> How often will iterate with no effect for normal execution (i.e. >>>>>>>>>> no shutdown)? And how costly it'll be? Why not handle HUP instead? >>>>>>>>> >>>>>>>>> That is my worry too. I spent a bit of time on it this morning to figure >>>>>>>>> out why this is a new issue, and traced it down to 6.16..6.17, and this >>>>>>>>> commit in particular: >>>>>>>>> >>>>>>>>> commit df30285b3670bf52e1e5512e4d4482bec5e93c16 >>>>>>>>> Author: Kuniyuki Iwashima <kuniyu@google.com> >>>>>>>>> Date: Wed Jul 2 22:35:18 2025 +0000 >>>>>>>>> >>>>>>>>> af_unix: Introduce SO_INQ. >>>>>>>>> >>>>>>>>> which is then not the first time I've had to fix fallout from that >>>>>>>>> commit. Need to dig a bit deeper. That said, I do also worry a bit about >>>>>>>>> missing events. Yes if both poll triggers are of the same type, eg >>>>>>>>> POLLIN, then we don't need to iterate again. IN + HUP is problematic, as >>>>>>>>> would anything else where you'd need separate handling for the trigger. >>>>>>>> >>>>>>>> Thinking more, I don't think the patch is correct either. Seems you >>>>>>>> expect the last recv to return 0, but let's say you have 2 refs and >>>>>>>> 8K in the rx queue. The first recv call gets 4K b/c some allocation >>>>>>>> fails. The 2nd recv call returns another 4K, and now you're in the >>>>>>>> same situation as before. >>>>>>>> >>>>>>>> You're trying to rely on a too specific behaviour. HUP handling should >>>>>>>> be better. >>>>>>> >>>>>>> Some variation on, if HUP'ed, it spins until the opcode give up. >>>>>> >>>>>> Took a quick look, and we don't even get a HUP, the hangup side >>>>>> ends up with a 0 mask. Which is less than useful... I'll keep >>>>>> digging. >>>>> >>>>> How about something like this? Will only retry if hup was seen, and >>>>> there are multiple refs. Avoids re-iterating for eg multiple POLLIN >>>>> wakes, which should be the common hot path if v & IO_POLL_REF_MASK != 1. >>>>> Keeps it local too. >>>> >>>> HUP handling is just a hack, it'd be best to avoid complicating >>> >>> It is, that's why I wasn't a huge fan of gating on that in the first >>> place! >>> >>>> the pool loop logic for that (and those continue do). >>>> >>>> io_poll_loop_retry() { >>>> ... >>>> atomic_or(IO_POLL_RETRY_FLAG, &req->poll_refs); >>>> } >>>> if (req->cqe.res & (POLLHUP | POLLRDHUP)) >>>> io_poll_loop_retry(); >>>> >>>> >>>> Can we isolate it like this? Nobody should care about extra >>>> atomics for this case. >>> >>> It's not the extra atomic, I agree it doesn't matter for this non-hot >>> case. It's more that setting the retry flag isn't enough, you need to >>> have it do another loop at that point. And if you just set that, then >>> it'll drop all refs and happily return and wait for the next poll >>> trigger that won't happen past HUP. >> >> It was actually supposed to force it into another iteration, but >> I see what you're saying. I'll take a look at this part tomorrow > > I mean, we can just reuse 'v' for this, and set IO_POLL_RETRY_FLAG. > There's no reason to set it in ->poll_refs as it's all local anyway. > Then that at least gets rid of the 'retry' variable. This is as clean/simple as I can get it, less text too for this one. diff --git a/io_uring/poll.c b/io_uring/poll.c index aac4b3b881fb..06183cf60201 100644 --- a/io_uring/poll.c +++ b/io_uring/poll.c @@ -228,6 +228,22 @@ static inline void io_poll_execute(struct io_kiocb *req, int res) __io_poll_execute(req, res); } +static inline void io_poll_check_retry(struct io_kiocb *req, int *v) +{ + if (req->opcode == IORING_OP_POLL_ADD) + return; + if (!(req->cqe.res & (POLLHUP | POLLRDHUP))) + return; + /* + * Release all references, retry if someone tried to restart + * task_work while we were executing it. + */ + *v &= IO_POLL_REF_MASK; + + /* multiple refs and HUP, ensure we loop once more */ + (*v)--; +} + /* * All poll tw should go through this. Checks for poll events, manages * references, does rewait, etc. @@ -294,6 +310,8 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) if (req->apoll_events & EPOLLONESHOT) return IOU_POLL_DONE; + io_poll_check_retry(req, &v); + /* multishot, just fill a CQE and proceed */ if (!(req->flags & REQ_F_APOLL_MULTISHOT)) { __poll_t mask = mangle_poll(req->cqe.res & @@ -316,12 +334,6 @@ static int io_poll_check_events(struct io_kiocb *req, io_tw_token_t tw) /* force the next iteration to vfs_poll() */ req->cqe.res = 0; - - /* - * Release all references, retry if someone tried to restart - * task_work while we were executing it. - */ - v &= IO_POLL_REF_MASK; } while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK); io_napi_add(req); -- Jens Axboe ^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-03-17 1:36 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-15 16:19 [PATCH] io_uring/poll: fix multishot recv missing EOF on wakeup race Jens Axboe 2026-03-16 14:17 ` Pavel Begunkov 2026-03-16 14:28 ` Jens Axboe 2026-03-16 14:40 ` Pavel Begunkov 2026-03-16 14:44 ` Pavel Begunkov 2026-03-16 15:16 ` Jens Axboe 2026-03-16 18:40 ` Jens Axboe 2026-03-16 22:24 ` Pavel Begunkov 2026-03-16 22:31 ` Jens Axboe 2026-03-16 23:08 ` Pavel Begunkov 2026-03-17 1:14 ` Jens Axboe 2026-03-17 1:36 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox