[PATCHSET 0/9] io_uring epoll wait support

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCHSET 0/9] io_uring epoll wait support
@ 2025-02-03 16:23 Jens Axboe
  2025-02-03 16:23 ` [PATCH 1/9] eventpoll: abstract out main epoll reaper into a function Jens Axboe
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner

Hi,

One issue people consistently run into when converting legacy epoll
event loops with io_uring is that parts of the event loop still needs to
use epoll. And since event loops generally need to wait in one spot,
they add the io_uring fd to the epoll set and continue to use
epoll_wait(2) to wait on events. This is suboptimal on the io_uring
front as there's now an active poller on the ring, and it's suboptimal
as it doesn't give the application the batch waiting (with fine grained
timeouts) that io_uring provides.

This patchset adds support for IORING_OP_EPOLL_WAIT, which does an async
epoll_wait() operation. No sleeping or thread offload is involved, it
relies on the wait_queue_entry callback for retries. With that, then
the above event loops can continue to use epoll for certain parts, but
bundle it all under waiting on the ring itself rather than add the ring
fd to the epoll set.

Patches 1..4 are just prep patches, and patch 5 adds the epoll change
to allow io_uring to queue a callback, if no events are available.
Patches 6..7 are just prep patches on the io_uring side, and patch 8
finally adds IORING_OP_EPOLL_WAIT support. Patch 9 adds multishot
support, which further gets rid of repeated write_lock and list
manipulations on the struct eventpoll waitqueue head. This last bit
should be a nice win, having a persistent waitqueue entry rather
than needing to lock/add/unlock for each epoll_wait() equivalent
operation.

Patches can also be found here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-epoll-wait

and are against 6.14-rc1 + already pending io_uring patches.

 fs/eventpoll.c                 | 155 +++++++++++++++++++--------
 include/linux/eventpoll.h      |   8 ++
 include/linux/io_uring_types.h |   4 +
 include/uapi/linux/io_uring.h  |   7 ++
 io_uring/Makefile              |   9 +-
 io_uring/cancel.c              |   5 +
 io_uring/epoll.c               | 190 ++++++++++++++++++++++++++++++++-
 io_uring/epoll.h               |  22 ++++
 io_uring/io_uring.c            |   5 +
 io_uring/opdef.c               |  14 +++
 io_uring/poll.c                |  30 +-----
 io_uring/poll.h                |  31 ++++++
 12 files changed, 400 insertions(+), 80 deletions(-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/9] eventpoll: abstract out main epoll reaper into a function
  2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
@ 2025-02-03 16:23 ` Jens Axboe
  2025-02-03 16:23 ` [PATCH 2/9] eventpoll: add helper to remove wait entry from wait queue head Jens Axboe
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

Add epoll_wait(), which takes a struct file and the number of events
etc to reap. This can then be called by do_epoll_wait(), and used
by io_uring as well.

No intended functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/eventpoll.c            | 31 ++++++++++++++++++-------------
 include/linux/eventpoll.h |  4 ++++
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 7c0980db77b3..73b639caed3d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2445,12 +2445,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	return do_epoll_ctl(epfd, op, fd, &epds, false);
 }
 
-/*
- * Implement the event wait interface for the eventpoll file. It is the kernel
- * part of the user space epoll_wait(2).
- */
-static int do_epoll_wait(int epfd, struct epoll_event __user *events,
-			 int maxevents, struct timespec64 *to)
+int epoll_wait(struct file *file, struct epoll_event __user *events,
+	       int maxevents, struct timespec64 *to)
 {
 	struct eventpoll *ep;
 
@@ -2462,28 +2458,37 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events,
 	if (!access_ok(events, maxevents * sizeof(struct epoll_event)))
 		return -EFAULT;
 
-	/* Get the "struct file *" for the eventpoll file */
-	CLASS(fd, f)(epfd);
-	if (fd_empty(f))
-		return -EBADF;
-
 	/*
 	 * We have to check that the file structure underneath the fd
 	 * the user passed to us _is_ an eventpoll file.
 	 */
-	if (!is_file_epoll(fd_file(f)))
+	if (!is_file_epoll(file))
 		return -EINVAL;
 
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
 	 * our own data structure.
 	 */
-	ep = fd_file(f)->private_data;
+	ep = file->private_data;
 
 	/* Time to fish for events ... */
 	return ep_poll(ep, events, maxevents, to);
 }
 
+/*
+ * Implement the event wait interface for the eventpoll file. It is the kernel
+ * part of the user space epoll_wait(2).
+ */
+static int do_epoll_wait(int epfd, struct epoll_event __user *events,
+			 int maxevents, struct timespec64 *to)
+{
+	/* Get the "struct file *" for the eventpoll file */
+	CLASS(fd, f)(epfd);
+	if (!fd_empty(f))
+		return epoll_wait(fd_file(f), events, maxevents, to);
+	return -EBADF;
+}
+
 SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
 		int, maxevents, int, timeout)
 {
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 0c0d00fcd131..f37fea931c44 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -25,6 +25,10 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, unsigned long t
 /* Used to release the epoll bits inside the "struct file" */
 void eventpoll_release_file(struct file *file);
 
+/* Use to reap events */
+int epoll_wait(struct file *file, struct epoll_event __user *events,
+	       int maxevents, struct timespec64 *to);
+
 /*
  * This is called from inside fs/file_table.c:__fput() to unlink files
  * from the eventpoll interface. We need to have this facility to cleanup
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/9] eventpoll: add helper to remove wait entry from wait queue head
  2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
  2025-02-03 16:23 ` [PATCH 1/9] eventpoll: abstract out main epoll reaper into a function Jens Axboe
@ 2025-02-03 16:23 ` Jens Axboe
  2025-02-03 16:23 ` [PATCH 3/9] eventpoll: abstract out ep_try_send_events() helper Jens Axboe
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

__epoll_wait_remove() is the core helper, it kills a given
wait_queue_entry from the eventpoll wait_queue_head. Use it internally,
and provide an overall helper, epoll_wait_remove(), which takes a struct
file and provides the same functionality.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/eventpoll.c            | 58 +++++++++++++++++++++++++--------------
 include/linux/eventpoll.h |  3 ++
 2 files changed, 40 insertions(+), 21 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 73b639caed3d..01edbee5c766 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1980,6 +1980,42 @@ static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
 	return ret;
 }
 
+static int __epoll_wait_remove(struct eventpoll *ep,
+			       struct wait_queue_entry *wait, int timed_out)
+{
+	int eavail;
+
+	/*
+	 * We were woken up, thus go and try to harvest some events. If timed
+	 * out and still on the wait queue, recheck eavail carefully under
+	 * lock, below.
+	 */
+	eavail = 1;
+
+	if (!list_empty_careful(&wait->entry)) {
+		write_lock_irq(&ep->lock);
+		/*
+		 * If the thread timed out and is not on the wait queue, it
+		 * means that the thread was woken up after its timeout expired
+		 * before it could reacquire the lock. Thus, when wait.entry is
+		 * empty, it needs to harvest events.
+		 */
+		if (timed_out)
+			eavail = list_empty(&wait->entry);
+		__remove_wait_queue(&ep->wq, wait);
+		write_unlock_irq(&ep->lock);
+	}
+
+	return eavail;
+}
+
+int epoll_wait_remove(struct file *file, struct wait_queue_entry *wait)
+{
+	if (is_file_epoll(file))
+		return __epoll_wait_remove(file->private_data, wait, false);
+	return -EINVAL;
+}
+
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
  *           event buffer.
@@ -2100,27 +2136,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 							      HRTIMER_MODE_ABS);
 		__set_current_state(TASK_RUNNING);
 
-		/*
-		 * We were woken up, thus go and try to harvest some events.
-		 * If timed out and still on the wait queue, recheck eavail
-		 * carefully under lock, below.
-		 */
-		eavail = 1;
-
-		if (!list_empty_careful(&wait.entry)) {
-			write_lock_irq(&ep->lock);
-			/*
-			 * If the thread timed out and is not on the wait queue,
-			 * it means that the thread was woken up after its
-			 * timeout expired before it could reacquire the lock.
-			 * Thus, when wait.entry is empty, it needs to harvest
-			 * events.
-			 */
-			if (timed_out)
-				eavail = list_empty(&wait.entry);
-			__remove_wait_queue(&ep->wq, &wait);
-			write_unlock_irq(&ep->lock);
-		}
+		eavail = __epoll_wait_remove(ep, &wait, timed_out);
 	}
 }
 
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index f37fea931c44..1301fc74aca0 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -29,6 +29,9 @@ void eventpoll_release_file(struct file *file);
 int epoll_wait(struct file *file, struct epoll_event __user *events,
 	       int maxevents, struct timespec64 *to);
 
+/* Remove wait entry */
+int epoll_wait_remove(struct file *file, struct wait_queue_entry *wait);
+
 /*
  * This is called from inside fs/file_table.c:__fput() to unlink files
  * from the eventpoll interface. We need to have this facility to cleanup
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/9] eventpoll: abstract out ep_try_send_events() helper
  2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
  2025-02-03 16:23 ` [PATCH 1/9] eventpoll: abstract out main epoll reaper into a function Jens Axboe
  2025-02-03 16:23 ` [PATCH 2/9] eventpoll: add helper to remove wait entry from wait queue head Jens Axboe
@ 2025-02-03 16:23 ` Jens Axboe
  2025-02-03 16:23 ` [PATCH 4/9] eventpoll: add struct wait_queue_entry argument to epoll_wait() Jens Axboe
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

In preparation for reusing this helper in another epoll setup helper,
abstract it out.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/eventpoll.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 01edbee5c766..3cbd290503c7 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2016,6 +2016,22 @@ int epoll_wait_remove(struct file *file, struct wait_queue_entry *wait)
 	return -EINVAL;
 }
 
+static int ep_try_send_events(struct eventpoll *ep,
+			      struct epoll_event __user *events, int maxevents)
+{
+	int res;
+
+	/*
+	 * Try to transfer events to user space. In case we get 0 events and
+	 * there's still timeout left over, we go trying again in search of
+	 * more luck.
+	 */
+	res = ep_send_events(ep, events, maxevents);
+	if (res > 0)
+		ep_suspend_napi_irqs(ep);
+	return res;
+}
+
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
  *           event buffer.
@@ -2067,17 +2083,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 
 	while (1) {
 		if (eavail) {
-			/*
-			 * Try to transfer events to user space. In case we get
-			 * 0 events and there's still timeout left over, we go
-			 * trying again in search of more luck.
-			 */
-			res = ep_send_events(ep, events, maxevents);
-			if (res) {
-				if (res > 0)
-					ep_suspend_napi_irqs(ep);
+			res = ep_try_send_events(ep, events, maxevents);
+			if (res)
 				return res;
-			}
 		}
 
 		if (timed_out)
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/9] eventpoll: add struct wait_queue_entry argument to epoll_wait()
  2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
                   ` (2 preceding siblings ...)
  2025-02-03 16:23 ` [PATCH 3/9] eventpoll: abstract out ep_try_send_events() helper Jens Axboe
@ 2025-02-03 16:23 ` Jens Axboe
  2025-02-03 16:23 ` [PATCH 5/9] eventpoll: add ep_poll_queue() loop Jens Axboe
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

In preparation for allowing an outside caller to add itself to the epoll
waitqueue, pass in a struct wait_queue_entry. Unused in its current
form, but will be utilized shortly.

No intended functional changes in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/eventpoll.c            | 5 +++--
 include/linux/eventpoll.h | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 3cbd290503c7..ecaa5591f4be 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2470,7 +2470,8 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 }
 
 int epoll_wait(struct file *file, struct epoll_event __user *events,
-	       int maxevents, struct timespec64 *to)
+	       int maxevents, struct timespec64 *to,
+	       struct wait_queue_entry *wait)
 {
 	struct eventpoll *ep;
 
@@ -2509,7 +2510,7 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events,
 	/* Get the "struct file *" for the eventpoll file */
 	CLASS(fd, f)(epfd);
 	if (!fd_empty(f))
-		return epoll_wait(fd_file(f), events, maxevents, to);
+		return epoll_wait(fd_file(f), events, maxevents, to, NULL);
 	return -EBADF;
 }
 
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 1301fc74aca0..24f9344df5a3 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -27,7 +27,8 @@ void eventpoll_release_file(struct file *file);
 
 /* Use to reap events */
 int epoll_wait(struct file *file, struct epoll_event __user *events,
-	       int maxevents, struct timespec64 *to);
+	       int maxevents, struct timespec64 *to,
+	       struct wait_queue_entry *wait);
 
 /* Remove wait entry */
 int epoll_wait_remove(struct file *file, struct wait_queue_entry *wait);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/9] eventpoll: add ep_poll_queue() loop
  2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
                   ` (3 preceding siblings ...)
  2025-02-03 16:23 ` [PATCH 4/9] eventpoll: add struct wait_queue_entry argument to epoll_wait() Jens Axboe
@ 2025-02-03 16:23 ` Jens Axboe
  2025-02-03 16:23 ` [PATCH 6/9] io_uring/epoll: remove CONFIG_EPOLL guards Jens Axboe
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

If a wait_queue_entry is passed in to epoll_wait(), then utilize this
new helper for reaping events and/or adding to the epoll waitqueue
rather than calling the potentially sleeping ep_poll(). It works like
ep_poll(), except it doesn't block - it either returns the events that
are already available, or it adds the specified entry to the struct
eventpoll waitqueue to get a callback when events are triggered. It
returns -EIOCBQUEUED for that case.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/eventpoll.c | 37 ++++++++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index ecaa5591f4be..a8be0c7110e4 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2032,6 +2032,39 @@ static int ep_try_send_events(struct eventpoll *ep,
 	return res;
 }
 
+static int ep_poll_queue(struct eventpoll *ep,
+			 struct epoll_event __user *events, int maxevents,
+			 struct wait_queue_entry *wait)
+{
+	int res, eavail;
+
+	/* See ep_poll() for commentary */
+	eavail = ep_events_available(ep);
+	while (1) {
+		if (eavail) {
+			res = ep_try_send_events(ep, events, maxevents);
+			if (res)
+				return res;
+		}
+
+		eavail = ep_busy_loop(ep, true);
+		if (eavail)
+			continue;
+
+		if (!list_empty_careful(&wait->entry))
+			return -EIOCBQUEUED;
+
+		write_lock_irq(&ep->lock);
+		eavail = ep_events_available(ep);
+		if (!eavail)
+			__add_wait_queue_exclusive(&ep->wq, wait);
+		write_unlock_irq(&ep->lock);
+
+		if (!eavail)
+			return -EIOCBQUEUED;
+	}
+}
+
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
  *           event buffer.
@@ -2497,7 +2530,9 @@ int epoll_wait(struct file *file, struct epoll_event __user *events,
 	ep = file->private_data;
 
 	/* Time to fish for events ... */
-	return ep_poll(ep, events, maxevents, to);
+	if (!wait)
+		return ep_poll(ep, events, maxevents, to);
+	return ep_poll_queue(ep, events, maxevents, wait);
 }
 
 /*
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/9] io_uring/epoll: remove CONFIG_EPOLL guards
  2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
                   ` (4 preceding siblings ...)
  2025-02-03 16:23 ` [PATCH 5/9] eventpoll: add ep_poll_queue() loop Jens Axboe
@ 2025-02-03 16:23 ` Jens Axboe
  2025-02-03 16:23 ` [PATCH 7/9] io_uring/poll: pull ownership handling into poll.h Jens Axboe
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

Just have the Makefile add the object if epoll is enabled, then it's
not necessary to guard the entire epoll.c file inside an CONFIG_EPOLL
ifdef.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/Makefile | 9 +++++----
 io_uring/epoll.c  | 2 --
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/io_uring/Makefile b/io_uring/Makefile
index d695b60dba4f..7114a6dbd439 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -11,9 +11,10 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
 					eventfd.o uring_cmd.o openclose.o \
 					sqpoll.o xattr.o nop.o fs.o splice.o \
 					sync.o msg_ring.o advise.o openclose.o \
-					epoll.o statx.o timeout.o fdinfo.o \
-					cancel.o waitid.o register.o \
-					truncate.o memmap.o alloc_cache.o
+					statx.o timeout.o fdinfo.o cancel.o \
+					waitid.o register.o truncate.o \
+					memmap.o alloc_cache.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
-obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
+obj-$(CONFIG_EPOLL)		+= epoll.o
+obj-$(CONFIG_NET_RX_BUSY_POLL)	+= napi.o
diff --git a/io_uring/epoll.c b/io_uring/epoll.c
index 89bff2068a19..7848d9cc073d 100644
--- a/io_uring/epoll.c
+++ b/io_uring/epoll.c
@@ -12,7 +12,6 @@
 #include "io_uring.h"
 #include "epoll.h"
 
-#if defined(CONFIG_EPOLL)
 struct io_epoll {
 	struct file			*file;
 	int				epfd;
@@ -58,4 +57,3 @@ int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags)
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 }
-#endif
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/9] io_uring/poll: pull ownership handling into poll.h
  2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
                   ` (5 preceding siblings ...)
  2025-02-03 16:23 ` [PATCH 6/9] io_uring/epoll: remove CONFIG_EPOLL guards Jens Axboe
@ 2025-02-03 16:23 ` Jens Axboe
  2025-02-03 16:23 ` [PATCH 8/9] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT Jens Axboe
  2025-02-03 16:23 ` [PATCH 9/9] io_uring/epoll: add multishot " Jens Axboe
  8 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

In preparation for using it from somewhere else. Rather than try and
duplicate the functionality, just make it generically available to
io_uring opcodes.

Note: would have to be used carefully, cannot be used by opcodes that
can trigger poll logic.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/poll.c | 30 +-----------------------------
 io_uring/poll.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 29 deletions(-)

diff --git a/io_uring/poll.c b/io_uring/poll.c
index bb1c0cd4f809..5e44ac562491 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -41,16 +41,6 @@ struct io_poll_table {
 	__poll_t result_mask;
 };
 
-#define IO_POLL_CANCEL_FLAG	BIT(31)
-#define IO_POLL_RETRY_FLAG	BIT(30)
-#define IO_POLL_REF_MASK	GENMASK(29, 0)
-
-/*
- * We usually have 1-2 refs taken, 128 is more than enough and we want to
- * maximise the margin between this amount and the moment when it overflows.
- */
-#define IO_POLL_REF_BIAS	128
-
 #define IO_WQE_F_DOUBLE		1
 
 static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
@@ -70,7 +60,7 @@ static inline bool wqe_is_double(struct wait_queue_entry *wqe)
 	return priv & IO_WQE_F_DOUBLE;
 }
 
-static bool io_poll_get_ownership_slowpath(struct io_kiocb *req)
+bool io_poll_get_ownership_slowpath(struct io_kiocb *req)
 {
 	int v;
 
@@ -85,24 +75,6 @@ static bool io_poll_get_ownership_slowpath(struct io_kiocb *req)
 	return !(atomic_fetch_inc(&req->poll_refs) & IO_POLL_REF_MASK);
 }
 
-/*
- * If refs part of ->poll_refs (see IO_POLL_REF_MASK) is 0, it's free. We can
- * bump it and acquire ownership. It's disallowed to modify requests while not
- * owning it, that prevents from races for enqueueing task_work's and b/w
- * arming poll and wakeups.
- */
-static inline bool io_poll_get_ownership(struct io_kiocb *req)
-{
-	if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS))
-		return io_poll_get_ownership_slowpath(req);
-	return !(atomic_fetch_inc(&req->poll_refs) & IO_POLL_REF_MASK);
-}
-
-static void io_poll_mark_cancelled(struct io_kiocb *req)
-{
-	atomic_or(IO_POLL_CANCEL_FLAG, &req->poll_refs);
-}
-
 static struct io_poll *io_poll_get_double(struct io_kiocb *req)
 {
 	/* pure poll stashes this in ->async_data, poll driven retry elsewhere */
diff --git a/io_uring/poll.h b/io_uring/poll.h
index 04ede93113dc..2f416cd3be13 100644
--- a/io_uring/poll.h
+++ b/io_uring/poll.h
@@ -21,6 +21,18 @@ struct async_poll {
 	struct io_poll		*double_poll;
 };
 
+#define IO_POLL_CANCEL_FLAG	BIT(31)
+#define IO_POLL_RETRY_FLAG	BIT(30)
+#define IO_POLL_REF_MASK	GENMASK(29, 0)
+
+bool io_poll_get_ownership_slowpath(struct io_kiocb *req);
+
+/*
+ * We usually have 1-2 refs taken, 128 is more than enough and we want to
+ * maximise the margin between this amount and the moment when it overflows.
+ */
+#define IO_POLL_REF_BIAS	128
+
 /*
  * Must only be called inside issue_flags & IO_URING_F_MULTISHOT, or
  * potentially other cases where we already "own" this poll request.
@@ -30,6 +42,25 @@ static inline void io_poll_multishot_retry(struct io_kiocb *req)
 	atomic_inc(&req->poll_refs);
 }
 
+/*
+ * If refs part of ->poll_refs (see IO_POLL_REF_MASK) is 0, it's free. We can
+ * bump it and acquire ownership. It's disallowed to modify requests while not
+ * owning it, that prevents from races for enqueueing task_work's and b/w
+ * arming poll and wakeups.
+ */
+static inline bool io_poll_get_ownership(struct io_kiocb *req)
+{
+	if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS))
+		return io_poll_get_ownership_slowpath(req);
+	return !(atomic_fetch_inc(&req->poll_refs) & IO_POLL_REF_MASK);
+}
+
+static inline void io_poll_mark_cancelled(struct io_kiocb *req)
+{
+	atomic_or(IO_POLL_CANCEL_FLAG, &req->poll_refs);
+}
+
+
 int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_poll_add(struct io_kiocb *req, unsigned int issue_flags);
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 8/9] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
  2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
                   ` (6 preceding siblings ...)
  2025-02-03 16:23 ` [PATCH 7/9] io_uring/poll: pull ownership handling into poll.h Jens Axboe
@ 2025-02-03 16:23 ` Jens Axboe
  2025-02-03 16:23 ` [PATCH 9/9] io_uring/epoll: add multishot " Jens Axboe
  8 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

For existing epoll event loops that can't fully convert to io_uring,
the used approach is usually to add the io_uring fd to the epoll
instance and use epoll_wait() to wait on both "legacy" and io_uring
events. While this work, it isn't optimal as:

1) epoll_wait() is pretty limited in what it can do. It does not support
   partial reaping of events, or waiting on a batch of events.

2) When an io_uring ring is added to an epoll instance, it activates the
   io_uring "I'm being polled" logic which slows things down.

Rather than use this approach, with EPOLL_WAIT support added to io_uring,
event loops can use the normal io_uring wait logic for everything, as
long as an epoll wait request has been armed with io_uring.

Note that IORING_OP_EPOLL_WAIT does NOT take a timeout value, as this
is an async request. Waiting on io_uring events in general has various
timeout parameters, and those are the ones that should be used when
waiting on any kind of request. If events are immediately available for
reaping, then This opcode will return those immediately. If none are
available, then it will post an async completion when they become
available.

cqe->res will contain either an error code (< 0 value) for a malformed
request, invalid epoll instance, etc. It will return a positive result
indicating how many events were reaped.

IORING_OP_EPOLL_WAIT requests may be canceled using the normal io_uring
cancelation infrastructure. The poll logic for managing ownership is
adopted to guard the epoll side too.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |   4 +
 include/uapi/linux/io_uring.h  |   1 +
 io_uring/cancel.c              |   5 +
 io_uring/epoll.c               | 168 +++++++++++++++++++++++++++++++++
 io_uring/epoll.h               |  22 +++++
 io_uring/io_uring.c            |   5 +
 io_uring/opdef.c               |  14 +++
 7 files changed, 219 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 3def525a1da3..ee56992d31d5 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -370,6 +370,10 @@ struct io_ring_ctx {
 	struct io_alloc_cache	futex_cache;
 #endif
 
+#ifdef CONFIG_EPOLL
+	struct hlist_head	epoll_list;
+#endif
+
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e11c82638527..a559e1e1544a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -278,6 +278,7 @@ enum io_uring_op {
 	IORING_OP_FTRUNCATE,
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
+	IORING_OP_EPOLL_WAIT,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index 484193567839..9cebd0145cb4 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -17,6 +17,7 @@
 #include "timeout.h"
 #include "waitid.h"
 #include "futex.h"
+#include "epoll.h"
 #include "cancel.h"
 
 struct io_cancel {
@@ -128,6 +129,10 @@ int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd,
 	if (ret != -ENOENT)
 		return ret;
 
+	ret = io_epoll_wait_cancel(ctx, cd, issue_flags);
+	if (ret != -ENOENT)
+		return ret;
+
 	spin_lock(&ctx->completion_lock);
 	if (!(cd->flags & IORING_ASYNC_CANCEL_FD))
 		ret = io_timeout_cancel(ctx, cd);
diff --git a/io_uring/epoll.c b/io_uring/epoll.c
index 7848d9cc073d..2a9c679516c8 100644
--- a/io_uring/epoll.c
+++ b/io_uring/epoll.c
@@ -11,6 +11,7 @@
 
 #include "io_uring.h"
 #include "epoll.h"
+#include "poll.h"
 
 struct io_epoll {
 	struct file			*file;
@@ -20,6 +21,13 @@ struct io_epoll {
 	struct epoll_event		event;
 };
 
+struct io_epoll_wait {
+	struct file			*file;
+	int				maxevents;
+	struct epoll_event __user	*events;
+	struct wait_queue_entry		wait;
+};
+
 int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_epoll *epoll = io_kiocb_to_cmd(req, struct io_epoll);
@@ -57,3 +65,163 @@ int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags)
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 }
+
+static void __io_epoll_cancel(struct io_kiocb *req)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	epoll_wait_remove(req->file, &iew->wait);
+	hlist_del_init(&req->hash_node);
+	io_req_set_res(req, -ECANCELED, 0);
+	req->io_task_work.func = io_req_task_complete;
+	io_req_task_work_add(req);
+}
+
+static void __io_epoll_wait_cancel(struct io_kiocb *req)
+{
+	io_poll_mark_cancelled(req);
+	if (io_poll_get_ownership(req)) {
+		__io_epoll_cancel(req);
+		io_req_set_res(req, -ECANCELED, 0);
+		req->io_task_work.func = io_req_task_complete;
+		io_req_task_work_add(req);
+	}
+}
+
+bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *tctx,
+			      bool cancel_all)
+{
+	struct hlist_node *tmp;
+	struct io_kiocb *req;
+	bool found = false;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	hlist_for_each_entry_safe(req, tmp, &ctx->epoll_list, hash_node) {
+		if (!io_match_task_safe(req, tctx, cancel_all))
+			continue;
+		__io_epoll_wait_cancel(req);
+		found = true;
+	}
+
+	return found;
+}
+
+int io_epoll_wait_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+			 unsigned int issue_flags)
+{
+	struct hlist_node *tmp;
+	struct io_kiocb *req;
+	int nr = 0;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	hlist_for_each_entry_safe(req, tmp, &ctx->epoll_list, hash_node) {
+		if (!io_cancel_req_match(req, cd))
+			continue;
+		__io_epoll_wait_cancel(req);
+		nr++;
+	}
+	io_ring_submit_unlock(ctx, issue_flags);
+	return nr ?: -ENOENT;
+}
+
+static void io_epoll_retry(struct io_kiocb *req, struct io_tw_state *ts)
+{
+	int v;
+
+	do {
+		v = atomic_read(&req->poll_refs);
+		if (unlikely(v != 1)) {
+			if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))
+				return;
+			if (v & IO_POLL_CANCEL_FLAG) {
+				__io_epoll_cancel(req);
+				return;
+			}
+		}
+		v &= IO_POLL_REF_MASK;
+	} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);
+
+	io_req_task_submit(req, ts);
+}
+
+static int io_epoll_execute(struct io_kiocb *req)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	if (io_poll_get_ownership(req)) {
+		list_del_init_careful(&iew->wait.entry);
+		req->io_task_work.func = io_epoll_retry;
+		io_req_task_work_add(req);
+		return 1;
+	}
+
+	return 0;
+}
+
+static __cold int io_epoll_pollfree_wake(struct io_kiocb *req)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	io_poll_mark_cancelled(req);
+	list_del_init_careful(&iew->wait.entry);
+	io_epoll_execute(req);
+	return 1;
+}
+
+static int io_epoll_wait_fn(struct wait_queue_entry *wait, unsigned mode,
+			    int sync, void *key)
+{
+	struct io_kiocb *req = wait->private;
+	__poll_t mask = key_to_poll(key);
+
+	if (unlikely(mask & POLLFREE))
+		return io_epoll_pollfree_wake(req);
+
+	return io_epoll_execute(req);
+}
+
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	if (sqe->off || sqe->rw_flags || sqe->buf_index || sqe->splice_fd_in)
+		return -EINVAL;
+
+	iew->maxevents = READ_ONCE(sqe->len);
+	iew->events = u64_to_user_ptr(READ_ONCE(sqe->addr));
+
+	iew->wait.flags = 0;
+	iew->wait.private = req;
+	iew->wait.func = io_epoll_wait_fn;
+	INIT_LIST_HEAD(&iew->wait.entry);
+	atomic_set(&req->poll_refs, 0);
+	return 0;
+}
+
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+	struct io_ring_ctx *ctx = req->ctx;
+	int ret;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	hlist_add_head(&req->hash_node, &ctx->epoll_list);
+	io_ring_submit_unlock(ctx, issue_flags);
+
+	/*
+	 * Timeout is fake here, it doesn't indicate any kind of sleep time.
+	 * It's just set to something that is non-zero, so that wait queue
+	 * wakeup is armed if no events are available.
+	 */
+	ret = epoll_wait(req->file, iew->events, iew->maxevents, NULL, &iew->wait);
+	if (ret == -EIOCBQUEUED)
+		return IOU_ISSUE_SKIP_COMPLETE;
+	else if (ret < 0)
+		req_set_fail(req);
+	io_ring_submit_lock(ctx, issue_flags);
+	hlist_del_init(&req->hash_node);
+	io_ring_submit_unlock(ctx, issue_flags);
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
diff --git a/io_uring/epoll.h b/io_uring/epoll.h
index 870cce11ba98..296940d89063 100644
--- a/io_uring/epoll.h
+++ b/io_uring/epoll.h
@@ -1,6 +1,28 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include "cancel.h"
+
 #if defined(CONFIG_EPOLL)
+int io_epoll_wait_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+			 unsigned int issue_flags);
+bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *tctx,
+			      bool cancel_all);
+
 int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags);
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags);
+#else
+static inline bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx,
+					    struct io_uring_task *tctx,
+					    bool cancel_all)
+{
+	return false;
+}
+static inline int io_epoll_wait_cancel(struct io_ring_ctx *ctx,
+				       struct io_cancel_data *cd,
+				       unsigned int issue_flags)
+{
+	return 0;
+}
 #endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 2f311aeb536f..a17abdbae7ee 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -93,6 +93,7 @@
 #include "notif.h"
 #include "waitid.h"
 #include "futex.h"
+#include "epoll.h"
 #include "napi.h"
 #include "uring_cmd.h"
 #include "msg_ring.h"
@@ -358,6 +359,9 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_HLIST_HEAD(&ctx->waitid_list);
 #ifdef CONFIG_FUTEX
 	INIT_HLIST_HEAD(&ctx->futex_list);
+#endif
+#ifdef CONFIG_EPOLL
+	INIT_HLIST_HEAD(&ctx->epoll_list);
 #endif
 	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
@@ -3095,6 +3099,7 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	ret |= io_poll_remove_all(ctx, tctx, cancel_all);
 	ret |= io_waitid_remove_all(ctx, tctx, cancel_all);
 	ret |= io_futex_remove_all(ctx, tctx, cancel_all);
+	ret |= io_epoll_wait_remove_all(ctx, tctx, cancel_all);
 	ret |= io_uring_try_cancel_uring_cmd(ctx, tctx, cancel_all);
 	mutex_unlock(&ctx->uring_lock);
 	ret |= io_kill_timeouts(ctx, tctx, cancel_all);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index e8baef4e5146..44553a657476 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -514,6 +514,17 @@ const struct io_issue_def io_issue_defs[] = {
 		.async_size		= sizeof(struct io_async_msghdr),
 #else
 		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.audit_skip		= 1,
+#if defined(CONFIG_EPOLL)
+		.prep			= io_epoll_wait_prep,
+		.issue			= io_epoll_wait,
+#else
+		.prep			= io_eopnotsupp_prep,
 #endif
 	},
 };
@@ -745,6 +756,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_LISTEN] = {
 		.name			= "LISTEN",
 	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.name			= "EPOLL_WAIT",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 9/9] io_uring/epoll: add multishot support for IORING_OP_EPOLL_WAIT
  2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
                   ` (7 preceding siblings ...)
  2025-02-03 16:23 ` [PATCH 8/9] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT Jens Axboe
@ 2025-02-03 16:23 ` Jens Axboe
  8 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-02-03 16:23 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

As with other multishot requests, submitting a multishot epoll wait
request will keep it re-armed post the initial trigger. This allows
multiple epoll wait completions per request submitted, every time
events are available. If more completions are expected for this
epoll wait request, then IORING_CQE_F_MORE will be set in the posted
cqe->flags.

For multishot, the request remains on the epoll callback waitqueue
head. This means that epoll doesn't need to juggle the ep->lock
writelock (and disable/enable IRQs) for each invocation of the
reaping loop. That should translate into nice efficiency gains.

Use by setting IORING_EPOLL_WAIT_MULTISHOT in the sqe->epoll_flags
member.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/uapi/linux/io_uring.h |  6 ++++++
 io_uring/epoll.c              | 40 ++++++++++++++++++++++++++---------
 2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index a559e1e1544a..93f504b6d4ec 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -73,6 +73,7 @@ struct io_uring_sqe {
 		__u32		futex_flags;
 		__u32		install_fd_flags;
 		__u32		nop_flags;
+		__u32		epoll_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -405,6 +406,11 @@ enum io_uring_op {
 #define IORING_ACCEPT_DONTWAIT	(1U << 1)
 #define IORING_ACCEPT_POLL_FIRST	(1U << 2)
 
+/*
+ * epoll_wait flags, stored in sqe->epoll_flags
+ */
+#define IORING_EPOLL_WAIT_MULTISHOT	(1U << 0)
+
 /*
  * IORING_OP_MSG_RING command types, stored in sqe->addr
  */
diff --git a/io_uring/epoll.c b/io_uring/epoll.c
index 2a9c679516c8..730f4b729f5b 100644
--- a/io_uring/epoll.c
+++ b/io_uring/epoll.c
@@ -24,6 +24,7 @@ struct io_epoll {
 struct io_epoll_wait {
 	struct file			*file;
 	int				maxevents;
+	int				flags;
 	struct epoll_event __user	*events;
 	struct wait_queue_entry		wait;
 };
@@ -145,12 +146,15 @@ static void io_epoll_retry(struct io_kiocb *req, struct io_tw_state *ts)
 	io_req_task_submit(req, ts);
 }
 
-static int io_epoll_execute(struct io_kiocb *req)
+static int io_epoll_execute(struct io_kiocb *req, __poll_t mask)
 {
 	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
 
 	if (io_poll_get_ownership(req)) {
-		list_del_init_careful(&iew->wait.entry);
+		if (mask & EPOLL_URING_WAKE)
+			req->flags &= ~REQ_F_APOLL_MULTISHOT;
+		if (!(req->flags & REQ_F_APOLL_MULTISHOT))
+			list_del_init_careful(&iew->wait.entry);
 		req->io_task_work.func = io_epoll_retry;
 		io_req_task_work_add(req);
 		return 1;
@@ -159,13 +163,13 @@ static int io_epoll_execute(struct io_kiocb *req)
 	return 0;
 }
 
-static __cold int io_epoll_pollfree_wake(struct io_kiocb *req)
+static __cold int io_epoll_pollfree_wake(struct io_kiocb *req, __poll_t mask)
 {
 	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
 
 	io_poll_mark_cancelled(req);
 	list_del_init_careful(&iew->wait.entry);
-	io_epoll_execute(req);
+	io_epoll_execute(req, mask);
 	return 1;
 }
 
@@ -176,18 +180,23 @@ static int io_epoll_wait_fn(struct wait_queue_entry *wait, unsigned mode,
 	__poll_t mask = key_to_poll(key);
 
 	if (unlikely(mask & POLLFREE))
-		return io_epoll_pollfree_wake(req);
+		return io_epoll_pollfree_wake(req, mask);
 
-	return io_epoll_execute(req);
+	return io_epoll_execute(req, mask);
 }
 
 int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
 
-	if (sqe->off || sqe->rw_flags || sqe->buf_index || sqe->splice_fd_in)
+	if (sqe->off || sqe->buf_index || sqe->splice_fd_in)
 		return -EINVAL;
 
+	iew->flags = READ_ONCE(sqe->epoll_flags);
+	if (iew->flags & ~IORING_EPOLL_WAIT_MULTISHOT)
+		return -EINVAL;
+	else if (iew->flags & IORING_EPOLL_WAIT_MULTISHOT)
+		req->flags |= REQ_F_APOLL_MULTISHOT;
 	iew->maxevents = READ_ONCE(sqe->len);
 	iew->events = u64_to_user_ptr(READ_ONCE(sqe->addr));
 
@@ -195,6 +204,7 @@ int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	iew->wait.private = req;
 	iew->wait.func = io_epoll_wait_fn;
 	INIT_LIST_HEAD(&iew->wait.entry);
+	INIT_HLIST_NODE(&req->hash_node);
 	atomic_set(&req->poll_refs, 0);
 	return 0;
 }
@@ -205,9 +215,11 @@ int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_ring_ctx *ctx = req->ctx;
 	int ret;
 
-	io_ring_submit_lock(ctx, issue_flags);
-	hlist_add_head(&req->hash_node, &ctx->epoll_list);
-	io_ring_submit_unlock(ctx, issue_flags);
+	if (hlist_unhashed(&req->hash_node)) {
+		io_ring_submit_lock(ctx, issue_flags);
+		hlist_add_head(&req->hash_node, &ctx->epoll_list);
+		io_ring_submit_unlock(ctx, issue_flags);
+	}
 
 	/*
 	 * Timeout is fake here, it doesn't indicate any kind of sleep time.
@@ -219,9 +231,17 @@ int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags)
 		return IOU_ISSUE_SKIP_COMPLETE;
 	else if (ret < 0)
 		req_set_fail(req);
+
+	if (ret >= 0 && req->flags & REQ_F_APOLL_MULTISHOT &&
+	    io_req_post_cqe(req, ret, IORING_CQE_F_MORE))
+		return IOU_ISSUE_SKIP_COMPLETE;
+
 	io_ring_submit_lock(ctx, issue_flags);
 	hlist_del_init(&req->hash_node);
 	io_ring_submit_unlock(ctx, issue_flags);
 	io_req_set_res(req, ret, 0);
+
+	if (issue_flags & IO_URING_F_MULTISHOT)
+		return IOU_STOP_MULTISHOT;
 	return IOU_OK;
 }
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-02-03 16:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-03 16:23 [PATCHSET 0/9] io_uring epoll wait support Jens Axboe
2025-02-03 16:23 ` [PATCH 1/9] eventpoll: abstract out main epoll reaper into a function Jens Axboe
2025-02-03 16:23 ` [PATCH 2/9] eventpoll: add helper to remove wait entry from wait queue head Jens Axboe
2025-02-03 16:23 ` [PATCH 3/9] eventpoll: abstract out ep_try_send_events() helper Jens Axboe
2025-02-03 16:23 ` [PATCH 4/9] eventpoll: add struct wait_queue_entry argument to epoll_wait() Jens Axboe
2025-02-03 16:23 ` [PATCH 5/9] eventpoll: add ep_poll_queue() loop Jens Axboe
2025-02-03 16:23 ` [PATCH 6/9] io_uring/epoll: remove CONFIG_EPOLL guards Jens Axboe
2025-02-03 16:23 ` [PATCH 7/9] io_uring/poll: pull ownership handling into poll.h Jens Axboe
2025-02-03 16:23 ` [PATCH 8/9] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT Jens Axboe
2025-02-03 16:23 ` [PATCH 9/9] io_uring/epoll: add multishot " Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox