[PATCHSET v3 0/7] io_uring epoll wait support

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCHSET v3 0/7] io_uring epoll wait support
@ 2025-02-07 17:32 Jens Axboe
  2025-02-07 17:32 ` [PATCH 1/7] eventpoll: abstract out ep_try_send_events() helper Jens Axboe
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Jens Axboe @ 2025-02-07 17:32 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner

Hi,

One issue people consistently run into when converting legacy epoll
event loops with io_uring is that parts of the event loop still needs to
use epoll. And since event loops generally need to wait in one spot,
they add the io_uring fd to the epoll set and continue to use
epoll_wait(2) to wait on events. This is suboptimal on the io_uring
front as there's now an active poller on the ring, and it's suboptimal
as it doesn't give the application the batch waiting (with fine grained
timeouts) that io_uring provides.

This patchset adds support for IORING_OP_EPOLL_WAIT, which does an async
epoll_wait() operation. No sleeping or thread offload is involved, it
relies on the wait_queue_entry callback for retries. With that, then
the above event loops can continue to use epoll for certain parts, but
bundle it all under waiting on the ring itself rather than add the ring
fd to the epoll set.

Patches 1..2 are just prep patches, and patch 3 adds the epoll change
to allow io_uring to queue a callback, if no events are available.
Patches 5..6 are just prep patches on the io_uring side, and patch 7
finally adds IORING_OP_EPOLL_WAIT support

Patches can also be found here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-epoll-wait

and are against 6.14-rc1 + already pending io_uring patches.

Since v2:
- Drop multishot support, to keep the initial version much simpler
- Drop provided buffers support, not required without multishot
- Cleanup epoll bits, notably adding a separate helper for queueing and
  checking for events
- Various other fixes and cleanups

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/7] eventpoll: abstract out ep_try_send_events() helper
  2025-02-07 17:32 [PATCHSET v3 0/7] io_uring epoll wait support Jens Axboe
@ 2025-02-07 17:32 ` Jens Axboe
  2025-02-07 17:32 ` [PATCH 2/7] eventpoll: abstract out parameter sanity checking Jens Axboe
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2025-02-07 17:32 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

In preparation for reusing this helper in another epoll setup helper,
abstract it out.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/eventpoll.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 7c0980db77b3..67d1808fda0e 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1980,6 +1980,22 @@ static int ep_autoremove_wake_function(struct wait_queue_entry *wq_entry,
 	return ret;
 }
 
+static int ep_try_send_events(struct eventpoll *ep,
+			      struct epoll_event __user *events, int maxevents)
+{
+	int res;
+
+	/*
+	 * Try to transfer events to user space. In case we get 0 events and
+	 * there's still timeout left over, we go trying again in search of
+	 * more luck.
+	 */
+	res = ep_send_events(ep, events, maxevents);
+	if (res > 0)
+		ep_suspend_napi_irqs(ep);
+	return res;
+}
+
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
  *           event buffer.
@@ -2031,17 +2047,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 
 	while (1) {
 		if (eavail) {
-			/*
-			 * Try to transfer events to user space. In case we get
-			 * 0 events and there's still timeout left over, we go
-			 * trying again in search of more luck.
-			 */
-			res = ep_send_events(ep, events, maxevents);
-			if (res) {
-				if (res > 0)
-					ep_suspend_napi_irqs(ep);
+			res = ep_try_send_events(ep, events, maxevents);
+			if (res)
 				return res;
-			}
 		}
 
 		if (timed_out)
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/7] eventpoll: abstract out parameter sanity checking
  2025-02-07 17:32 [PATCHSET v3 0/7] io_uring epoll wait support Jens Axboe
  2025-02-07 17:32 ` [PATCH 1/7] eventpoll: abstract out ep_try_send_events() helper Jens Axboe
@ 2025-02-07 17:32 ` Jens Axboe
  2025-02-07 17:32 ` [PATCH 3/7] eventpoll: add epoll_queue() interface Jens Axboe
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2025-02-07 17:32 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

Add a helper that checks the validity of the file descriptor and
other parameters passed in to epoll_wait().

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/eventpoll.c | 39 +++++++++++++++++++++++++--------------
 1 file changed, 25 insertions(+), 14 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 67d1808fda0e..14466765b85d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2453,6 +2453,27 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
 	return do_epoll_ctl(epfd, op, fd, &epds, false);
 }
 
+static int ep_check_params(struct file *file, struct epoll_event __user *evs,
+			   int maxevents)
+{
+	/* The maximum number of event must be greater than zero */
+	if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
+		return -EINVAL;
+
+	/* Verify that the area passed by the user is writeable */
+	if (!access_ok(evs, maxevents * sizeof(struct epoll_event)))
+		return -EFAULT;
+
+	/*
+	 * We have to check that the file structure underneath the fd
+	 * the user passed to us _is_ an eventpoll file.
+	 */
+	if (!is_file_epoll(file))
+		return -EINVAL;
+
+	return 0;
+}
+
 /*
  * Implement the event wait interface for the eventpoll file. It is the kernel
  * part of the user space epoll_wait(2).
@@ -2461,26 +2482,16 @@ static int do_epoll_wait(int epfd, struct epoll_event __user *events,
 			 int maxevents, struct timespec64 *to)
 {
 	struct eventpoll *ep;
-
-	/* The maximum number of event must be greater than zero */
-	if (maxevents <= 0 || maxevents > EP_MAX_EVENTS)
-		return -EINVAL;
-
-	/* Verify that the area passed by the user is writeable */
-	if (!access_ok(events, maxevents * sizeof(struct epoll_event)))
-		return -EFAULT;
+	int ret;
 
 	/* Get the "struct file *" for the eventpoll file */
 	CLASS(fd, f)(epfd);
 	if (fd_empty(f))
 		return -EBADF;
 
-	/*
-	 * We have to check that the file structure underneath the fd
-	 * the user passed to us _is_ an eventpoll file.
-	 */
-	if (!is_file_epoll(fd_file(f)))
-		return -EINVAL;
+	ret = ep_check_params(fd_file(f), events, maxevents);
+	if (unlikely(ret))
+		return ret;
 
 	/*
 	 * At this point it is safe to assume that the "private_data" contains
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/7] eventpoll: add epoll_queue() interface
  2025-02-07 17:32 [PATCHSET v3 0/7] io_uring epoll wait support Jens Axboe
  2025-02-07 17:32 ` [PATCH 1/7] eventpoll: abstract out ep_try_send_events() helper Jens Axboe
  2025-02-07 17:32 ` [PATCH 2/7] eventpoll: abstract out parameter sanity checking Jens Axboe
@ 2025-02-07 17:32 ` Jens Axboe
  2025-02-07 17:32 ` [PATCH 4/7] eventpoll: add helper to remove wait entry from wait queue head Jens Axboe
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2025-02-07 17:32 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

Basic interface that takes a wait_queue_entry rather than post one on
the stack, which can be a persistent callback for when new events
arrive.

Works like regular epoll_wait(), except it doesn't block. If events are
available, they are returned. If none are available, the passed in
wait_queue_entry is added to the callback list. The wait_queue_entry
must be previously initialized, and the callback provided will be called
when events are added to the epoll context.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/eventpoll.c            | 39 +++++++++++++++++++++++++++++++++++++++
 include/linux/eventpoll.h |  4 ++++
 2 files changed, 43 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 14466765b85d..d3ac466ad415 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1996,6 +1996,33 @@ static int ep_try_send_events(struct eventpoll *ep,
 	return res;
 }
 
+static int ep_poll_queue(struct eventpoll *ep,
+			 struct epoll_event __user *events, int maxevents,
+			 struct wait_queue_entry *wait)
+{
+	int res = 0, eavail;
+
+	/* See ep_poll() for commentary */
+	eavail = ep_events_available(ep);
+	while (1) {
+		if (eavail) {
+			res = ep_try_send_events(ep, events, maxevents);
+			if (res)
+				return res;
+		}
+		if (!list_empty_careful(&wait->entry))
+			break;
+		write_lock_irq(&ep->lock);
+		eavail = ep_events_available(ep);
+		if (!eavail)
+			__add_wait_queue_exclusive(&ep->wq, wait);
+		write_unlock_irq(&ep->lock);
+		if (!eavail)
+			break;
+	}
+	return -EIOCBQUEUED;
+}
+
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
  *           event buffer.
@@ -2474,6 +2501,18 @@ static int ep_check_params(struct file *file, struct epoll_event __user *evs,
 	return 0;
 }
 
+int epoll_queue(struct file *file, struct epoll_event __user *events,
+		int maxevents, struct wait_queue_entry *wait)
+{
+	int ret;
+
+	ret = ep_check_params(file, events, maxevents);
+	if (unlikely(ret))
+		return ret;
+
+	return ep_poll_queue(file->private_data, events, maxevents, wait);
+}
+
 /*
  * Implement the event wait interface for the eventpoll file. It is the kernel
  * part of the user space epoll_wait(2).
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 0c0d00fcd131..8de16374b8fe 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -25,6 +25,10 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, unsigned long t
 /* Used to release the epoll bits inside the "struct file" */
 void eventpoll_release_file(struct file *file);
 
+/* Use to reap events, and/or queue for a callback on new events */
+int epoll_queue(struct file *file, struct epoll_event __user *events,
+		int maxevents, struct wait_queue_entry *wait);
+
 /*
  * This is called from inside fs/file_table.c:__fput() to unlink files
  * from the eventpoll interface. We need to have this facility to cleanup
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/7] eventpoll: add helper to remove wait entry from wait queue head
  2025-02-07 17:32 [PATCHSET v3 0/7] io_uring epoll wait support Jens Axboe
                   ` (2 preceding siblings ...)
  2025-02-07 17:32 ` [PATCH 3/7] eventpoll: add epoll_queue() interface Jens Axboe
@ 2025-02-07 17:32 ` Jens Axboe
  2025-02-07 17:32 ` [PATCH 5/7] io_uring/epoll: remove CONFIG_EPOLL guards Jens Axboe
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2025-02-07 17:32 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

__epoll_wait_remove() is the core helper, it kills a given
wait_queue_entry from the eventpoll wait_queue_head. Use it internally,
and provide an overall helper, epoll_wait_remove(), which takes a struct
file and provides the same functionality.

Signed-off-by: Jens Axboe <[email protected]>
---
 fs/eventpoll.c            | 58 +++++++++++++++++++++++++--------------
 include/linux/eventpoll.h |  3 ++
 2 files changed, 40 insertions(+), 21 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index d3ac466ad415..b96cc9193517 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2023,6 +2023,42 @@ static int ep_poll_queue(struct eventpoll *ep,
 	return -EIOCBQUEUED;
 }
 
+static int __epoll_wait_remove(struct eventpoll *ep,
+			       struct wait_queue_entry *wait, int timed_out)
+{
+	int eavail;
+
+	/*
+	 * We were woken up, thus go and try to harvest some events. If timed
+	 * out and still on the wait queue, recheck eavail carefully under
+	 * lock, below.
+	 */
+	eavail = 1;
+
+	if (!list_empty_careful(&wait->entry)) {
+		write_lock_irq(&ep->lock);
+		/*
+		 * If the thread timed out and is not on the wait queue, it
+		 * means that the thread was woken up after its timeout expired
+		 * before it could reacquire the lock. Thus, when wait.entry is
+		 * empty, it needs to harvest events.
+		 */
+		if (timed_out)
+			eavail = list_empty(&wait->entry);
+		__remove_wait_queue(&ep->wq, wait);
+		write_unlock_irq(&ep->lock);
+	}
+
+	return eavail;
+}
+
+int epoll_wait_remove(struct file *file, struct wait_queue_entry *wait)
+{
+	if (is_file_epoll(file))
+		return __epoll_wait_remove(file->private_data, wait, false);
+	return -EINVAL;
+}
+
 /**
  * ep_poll - Retrieves ready events, and delivers them to the caller-supplied
  *           event buffer.
@@ -2135,27 +2171,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 							      HRTIMER_MODE_ABS);
 		__set_current_state(TASK_RUNNING);
 
-		/*
-		 * We were woken up, thus go and try to harvest some events.
-		 * If timed out and still on the wait queue, recheck eavail
-		 * carefully under lock, below.
-		 */
-		eavail = 1;
-
-		if (!list_empty_careful(&wait.entry)) {
-			write_lock_irq(&ep->lock);
-			/*
-			 * If the thread timed out and is not on the wait queue,
-			 * it means that the thread was woken up after its
-			 * timeout expired before it could reacquire the lock.
-			 * Thus, when wait.entry is empty, it needs to harvest
-			 * events.
-			 */
-			if (timed_out)
-				eavail = list_empty(&wait.entry);
-			__remove_wait_queue(&ep->wq, &wait);
-			write_unlock_irq(&ep->lock);
-		}
+		eavail = __epoll_wait_remove(ep, &wait, timed_out);
 	}
 }
 
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 8de16374b8fe..6c088d5e945b 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -29,6 +29,9 @@ void eventpoll_release_file(struct file *file);
 int epoll_queue(struct file *file, struct epoll_event __user *events,
 		int maxevents, struct wait_queue_entry *wait);
 
+/* Remove wait entry */
+int epoll_wait_remove(struct file *file, struct wait_queue_entry *wait);
+
 /*
  * This is called from inside fs/file_table.c:__fput() to unlink files
  * from the eventpoll interface. We need to have this facility to cleanup
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/7] io_uring/epoll: remove CONFIG_EPOLL guards
  2025-02-07 17:32 [PATCHSET v3 0/7] io_uring epoll wait support Jens Axboe
                   ` (3 preceding siblings ...)
  2025-02-07 17:32 ` [PATCH 4/7] eventpoll: add helper to remove wait entry from wait queue head Jens Axboe
@ 2025-02-07 17:32 ` Jens Axboe
  2025-02-07 17:32 ` [PATCH 6/7] io_uring/poll: pull ownership handling into poll.h Jens Axboe
  2025-02-07 17:32 ` [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT Jens Axboe
  6 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2025-02-07 17:32 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

Just have the Makefile add the object if epoll is enabled, then it's
not necessary to guard the entire epoll.c file inside an CONFIG_EPOLL
ifdef.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/Makefile | 9 +++++----
 io_uring/epoll.c  | 2 --
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/io_uring/Makefile b/io_uring/Makefile
index d695b60dba4f..7114a6dbd439 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -11,9 +11,10 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
 					eventfd.o uring_cmd.o openclose.o \
 					sqpoll.o xattr.o nop.o fs.o splice.o \
 					sync.o msg_ring.o advise.o openclose.o \
-					epoll.o statx.o timeout.o fdinfo.o \
-					cancel.o waitid.o register.o \
-					truncate.o memmap.o alloc_cache.o
+					statx.o timeout.o fdinfo.o cancel.o \
+					waitid.o register.o truncate.o \
+					memmap.o alloc_cache.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
-obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
+obj-$(CONFIG_EPOLL)		+= epoll.o
+obj-$(CONFIG_NET_RX_BUSY_POLL)	+= napi.o
diff --git a/io_uring/epoll.c b/io_uring/epoll.c
index 89bff2068a19..7848d9cc073d 100644
--- a/io_uring/epoll.c
+++ b/io_uring/epoll.c
@@ -12,7 +12,6 @@
 #include "io_uring.h"
 #include "epoll.h"
 
-#if defined(CONFIG_EPOLL)
 struct io_epoll {
 	struct file			*file;
 	int				epfd;
@@ -58,4 +57,3 @@ int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags)
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 }
-#endif
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 6/7] io_uring/poll: pull ownership handling into poll.h
  2025-02-07 17:32 [PATCHSET v3 0/7] io_uring epoll wait support Jens Axboe
                   ` (4 preceding siblings ...)
  2025-02-07 17:32 ` [PATCH 5/7] io_uring/epoll: remove CONFIG_EPOLL guards Jens Axboe
@ 2025-02-07 17:32 ` Jens Axboe
  2025-02-07 17:32 ` [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT Jens Axboe
  6 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2025-02-07 17:32 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

In preparation for using it from somewhere else. Rather than try and
duplicate the functionality, just make it generically available to
io_uring opcodes.

Note: would have to be used carefully, cannot be used by opcodes that
can trigger poll logic.

Signed-off-by: Jens Axboe <[email protected]>
---
 io_uring/poll.c | 30 +-----------------------------
 io_uring/poll.h | 31 +++++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+), 29 deletions(-)

diff --git a/io_uring/poll.c b/io_uring/poll.c
index bb1c0cd4f809..5e44ac562491 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -41,16 +41,6 @@ struct io_poll_table {
 	__poll_t result_mask;
 };
 
-#define IO_POLL_CANCEL_FLAG	BIT(31)
-#define IO_POLL_RETRY_FLAG	BIT(30)
-#define IO_POLL_REF_MASK	GENMASK(29, 0)
-
-/*
- * We usually have 1-2 refs taken, 128 is more than enough and we want to
- * maximise the margin between this amount and the moment when it overflows.
- */
-#define IO_POLL_REF_BIAS	128
-
 #define IO_WQE_F_DOUBLE		1
 
 static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
@@ -70,7 +60,7 @@ static inline bool wqe_is_double(struct wait_queue_entry *wqe)
 	return priv & IO_WQE_F_DOUBLE;
 }
 
-static bool io_poll_get_ownership_slowpath(struct io_kiocb *req)
+bool io_poll_get_ownership_slowpath(struct io_kiocb *req)
 {
 	int v;
 
@@ -85,24 +75,6 @@ static bool io_poll_get_ownership_slowpath(struct io_kiocb *req)
 	return !(atomic_fetch_inc(&req->poll_refs) & IO_POLL_REF_MASK);
 }
 
-/*
- * If refs part of ->poll_refs (see IO_POLL_REF_MASK) is 0, it's free. We can
- * bump it and acquire ownership. It's disallowed to modify requests while not
- * owning it, that prevents from races for enqueueing task_work's and b/w
- * arming poll and wakeups.
- */
-static inline bool io_poll_get_ownership(struct io_kiocb *req)
-{
-	if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS))
-		return io_poll_get_ownership_slowpath(req);
-	return !(atomic_fetch_inc(&req->poll_refs) & IO_POLL_REF_MASK);
-}
-
-static void io_poll_mark_cancelled(struct io_kiocb *req)
-{
-	atomic_or(IO_POLL_CANCEL_FLAG, &req->poll_refs);
-}
-
 static struct io_poll *io_poll_get_double(struct io_kiocb *req)
 {
 	/* pure poll stashes this in ->async_data, poll driven retry elsewhere */
diff --git a/io_uring/poll.h b/io_uring/poll.h
index 04ede93113dc..2f416cd3be13 100644
--- a/io_uring/poll.h
+++ b/io_uring/poll.h
@@ -21,6 +21,18 @@ struct async_poll {
 	struct io_poll		*double_poll;
 };
 
+#define IO_POLL_CANCEL_FLAG	BIT(31)
+#define IO_POLL_RETRY_FLAG	BIT(30)
+#define IO_POLL_REF_MASK	GENMASK(29, 0)
+
+bool io_poll_get_ownership_slowpath(struct io_kiocb *req);
+
+/*
+ * We usually have 1-2 refs taken, 128 is more than enough and we want to
+ * maximise the margin between this amount and the moment when it overflows.
+ */
+#define IO_POLL_REF_BIAS	128
+
 /*
  * Must only be called inside issue_flags & IO_URING_F_MULTISHOT, or
  * potentially other cases where we already "own" this poll request.
@@ -30,6 +42,25 @@ static inline void io_poll_multishot_retry(struct io_kiocb *req)
 	atomic_inc(&req->poll_refs);
 }
 
+/*
+ * If refs part of ->poll_refs (see IO_POLL_REF_MASK) is 0, it's free. We can
+ * bump it and acquire ownership. It's disallowed to modify requests while not
+ * owning it, that prevents from races for enqueueing task_work's and b/w
+ * arming poll and wakeups.
+ */
+static inline bool io_poll_get_ownership(struct io_kiocb *req)
+{
+	if (unlikely(atomic_read(&req->poll_refs) >= IO_POLL_REF_BIAS))
+		return io_poll_get_ownership_slowpath(req);
+	return !(atomic_fetch_inc(&req->poll_refs) & IO_POLL_REF_MASK);
+}
+
+static inline void io_poll_mark_cancelled(struct io_kiocb *req)
+{
+	atomic_or(IO_POLL_CANCEL_FLAG, &req->poll_refs);
+}
+
+
 int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_poll_add(struct io_kiocb *req, unsigned int issue_flags);
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
  2025-02-07 17:32 [PATCHSET v3 0/7] io_uring epoll wait support Jens Axboe
                   ` (5 preceding siblings ...)
  2025-02-07 17:32 ` [PATCH 6/7] io_uring/poll: pull ownership handling into poll.h Jens Axboe
@ 2025-02-07 17:32 ` Jens Axboe
  2025-02-08 23:27   ` Pavel Begunkov
  6 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2025-02-07 17:32 UTC (permalink / raw)
  To: io-uring; +Cc: linux-fsdevel, brauner, Jens Axboe

For existing epoll event loops that can't fully convert to io_uring,
the used approach is usually to add the io_uring fd to the epoll
instance and use epoll_wait() to wait on both "legacy" and io_uring
events. While this work, it isn't optimal as:

1) epoll_wait() is pretty limited in what it can do. It does not support
   partial reaping of events, or waiting on a batch of events.

2) When an io_uring ring is added to an epoll instance, it activates the
   io_uring "I'm being polled" logic which slows things down.

Rather than use this approach, with EPOLL_WAIT support added to io_uring,
event loops can use the normal io_uring wait logic for everything, as
long as an epoll wait request has been armed with io_uring.

Note that IORING_OP_EPOLL_WAIT does NOT take a timeout value, as this
is an async request. Waiting on io_uring events in general has various
timeout parameters, and those are the ones that should be used when
waiting on any kind of request. If events are immediately available for
reaping, then This opcode will return those immediately. If none are
available, then it will post an async completion when they become
available.

cqe->res will contain either an error code (< 0 value) for a malformed
request, invalid epoll instance, etc. It will return a positive result
indicating how many events were reaped.

IORING_OP_EPOLL_WAIT requests may be canceled using the normal io_uring
cancelation infrastructure. The poll logic for managing ownership is
adopted to guard the epoll side too.

Signed-off-by: Jens Axboe <[email protected]>
---
 include/linux/io_uring_types.h |   4 +
 include/uapi/linux/io_uring.h  |   1 +
 io_uring/cancel.c              |   5 ++
 io_uring/epoll.c               | 143 +++++++++++++++++++++++++++++++++
 io_uring/epoll.h               |  22 +++++
 io_uring/io_uring.c            |   5 ++
 io_uring/opdef.c               |  14 ++++
 7 files changed, 194 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e2fef264ff8b..031ba708a81d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -369,6 +369,10 @@ struct io_ring_ctx {
 	struct io_alloc_cache	futex_cache;
 #endif
 
+#ifdef CONFIG_EPOLL
+	struct hlist_head	epoll_list;
+#endif
+
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e11c82638527..a559e1e1544a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -278,6 +278,7 @@ enum io_uring_op {
 	IORING_OP_FTRUNCATE,
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
+	IORING_OP_EPOLL_WAIT,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index 0870060bac7c..d1af9496d9b3 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -17,6 +17,7 @@
 #include "timeout.h"
 #include "waitid.h"
 #include "futex.h"
+#include "epoll.h"
 #include "cancel.h"
 
 struct io_cancel {
@@ -128,6 +129,10 @@ int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd,
 	if (ret != -ENOENT)
 		return ret;
 
+	ret = io_epoll_wait_cancel(ctx, cd, issue_flags);
+	if (ret != -ENOENT)
+		return ret;
+
 	spin_lock(&ctx->completion_lock);
 	if (!(cd->flags & IORING_ASYNC_CANCEL_FD))
 		ret = io_timeout_cancel(ctx, cd);
diff --git a/io_uring/epoll.c b/io_uring/epoll.c
index 7848d9cc073d..8f54bb1c39de 100644
--- a/io_uring/epoll.c
+++ b/io_uring/epoll.c
@@ -11,6 +11,7 @@
 
 #include "io_uring.h"
 #include "epoll.h"
+#include "poll.h"
 
 struct io_epoll {
 	struct file			*file;
@@ -20,6 +21,13 @@ struct io_epoll {
 	struct epoll_event		event;
 };
 
+struct io_epoll_wait {
+	struct file			*file;
+	int				maxevents;
+	struct epoll_event __user	*events;
+	struct wait_queue_entry		wait;
+};
+
 int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_epoll *epoll = io_kiocb_to_cmd(req, struct io_epoll);
@@ -57,3 +65,138 @@ int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags)
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 }
+
+static void __io_epoll_finish(struct io_kiocb *req, int res)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	lockdep_assert_held(&req->ctx->uring_lock);
+
+	epoll_wait_remove(req->file, &iew->wait);
+	hlist_del_init(&req->hash_node);
+	io_req_set_res(req, res, 0);
+	req->io_task_work.func = io_req_task_complete;
+	io_req_task_work_add(req);
+}
+
+static void __io_epoll_cancel(struct io_kiocb *req)
+{
+	__io_epoll_finish(req, -ECANCELED);
+}
+
+static bool __io_epoll_wait_cancel(struct io_kiocb *req)
+{
+	io_poll_mark_cancelled(req);
+	if (io_poll_get_ownership(req))
+		__io_epoll_cancel(req);
+	return true;
+}
+
+bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *tctx,
+			      bool cancel_all)
+{
+	return io_cancel_remove_all(ctx, tctx, &ctx->epoll_list, cancel_all, __io_epoll_wait_cancel);
+}
+
+int io_epoll_wait_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+			 unsigned int issue_flags)
+{
+	return io_cancel_remove(ctx, cd, issue_flags, &ctx->epoll_list, __io_epoll_wait_cancel);
+}
+
+static void io_epoll_retry(struct io_kiocb *req, struct io_tw_state *ts)
+{
+	int v;
+
+	do {
+		v = atomic_read(&req->poll_refs);
+		if (unlikely(v != 1)) {
+			if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))
+				return;
+			if (v & IO_POLL_CANCEL_FLAG) {
+				__io_epoll_cancel(req);
+				return;
+			}
+		}
+		v &= IO_POLL_REF_MASK;
+	} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);
+
+	io_req_task_submit(req, ts);
+}
+
+static int io_epoll_execute(struct io_kiocb *req)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	list_del_init_careful(&iew->wait.entry);
+	if (io_poll_get_ownership(req)) {
+		req->io_task_work.func = io_epoll_retry;
+		io_req_task_work_add(req);
+	}
+
+	return 1;
+}
+
+static __cold int io_epoll_pollfree_wake(struct io_kiocb *req)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	io_poll_mark_cancelled(req);
+	list_del_init_careful(&iew->wait.entry);
+	io_epoll_execute(req);
+	return 1;
+}
+
+static int io_epoll_wait_fn(struct wait_queue_entry *wait, unsigned mode,
+			    int sync, void *key)
+{
+	struct io_kiocb *req = wait->private;
+	__poll_t mask = key_to_poll(key);
+
+	if (unlikely(mask & POLLFREE))
+		return io_epoll_pollfree_wake(req);
+
+	return io_epoll_execute(req);
+}
+
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	if (sqe->off || sqe->rw_flags || sqe->buf_index || sqe->splice_fd_in)
+		return -EINVAL;
+
+	iew->maxevents = READ_ONCE(sqe->len);
+	iew->events = u64_to_user_ptr(READ_ONCE(sqe->addr));
+
+	iew->wait.flags = 0;
+	iew->wait.private = req;
+	iew->wait.func = io_epoll_wait_fn;
+	INIT_LIST_HEAD(&iew->wait.entry);
+	INIT_HLIST_NODE(&req->hash_node);
+	atomic_set(&req->poll_refs, 0);
+	return 0;
+}
+
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+	struct io_ring_ctx *ctx = req->ctx;
+	int ret;
+
+	io_ring_submit_lock(ctx, issue_flags);
+
+	ret = epoll_queue(req->file, iew->events, iew->maxevents, &iew->wait);
+	if (ret == -EIOCBQUEUED) {
+		if (hlist_unhashed(&req->hash_node))
+			hlist_add_head(&req->hash_node, &ctx->epoll_list);
+		io_ring_submit_unlock(ctx, issue_flags);
+		return IOU_ISSUE_SKIP_COMPLETE;
+	} else if (ret < 0) {
+		req_set_fail(req);
+	}
+	hlist_del_init(&req->hash_node);
+	io_ring_submit_unlock(ctx, issue_flags);
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
diff --git a/io_uring/epoll.h b/io_uring/epoll.h
index 870cce11ba98..296940d89063 100644
--- a/io_uring/epoll.h
+++ b/io_uring/epoll.h
@@ -1,6 +1,28 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include "cancel.h"
+
 #if defined(CONFIG_EPOLL)
+int io_epoll_wait_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+			 unsigned int issue_flags);
+bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *tctx,
+			      bool cancel_all);
+
 int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags);
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags);
+#else
+static inline bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx,
+					    struct io_uring_task *tctx,
+					    bool cancel_all)
+{
+	return false;
+}
+static inline int io_epoll_wait_cancel(struct io_ring_ctx *ctx,
+				       struct io_cancel_data *cd,
+				       unsigned int issue_flags)
+{
+	return 0;
+}
 #endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ec98a0ec6f34..73b9246eaa50 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -93,6 +93,7 @@
 #include "notif.h"
 #include "waitid.h"
 #include "futex.h"
+#include "epoll.h"
 #include "napi.h"
 #include "uring_cmd.h"
 #include "msg_ring.h"
@@ -356,6 +357,9 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_HLIST_HEAD(&ctx->waitid_list);
 #ifdef CONFIG_FUTEX
 	INIT_HLIST_HEAD(&ctx->futex_list);
+#endif
+#ifdef CONFIG_EPOLL
+	INIT_HLIST_HEAD(&ctx->epoll_list);
 #endif
 	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
@@ -3079,6 +3083,7 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	ret |= io_poll_remove_all(ctx, tctx, cancel_all);
 	ret |= io_waitid_remove_all(ctx, tctx, cancel_all);
 	ret |= io_futex_remove_all(ctx, tctx, cancel_all);
+	ret |= io_epoll_wait_remove_all(ctx, tctx, cancel_all);
 	ret |= io_uring_try_cancel_uring_cmd(ctx, tctx, cancel_all);
 	mutex_unlock(&ctx->uring_lock);
 	ret |= io_kill_timeouts(ctx, tctx, cancel_all);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index e8baef4e5146..44553a657476 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -514,6 +514,17 @@ const struct io_issue_def io_issue_defs[] = {
 		.async_size		= sizeof(struct io_async_msghdr),
 #else
 		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.audit_skip		= 1,
+#if defined(CONFIG_EPOLL)
+		.prep			= io_epoll_wait_prep,
+		.issue			= io_epoll_wait,
+#else
+		.prep			= io_eopnotsupp_prep,
 #endif
 	},
 };
@@ -745,6 +756,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_LISTEN] = {
 		.name			= "LISTEN",
 	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.name			= "EPOLL_WAIT",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
  2025-02-07 17:32 ` [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT Jens Axboe
@ 2025-02-08 23:27   ` Pavel Begunkov
  2025-02-09  0:24     ` Pavel Begunkov
  0 siblings, 1 reply; 11+ messages in thread
From: Pavel Begunkov @ 2025-02-08 23:27 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-fsdevel, brauner

On 2/7/25 17:32, Jens Axboe wrote:
> For existing epoll event loops that can't fully convert to io_uring,
> the used approach is usually to add the io_uring fd to the epoll
> instance and use epoll_wait() to wait on both "legacy" and io_uring
> events. While this work, it isn't optimal as:
> 
> 1) epoll_wait() is pretty limited in what it can do. It does not support
>     partial reaping of events, or waiting on a batch of events.
> 
> 2) When an io_uring ring is added to an epoll instance, it activates the
>     io_uring "I'm being polled" logic which slows things down.
> 
> Rather than use this approach, with EPOLL_WAIT support added to io_uring,
> event loops can use the normal io_uring wait logic for everything, as
> long as an epoll wait request has been armed with io_uring.
> 
> Note that IORING_OP_EPOLL_WAIT does NOT take a timeout value, as this
> is an async request. Waiting on io_uring events in general has various
> timeout parameters, and those are the ones that should be used when
> waiting on any kind of request. If events are immediately available for
> reaping, then This opcode will return those immediately. If none are
> available, then it will post an async completion when they become
> available.
> 
> cqe->res will contain either an error code (< 0 value) for a malformed
> request, invalid epoll instance, etc. It will return a positive result
> indicating how many events were reaped.
> 
> IORING_OP_EPOLL_WAIT requests may be canceled using the normal io_uring
> cancelation infrastructure. The poll logic for managing ownership is
> adopted to guard the epoll side too.
> 
> Signed-off-by: Jens Axboe <[email protected]>
> ---
>   include/linux/io_uring_types.h |   4 +
>   include/uapi/linux/io_uring.h  |   1 +
>   io_uring/cancel.c              |   5 ++
>   io_uring/epoll.c               | 143 +++++++++++++++++++++++++++++++++
>   io_uring/epoll.h               |  22 +++++
>   io_uring/io_uring.c            |   5 ++
>   io_uring/opdef.c               |  14 ++++
>   7 files changed, 194 insertions(+)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index e2fef264ff8b..031ba708a81d 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -369,6 +369,10 @@ struct io_ring_ctx {
...
> +bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *tctx,
> +			      bool cancel_all)
> +{
> +	return io_cancel_remove_all(ctx, tctx, &ctx->epoll_list, cancel_all, __io_epoll_wait_cancel);
> +}
> +
> +int io_epoll_wait_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
> +			 unsigned int issue_flags)
> +{
> +	return io_cancel_remove(ctx, cd, issue_flags, &ctx->epoll_list, __io_epoll_wait_cancel);
> +}
> +
> +static void io_epoll_retry(struct io_kiocb *req, struct io_tw_state *ts)
> +{
> +	int v;
> +
> +	do {
> +		v = atomic_read(&req->poll_refs);
> +		if (unlikely(v != 1)) {
> +			if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))
> +				return;
> +			if (v & IO_POLL_CANCEL_FLAG) {
> +				__io_epoll_cancel(req);
> +				return;
> +			}
> +		}
> +		v &= IO_POLL_REF_MASK;
> +	} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);

I actually looked up the epoll code this time. If we disregard
cancellations, you have only 1 wait entry, which should've been removed
from the queue by io_epoll_wait_fn(), in which case the entire loop is
doing nothing as there is no one to race with. ->hash_node is the only
shared part, but it's sync'ed by the mutex.

As for cancellation, epoll_wait_remove() also removes the entry, and
you can rely on it to tell if the entry was removed inside, from
which you derive if you're the current owner.

Maybe this handling might be useful for the multishot mode, perhaps
along the lines of:

io_epoll_retry()
{
	do {
		res = epoll_get_events();
		if (one_shot || cancel) {
			wq_remove();
			unhash();
			complete_req(res);
			return;
		}

		post_cqe(res);

		// now recheck if new events came while we were processing
		// the previous batch.
	} while (refs_drop(req->poll_refs));
}

epoll_issue(issue_flags) {
	queue_poll();
	return;
}

But it might be better to just poll the epoll fd, reuse all the
io_uring polling machinery, and implement IO_URING_F_MULTISHOT for
the epoll opcode.

epoll_issue(issue_flags) {
	if (!(flags & IO_URING_F_MULTISHOT))
		return -EAGAIN;

	res = epoll_check_events();
	post_cqe(res);
	etc.
}

I think that would make this patch quite trivial, including
the multishot mode.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
  2025-02-08 23:27   ` Pavel Begunkov
@ 2025-02-09  0:24     ` Pavel Begunkov
  2025-02-09 16:19       ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Pavel Begunkov @ 2025-02-09  0:24 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-fsdevel, brauner

On 2/8/25 23:27, Pavel Begunkov wrote:
...
> But it might be better to just poll the epoll fd, reuse all the
> io_uring polling machinery, and implement IO_URING_F_MULTISHOT for
> the epoll opcode.
> 
> epoll_issue(issue_flags) {
>      if (!(flags & IO_URING_F_MULTISHOT))
>          return -EAGAIN;
> 
>      res = epoll_check_events();
>      post_cqe(res);
>      etc.
> }
> 
> I think that would make this patch quite trivial, including
> the multishot mode.

Something like this instead of the last patch. Completely untested,
the eventpoll.c hunk is dirty might be incorrect, need to pass the
right mask for polling, and all that. At least it looks simpler,
and probably doesn't need half of the prep patches.


diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index b96cc9193517..99dd8c1a2f2c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1996,33 +1996,6 @@ static int ep_try_send_events(struct eventpoll *ep,
  	return res;
  }
  
-static int ep_poll_queue(struct eventpoll *ep,
-			 struct epoll_event __user *events, int maxevents,
-			 struct wait_queue_entry *wait)
-{
-	int res = 0, eavail;
-
-	/* See ep_poll() for commentary */
-	eavail = ep_events_available(ep);
-	while (1) {
-		if (eavail) {
-			res = ep_try_send_events(ep, events, maxevents);
-			if (res)
-				return res;
-		}
-		if (!list_empty_careful(&wait->entry))
-			break;
-		write_lock_irq(&ep->lock);
-		eavail = ep_events_available(ep);
-		if (!eavail)
-			__add_wait_queue_exclusive(&ep->wq, wait);
-		write_unlock_irq(&ep->lock);
-		if (!eavail)
-			break;
-	}
-	return -EIOCBQUEUED;
-}
-
  static int __epoll_wait_remove(struct eventpoll *ep,
  			       struct wait_queue_entry *wait, int timed_out)
  {
@@ -2517,16 +2490,22 @@ static int ep_check_params(struct file *file, struct epoll_event __user *evs,
  	return 0;
  }
  
-int epoll_queue(struct file *file, struct epoll_event __user *events,
-		int maxevents, struct wait_queue_entry *wait)
+int epoll_sendevents(struct file *file, struct epoll_event __user *events,
+		     int maxevents)
  {
-	int ret;
+	int res = 0, eavail;
  
  	ret = ep_check_params(file, events, maxevents);
  	if (unlikely(ret))
  		return ret;
  
-	return ep_poll_queue(file->private_data, events, maxevents, wait);
+	eavail = ep_events_available(ep);
+	if (eavail) {
+		res = ep_try_send_events(ep, events, maxevents);
+		if (res)
+			return res;
+	}
+	return 0;
  }
  
  /*
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 6c088d5e945b..751e3f325927 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -25,9 +25,8 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, unsigned long t
  /* Used to release the epoll bits inside the "struct file" */
  void eventpoll_release_file(struct file *file);
  
-/* Use to reap events, and/or queue for a callback on new events */
-int epoll_queue(struct file *file, struct epoll_event __user *events,
-		int maxevents, struct wait_queue_entry *wait);
+int epoll_sendevents(struct file *file, struct epoll_event __user *events,
+		int maxevents);
  
  /* Remove wait entry */
  int epoll_wait_remove(struct file *file, struct wait_queue_entry *wait);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e11c82638527..a559e1e1544a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -278,6 +278,7 @@ enum io_uring_op {
  	IORING_OP_FTRUNCATE,
  	IORING_OP_BIND,
  	IORING_OP_LISTEN,
+	IORING_OP_EPOLL_WAIT,
  
  	/* this goes last, obviously */
  	IORING_OP_LAST,
diff --git a/io_uring/epoll.c b/io_uring/epoll.c
index 7848d9cc073d..6d2c48ba1923 100644
--- a/io_uring/epoll.c
+++ b/io_uring/epoll.c
@@ -20,6 +20,12 @@ struct io_epoll {
  	struct epoll_event		event;
  };
  
+struct io_epoll_wait {
+	struct file			*file;
+	int				maxevents;
+	struct epoll_event __user	*events;
+};
+
  int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  {
  	struct io_epoll *epoll = io_kiocb_to_cmd(req, struct io_epoll);
@@ -57,3 +63,30 @@ int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags)
  	io_req_set_res(req, ret, 0);
  	return IOU_OK;
  }
+
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	if (sqe->off || sqe->rw_flags || sqe->buf_index || sqe->splice_fd_in)
+		return -EINVAL;
+
+	iew->maxevents = READ_ONCE(sqe->len);
+	iew->events = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	return 0;
+}
+
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+	int ret;
+
+	ret = epoll_sendevents(req->file, iew->events, iew->maxevents);
+	if (ret == 0)
+		return -EAGAIN;
+	if (ret < 0)
+		req_set_fail(req);
+
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
diff --git a/io_uring/epoll.h b/io_uring/epoll.h
index 870cce11ba98..4111997c360b 100644
--- a/io_uring/epoll.h
+++ b/io_uring/epoll.h
@@ -3,4 +3,6 @@
  #if defined(CONFIG_EPOLL)
  int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
  int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags);
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags);
  #endif
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index e8baef4e5146..bd62d6068b61 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -514,6 +514,18 @@ const struct io_issue_def io_issue_defs[] = {
  		.async_size		= sizeof(struct io_async_msghdr),
  #else
  		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.needs_file		= 1,
+		.audit_skip		= 1,
+		.pollout		= 1,
+		.pollin			= 1,
+#if defined(CONFIG_EPOLL)
+		.prep			= io_epoll_wait_prep,
+		.issue			= io_epoll_wait,
+#else
+		.prep			= io_eopnotsupp_prep,
  #endif
  	},
  };
@@ -745,6 +757,9 @@ const struct io_cold_def io_cold_defs[] = {
  	[IORING_OP_LISTEN] = {
  		.name			= "LISTEN",
  	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.name			= "EPOLL_WAIT",
+	},
  };
  
  const char *io_uring_get_opcode(u8 opcode)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
  2025-02-09  0:24     ` Pavel Begunkov
@ 2025-02-09 16:19       ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2025-02-09 16:19 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-fsdevel, brauner

On 2/8/25 5:24 PM, Pavel Begunkov wrote:
> On 2/8/25 23:27, Pavel Begunkov wrote:
> ...
>> But it might be better to just poll the epoll fd, reuse all the
>> io_uring polling machinery, and implement IO_URING_F_MULTISHOT for
>> the epoll opcode.
>>
>> epoll_issue(issue_flags) {
>>      if (!(flags & IO_URING_F_MULTISHOT))
>>          return -EAGAIN;
>>
>>      res = epoll_check_events();
>>      post_cqe(res);
>>      etc.
>> }
>>
>> I think that would make this patch quite trivial, including
>> the multishot mode.
> 
> Something like this instead of the last patch. Completely untested,
> the eventpoll.c hunk is dirty might be incorrect, need to pass the
> right mask for polling, and all that. At least it looks simpler,
> and probably doesn't need half of the prep patches.

I like that idea! I'll roll with it and get it finalized and then do
some testing.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-02-09 16:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-07 17:32 [PATCHSET v3 0/7] io_uring epoll wait support Jens Axboe
2025-02-07 17:32 ` [PATCH 1/7] eventpoll: abstract out ep_try_send_events() helper Jens Axboe
2025-02-07 17:32 ` [PATCH 2/7] eventpoll: abstract out parameter sanity checking Jens Axboe
2025-02-07 17:32 ` [PATCH 3/7] eventpoll: add epoll_queue() interface Jens Axboe
2025-02-07 17:32 ` [PATCH 4/7] eventpoll: add helper to remove wait entry from wait queue head Jens Axboe
2025-02-07 17:32 ` [PATCH 5/7] io_uring/epoll: remove CONFIG_EPOLL guards Jens Axboe
2025-02-07 17:32 ` [PATCH 6/7] io_uring/poll: pull ownership handling into poll.h Jens Axboe
2025-02-07 17:32 ` [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT Jens Axboe
2025-02-08 23:27   ` Pavel Begunkov
2025-02-09  0:24     ` Pavel Begunkov
2025-02-09 16:19       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox