public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH 0/6] regions, param pre-mapping and reg waits extension
@ 2024-11-14  4:14 Pavel Begunkov
  2024-11-14  4:14 ` [PATCH 1/6] io_uring: fortify io_pin_pages with a warning Pavel Begunkov
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-14  4:14 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

A bit late but first we need a better and more generic API for
ring/memory/region registration (see Patch 4), and it changes the API
extending registered waits to be a generic parameter passing mechanism.
That will be useful not only for waiting but also request arguments
(msghdr, iovec, etc), upcomig rw with meta attrobitutes (PI), and for
BPF proposal as well.

I covered region registration with tests, but for reg waits it only
enables the basic test. Need to enable and run the rest of them
before merged. Dirty branch:

https://github.com/isilence/liburing/tree/io-uring-region-test

Pavel Begunkov (6):
  io_uring: fortify io_pin_pages with a warning
  io_uring: disable ENTER_EXT_ARG_REG for IOPOLL
  io_uring: temporarily disable registered waits
  io_uring: introduce memory regions
  io_uring: add parameter region registration
  io_uring: enable IORING_ENTER_EXT_ARG_REG back

 include/linux/io_uring_types.h | 20 ++++----
 include/uapi/linux/io_uring.h  | 27 ++++++++++-
 io_uring/io_uring.c            | 26 +++++-----
 io_uring/memmap.c              | 67 +++++++++++++++++++++++++
 io_uring/memmap.h              | 14 ++++++
 io_uring/register.c            | 89 +++++++++++++---------------------
 io_uring/register.h            |  1 -
 7 files changed, 161 insertions(+), 83 deletions(-)

-- 
2.46.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/6] io_uring: fortify io_pin_pages with a warning
  2024-11-14  4:14 [PATCH 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
@ 2024-11-14  4:14 ` Pavel Begunkov
  2024-11-14  4:14 ` [PATCH 2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL Pavel Begunkov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-14  4:14 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

We're a bit too frivolous with types of nr_pages arguments, converting
it to long and back to int, passing an unsigned int pointer as an int
pointer and so on. Shouldn't cause any problem but should be carefully
reviewed, but until then let's add a WARN_ON_ONCE check to be more
confident callers don't pass poorely checked arguents.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/memmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 85c66fa54956..6ab59c60dfd0 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -140,6 +140,8 @@ struct page **io_pin_pages(unsigned long uaddr, unsigned long len, int *npages)
 	nr_pages = end - start;
 	if (WARN_ON_ONCE(!nr_pages))
 		return ERR_PTR(-EINVAL);
+	if (WARN_ON_ONCE(nr_pages > INT_MAX))
+		return ERR_PTR(-EOVERFLOW);
 
 	pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL);
 	if (!pages)
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL
  2024-11-14  4:14 [PATCH 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
  2024-11-14  4:14 ` [PATCH 1/6] io_uring: fortify io_pin_pages with a warning Pavel Begunkov
@ 2024-11-14  4:14 ` Pavel Begunkov
  2024-11-14  4:14 ` [PATCH 3/6] io_uring: temporarily disable registered waits Pavel Begunkov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-14  4:14 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

IOPOLL doesn't use the extended arguments, no need for it to support
IORING_ENTER_EXT_ARG_REG. Let's disable it for IOPOLL, if anything it
leaves more space for future extensions.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index bd71782057de..464a70bde7e6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3214,12 +3214,8 @@ static int io_validate_ext_arg(struct io_ring_ctx *ctx, unsigned flags,
 
 	if (!(flags & IORING_ENTER_EXT_ARG))
 		return 0;
-
-	if (flags & IORING_ENTER_EXT_ARG_REG) {
-		if (argsz != sizeof(struct io_uring_reg_wait))
-			return -EINVAL;
-		return PTR_ERR(io_get_ext_arg_reg(ctx, argp));
-	}
+	if (flags & IORING_ENTER_EXT_ARG_REG)
+		return -EINVAL;
 	if (argsz != sizeof(arg))
 		return -EINVAL;
 	if (copy_from_user(&arg, argp, sizeof(arg)))
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/6] io_uring: temporarily disable registered waits
  2024-11-14  4:14 [PATCH 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
  2024-11-14  4:14 ` [PATCH 1/6] io_uring: fortify io_pin_pages with a warning Pavel Begunkov
  2024-11-14  4:14 ` [PATCH 2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL Pavel Begunkov
@ 2024-11-14  4:14 ` Pavel Begunkov
  2024-11-14  4:14 ` [PATCH 4/6] io_uring: introduce memory regions Pavel Begunkov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-14  4:14 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Disable wait argument registration as it'll be replaced with a more
generic feature. We'll still need IORING_ENTER_EXT_ARG_REG parsing
in a few commits so leave it be.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 10 -----
 include/uapi/linux/io_uring.h  |  3 --
 io_uring/io_uring.c            | 10 -----
 io_uring/register.c            | 82 ----------------------------------
 io_uring/register.h            |  1 -
 5 files changed, 106 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 072e65e93105..52a5da99a205 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -330,14 +330,6 @@ struct io_ring_ctx {
 		atomic_t		cq_wait_nr;
 		atomic_t		cq_timeouts;
 		struct wait_queue_head	cq_wait;
-
-		/*
-		 * If registered with IORING_REGISTER_CQWAIT_REG, a single
-		 * page holds N entries, mapped in cq_wait_arg. cq_wait_index
-		 * is the maximum allowable index.
-		 */
-		struct io_uring_reg_wait	*cq_wait_arg;
-		unsigned char			cq_wait_index;
 	} ____cacheline_aligned_in_smp;
 
 	/* timeouts */
@@ -431,8 +423,6 @@ struct io_ring_ctx {
 	unsigned short			n_sqe_pages;
 	struct page			**ring_pages;
 	struct page			**sqe_pages;
-
-	struct page			**cq_wait_page;
 };
 
 struct io_tw_state {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5d08435b95a8..132f5db3d4e8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -627,9 +627,6 @@ enum io_uring_register_op {
 	/* resize CQ ring */
 	IORING_REGISTER_RESIZE_RINGS		= 33,
 
-	/* register fixed io_uring_reg_wait arguments */
-	IORING_REGISTER_CQWAIT_REG		= 34,
-
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 464a70bde7e6..286b7bb73978 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2709,7 +2709,6 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_alloc_cache_free(&ctx->msg_cache, io_msg_cache_free);
 	io_futex_cache_free(ctx);
 	io_destroy_buffers(ctx);
-	io_unregister_cqwait_reg(ctx);
 	mutex_unlock(&ctx->uring_lock);
 	if (ctx->sq_creds)
 		put_cred(ctx->sq_creds);
@@ -3195,15 +3194,6 @@ void __io_uring_cancel(bool cancel_all)
 static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx,
 			const struct io_uring_getevents_arg __user *uarg)
 {
-	struct io_uring_reg_wait *arg = READ_ONCE(ctx->cq_wait_arg);
-
-	if (arg) {
-		unsigned int index = (unsigned int) (uintptr_t) uarg;
-
-		if (index <= ctx->cq_wait_index)
-			return arg + index;
-	}
-
 	return ERR_PTR(-EFAULT);
 }
 
diff --git a/io_uring/register.c b/io_uring/register.c
index 45edfc57963a..3c5a3cfb186b 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -570,82 +570,6 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
 	return ret;
 }
 
-void io_unregister_cqwait_reg(struct io_ring_ctx *ctx)
-{
-	unsigned short npages = 1;
-
-	if (!ctx->cq_wait_page)
-		return;
-
-	io_pages_unmap(ctx->cq_wait_arg, &ctx->cq_wait_page, &npages, true);
-	ctx->cq_wait_arg = NULL;
-	if (ctx->user)
-		__io_unaccount_mem(ctx->user, 1);
-}
-
-/*
- * Register a page holding N entries of struct io_uring_reg_wait, which can
- * be used via io_uring_enter(2) if IORING_GETEVENTS_EXT_ARG_REG is set.
- * If that is set with IORING_GETEVENTS_EXT_ARG, then instead of passing
- * in a pointer for a struct io_uring_getevents_arg, an index into this
- * registered array is passed, avoiding two (arg + timeout) copies per
- * invocation.
- */
-static int io_register_cqwait_reg(struct io_ring_ctx *ctx, void __user *uarg)
-{
-	struct io_uring_cqwait_reg_arg arg;
-	struct io_uring_reg_wait *reg;
-	struct page **pages;
-	unsigned long len;
-	int nr_pages, poff;
-	int ret;
-
-	if (ctx->cq_wait_page || ctx->cq_wait_arg)
-		return -EBUSY;
-	if (copy_from_user(&arg, uarg, sizeof(arg)))
-		return -EFAULT;
-	if (!arg.nr_entries || arg.flags)
-		return -EINVAL;
-	if (arg.struct_size != sizeof(*reg))
-		return -EINVAL;
-	if (check_mul_overflow(arg.struct_size, arg.nr_entries, &len))
-		return -EOVERFLOW;
-	if (len > PAGE_SIZE)
-		return -EINVAL;
-	/* offset + len must fit within a page, and must be reg_wait aligned */
-	poff = arg.user_addr & ~PAGE_MASK;
-	if (len + poff > PAGE_SIZE)
-		return -EINVAL;
-	if (poff % arg.struct_size)
-		return -EINVAL;
-
-	pages = io_pin_pages(arg.user_addr, len, &nr_pages);
-	if (IS_ERR(pages))
-		return PTR_ERR(pages);
-	ret = -EINVAL;
-	if (nr_pages != 1)
-		goto out_free;
-	if (ctx->user) {
-		ret = __io_account_mem(ctx->user, 1);
-		if (ret)
-			goto out_free;
-	}
-
-	reg = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
-	if (reg) {
-		ctx->cq_wait_index = arg.nr_entries - 1;
-		WRITE_ONCE(ctx->cq_wait_page, pages);
-		WRITE_ONCE(ctx->cq_wait_arg, (void *) reg + poff);
-		return 0;
-	}
-	ret = -ENOMEM;
-	if (ctx->user)
-		__io_unaccount_mem(ctx->user, 1);
-out_free:
-	io_pages_free(&pages, nr_pages);
-	return ret;
-}
-
 static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			       void __user *arg, unsigned nr_args)
 	__releases(ctx->uring_lock)
@@ -840,12 +764,6 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_resize_rings(ctx, arg);
 		break;
-	case IORING_REGISTER_CQWAIT_REG:
-		ret = -EINVAL;
-		if (!arg || nr_args != 1)
-			break;
-		ret = io_register_cqwait_reg(ctx, arg);
-		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/io_uring/register.h b/io_uring/register.h
index 3e935e8fa4b2..a5f39d5ef9e0 100644
--- a/io_uring/register.h
+++ b/io_uring/register.h
@@ -5,6 +5,5 @@
 int io_eventfd_unregister(struct io_ring_ctx *ctx);
 int io_unregister_personality(struct io_ring_ctx *ctx, unsigned id);
 struct file *io_uring_register_get_file(unsigned int fd, bool registered);
-void io_unregister_cqwait_reg(struct io_ring_ctx *ctx);
 
 #endif
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/6] io_uring: introduce memory regions
  2024-11-14  4:14 [PATCH 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
                   ` (2 preceding siblings ...)
  2024-11-14  4:14 ` [PATCH 3/6] io_uring: temporarily disable registered waits Pavel Begunkov
@ 2024-11-14  4:14 ` Pavel Begunkov
  2024-11-15 14:44   ` Jens Axboe
  2024-11-14  4:14 ` [PATCH 5/6] io_uring: add parameter region registration Pavel Begunkov
  2024-11-14  4:14 ` [PATCH 6/6] io_uring: enable IORING_ENTER_EXT_ARG_REG back Pavel Begunkov
  5 siblings, 1 reply; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-14  4:14 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

We've got a good number of mappings we share with the userspace, that
includes the main rings, provided buffer rings and at least a couple
more types. And all of them duplicate some of the code for page pinning,
mmap'ing and attempts to optimise it with huge pages.

Introduce a notion of regions. For userspace it's just a new structure
called struct io_uring_region_desc which supposed to parameterise all
such mapping / queues creations. It either represents a user provided
memory, in which case the user_addr field should point to it, or a
request to the kernel to creating the memory, in which case the user is
supposed to mmap it after using the offset returned in the mmap_offset
field. With uniform userspace API we can avoid additional boiler plate
code and when we'd be adding some optimisation it'll be applied to all
mapping types.

Internally, there is a new structure struct io_mapped_region holding all
relevant runtime information and some helpers to work with it. This
patch limits it to user provided regions, which will be extended as a
follow up work.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  6 ++++
 include/uapi/linux/io_uring.h  | 13 +++++++
 io_uring/memmap.c              | 65 ++++++++++++++++++++++++++++++++++
 io_uring/memmap.h              | 14 ++++++++
 4 files changed, 98 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 52a5da99a205..1d3a37234ace 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -75,6 +75,12 @@ struct io_hash_table {
 	unsigned		hash_bits;
 };
 
+struct io_mapped_region {
+	struct page		**pages;
+	void			*vmap_ptr;
+	size_t			nr_pages;
+};
+
 /*
  * Arbitrary limit, can be raised if need be
  */
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 132f5db3d4e8..7ceeccbbf4cb 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -647,6 +647,19 @@ struct io_uring_files_update {
 	__aligned_u64 /* __s32 * */ fds;
 };
 
+enum {
+	/* initialise with user memory pointed by user_addr */
+	IORING_REGION_USER_MEM			= 1,
+};
+
+struct io_uring_region_desc {
+	__u64 user_addr;
+	__u64 size;
+	__u64 flags;
+	__u64 mmap_offset;
+	__u64 __resv[4];
+};
+
 /*
  * Register a fully sparse file space, rather than pass in an array of all
  * -1 file descriptors.
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 6ab59c60dfd0..6b03f5641ef3 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -12,6 +12,7 @@
 
 #include "memmap.h"
 #include "kbuf.h"
+#include "rsrc.h"
 
 static void *io_mem_alloc_compound(struct page **pages, int nr_pages,
 				   size_t size, gfp_t gfp)
@@ -194,6 +195,70 @@ void *__io_uaddr_map(struct page ***pages, unsigned short *npages,
 	return ERR_PTR(-ENOMEM);
 }
 
+void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr)
+{
+	if (mr->pages)
+		unpin_user_pages(mr->pages, mr->nr_pages);
+	if (mr->vmap_ptr)
+		vunmap(mr->vmap_ptr);
+	if (mr->nr_pages && ctx->user)
+		__io_unaccount_mem(ctx->user, mr->nr_pages);
+
+	memset(mr, 0, sizeof(*mr));
+}
+
+int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
+		     struct io_uring_region_desc *reg)
+{
+	int pages_accounted = 0;
+	struct page **pages;
+	int nr_pages, ret;
+	void *vptr;
+	u64 end;
+
+	if (WARN_ON_ONCE(mr->pages || mr->vmap_ptr || mr->nr_pages))
+		return -EFAULT;
+	if (memchr_inv(&reg->__resv, 0, sizeof(reg->__resv)))
+		return -EINVAL;
+	if (reg->flags != IORING_REGION_USER_MEM)
+		return -EINVAL;
+	if (!reg->user_addr)
+		return -EFAULT;
+	if (!reg->size || reg->mmap_offset)
+		return -EINVAL;
+	if ((reg->size >> PAGE_SHIFT) > INT_MAX)
+		return E2BIG;
+	if ((reg->user_addr | reg->size) & ~PAGE_MASK)
+		return -EINVAL;
+	if (check_add_overflow(reg->user_addr, reg->size, &end))
+		return -EOVERFLOW;
+
+	pages = io_pin_pages(reg->user_addr, reg->size, &nr_pages);
+	if (IS_ERR(pages))
+		return PTR_ERR(pages);
+
+	if (ctx->user) {
+		ret = __io_account_mem(ctx->user, nr_pages);
+		if (ret)
+			goto out_free;
+		pages_accounted = nr_pages;
+	}
+
+	vptr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
+	if (!vptr)
+		goto out_free;
+
+	mr->pages = pages;
+	mr->vmap_ptr = vptr;
+	mr->nr_pages = nr_pages;
+	return 0;
+out_free:
+	if (pages_accounted)
+		__io_unaccount_mem(ctx->user, pages_accounted);
+	io_pages_free(&pages, nr_pages);
+	return ret;
+}
+
 static void *io_uring_validate_mmap_request(struct file *file, loff_t pgoff,
 					    size_t sz)
 {
diff --git a/io_uring/memmap.h b/io_uring/memmap.h
index 5cec5b7ac49a..f361a635b6c7 100644
--- a/io_uring/memmap.h
+++ b/io_uring/memmap.h
@@ -22,4 +22,18 @@ unsigned long io_uring_get_unmapped_area(struct file *file, unsigned long addr,
 					 unsigned long flags);
 int io_uring_mmap(struct file *file, struct vm_area_struct *vma);
 
+void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr);
+int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
+		     struct io_uring_region_desc *reg);
+
+static inline void *io_region_get_ptr(struct io_mapped_region *mr)
+{
+	return mr->vmap_ptr;
+}
+
+static inline bool io_region_is_set(struct io_mapped_region *mr)
+{
+	return !!mr->nr_pages;
+}
+
 #endif
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/6] io_uring: add parameter region registration
  2024-11-14  4:14 [PATCH 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
                   ` (3 preceding siblings ...)
  2024-11-14  4:14 ` [PATCH 4/6] io_uring: introduce memory regions Pavel Begunkov
@ 2024-11-14  4:14 ` Pavel Begunkov
  2024-11-14  4:14 ` [PATCH 6/6] io_uring: enable IORING_ENTER_EXT_ARG_REG back Pavel Begunkov
  5 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-14  4:14 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Allow the user to pre-register a region for passing various paramteres.

To use it for passing wait loop arguments, which is wired in the
following commit, the region has to be registered with the
IORING_PARAM_REGION_WAIT_ARG flag set. The flag also requires the
context to be currently disabled, i.e. IORING_SETUP_R_DISABLED, to avoid
races with otherwise potentially running waiters.

This will also be useful in the future for various request / SQE
arguments like iovec, the meta read/write API, and also for BPF.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  6 ++++
 include/uapi/linux/io_uring.h  | 13 ++++++++
 io_uring/io_uring.c            |  1 +
 io_uring/register.c            | 59 ++++++++++++++++++++++++++++++++++
 4 files changed, 79 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 1d3a37234ace..aa5f5ea98076 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -324,6 +324,9 @@ struct io_ring_ctx {
 		unsigned		cq_entries;
 		struct io_ev_fd	__rcu	*io_ev_fd;
 		unsigned		cq_extra;
+
+		void			*cq_wait_arg;
+		size_t			cq_wait_size;
 	} ____cacheline_aligned_in_smp;
 
 	/*
@@ -429,6 +432,9 @@ struct io_ring_ctx {
 	unsigned short			n_sqe_pages;
 	struct page			**ring_pages;
 	struct page			**sqe_pages;
+
+	/* used for optimised request parameter and wait argument passing  */
+	struct io_mapped_region		param_region;
 };
 
 struct io_tw_state {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7ceeccbbf4cb..49b94029c137 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -627,6 +627,8 @@ enum io_uring_register_op {
 	/* resize CQ ring */
 	IORING_REGISTER_RESIZE_RINGS		= 33,
 
+	IORING_REGISTER_PARAM_REGION		= 34,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
@@ -660,6 +662,17 @@ struct io_uring_region_desc {
 	__u64 __resv[4];
 };
 
+enum {
+	/* expose the region as registered wait arguments */
+	IORING_PARAM_REGION_WAIT_ARG			= 1,
+};
+
+struct io_uring_param_region_reg {
+	__u64 region_uptr; /* struct io_uring_region_desc * */
+	__u64 flags;
+	__u64 __resv[2];
+};
+
 /*
  * Register a fully sparse file space, rather than pass in an array of all
  * -1 file descriptors.
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 286b7bb73978..c640b8a4ceee 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2709,6 +2709,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_alloc_cache_free(&ctx->msg_cache, io_msg_cache_free);
 	io_futex_cache_free(ctx);
 	io_destroy_buffers(ctx);
+	io_free_region(ctx, &ctx->param_region);
 	mutex_unlock(&ctx->uring_lock);
 	if (ctx->sq_creds)
 		put_cred(ctx->sq_creds);
diff --git a/io_uring/register.c b/io_uring/register.c
index 3c5a3cfb186b..d1ba14da37ea 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -570,6 +570,59 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
 	return ret;
 }
 
+/*
+ * Register a page holding N entries of struct io_uring_reg_wait, which can
+ * be used via io_uring_enter(2) if IORING_GETEVENTS_EXT_ARG_REG is set.
+ * If that is set with IORING_GETEVENTS_EXT_ARG, then instead of passing
+ * in a pointer for a struct io_uring_getevents_arg, an index into this
+ * registered array is passed, avoiding two (arg + timeout) copies per
+ * invocation.
+ */
+static int io_register_mapped_heap(struct io_ring_ctx *ctx, void __user *uarg)
+{
+	struct io_uring_param_region_reg __user *reg_uptr = uarg;
+	struct io_uring_param_region_reg reg;
+	struct io_uring_region_desc __user *rd_uptr;
+	struct io_uring_region_desc rd;
+	int ret;
+
+	if (io_region_is_set(&ctx->param_region))
+		return -EBUSY;
+	if (copy_from_user(&reg, reg_uptr, sizeof(reg)))
+		return -EFAULT;
+	rd_uptr = u64_to_user_ptr(reg.region_uptr);
+	if (copy_from_user(&rd, rd_uptr, sizeof(rd)))
+		return -EFAULT;
+
+	if (memchr_inv(&reg.__resv, 0, sizeof(reg.__resv)))
+		return -EINVAL;
+	if (reg.flags != IORING_PARAM_REGION_WAIT_ARG)
+		return -EINVAL;
+
+	/*
+	 * This ensures there are no waiters. Waiters are unlocked and it's
+	 * hard to synchronise with them, especially if we need to initialise
+	 * the region.
+	 */
+	if ((reg.flags & IORING_PARAM_REGION_WAIT_ARG) &&
+	    !(ctx->flags & IORING_SETUP_R_DISABLED))
+		return -EINVAL;
+
+	ret = io_create_region(ctx, &ctx->param_region, &rd);
+	if (ret)
+		return ret;
+	if (copy_to_user(rd_uptr, &rd, sizeof(rd))) {
+		io_free_region(ctx, &ctx->param_region);
+		return -EFAULT;
+	}
+
+	if (reg.flags & IORING_PARAM_REGION_WAIT_ARG) {
+		ctx->cq_wait_arg = io_region_get_ptr(&ctx->param_region);
+		ctx->cq_wait_size = rd.size;
+	}
+	return 0;
+}
+
 static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			       void __user *arg, unsigned nr_args)
 	__releases(ctx->uring_lock)
@@ -764,6 +817,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_resize_rings(ctx, arg);
 		break;
+	case IORING_REGISTER_PARAM_REGION:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_mapped_heap(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 6/6] io_uring: enable IORING_ENTER_EXT_ARG_REG back
  2024-11-14  4:14 [PATCH 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
                   ` (4 preceding siblings ...)
  2024-11-14  4:14 ` [PATCH 5/6] io_uring: add parameter region registration Pavel Begunkov
@ 2024-11-14  4:14 ` Pavel Begunkov
  5 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-14  4:14 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

We can reenable IORING_ENTER_EXT_ARG_REG, which is now backed by
parameter regions.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c640b8a4ceee..8b4b866b7761 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3195,7 +3195,16 @@ void __io_uring_cancel(bool cancel_all)
 static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx,
 			const struct io_uring_getevents_arg __user *uarg)
 {
-	return ERR_PTR(-EFAULT);
+	unsigned long size = sizeof(struct io_uring_reg_wait);
+	unsigned long offset = (uintptr_t)uarg;
+	unsigned long end;
+
+	/* also protects from NULL ->cq_wait_arg as the size would be 0 */
+	if (unlikely(check_add_overflow(offset, size, &end) ||
+		     end >= ctx->cq_wait_size))
+		return ERR_PTR(-EFAULT);
+
+	return ctx->cq_wait_arg + offset;
 }
 
 static int io_validate_ext_arg(struct io_ring_ctx *ctx, unsigned flags,
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/6] io_uring: introduce memory regions
  2024-11-14  4:14 ` [PATCH 4/6] io_uring: introduce memory regions Pavel Begunkov
@ 2024-11-15 14:44   ` Jens Axboe
  2024-11-15 15:54     ` Pavel Begunkov
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2024-11-15 14:44 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 11/13/24 9:14 PM, Pavel Begunkov wrote:
> +void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr)
> +{
> +	if (mr->pages)
> +		unpin_user_pages(mr->pages, mr->nr_pages);
> +	if (mr->vmap_ptr)
> +		vunmap(mr->vmap_ptr);
> +	if (mr->nr_pages && ctx->user)
> +		__io_unaccount_mem(ctx->user, mr->nr_pages);
> +
> +	memset(mr, 0, sizeof(*mr));
> +}

This is missing a kvfree(mr->pages);

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/6] io_uring: introduce memory regions
  2024-11-15 14:44   ` Jens Axboe
@ 2024-11-15 15:54     ` Pavel Begunkov
  0 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-15 15:54 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 11/15/24 14:44, Jens Axboe wrote:
> On 11/13/24 9:14 PM, Pavel Begunkov wrote:
>> +void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr)
>> +{
>> +	if (mr->pages)
>> +		unpin_user_pages(mr->pages, mr->nr_pages);
>> +	if (mr->vmap_ptr)
>> +		vunmap(mr->vmap_ptr);
>> +	if (mr->nr_pages && ctx->user)
>> +		__io_unaccount_mem(ctx->user, mr->nr_pages);
>> +
>> +	memset(mr, 0, sizeof(*mr));
>> +}
> 
> This is missing a kvfree(mr->pages);

Indeed, I'll fix it.

FWIW, this is v1 instead of v2 (which also has the same problem).

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-11-15 15:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-14  4:14 [PATCH 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
2024-11-14  4:14 ` [PATCH 1/6] io_uring: fortify io_pin_pages with a warning Pavel Begunkov
2024-11-14  4:14 ` [PATCH 2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL Pavel Begunkov
2024-11-14  4:14 ` [PATCH 3/6] io_uring: temporarily disable registered waits Pavel Begunkov
2024-11-14  4:14 ` [PATCH 4/6] io_uring: introduce memory regions Pavel Begunkov
2024-11-15 14:44   ` Jens Axboe
2024-11-15 15:54     ` Pavel Begunkov
2024-11-14  4:14 ` [PATCH 5/6] io_uring: add parameter region registration Pavel Begunkov
2024-11-14  4:14 ` [PATCH 6/6] io_uring: enable IORING_ENTER_EXT_ARG_REG back Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox