* [PATCH v3 0/6] regions, param pre-mapping and reg waits extension
@ 2024-11-15 16:54 Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 1/6] io_uring: fortify io_pin_pages with a warning Pavel Begunkov
` (6 more replies)
0 siblings, 7 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-15 16:54 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
A bit late but first we need a better and more generic API for
ring/memory/region registration (see Patch 4), and it changes the API
extending registered waits to be a generic parameter passing mechanism.
That will be useful in the future to implement a more flexible rings
creation, especially when we want to share same huge page / mapping.
Patch 6 uses it for registered wait arguments, and it can also be
used to optimise parameter passing for normal io_uring requests.
A dirty liburing branch with tests:
https://github.com/isilence/liburing/tree/io-uring-region-test
v3: fix page array memleak (Patch 4)
v2: cleaned up namings and commit messages
moved all EXT_ARG_REG related bits Patch 5 -> 6
added alignment checks (Patch 6)
Pavel Begunkov (6):
io_uring: fortify io_pin_pages with a warning
io_uring: disable ENTER_EXT_ARG_REG for IOPOLL
io_uring: temporarily disable registered waits
io_uring: introduce concept of memory regions
io_uring: add memory region registration
io_uring: restore back registered wait arguments
include/linux/io_uring_types.h | 20 +++----
include/uapi/linux/io_uring.h | 28 +++++++++-
io_uring/io_uring.c | 27 +++++-----
io_uring/memmap.c | 69 ++++++++++++++++++++++++
io_uring/memmap.h | 14 +++++
io_uring/register.c | 97 ++++++++++++----------------------
io_uring/register.h | 1 -
7 files changed, 166 insertions(+), 90 deletions(-)
--
2.46.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v3 1/6] io_uring: fortify io_pin_pages with a warning
2024-11-15 16:54 [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
@ 2024-11-15 16:54 ` Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL Pavel Begunkov
` (5 subsequent siblings)
6 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-15 16:54 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
We're a bit too frivolous with types of nr_pages arguments, converting
it to long and back to int, passing an unsigned int pointer as an int
pointer and so on. Shouldn't cause any problem but should be carefully
reviewed, but until then let's add a WARN_ON_ONCE check to be more
confident callers don't pass poorely checked arguents.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/memmap.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 85c66fa54956..6ab59c60dfd0 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -140,6 +140,8 @@ struct page **io_pin_pages(unsigned long uaddr, unsigned long len, int *npages)
nr_pages = end - start;
if (WARN_ON_ONCE(!nr_pages))
return ERR_PTR(-EINVAL);
+ if (WARN_ON_ONCE(nr_pages > INT_MAX))
+ return ERR_PTR(-EOVERFLOW);
pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL);
if (!pages)
--
2.46.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL
2024-11-15 16:54 [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 1/6] io_uring: fortify io_pin_pages with a warning Pavel Begunkov
@ 2024-11-15 16:54 ` Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 3/6] io_uring: temporarily disable registered waits Pavel Begunkov
` (4 subsequent siblings)
6 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-15 16:54 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
IOPOLL doesn't use the extended arguments, no need for it to support
IORING_ENTER_EXT_ARG_REG. Let's disable it for IOPOLL, if anything it
leaves more space for future extensions.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index bd71782057de..464a70bde7e6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3214,12 +3214,8 @@ static int io_validate_ext_arg(struct io_ring_ctx *ctx, unsigned flags,
if (!(flags & IORING_ENTER_EXT_ARG))
return 0;
-
- if (flags & IORING_ENTER_EXT_ARG_REG) {
- if (argsz != sizeof(struct io_uring_reg_wait))
- return -EINVAL;
- return PTR_ERR(io_get_ext_arg_reg(ctx, argp));
- }
+ if (flags & IORING_ENTER_EXT_ARG_REG)
+ return -EINVAL;
if (argsz != sizeof(arg))
return -EINVAL;
if (copy_from_user(&arg, argp, sizeof(arg)))
--
2.46.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 3/6] io_uring: temporarily disable registered waits
2024-11-15 16:54 [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 1/6] io_uring: fortify io_pin_pages with a warning Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL Pavel Begunkov
@ 2024-11-15 16:54 ` Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 4/6] io_uring: introduce concept of memory regions Pavel Begunkov
` (3 subsequent siblings)
6 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-15 16:54 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Disable wait argument registration as it'll be replaced with a more
generic feature. We'll still need IORING_ENTER_EXT_ARG_REG parsing
in a few commits so leave it be.
Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 10 -----
include/uapi/linux/io_uring.h | 3 --
io_uring/io_uring.c | 10 -----
io_uring/register.c | 82 ----------------------------------
io_uring/register.h | 1 -
5 files changed, 106 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 072e65e93105..52a5da99a205 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -330,14 +330,6 @@ struct io_ring_ctx {
atomic_t cq_wait_nr;
atomic_t cq_timeouts;
struct wait_queue_head cq_wait;
-
- /*
- * If registered with IORING_REGISTER_CQWAIT_REG, a single
- * page holds N entries, mapped in cq_wait_arg. cq_wait_index
- * is the maximum allowable index.
- */
- struct io_uring_reg_wait *cq_wait_arg;
- unsigned char cq_wait_index;
} ____cacheline_aligned_in_smp;
/* timeouts */
@@ -431,8 +423,6 @@ struct io_ring_ctx {
unsigned short n_sqe_pages;
struct page **ring_pages;
struct page **sqe_pages;
-
- struct page **cq_wait_page;
};
struct io_tw_state {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5d08435b95a8..132f5db3d4e8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -627,9 +627,6 @@ enum io_uring_register_op {
/* resize CQ ring */
IORING_REGISTER_RESIZE_RINGS = 33,
- /* register fixed io_uring_reg_wait arguments */
- IORING_REGISTER_CQWAIT_REG = 34,
-
/* this goes last */
IORING_REGISTER_LAST,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 464a70bde7e6..286b7bb73978 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2709,7 +2709,6 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
io_alloc_cache_free(&ctx->msg_cache, io_msg_cache_free);
io_futex_cache_free(ctx);
io_destroy_buffers(ctx);
- io_unregister_cqwait_reg(ctx);
mutex_unlock(&ctx->uring_lock);
if (ctx->sq_creds)
put_cred(ctx->sq_creds);
@@ -3195,15 +3194,6 @@ void __io_uring_cancel(bool cancel_all)
static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx,
const struct io_uring_getevents_arg __user *uarg)
{
- struct io_uring_reg_wait *arg = READ_ONCE(ctx->cq_wait_arg);
-
- if (arg) {
- unsigned int index = (unsigned int) (uintptr_t) uarg;
-
- if (index <= ctx->cq_wait_index)
- return arg + index;
- }
-
return ERR_PTR(-EFAULT);
}
diff --git a/io_uring/register.c b/io_uring/register.c
index 45edfc57963a..3c5a3cfb186b 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -570,82 +570,6 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
return ret;
}
-void io_unregister_cqwait_reg(struct io_ring_ctx *ctx)
-{
- unsigned short npages = 1;
-
- if (!ctx->cq_wait_page)
- return;
-
- io_pages_unmap(ctx->cq_wait_arg, &ctx->cq_wait_page, &npages, true);
- ctx->cq_wait_arg = NULL;
- if (ctx->user)
- __io_unaccount_mem(ctx->user, 1);
-}
-
-/*
- * Register a page holding N entries of struct io_uring_reg_wait, which can
- * be used via io_uring_enter(2) if IORING_GETEVENTS_EXT_ARG_REG is set.
- * If that is set with IORING_GETEVENTS_EXT_ARG, then instead of passing
- * in a pointer for a struct io_uring_getevents_arg, an index into this
- * registered array is passed, avoiding two (arg + timeout) copies per
- * invocation.
- */
-static int io_register_cqwait_reg(struct io_ring_ctx *ctx, void __user *uarg)
-{
- struct io_uring_cqwait_reg_arg arg;
- struct io_uring_reg_wait *reg;
- struct page **pages;
- unsigned long len;
- int nr_pages, poff;
- int ret;
-
- if (ctx->cq_wait_page || ctx->cq_wait_arg)
- return -EBUSY;
- if (copy_from_user(&arg, uarg, sizeof(arg)))
- return -EFAULT;
- if (!arg.nr_entries || arg.flags)
- return -EINVAL;
- if (arg.struct_size != sizeof(*reg))
- return -EINVAL;
- if (check_mul_overflow(arg.struct_size, arg.nr_entries, &len))
- return -EOVERFLOW;
- if (len > PAGE_SIZE)
- return -EINVAL;
- /* offset + len must fit within a page, and must be reg_wait aligned */
- poff = arg.user_addr & ~PAGE_MASK;
- if (len + poff > PAGE_SIZE)
- return -EINVAL;
- if (poff % arg.struct_size)
- return -EINVAL;
-
- pages = io_pin_pages(arg.user_addr, len, &nr_pages);
- if (IS_ERR(pages))
- return PTR_ERR(pages);
- ret = -EINVAL;
- if (nr_pages != 1)
- goto out_free;
- if (ctx->user) {
- ret = __io_account_mem(ctx->user, 1);
- if (ret)
- goto out_free;
- }
-
- reg = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
- if (reg) {
- ctx->cq_wait_index = arg.nr_entries - 1;
- WRITE_ONCE(ctx->cq_wait_page, pages);
- WRITE_ONCE(ctx->cq_wait_arg, (void *) reg + poff);
- return 0;
- }
- ret = -ENOMEM;
- if (ctx->user)
- __io_unaccount_mem(ctx->user, 1);
-out_free:
- io_pages_free(&pages, nr_pages);
- return ret;
-}
-
static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
void __user *arg, unsigned nr_args)
__releases(ctx->uring_lock)
@@ -840,12 +764,6 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_register_resize_rings(ctx, arg);
break;
- case IORING_REGISTER_CQWAIT_REG:
- ret = -EINVAL;
- if (!arg || nr_args != 1)
- break;
- ret = io_register_cqwait_reg(ctx, arg);
- break;
default:
ret = -EINVAL;
break;
diff --git a/io_uring/register.h b/io_uring/register.h
index 3e935e8fa4b2..a5f39d5ef9e0 100644
--- a/io_uring/register.h
+++ b/io_uring/register.h
@@ -5,6 +5,5 @@
int io_eventfd_unregister(struct io_ring_ctx *ctx);
int io_unregister_personality(struct io_ring_ctx *ctx, unsigned id);
struct file *io_uring_register_get_file(unsigned int fd, bool registered);
-void io_unregister_cqwait_reg(struct io_ring_ctx *ctx);
#endif
--
2.46.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 4/6] io_uring: introduce concept of memory regions
2024-11-15 16:54 [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
` (2 preceding siblings ...)
2024-11-15 16:54 ` [PATCH v3 3/6] io_uring: temporarily disable registered waits Pavel Begunkov
@ 2024-11-15 16:54 ` Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 5/6] io_uring: add memory region registration Pavel Begunkov
` (2 subsequent siblings)
6 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-15 16:54 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
We've got a good number of mappings we share with the userspace, that
includes the main rings, provided buffer rings, upcoming rings for
zerocopy rx and more. All of them duplicate user argument parsing and
some internal details as well (page pinnning, huge page optimisations,
mmap'ing, etc.)
Introduce a notion of regions. For userspace for now it's just a new
structure called struct io_uring_region_desc which is supposed to
parameterise all such mapping / queue creations. A region either
represents a user provided chunk of memory, in which case the user_addr
field should point to it, or a request for the kernel to allocate the
memory, in which case the user would need to mmap it after using the
offset returned in the mmap_offset field. With a uniform userspace API
we can avoid additional boiler plate code and apply future optimisation
to all of them at once.
Internally, there is a new structure struct io_mapped_region holding all
relevant runtime information and some helpers to work with it. This
patch limits it to user provided regions.
Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 6 +++
include/uapi/linux/io_uring.h | 14 +++++++
io_uring/memmap.c | 67 ++++++++++++++++++++++++++++++++++
io_uring/memmap.h | 14 +++++++
4 files changed, 101 insertions(+)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 52a5da99a205..1d3a37234ace 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -75,6 +75,12 @@ struct io_hash_table {
unsigned hash_bits;
};
+struct io_mapped_region {
+ struct page **pages;
+ void *vmap_ptr;
+ size_t nr_pages;
+};
+
/*
* Arbitrary limit, can be raised if need be
*/
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 132f5db3d4e8..5cbfd330c688 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -647,6 +647,20 @@ struct io_uring_files_update {
__aligned_u64 /* __s32 * */ fds;
};
+enum {
+ /* initialise with user provided memory pointed by user_addr */
+ IORING_MEM_REGION_TYPE_USER = 1,
+};
+
+struct io_uring_region_desc {
+ __u64 user_addr;
+ __u64 size;
+ __u32 flags;
+ __u32 id;
+ __u64 mmap_offset;
+ __u64 __resv[4];
+};
+
/*
* Register a fully sparse file space, rather than pass in an array of all
* -1 file descriptors.
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 6ab59c60dfd0..bbd9569a0120 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -12,6 +12,7 @@
#include "memmap.h"
#include "kbuf.h"
+#include "rsrc.h"
static void *io_mem_alloc_compound(struct page **pages, int nr_pages,
size_t size, gfp_t gfp)
@@ -194,6 +195,72 @@ void *__io_uaddr_map(struct page ***pages, unsigned short *npages,
return ERR_PTR(-ENOMEM);
}
+void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr)
+{
+ if (mr->pages) {
+ unpin_user_pages(mr->pages, mr->nr_pages);
+ kvfree(mr->pages);
+ }
+ if (mr->vmap_ptr)
+ vunmap(mr->vmap_ptr);
+ if (mr->nr_pages && ctx->user)
+ __io_unaccount_mem(ctx->user, mr->nr_pages);
+
+ memset(mr, 0, sizeof(*mr));
+}
+
+int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
+ struct io_uring_region_desc *reg)
+{
+ int pages_accounted = 0;
+ struct page **pages;
+ int nr_pages, ret;
+ void *vptr;
+ u64 end;
+
+ if (WARN_ON_ONCE(mr->pages || mr->vmap_ptr || mr->nr_pages))
+ return -EFAULT;
+ if (memchr_inv(®->__resv, 0, sizeof(reg->__resv)))
+ return -EINVAL;
+ if (reg->flags != IORING_MEM_REGION_TYPE_USER)
+ return -EINVAL;
+ if (!reg->user_addr)
+ return -EFAULT;
+ if (!reg->size || reg->mmap_offset || reg->id)
+ return -EINVAL;
+ if ((reg->size >> PAGE_SHIFT) > INT_MAX)
+ return E2BIG;
+ if ((reg->user_addr | reg->size) & ~PAGE_MASK)
+ return -EINVAL;
+ if (check_add_overflow(reg->user_addr, reg->size, &end))
+ return -EOVERFLOW;
+
+ pages = io_pin_pages(reg->user_addr, reg->size, &nr_pages);
+ if (IS_ERR(pages))
+ return PTR_ERR(pages);
+
+ if (ctx->user) {
+ ret = __io_account_mem(ctx->user, nr_pages);
+ if (ret)
+ goto out_free;
+ pages_accounted = nr_pages;
+ }
+
+ vptr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
+ if (!vptr)
+ goto out_free;
+
+ mr->pages = pages;
+ mr->vmap_ptr = vptr;
+ mr->nr_pages = nr_pages;
+ return 0;
+out_free:
+ if (pages_accounted)
+ __io_unaccount_mem(ctx->user, pages_accounted);
+ io_pages_free(&pages, nr_pages);
+ return ret;
+}
+
static void *io_uring_validate_mmap_request(struct file *file, loff_t pgoff,
size_t sz)
{
diff --git a/io_uring/memmap.h b/io_uring/memmap.h
index 5cec5b7ac49a..f361a635b6c7 100644
--- a/io_uring/memmap.h
+++ b/io_uring/memmap.h
@@ -22,4 +22,18 @@ unsigned long io_uring_get_unmapped_area(struct file *file, unsigned long addr,
unsigned long flags);
int io_uring_mmap(struct file *file, struct vm_area_struct *vma);
+void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr);
+int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
+ struct io_uring_region_desc *reg);
+
+static inline void *io_region_get_ptr(struct io_mapped_region *mr)
+{
+ return mr->vmap_ptr;
+}
+
+static inline bool io_region_is_set(struct io_mapped_region *mr)
+{
+ return !!mr->nr_pages;
+}
+
#endif
--
2.46.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 5/6] io_uring: add memory region registration
2024-11-15 16:54 [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
` (3 preceding siblings ...)
2024-11-15 16:54 ` [PATCH v3 4/6] io_uring: introduce concept of memory regions Pavel Begunkov
@ 2024-11-15 16:54 ` Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 6/6] io_uring: restore back registered wait arguments Pavel Begunkov
2024-11-15 17:30 ` [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Jens Axboe
6 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-15 16:54 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Regions will serve multiple purposes. First, with it we can decouple
ring/etc. object creation from registration / mapping of the memory they
will be placed in. We already have hacks that allow to put both SQ and
CQ into the same huge page, in the future we should be able to:
region = create_region(io_ring);
create_pbuf_ring(io_uring, region, offset=0);
create_pbuf_ring(io_uring, region, offset=N);
The second use case is efficiently passing parameters. The following
patch enables back on top of regions IORING_ENTER_EXT_ARG_REG, which
optimises wait arguments. It'll also be useful for request arguments
replacing iovecs, msghdr, etc. pointers. Eventually it would also be
handy for BPF as well if it comes to fruition.
Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 3 +++
include/uapi/linux/io_uring.h | 8 ++++++++
io_uring/io_uring.c | 1 +
io_uring/register.c | 37 ++++++++++++++++++++++++++++++++++
4 files changed, 49 insertions(+)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 1d3a37234ace..e1d69123e164 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -429,6 +429,9 @@ struct io_ring_ctx {
unsigned short n_sqe_pages;
struct page **ring_pages;
struct page **sqe_pages;
+
+ /* used for optimised request parameter and wait argument passing */
+ struct io_mapped_region param_region;
};
struct io_tw_state {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5cbfd330c688..1ee35890125b 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -627,6 +627,8 @@ enum io_uring_register_op {
/* resize CQ ring */
IORING_REGISTER_RESIZE_RINGS = 33,
+ IORING_REGISTER_MEM_REGION = 34,
+
/* this goes last */
IORING_REGISTER_LAST,
@@ -661,6 +663,12 @@ struct io_uring_region_desc {
__u64 __resv[4];
};
+struct io_uring_mem_region_reg {
+ __u64 region_uptr; /* struct io_uring_region_desc * */
+ __u64 flags;
+ __u64 __resv[2];
+};
+
/*
* Register a fully sparse file space, rather than pass in an array of all
* -1 file descriptors.
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 286b7bb73978..c640b8a4ceee 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2709,6 +2709,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
io_alloc_cache_free(&ctx->msg_cache, io_msg_cache_free);
io_futex_cache_free(ctx);
io_destroy_buffers(ctx);
+ io_free_region(ctx, &ctx->param_region);
mutex_unlock(&ctx->uring_lock);
if (ctx->sq_creds)
put_cred(ctx->sq_creds);
diff --git a/io_uring/register.c b/io_uring/register.c
index 3c5a3cfb186b..2cbac3d9b288 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -570,6 +570,37 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
return ret;
}
+static int io_register_mem_region(struct io_ring_ctx *ctx, void __user *uarg)
+{
+ struct io_uring_mem_region_reg __user *reg_uptr = uarg;
+ struct io_uring_mem_region_reg reg;
+ struct io_uring_region_desc __user *rd_uptr;
+ struct io_uring_region_desc rd;
+ int ret;
+
+ if (io_region_is_set(&ctx->param_region))
+ return -EBUSY;
+ if (copy_from_user(®, reg_uptr, sizeof(reg)))
+ return -EFAULT;
+ rd_uptr = u64_to_user_ptr(reg.region_uptr);
+ if (copy_from_user(&rd, rd_uptr, sizeof(rd)))
+ return -EFAULT;
+
+ if (memchr_inv(®.__resv, 0, sizeof(reg.__resv)))
+ return -EINVAL;
+ if (reg.flags)
+ return -EINVAL;
+
+ ret = io_create_region(ctx, &ctx->param_region, &rd);
+ if (ret)
+ return ret;
+ if (copy_to_user(rd_uptr, &rd, sizeof(rd))) {
+ io_free_region(ctx, &ctx->param_region);
+ return -EFAULT;
+ }
+ return 0;
+}
+
static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
void __user *arg, unsigned nr_args)
__releases(ctx->uring_lock)
@@ -764,6 +795,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_register_resize_rings(ctx, arg);
break;
+ case IORING_REGISTER_MEM_REGION:
+ ret = -EINVAL;
+ if (!arg || nr_args != 1)
+ break;
+ ret = io_register_mem_region(ctx, arg);
+ break;
default:
ret = -EINVAL;
break;
--
2.46.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v3 6/6] io_uring: restore back registered wait arguments
2024-11-15 16:54 [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
` (4 preceding siblings ...)
2024-11-15 16:54 ` [PATCH v3 5/6] io_uring: add memory region registration Pavel Begunkov
@ 2024-11-15 16:54 ` Pavel Begunkov
2024-11-15 17:30 ` [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Jens Axboe
6 siblings, 0 replies; 9+ messages in thread
From: Pavel Begunkov @ 2024-11-15 16:54 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Now we've got a more generic region registration API, place
IORING_ENTER_EXT_ARG_REG and re-enable it.
First, the user has to register a region with the
IORING_MEM_REGION_REG_WAIT_ARG flag set. It can only be done for a
ring in a disabled state, aka IORING_SETUP_R_DISABLED, to avoid races
with already running waiters. With that we should have stable constant
values for ctx->cq_wait_{size,arg} in io_get_ext_arg_reg() and hence no
READ_ONCE required.
The other API difference is that we're now passing byte offsets instead
of indexes. The user _must_ align all offsets / pointers to the native
word size, failing to do so might but not necessarily has to lead to a
failure usually returned as -EFAULT. liburing will be hiding this
details from users.
Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 3 +++
include/uapi/linux/io_uring.h | 5 +++++
io_uring/io_uring.c | 14 +++++++++++++-
io_uring/register.c | 16 +++++++++++++++-
4 files changed, 36 insertions(+), 2 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e1d69123e164..aa5f5ea98076 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -324,6 +324,9 @@ struct io_ring_ctx {
unsigned cq_entries;
struct io_ev_fd __rcu *io_ev_fd;
unsigned cq_extra;
+
+ void *cq_wait_arg;
+ size_t cq_wait_size;
} ____cacheline_aligned_in_smp;
/*
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1ee35890125b..4418d0192959 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -663,6 +663,11 @@ struct io_uring_region_desc {
__u64 __resv[4];
};
+enum {
+ /* expose the region as registered wait arguments */
+ IORING_MEM_REGION_REG_WAIT_ARG = 1,
+};
+
struct io_uring_mem_region_reg {
__u64 region_uptr; /* struct io_uring_region_desc * */
__u64 flags;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c640b8a4ceee..c93a6a9cd47e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3195,7 +3195,19 @@ void __io_uring_cancel(bool cancel_all)
static struct io_uring_reg_wait *io_get_ext_arg_reg(struct io_ring_ctx *ctx,
const struct io_uring_getevents_arg __user *uarg)
{
- return ERR_PTR(-EFAULT);
+ unsigned long size = sizeof(struct io_uring_reg_wait);
+ unsigned long offset = (uintptr_t)uarg;
+ unsigned long end;
+
+ if (unlikely(offset % sizeof(long)))
+ return ERR_PTR(-EFAULT);
+
+ /* also protects from NULL ->cq_wait_arg as the size would be 0 */
+ if (unlikely(check_add_overflow(offset, size, &end) ||
+ end >= ctx->cq_wait_size))
+ return ERR_PTR(-EFAULT);
+
+ return ctx->cq_wait_arg + offset;
}
static int io_validate_ext_arg(struct io_ring_ctx *ctx, unsigned flags,
diff --git a/io_uring/register.c b/io_uring/register.c
index 2cbac3d9b288..1a60f4916649 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -588,7 +588,16 @@ static int io_register_mem_region(struct io_ring_ctx *ctx, void __user *uarg)
if (memchr_inv(®.__resv, 0, sizeof(reg.__resv)))
return -EINVAL;
- if (reg.flags)
+ if (reg.flags & ~IORING_MEM_REGION_REG_WAIT_ARG)
+ return -EINVAL;
+
+ /*
+ * This ensures there are no waiters. Waiters are unlocked and it's
+ * hard to synchronise with them, especially if we need to initialise
+ * the region.
+ */
+ if ((reg.flags & IORING_MEM_REGION_REG_WAIT_ARG) &&
+ !(ctx->flags & IORING_SETUP_R_DISABLED))
return -EINVAL;
ret = io_create_region(ctx, &ctx->param_region, &rd);
@@ -598,6 +607,11 @@ static int io_register_mem_region(struct io_ring_ctx *ctx, void __user *uarg)
io_free_region(ctx, &ctx->param_region);
return -EFAULT;
}
+
+ if (reg.flags & IORING_MEM_REGION_REG_WAIT_ARG) {
+ ctx->cq_wait_arg = io_region_get_ptr(&ctx->param_region);
+ ctx->cq_wait_size = rd.size;
+ }
return 0;
}
--
2.46.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v3 0/6] regions, param pre-mapping and reg waits extension
2024-11-15 16:54 [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
` (5 preceding siblings ...)
2024-11-15 16:54 ` [PATCH v3 6/6] io_uring: restore back registered wait arguments Pavel Begunkov
@ 2024-11-15 17:30 ` Jens Axboe
2024-11-15 17:31 ` Jens Axboe
6 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2024-11-15 17:30 UTC (permalink / raw)
To: io-uring, Pavel Begunkov
On Fri, 15 Nov 2024 16:54:37 +0000, Pavel Begunkov wrote:
> A bit late but first we need a better and more generic API for
> ring/memory/region registration (see Patch 4), and it changes the API
> extending registered waits to be a generic parameter passing mechanism.
> That will be useful in the future to implement a more flexible rings
> creation, especially when we want to share same huge page / mapping.
> Patch 6 uses it for registered wait arguments, and it can also be
> used to optimise parameter passing for normal io_uring requests.
>
> [...]
Applied, thanks!
[1/6] io_uring: fortify io_pin_pages with a warning
(no commit info)
[2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL
(no commit info)
[3/6] io_uring: temporarily disable registered waits
(no commit info)
[4/6] io_uring: introduce concept of memory regions
(no commit info)
[5/6] io_uring: add memory region registration
(no commit info)
[6/6] io_uring: restore back registered wait arguments
(no commit info)
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v3 0/6] regions, param pre-mapping and reg waits extension
2024-11-15 17:30 ` [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Jens Axboe
@ 2024-11-15 17:31 ` Jens Axboe
0 siblings, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2024-11-15 17:31 UTC (permalink / raw)
To: io-uring, Pavel Begunkov
On 11/15/24 10:30 AM, Jens Axboe wrote:
>
> On Fri, 15 Nov 2024 16:54:37 +0000, Pavel Begunkov wrote:
>> A bit late but first we need a better and more generic API for
>> ring/memory/region registration (see Patch 4), and it changes the API
>> extending registered waits to be a generic parameter passing mechanism.
>> That will be useful in the future to implement a more flexible rings
>> creation, especially when we want to share same huge page / mapping.
>> Patch 6 uses it for registered wait arguments, and it can also be
>> used to optimise parameter passing for normal io_uring requests.
>>
>> [...]
>
> Applied, thanks!
>
> [1/6] io_uring: fortify io_pin_pages with a warning
> (no commit info)
> [2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL
> (no commit info)
> [3/6] io_uring: temporarily disable registered waits
> (no commit info)
> [4/6] io_uring: introduce concept of memory regions
> (no commit info)
> [5/6] io_uring: add memory region registration
> (no commit info)
> [6/6] io_uring: restore back registered wait arguments
> (no commit info)
Manual followup - normally I would've let this simmer until the next
version, but it is kind of silly to introduce fixed waits and then be
stuck with that implementation for eternity when we could be using the
generic infrastructure. Hence why it's added at this point for 6.13.
Caveat - this will break the existing registered cqwait in liburing,
but there's time to get that sorted before the next liburing release.
--
Jens Axboe
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-11-15 17:32 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-15 16:54 [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 1/6] io_uring: fortify io_pin_pages with a warning Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 2/6] io_uring: disable ENTER_EXT_ARG_REG for IOPOLL Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 3/6] io_uring: temporarily disable registered waits Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 4/6] io_uring: introduce concept of memory regions Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 5/6] io_uring: add memory region registration Pavel Begunkov
2024-11-15 16:54 ` [PATCH v3 6/6] io_uring: restore back registered wait arguments Pavel Begunkov
2024-11-15 17:30 ` [PATCH v3 0/6] regions, param pre-mapping and reg waits extension Jens Axboe
2024-11-15 17:31 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox