* [PATCH 01/11] io_uring: rename ->resize_lock
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 02/11] io_uring/rsrc: export io_check_coalesce_buffer Pavel Begunkov
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
->resize_lock is used for resizing rings, but it's a good idea to reuse
it in other cases as well. Rename it into mmap_lock as it's protects
from races with mmap.
Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 2 +-
io_uring/io_uring.c | 2 +-
io_uring/memmap.c | 6 +++---
io_uring/register.c | 8 ++++----
4 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index aa5f5ea98076..ac7b2b6484a9 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -422,7 +422,7 @@ struct io_ring_ctx {
* side will need to grab this lock, to prevent either side from
* being run concurrently with the other.
*/
- struct mutex resize_lock;
+ struct mutex mmap_lock;
/*
* If IORING_SETUP_NO_MMAP is used, then the below holds
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index da8fd460977b..d565b1589951 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -350,7 +350,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
INIT_HLIST_HEAD(&ctx->cancelable_uring_cmd);
io_napi_init(ctx);
- mutex_init(&ctx->resize_lock);
+ mutex_init(&ctx->mmap_lock);
return ctx;
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 3d71756bc598..771a57a4a16b 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -322,7 +322,7 @@ __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
unsigned int npages;
void *ptr;
- guard(mutex)(&ctx->resize_lock);
+ guard(mutex)(&ctx->mmap_lock);
ptr = io_uring_validate_mmap_request(file, vma->vm_pgoff, sz);
if (IS_ERR(ptr))
@@ -358,7 +358,7 @@ unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr,
if (addr)
return -EINVAL;
- guard(mutex)(&ctx->resize_lock);
+ guard(mutex)(&ctx->mmap_lock);
ptr = io_uring_validate_mmap_request(filp, pgoff, len);
if (IS_ERR(ptr))
@@ -408,7 +408,7 @@ unsigned long io_uring_get_unmapped_area(struct file *file, unsigned long addr,
struct io_ring_ctx *ctx = file->private_data;
void *ptr;
- guard(mutex)(&ctx->resize_lock);
+ guard(mutex)(&ctx->mmap_lock);
ptr = io_uring_validate_mmap_request(file, pgoff, len);
if (IS_ERR(ptr))
diff --git a/io_uring/register.c b/io_uring/register.c
index 1e99c783abdf..ba61697d7a53 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -486,15 +486,15 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
}
/*
- * We'll do the swap. Grab the ctx->resize_lock, which will exclude
+ * We'll do the swap. Grab the ctx->mmap_lock, which will exclude
* any new mmap's on the ring fd. Clear out existing mappings to prevent
* mmap from seeing them, as we'll unmap them. Any attempt to mmap
* existing rings beyond this point will fail. Not that it could proceed
* at this point anyway, as the io_uring mmap side needs go grab the
- * ctx->resize_lock as well. Likewise, hold the completion lock over the
+ * ctx->mmap_lock as well. Likewise, hold the completion lock over the
* duration of the actual swap.
*/
- mutex_lock(&ctx->resize_lock);
+ mutex_lock(&ctx->mmap_lock);
spin_lock(&ctx->completion_lock);
o.rings = ctx->rings;
ctx->rings = NULL;
@@ -561,7 +561,7 @@ static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
ret = 0;
out:
spin_unlock(&ctx->completion_lock);
- mutex_unlock(&ctx->resize_lock);
+ mutex_unlock(&ctx->mmap_lock);
io_register_free_rings(&p, to_free);
if (ctx->sq_data)
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 02/11] io_uring/rsrc: export io_check_coalesce_buffer
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
2024-11-20 23:33 ` [PATCH 01/11] io_uring: rename ->resize_lock Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 03/11] io_uring/memmap: add internal region flags Pavel Begunkov
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
io_try_coalesce_buffer() is a useful helper collecting useful info about
a set of pages, I want to reuse it for analysing ring/etc. mappings. I
don't need the entire thing and only interested if it can be coalesced
into a single page, but that's better than duplicating the parsing.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/rsrc.c | 22 ++++++++++++----------
io_uring/rsrc.h | 4 ++++
2 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index adaae8630932..e51e5ddae728 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -626,11 +626,12 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
return ret;
}
-static bool io_do_coalesce_buffer(struct page ***pages, int *nr_pages,
- struct io_imu_folio_data *data, int nr_folios)
+static bool io_coalesce_buffer(struct page ***pages, int *nr_pages,
+ struct io_imu_folio_data *data)
{
struct page **page_array = *pages, **new_array = NULL;
int nr_pages_left = *nr_pages, i, j;
+ int nr_folios = data->nr_folios;
/* Store head pages only*/
new_array = kvmalloc_array(nr_folios, sizeof(struct page *),
@@ -667,15 +668,14 @@ static bool io_do_coalesce_buffer(struct page ***pages, int *nr_pages,
return true;
}
-static bool io_try_coalesce_buffer(struct page ***pages, int *nr_pages,
- struct io_imu_folio_data *data)
+bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
+ struct io_imu_folio_data *data)
{
- struct page **page_array = *pages;
struct folio *folio = page_folio(page_array[0]);
unsigned int count = 1, nr_folios = 1;
int i;
- if (*nr_pages <= 1)
+ if (nr_pages <= 1)
return false;
data->nr_pages_mid = folio_nr_pages(folio);
@@ -687,7 +687,7 @@ static bool io_try_coalesce_buffer(struct page ***pages, int *nr_pages,
* Check if pages are contiguous inside a folio, and all folios have
* the same page count except for the head and tail.
*/
- for (i = 1; i < *nr_pages; i++) {
+ for (i = 1; i < nr_pages; i++) {
if (page_folio(page_array[i]) == folio &&
page_array[i] == page_array[i-1] + 1) {
count++;
@@ -715,7 +715,8 @@ static bool io_try_coalesce_buffer(struct page ***pages, int *nr_pages,
if (nr_folios == 1)
data->nr_pages_head = count;
- return io_do_coalesce_buffer(pages, nr_pages, data, nr_folios);
+ data->nr_folios = nr_folios;
+ return true;
}
static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
@@ -729,7 +730,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
size_t size;
int ret, nr_pages, i;
struct io_imu_folio_data data;
- bool coalesced;
+ bool coalesced = false;
if (!iov->iov_base)
return NULL;
@@ -749,7 +750,8 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
}
/* If it's huge page(s), try to coalesce them into fewer bvec entries */
- coalesced = io_try_coalesce_buffer(&pages, &nr_pages, &data);
+ if (io_check_coalesce_buffer(pages, nr_pages, &data))
+ coalesced = io_coalesce_buffer(&pages, &nr_pages, &data);
imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
if (!imu)
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 7a4668deaa1a..c8b093584461 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -40,6 +40,7 @@ struct io_imu_folio_data {
/* For non-head/tail folios, has to be fully included */
unsigned int nr_pages_mid;
unsigned int folio_shift;
+ unsigned int nr_folios;
};
struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type);
@@ -66,6 +67,9 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
unsigned int size, unsigned int type);
+bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
+ struct io_imu_folio_data *data);
+
static inline struct io_rsrc_node *io_rsrc_node_lookup(struct io_rsrc_data *data,
int index)
{
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 03/11] io_uring/memmap: add internal region flags
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
2024-11-20 23:33 ` [PATCH 01/11] io_uring: rename ->resize_lock Pavel Begunkov
2024-11-20 23:33 ` [PATCH 02/11] io_uring/rsrc: export io_check_coalesce_buffer Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 04/11] io_uring/memmap: flag regions with user pages Pavel Begunkov
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Add internal flags for struct io_mapped_region, it will help to add more
functionality while not bloating struct io_mapped_region. Use it to mark
if the pointer needs to be vunmap'ed.
Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 5 +++--
io_uring/memmap.c | 13 +++++++++----
io_uring/memmap.h | 2 +-
3 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index ac7b2b6484a9..31b420b8ecd9 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -77,8 +77,9 @@ struct io_hash_table {
struct io_mapped_region {
struct page **pages;
- void *vmap_ptr;
- size_t nr_pages;
+ void *ptr;
+ unsigned nr_pages;
+ unsigned flags;
};
/*
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 771a57a4a16b..21353ea09b39 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -195,14 +195,18 @@ void *__io_uaddr_map(struct page ***pages, unsigned short *npages,
return ERR_PTR(-ENOMEM);
}
+enum {
+ IO_REGION_F_VMAP = 1,
+};
+
void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr)
{
if (mr->pages) {
unpin_user_pages(mr->pages, mr->nr_pages);
kvfree(mr->pages);
}
- if (mr->vmap_ptr)
- vunmap(mr->vmap_ptr);
+ if ((mr->flags & IO_REGION_F_VMAP) && mr->ptr)
+ vunmap(mr->ptr);
if (mr->nr_pages && ctx->user)
__io_unaccount_mem(ctx->user, mr->nr_pages);
@@ -218,7 +222,7 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
void *vptr;
u64 end;
- if (WARN_ON_ONCE(mr->pages || mr->vmap_ptr || mr->nr_pages))
+ if (WARN_ON_ONCE(mr->pages || mr->ptr || mr->nr_pages))
return -EFAULT;
if (memchr_inv(®->__resv, 0, sizeof(reg->__resv)))
return -EINVAL;
@@ -253,8 +257,9 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
}
mr->pages = pages;
- mr->vmap_ptr = vptr;
+ mr->ptr = vptr;
mr->nr_pages = nr_pages;
+ mr->flags |= IO_REGION_F_VMAP;
return 0;
out_free:
if (pages_accounted)
diff --git a/io_uring/memmap.h b/io_uring/memmap.h
index f361a635b6c7..2096a8427277 100644
--- a/io_uring/memmap.h
+++ b/io_uring/memmap.h
@@ -28,7 +28,7 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
static inline void *io_region_get_ptr(struct io_mapped_region *mr)
{
- return mr->vmap_ptr;
+ return mr->ptr;
}
static inline bool io_region_is_set(struct io_mapped_region *mr)
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 04/11] io_uring/memmap: flag regions with user pages
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
` (2 preceding siblings ...)
2024-11-20 23:33 ` [PATCH 03/11] io_uring/memmap: add internal region flags Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 05/11] io_uring/memmap: account memory before pinning Pavel Begunkov
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
In preparation to kernel allocated regions add a flag telling if
the region contains user pinned pages or not.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/memmap.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 21353ea09b39..f76bee5a861a 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -197,12 +197,16 @@ void *__io_uaddr_map(struct page ***pages, unsigned short *npages,
enum {
IO_REGION_F_VMAP = 1,
+ IO_REGION_F_USER_PINNED = 2,
};
void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr)
{
if (mr->pages) {
- unpin_user_pages(mr->pages, mr->nr_pages);
+ if (mr->flags & IO_REGION_F_USER_PINNED)
+ unpin_user_pages(mr->pages, mr->nr_pages);
+ else
+ release_pages(mr->pages, mr->nr_pages);
kvfree(mr->pages);
}
if ((mr->flags & IO_REGION_F_VMAP) && mr->ptr)
@@ -259,7 +263,7 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
mr->pages = pages;
mr->ptr = vptr;
mr->nr_pages = nr_pages;
- mr->flags |= IO_REGION_F_VMAP;
+ mr->flags |= IO_REGION_F_VMAP | IO_REGION_F_USER_PINNED;
return 0;
out_free:
if (pages_accounted)
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 05/11] io_uring/memmap: account memory before pinning
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
` (3 preceding siblings ...)
2024-11-20 23:33 ` [PATCH 04/11] io_uring/memmap: flag regions with user pages Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 06/11] io_uring/memmap: reuse io_free_region for failure path Pavel Begunkov
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Move memory accounting before page pinning. It shouldn't even try to pin
pages if it's not allowed, and accounting is also relatively
inexpensive. It also give a better code structure as we do generic
accounting and then can branch for different mapping types.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/memmap.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index f76bee5a861a..cc5f6f69ee6c 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -243,17 +243,21 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
if (check_add_overflow(reg->user_addr, reg->size, &end))
return -EOVERFLOW;
- pages = io_pin_pages(reg->user_addr, reg->size, &nr_pages);
- if (IS_ERR(pages))
- return PTR_ERR(pages);
-
+ nr_pages = reg->size >> PAGE_SHIFT;
if (ctx->user) {
ret = __io_account_mem(ctx->user, nr_pages);
if (ret)
- goto out_free;
+ return ret;
pages_accounted = nr_pages;
}
+ pages = io_pin_pages(reg->user_addr, reg->size, &nr_pages);
+ if (IS_ERR(pages)) {
+ ret = PTR_ERR(pages);
+ pages = NULL;
+ goto out_free;
+ }
+
vptr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
if (!vptr) {
ret = -ENOMEM;
@@ -268,7 +272,8 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
out_free:
if (pages_accounted)
__io_unaccount_mem(ctx->user, pages_accounted);
- io_pages_free(&pages, nr_pages);
+ if (pages)
+ io_pages_free(&pages, nr_pages);
return ret;
}
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 06/11] io_uring/memmap: reuse io_free_region for failure path
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
` (4 preceding siblings ...)
2024-11-20 23:33 ` [PATCH 05/11] io_uring/memmap: account memory before pinning Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 07/11] io_uring/memmap: optimise single folio regions Pavel Begunkov
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Regions are going to become more complex with allocation options and
optimisations, I want to split initialisation into steps and for that it
needs a sane fail path. Reuse io_free_region(), it's smart enough to
undo only what's needed and leaves the structure in a consistent state.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/memmap.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index cc5f6f69ee6c..2b3cb3fd3fdf 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -220,7 +220,6 @@ void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr)
int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
struct io_uring_region_desc *reg)
{
- int pages_accounted = 0;
struct page **pages;
int nr_pages, ret;
void *vptr;
@@ -248,32 +247,27 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
ret = __io_account_mem(ctx->user, nr_pages);
if (ret)
return ret;
- pages_accounted = nr_pages;
}
+ mr->nr_pages = nr_pages;
pages = io_pin_pages(reg->user_addr, reg->size, &nr_pages);
if (IS_ERR(pages)) {
ret = PTR_ERR(pages);
- pages = NULL;
goto out_free;
}
+ mr->pages = pages;
+ mr->flags |= IO_REGION_F_USER_PINNED;
vptr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
if (!vptr) {
ret = -ENOMEM;
goto out_free;
}
-
- mr->pages = pages;
mr->ptr = vptr;
- mr->nr_pages = nr_pages;
- mr->flags |= IO_REGION_F_VMAP | IO_REGION_F_USER_PINNED;
+ mr->flags |= IO_REGION_F_VMAP;
return 0;
out_free:
- if (pages_accounted)
- __io_unaccount_mem(ctx->user, pages_accounted);
- if (pages)
- io_pages_free(&pages, nr_pages);
+ io_free_region(ctx, mr);
return ret;
}
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 07/11] io_uring/memmap: optimise single folio regions
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
` (5 preceding siblings ...)
2024-11-20 23:33 ` [PATCH 06/11] io_uring/memmap: reuse io_free_region for failure path Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 08/11] io_uring/memmap: helper for pinning region pages Pavel Begunkov
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
We don't need to vmap if memory is already physically contiguous. There
are two important cases it covers: PAGE_SIZE regions and huge pages.
Use io_check_coalesce_buffer() to get the number of contiguous folios.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/memmap.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 2b3cb3fd3fdf..32d2a39aff02 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -217,12 +217,31 @@ void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr)
memset(mr, 0, sizeof(*mr));
}
+static int io_region_init_ptr(struct io_mapped_region *mr)
+{
+ struct io_imu_folio_data ifd;
+ void *ptr;
+
+ if (io_check_coalesce_buffer(mr->pages, mr->nr_pages, &ifd)) {
+ if (ifd.nr_folios == 1) {
+ mr->ptr = page_address(mr->pages[0]);
+ return 0;
+ }
+ }
+ ptr = vmap(mr->pages, mr->nr_pages, VM_MAP, PAGE_KERNEL);
+ if (!ptr)
+ return -ENOMEM;
+
+ mr->ptr = ptr;
+ mr->flags |= IO_REGION_F_VMAP;
+ return 0;
+}
+
int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
struct io_uring_region_desc *reg)
{
struct page **pages;
int nr_pages, ret;
- void *vptr;
u64 end;
if (WARN_ON_ONCE(mr->pages || mr->ptr || mr->nr_pages))
@@ -258,13 +277,9 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
mr->pages = pages;
mr->flags |= IO_REGION_F_USER_PINNED;
- vptr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
- if (!vptr) {
- ret = -ENOMEM;
+ ret = io_region_init_ptr(mr);
+ if (ret)
goto out_free;
- }
- mr->ptr = vptr;
- mr->flags |= IO_REGION_F_VMAP;
return 0;
out_free:
io_free_region(ctx, mr);
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 08/11] io_uring/memmap: helper for pinning region pages
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
` (6 preceding siblings ...)
2024-11-20 23:33 ` [PATCH 07/11] io_uring/memmap: optimise single folio regions Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 09/11] io_uring/memmap: add IO_REGION_F_SINGLE_REF Pavel Begunkov
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
In preparation to adding kernel allocated regions extract a new helper
that pins user pages.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/memmap.c | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 32d2a39aff02..15fefbed77ec 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -237,10 +237,28 @@ static int io_region_init_ptr(struct io_mapped_region *mr)
return 0;
}
+static int io_region_pin_pages(struct io_ring_ctx *ctx,
+ struct io_mapped_region *mr,
+ struct io_uring_region_desc *reg)
+{
+ unsigned long size = mr->nr_pages << PAGE_SHIFT;
+ struct page **pages;
+ int nr_pages;
+
+ pages = io_pin_pages(reg->user_addr, size, &nr_pages);
+ if (IS_ERR(pages))
+ return PTR_ERR(pages);
+ if (WARN_ON_ONCE(nr_pages != mr->nr_pages))
+ return -EFAULT;
+
+ mr->pages = pages;
+ mr->flags |= IO_REGION_F_USER_PINNED;
+ return 0;
+}
+
int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
struct io_uring_region_desc *reg)
{
- struct page **pages;
int nr_pages, ret;
u64 end;
@@ -269,14 +287,9 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
}
mr->nr_pages = nr_pages;
- pages = io_pin_pages(reg->user_addr, reg->size, &nr_pages);
- if (IS_ERR(pages)) {
- ret = PTR_ERR(pages);
+ ret = io_region_pin_pages(ctx, mr, reg);
+ if (ret)
goto out_free;
- }
- mr->pages = pages;
- mr->flags |= IO_REGION_F_USER_PINNED;
-
ret = io_region_init_ptr(mr);
if (ret)
goto out_free;
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 09/11] io_uring/memmap: add IO_REGION_F_SINGLE_REF
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
` (7 preceding siblings ...)
2024-11-20 23:33 ` [PATCH 08/11] io_uring/memmap: helper for pinning region pages Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 10/11] io_uring/memmap: implement kernel allocated regions Pavel Begunkov
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Kernel allocated compound pages will have just one reference for the
entire page array, add a flag telling io_free_region about that.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/memmap.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 15fefbed77ec..cdd620bdd3ee 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -198,15 +198,22 @@ void *__io_uaddr_map(struct page ***pages, unsigned short *npages,
enum {
IO_REGION_F_VMAP = 1,
IO_REGION_F_USER_PINNED = 2,
+ IO_REGION_F_SINGLE_REF = 4,
};
void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr)
{
if (mr->pages) {
+ long nr_pages = mr->nr_pages;
+
+ if (mr->flags & IO_REGION_F_SINGLE_REF)
+ nr_pages = 1;
+
if (mr->flags & IO_REGION_F_USER_PINNED)
- unpin_user_pages(mr->pages, mr->nr_pages);
+ unpin_user_pages(mr->pages, nr_pages);
else
- release_pages(mr->pages, mr->nr_pages);
+ release_pages(mr->pages, nr_pages);
+
kvfree(mr->pages);
}
if ((mr->flags & IO_REGION_F_VMAP) && mr->ptr)
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 10/11] io_uring/memmap: implement kernel allocated regions
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
` (8 preceding siblings ...)
2024-11-20 23:33 ` [PATCH 09/11] io_uring/memmap: add IO_REGION_F_SINGLE_REF Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-20 23:33 ` [PATCH 11/11] io_uring/memmap: implement mmap for regions Pavel Begunkov
2024-11-21 1:28 ` [PATCH 00/11] support kernel allocated regions Jens Axboe
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
Allow the kernel to allocate memory for a region. That's the classical
way SQ/CQ are allocated. It's not yet useful to user space as there
is no way to mmap it, which is why it's explicitly disabled in
io_register_mem_region().
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/memmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
io_uring/register.c | 2 ++
2 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index cdd620bdd3ee..8598770bc385 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -263,6 +263,39 @@ static int io_region_pin_pages(struct io_ring_ctx *ctx,
return 0;
}
+static int io_region_allocate_pages(struct io_ring_ctx *ctx,
+ struct io_mapped_region *mr,
+ struct io_uring_region_desc *reg)
+{
+ gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN;
+ unsigned long size = mr->nr_pages << PAGE_SHIFT;
+ unsigned long nr_allocated;
+ struct page **pages;
+ void *p;
+
+ pages = kvmalloc_array(mr->nr_pages, sizeof(*pages), gfp);
+ if (!pages)
+ return -ENOMEM;
+
+ p = io_mem_alloc_compound(pages, mr->nr_pages, size, gfp);
+ if (!IS_ERR(p)) {
+ mr->flags |= IO_REGION_F_SINGLE_REF;
+ mr->pages = pages;
+ return 0;
+ }
+
+ nr_allocated = alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL,
+ mr->nr_pages, NULL, pages);
+ if (nr_allocated != mr->nr_pages) {
+ if (nr_allocated)
+ release_pages(pages, nr_allocated);
+ kvfree(pages);
+ return -ENOMEM;
+ }
+ mr->pages = pages;
+ return 0;
+}
+
int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
struct io_uring_region_desc *reg)
{
@@ -273,9 +306,10 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
return -EFAULT;
if (memchr_inv(®->__resv, 0, sizeof(reg->__resv)))
return -EINVAL;
- if (reg->flags != IORING_MEM_REGION_TYPE_USER)
+ if (reg->flags & ~IORING_MEM_REGION_TYPE_USER)
return -EINVAL;
- if (!reg->user_addr)
+ /* user_addr should be set IFF it's a user memory backed region */
+ if ((reg->flags & IORING_MEM_REGION_TYPE_USER) != !!reg->user_addr)
return -EFAULT;
if (!reg->size || reg->mmap_offset || reg->id)
return -EINVAL;
@@ -294,9 +328,13 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
}
mr->nr_pages = nr_pages;
- ret = io_region_pin_pages(ctx, mr, reg);
+ if (reg->flags & IORING_MEM_REGION_TYPE_USER)
+ ret = io_region_pin_pages(ctx, mr, reg);
+ else
+ ret = io_region_allocate_pages(ctx, mr, reg);
if (ret)
goto out_free;
+
ret = io_region_init_ptr(mr);
if (ret)
goto out_free;
diff --git a/io_uring/register.c b/io_uring/register.c
index ba61697d7a53..f043d3f6b026 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -586,6 +586,8 @@ static int io_register_mem_region(struct io_ring_ctx *ctx, void __user *uarg)
if (copy_from_user(&rd, rd_uptr, sizeof(rd)))
return -EFAULT;
+ if (!(rd.flags & IORING_MEM_REGION_TYPE_USER))
+ return -EINVAL;
if (memchr_inv(®.__resv, 0, sizeof(reg.__resv)))
return -EINVAL;
if (reg.flags & ~IORING_MEM_REGION_REG_WAIT_ARG)
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 11/11] io_uring/memmap: implement mmap for regions
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
` (9 preceding siblings ...)
2024-11-20 23:33 ` [PATCH 10/11] io_uring/memmap: implement kernel allocated regions Pavel Begunkov
@ 2024-11-20 23:33 ` Pavel Begunkov
2024-11-21 1:28 ` [PATCH 00/11] support kernel allocated regions Jens Axboe
11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2024-11-20 23:33 UTC (permalink / raw)
To: io-uring; +Cc: asml.silence
The patch implements mmap for the param region and enables the kernel
allocation mode. Internally it uses a fixed mmap offset, however the
user has to use the offset returned in
struct io_uring_region_desc::mmap_offset.
Note, mmap doesn't and can't take ->uring_lock and the region / ring
lookup is protected by ->mmap_lock, and it's directly peeking at
ctx->param_region. We can't protect io_create_region() with the
mmap_lock as it'd deadlock, which is why io_create_region_mmap_safe()
initialises it for us in a temporary variable and then publishes it
with the lock taken. It's intentionally decoupled from main region
helpers, and in the future we might want to have a list of active
regions, which then could be protected by the ->mmap_lock.
Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/memmap.c | 61 +++++++++++++++++++++++++++++++++++++++++----
io_uring/memmap.h | 10 +++++++-
io_uring/register.c | 6 ++---
3 files changed, 67 insertions(+), 10 deletions(-)
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 8598770bc385..5d971ba33d5a 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -265,7 +265,8 @@ static int io_region_pin_pages(struct io_ring_ctx *ctx,
static int io_region_allocate_pages(struct io_ring_ctx *ctx,
struct io_mapped_region *mr,
- struct io_uring_region_desc *reg)
+ struct io_uring_region_desc *reg,
+ unsigned long mmap_offset)
{
gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN;
unsigned long size = mr->nr_pages << PAGE_SHIFT;
@@ -280,8 +281,7 @@ static int io_region_allocate_pages(struct io_ring_ctx *ctx,
p = io_mem_alloc_compound(pages, mr->nr_pages, size, gfp);
if (!IS_ERR(p)) {
mr->flags |= IO_REGION_F_SINGLE_REF;
- mr->pages = pages;
- return 0;
+ goto done;
}
nr_allocated = alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL,
@@ -292,12 +292,15 @@ static int io_region_allocate_pages(struct io_ring_ctx *ctx,
kvfree(pages);
return -ENOMEM;
}
+done:
+ reg->mmap_offset = mmap_offset;
mr->pages = pages;
return 0;
}
int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
- struct io_uring_region_desc *reg)
+ struct io_uring_region_desc *reg,
+ unsigned long mmap_offset)
{
int nr_pages, ret;
u64 end;
@@ -331,7 +334,7 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
if (reg->flags & IORING_MEM_REGION_TYPE_USER)
ret = io_region_pin_pages(ctx, mr, reg);
else
- ret = io_region_allocate_pages(ctx, mr, reg);
+ ret = io_region_allocate_pages(ctx, mr, reg, mmap_offset);
if (ret)
goto out_free;
@@ -344,6 +347,50 @@ int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
return ret;
}
+int io_create_region_mmap_safe(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
+ struct io_uring_region_desc *reg,
+ unsigned long mmap_offset)
+{
+ struct io_mapped_region tmp_mr;
+ int ret;
+
+ memcpy(&tmp_mr, mr, sizeof(tmp_mr));
+ ret = io_create_region(ctx, &tmp_mr, reg, mmap_offset);
+ if (ret)
+ return ret;
+
+ /*
+ * Once published mmap can find it without holding only the ->mmap_lock
+ * and not ->uring_lock.
+ */
+ guard(mutex)(&ctx->mmap_lock);
+ memcpy(mr, &tmp_mr, sizeof(tmp_mr));
+ return 0;
+}
+
+static void *io_region_validate_mmap(struct io_ring_ctx *ctx,
+ struct io_mapped_region *mr)
+{
+ lockdep_assert_held(&ctx->mmap_lock);
+
+ if (!io_region_is_set(mr))
+ return ERR_PTR(-EINVAL);
+ if (mr->flags & IO_REGION_F_USER_PINNED)
+ return ERR_PTR(-EINVAL);
+
+ return io_region_get_ptr(mr);
+}
+
+static int io_region_mmap(struct io_ring_ctx *ctx,
+ struct io_mapped_region *mr,
+ struct vm_area_struct *vma)
+{
+ unsigned long nr_pages = mr->nr_pages;
+
+ vm_flags_set(vma, VM_DONTEXPAND);
+ return vm_insert_pages(vma, vma->vm_start, mr->pages, &nr_pages);
+}
+
static void *io_uring_validate_mmap_request(struct file *file, loff_t pgoff,
size_t sz)
{
@@ -379,6 +426,8 @@ static void *io_uring_validate_mmap_request(struct file *file, loff_t pgoff,
io_put_bl(ctx, bl);
return ptr;
}
+ case IORING_MAP_OFF_PARAM_REGION:
+ return io_region_validate_mmap(ctx, &ctx->param_region);
}
return ERR_PTR(-EINVAL);
@@ -419,6 +468,8 @@ __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
ctx->n_sqe_pages);
case IORING_OFF_PBUF_RING:
return io_pbuf_mmap(file, vma);
+ case IORING_MAP_OFF_PARAM_REGION:
+ return io_region_mmap(ctx, &ctx->param_region, vma);
}
return -EINVAL;
diff --git a/io_uring/memmap.h b/io_uring/memmap.h
index 2096a8427277..2402bca3d700 100644
--- a/io_uring/memmap.h
+++ b/io_uring/memmap.h
@@ -1,6 +1,8 @@
#ifndef IO_URING_MEMMAP_H
#define IO_URING_MEMMAP_H
+#define IORING_MAP_OFF_PARAM_REGION 0x20000000ULL
+
struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
void io_pages_free(struct page ***pages, int npages);
int io_uring_mmap_pages(struct io_ring_ctx *ctx, struct vm_area_struct *vma,
@@ -24,7 +26,13 @@ int io_uring_mmap(struct file *file, struct vm_area_struct *vma);
void io_free_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr);
int io_create_region(struct io_ring_ctx *ctx, struct io_mapped_region *mr,
- struct io_uring_region_desc *reg);
+ struct io_uring_region_desc *reg,
+ unsigned long mmap_offset);
+
+int io_create_region_mmap_safe(struct io_ring_ctx *ctx,
+ struct io_mapped_region *mr,
+ struct io_uring_region_desc *reg,
+ unsigned long mmap_offset);
static inline void *io_region_get_ptr(struct io_mapped_region *mr)
{
diff --git a/io_uring/register.c b/io_uring/register.c
index f043d3f6b026..5b099ec36d00 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -585,9 +585,6 @@ static int io_register_mem_region(struct io_ring_ctx *ctx, void __user *uarg)
rd_uptr = u64_to_user_ptr(reg.region_uptr);
if (copy_from_user(&rd, rd_uptr, sizeof(rd)))
return -EFAULT;
-
- if (!(rd.flags & IORING_MEM_REGION_TYPE_USER))
- return -EINVAL;
if (memchr_inv(®.__resv, 0, sizeof(reg.__resv)))
return -EINVAL;
if (reg.flags & ~IORING_MEM_REGION_REG_WAIT_ARG)
@@ -602,7 +599,8 @@ static int io_register_mem_region(struct io_ring_ctx *ctx, void __user *uarg)
!(ctx->flags & IORING_SETUP_R_DISABLED))
return -EINVAL;
- ret = io_create_region(ctx, &ctx->param_region, &rd);
+ ret = io_create_region_mmap_safe(ctx, &ctx->param_region, &rd,
+ IORING_MAP_OFF_PARAM_REGION);
if (ret)
return ret;
if (copy_to_user(rd_uptr, &rd, sizeof(rd))) {
--
2.46.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 00/11] support kernel allocated regions
2024-11-20 23:33 [PATCH 00/11] support kernel allocated regions Pavel Begunkov
` (10 preceding siblings ...)
2024-11-20 23:33 ` [PATCH 11/11] io_uring/memmap: implement mmap for regions Pavel Begunkov
@ 2024-11-21 1:28 ` Jens Axboe
11 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-11-21 1:28 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
On 11/20/24 4:33 PM, Pavel Begunkov wrote:
> The classical way SQ/CQ work is kernel doing the allocation
> and the user mmap'ing it into the userspace. Regions need to
> support it as well.
>
> The patchset should be straightforward with simple preparations
> patches and cleanups. The main part is Patch 10, which internally
> implements kernel allocations, and Patch 11 that implementing the
> mmap part and exposes it to reg-wait / parameter region users.
>
> I'll be sending liburing tests in a separate set. Additionally
> tested converting CQ/SQ to internal region api, but this change
> is left for later.
Took a quick look and I like it, agree that regions should be
broadly usable rather than be tied to pinning. I'll give this a
more thorough look in the coming days.
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread