public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] preparation for zcrx with huge pages
@ 2025-04-22 14:44 Pavel Begunkov
  2025-04-22 14:44 ` [PATCH 1/4] io_uring/zcrx: add helper for importing user memory Pavel Begunkov
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-04-22 14:44 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, David Wei

Add barebone support for huge pages for zcrx, with the only real
effect is shrinking the page array. However, it's a prerequisite
for other huge page optimisations, like improved dma mappings and
large page pool allocation sizes.

There is no new uapi, but there is a basic example:

https://github.com/isilence/liburing/tree/zcrx-huge-page

Pavel Begunkov (4):
  io_uring/zcrx: add helper for importing user memory
  io_uring/zcrx: add initial infra for large pages
  io_uring: export io_coalesce_buffer()
  io_uring/zcrx: coalesce areas with huge pages

 io_uring/rsrc.c |  2 +-
 io_uring/rsrc.h |  2 ++
 io_uring/zcrx.c | 88 +++++++++++++++++++++++++++++++++++++------------
 io_uring/zcrx.h |  3 ++
 4 files changed, 73 insertions(+), 22 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] io_uring/zcrx: add helper for importing user memory
  2025-04-22 14:44 [PATCH 0/4] preparation for zcrx with huge pages Pavel Begunkov
@ 2025-04-22 14:44 ` Pavel Begunkov
  2025-04-22 14:44 ` [PATCH 2/4] io_uring/zcrx: add initial infra for large pages Pavel Begunkov
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-04-22 14:44 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, David Wei

There are two distinct steps for creating an area. First, we import
user memory, and then populate net_iov. In preparation to changes for
the first step, extract a helper function for it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/zcrx.c | 47 +++++++++++++++++++++++++++++++----------------
 1 file changed, 31 insertions(+), 16 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index a8a9b79d3c23..0f9375e889c3 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -203,13 +203,38 @@ static void io_zcrx_free_area(struct io_zcrx_area *area)
 	kfree(area);
 }
 
+static int io_import_area_memory(struct io_zcrx_ifq *ifq,
+				 struct io_zcrx_area *area,
+				 struct io_uring_zcrx_area_reg *area_reg)
+{
+	struct iovec iov;
+	int nr_pages;
+	int ret;
+
+	iov.iov_base = u64_to_user_ptr(area_reg->addr);
+	iov.iov_len = area_reg->len;
+	ret = io_buffer_validate(&iov);
+	if (ret)
+		return ret;
+
+	area->pages = io_pin_pages((unsigned long)area_reg->addr, area_reg->len,
+				   &nr_pages);
+	if (IS_ERR(area->pages)) {
+		ret = PTR_ERR(area->pages);
+		area->pages = NULL;
+		return ret;
+	}
+	area->nr_folios = nr_pages;
+	return 0;
+}
+
 static int io_zcrx_create_area(struct io_zcrx_ifq *ifq,
 			       struct io_zcrx_area **res,
 			       struct io_uring_zcrx_area_reg *area_reg)
 {
+	unsigned nr_iovs = area_reg->len >> PAGE_SHIFT;
 	struct io_zcrx_area *area;
-	int i, ret, nr_pages, nr_iovs;
-	struct iovec iov;
+	int i, ret;
 
 	if (area_reg->flags || area_reg->rq_area_token)
 		return -EINVAL;
@@ -218,27 +243,17 @@ static int io_zcrx_create_area(struct io_zcrx_ifq *ifq,
 	if (area_reg->addr & ~PAGE_MASK || area_reg->len & ~PAGE_MASK)
 		return -EINVAL;
 
-	iov.iov_base = u64_to_user_ptr(area_reg->addr);
-	iov.iov_len = area_reg->len;
-	ret = io_buffer_validate(&iov);
-	if (ret)
-		return ret;
-
 	ret = -ENOMEM;
 	area = kzalloc(sizeof(*area), GFP_KERNEL);
 	if (!area)
 		goto err;
+	area->nia.num_niovs = nr_iovs;
 
-	area->pages = io_pin_pages((unsigned long)area_reg->addr, area_reg->len,
-				   &nr_pages);
-	if (IS_ERR(area->pages)) {
-		ret = PTR_ERR(area->pages);
-		area->pages = NULL;
+	ret = io_import_area_memory(ifq, area, area_reg);
+	if (ret)
 		goto err;
-	}
-	area->nr_folios = nr_iovs = nr_pages;
-	area->nia.num_niovs = nr_iovs;
 
+	ret = -ENOMEM;
 	area->nia.niovs = kvmalloc_array(nr_iovs, sizeof(area->nia.niovs[0]),
 					 GFP_KERNEL | __GFP_ZERO);
 	if (!area->nia.niovs)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] io_uring/zcrx: add initial infra for large pages
  2025-04-22 14:44 [PATCH 0/4] preparation for zcrx with huge pages Pavel Begunkov
  2025-04-22 14:44 ` [PATCH 1/4] io_uring/zcrx: add helper for importing user memory Pavel Begunkov
@ 2025-04-22 14:44 ` Pavel Begunkov
  2025-04-22 14:44 ` [PATCH 3/4] io_uring: export io_coalesce_buffer() Pavel Begunkov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-04-22 14:44 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, David Wei

Currently, the page array and net_iovs are 4K sized and have the same
number of elements. Allow the page array to be of a different shape,
which will be needed to support huge pages. The total size should always
match, but now we can store fewer larger pages / folios. The only
restriction here is that the folios size should always be equal or
larger than the niov size.

Note, there is no way just yet to really shrink the page array, and
it'll be added in following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/zcrx.c | 24 +++++++++++++++++++-----
 io_uring/zcrx.h |  3 +++
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 0f9375e889c3..784c4ed6c780 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -38,11 +38,21 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
 	return container_of(owner, struct io_zcrx_area, nia);
 }
 
-static inline struct page *io_zcrx_iov_page(const struct net_iov *niov)
+/* shift from chunk / niov to folio size */
+static inline unsigned io_chunk_folio_shift(struct io_zcrx_area *area)
 {
-	struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+	return area->folio_shift - PAGE_SHIFT;
+}
+
+static struct page *io_zcrx_iov_page(struct io_zcrx_area *area,
+				     const struct net_iov *niov)
+{
+	unsigned chunk_gid = net_iov_idx(niov) + area->chunk_id_offset;
+	unsigned folio_idx, base_chunk_gid;
 
-	return area->pages[net_iov_idx(niov)];
+	folio_idx = chunk_gid >> io_chunk_folio_shift(area);
+	base_chunk_gid = folio_idx << io_chunk_folio_shift(area);
+	return area->pages[folio_idx] + (chunk_gid - base_chunk_gid);
 }
 
 #define IO_DMA_ATTR (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
@@ -82,9 +92,11 @@ static int io_zcrx_map_area(struct io_zcrx_ifq *ifq, struct io_zcrx_area *area)
 
 	for (i = 0; i < area->nia.num_niovs; i++) {
 		struct net_iov *niov = &area->nia.niovs[i];
+		struct page *page;
 		dma_addr_t dma;
 
-		dma = dma_map_page_attrs(ifq->dev, area->pages[i], 0, PAGE_SIZE,
+		page = io_zcrx_iov_page(area, niov);
+		dma = dma_map_page_attrs(ifq->dev, page, 0, PAGE_SIZE,
 					 DMA_FROM_DEVICE, IO_DMA_ATTR);
 		if (dma_mapping_error(ifq->dev, dma))
 			break;
@@ -225,6 +237,8 @@ static int io_import_area_memory(struct io_zcrx_ifq *ifq,
 		return ret;
 	}
 	area->nr_folios = nr_pages;
+	area->folio_shift = PAGE_SHIFT;
+	area->chunk_id_offset = 0;
 	return 0;
 }
 
@@ -807,7 +821,7 @@ static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 			break;
 		}
 
-		dst_page = io_zcrx_iov_page(niov);
+		dst_page = io_zcrx_iov_page(area, niov);
 		dst_addr = kmap_local_page(dst_page);
 		if (src_page)
 			src_base = kmap_local_page(src_page);
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index e3c7c4e647f1..dd29cfef637f 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -15,7 +15,10 @@ struct io_zcrx_area {
 	bool			is_mapped;
 	u16			area_id;
 	struct page		**pages;
+	/* offset into the first folio in allocation chunks  */
+	unsigned long		chunk_id_offset;
 	unsigned long		nr_folios;
+	unsigned		folio_shift;
 
 	/* freelist */
 	spinlock_t		freelist_lock ____cacheline_aligned_in_smp;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] io_uring: export io_coalesce_buffer()
  2025-04-22 14:44 [PATCH 0/4] preparation for zcrx with huge pages Pavel Begunkov
  2025-04-22 14:44 ` [PATCH 1/4] io_uring/zcrx: add helper for importing user memory Pavel Begunkov
  2025-04-22 14:44 ` [PATCH 2/4] io_uring/zcrx: add initial infra for large pages Pavel Begunkov
@ 2025-04-22 14:44 ` Pavel Begunkov
  2025-04-22 14:44 ` [PATCH 4/4] io_uring/zcrx: coalesce areas with huge pages Pavel Begunkov
  2025-04-25 14:01 ` [PATCH 0/4] preparation for zcrx " Pavel Begunkov
  4 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-04-22 14:44 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, David Wei

We'll need io_coalesce_buffer() in the next patch for zcrx, export it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/rsrc.c | 2 +-
 io_uring/rsrc.h | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index b4c5f3ee8855..572edf843f40 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -681,7 +681,7 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
 	return ret;
 }
 
-static bool io_coalesce_buffer(struct page ***pages, int *nr_pages,
+bool io_coalesce_buffer(struct page ***pages, int *nr_pages,
 				struct io_imu_folio_data *data)
 {
 	struct page **page_array = *pages, **new_array = NULL;
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 6008ad2e6d9e..2621be73e7e2 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -87,6 +87,8 @@ int io_buffer_validate(struct iovec *iov);
 
 bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
 			      struct io_imu_folio_data *data);
+bool io_coalesce_buffer(struct page ***pages, int *nr_pages,
+			struct io_imu_folio_data *data);
 
 static inline struct io_rsrc_node *io_rsrc_node_lookup(struct io_rsrc_data *data,
 						       int index)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] io_uring/zcrx: coalesce areas with huge pages
  2025-04-22 14:44 [PATCH 0/4] preparation for zcrx with huge pages Pavel Begunkov
                   ` (2 preceding siblings ...)
  2025-04-22 14:44 ` [PATCH 3/4] io_uring: export io_coalesce_buffer() Pavel Begunkov
@ 2025-04-22 14:44 ` Pavel Begunkov
  2025-04-25 14:01 ` [PATCH 0/4] preparation for zcrx " Pavel Begunkov
  4 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-04-22 14:44 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, David Wei

Try to shrink the page array into fewer larger folios if possible. This
reduces the footprint for the array and prepares us for future huge page
optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/zcrx.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 784c4ed6c780..fd0d97830854 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -219,6 +219,8 @@ static int io_import_area_memory(struct io_zcrx_ifq *ifq,
 				 struct io_zcrx_area *area,
 				 struct io_uring_zcrx_area_reg *area_reg)
 {
+	struct io_imu_folio_data data;
+	bool coalesced = false;
 	struct iovec iov;
 	int nr_pages;
 	int ret;
@@ -239,6 +241,21 @@ static int io_import_area_memory(struct io_zcrx_ifq *ifq,
 	area->nr_folios = nr_pages;
 	area->folio_shift = PAGE_SHIFT;
 	area->chunk_id_offset = 0;
+
+	if (nr_pages > 1 && io_check_coalesce_buffer(area->pages, nr_pages, &data)) {
+		if (data.nr_pages_mid != 1)
+			coalesced = io_coalesce_buffer(&area->pages, &nr_pages, &data);
+	}
+
+	if (coalesced) {
+		size_t folio_size = 1UL << data.folio_shift;
+		size_t offset = folio_size - (data.nr_pages_head << PAGE_SHIFT);
+
+		area->nr_folios = nr_pages;
+		area->folio_shift = data.folio_shift;
+		area->chunk_id_offset = offset >> PAGE_SHIFT;
+	}
+
 	return 0;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] preparation for zcrx with huge pages
  2025-04-22 14:44 [PATCH 0/4] preparation for zcrx with huge pages Pavel Begunkov
                   ` (3 preceding siblings ...)
  2025-04-22 14:44 ` [PATCH 4/4] io_uring/zcrx: coalesce areas with huge pages Pavel Begunkov
@ 2025-04-25 14:01 ` Pavel Begunkov
  2025-04-25 15:42   ` David Wei
  4 siblings, 1 reply; 8+ messages in thread
From: Pavel Begunkov @ 2025-04-25 14:01 UTC (permalink / raw)
  To: io-uring; +Cc: David Wei

On 4/22/25 15:44, Pavel Begunkov wrote:
> Add barebone support for huge pages for zcrx, with the only real
> effect is shrinking the page array. However, it's a prerequisite
> for other huge page optimisations, like improved dma mappings and
> large page pool allocation sizes.

I'm going to resend the series later with more changes.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] preparation for zcrx with huge pages
  2025-04-25 14:01 ` [PATCH 0/4] preparation for zcrx " Pavel Begunkov
@ 2025-04-25 15:42   ` David Wei
  2025-04-26  0:01     ` Pavel Begunkov
  0 siblings, 1 reply; 8+ messages in thread
From: David Wei @ 2025-04-25 15:42 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 2025-04-25 07:01, Pavel Begunkov wrote:
> On 4/22/25 15:44, Pavel Begunkov wrote:
>> Add barebone support for huge pages for zcrx, with the only real
>> effect is shrinking the page array. However, it's a prerequisite
>> for other huge page optimisations, like improved dma mappings and
>> large page pool allocation sizes.
> 
> I'm going to resend the series later with more changes.
> 

Thanks Pavel, sorry for the lack of responses this week. I will look at
the v2.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/4] preparation for zcrx with huge pages
  2025-04-25 15:42   ` David Wei
@ 2025-04-26  0:01     ` Pavel Begunkov
  0 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-04-26  0:01 UTC (permalink / raw)
  To: David Wei, io-uring

On 4/25/25 16:42, David Wei wrote:
> On 2025-04-25 07:01, Pavel Begunkov wrote:
>> On 4/22/25 15:44, Pavel Begunkov wrote:
>>> Add barebone support for huge pages for zcrx, with the only real
>>> effect is shrinking the page array. However, it's a prerequisite
>>> for other huge page optimisations, like improved dma mappings and
>>> large page pool allocation sizes.
>>
>> I'm going to resend the series later with more changes.
>>
> 
> Thanks Pavel, sorry for the lack of responses this week. I will look at
> the v2.

No worries at all! Thanks

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-26  0:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22 14:44 [PATCH 0/4] preparation for zcrx with huge pages Pavel Begunkov
2025-04-22 14:44 ` [PATCH 1/4] io_uring/zcrx: add helper for importing user memory Pavel Begunkov
2025-04-22 14:44 ` [PATCH 2/4] io_uring/zcrx: add initial infra for large pages Pavel Begunkov
2025-04-22 14:44 ` [PATCH 3/4] io_uring: export io_coalesce_buffer() Pavel Begunkov
2025-04-22 14:44 ` [PATCH 4/4] io_uring/zcrx: coalesce areas with huge pages Pavel Begunkov
2025-04-25 14:01 ` [PATCH 0/4] preparation for zcrx " Pavel Begunkov
2025-04-25 15:42   ` David Wei
2025-04-26  0:01     ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox