public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
       [not found] <CGME20240514075453epcas5p17974fb62d65a88b1a1b55b97942ee2be@epcas5p1.samsung.com>
@ 2024-05-14  7:54 ` Chenliang Li
       [not found]   ` <CGME20240514075457epcas5p10f02f1746f957df91353724ec859664f@epcas5p1.samsung.com>
                     ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Chenliang Li @ 2024-05-14  7:54 UTC (permalink / raw)
  To: axboe, asml.silence
  Cc: io-uring, peiwei.li, joshi.k, kundan.kumar, anuj20.g, gost.dev,
	Chenliang Li

Registered buffers are stored and processed in the form of bvec array,
each bvec element typically points to a PAGE_SIZE page but can also work
with hugepages. Specifically, a buffer consisting of a hugepage is
coalesced to use only one hugepage bvec entry during registration.
This coalescing feature helps to save both the space and DMA-mapping time.

However, currently the coalescing feature doesn't work for multi-hugepage
buffers. For a buffer with several 2M hugepages, we still split it into
thousands of 4K page bvec entries while in fact, we can just use a
handful of hugepage bvecs.

This patch series enables coalescing registered buffers with more than
one hugepages. It optimizes the DMA-mapping time and saves memory for
these kind of buffers.

Testing:

The hugepage fixed buffer I/O can be tested using fio without
modification. The fio command used in the following test is given
in [1]. There's also a liburing testcase in [2]. Also, the system
should have enough hugepages available before testing.

Perf diff of 8M(4 * 2M hugepages) fio randread test:

Before          After           Symbol
.....................................................
4.68%				[k] __blk_rq_map_sg
3.31%				[k] dma_direct_map_sg
2.64%				[k] dma_pool_alloc
1.09%				[k] sg_next
                +0.49%		[k] dma_map_page_attrs

Perf diff of 8M fio randwrite test:

Before		After		Symbol
......................................................
2.82%				[k] __blk_rq_map_sg
2.05%				[k] dma_direct_map_sg
1.75%				[k] dma_pool_alloc
0.68%				[k] sg_next
		+0.08%		[k] dma_map_page_attrs

First three patches prepare for adding the multi-hugepage coalescing
into buffer registration, the 4th patch enables the feature. 

-----------------
Changes since v3:

- Delete unnecessary commit message
- Update test command and test results

v3 : https://lore.kernel.org/io-uring/[email protected]/T/#t

Changes since v2:

- Modify the loop iterator increment to make code cleaner
- Minor fix to the return procedure in coalesced buffer account
- Correct commit messages
- Add test cases in liburing

v2 : https://lore.kernel.org/io-uring/[email protected]/T/#t

Changes since v1:

- Split into 4 patches
- Fix code style issues
- Rearrange the change of code for cleaner look
- Add speciallized pinned page accounting procedure for coalesced
  buffers
- Reordered the newly add fields in imu struct for better compaction

v1 : https://lore.kernel.org/io-uring/[email protected]/T/#u

[1]
fio -iodepth=64 -rw=randread(-rw=randwrite) -direct=1 -ioengine=io_uring \
-bs=8M -numjobs=1 -group_reporting -mem=shmhuge -fixedbufs -hugepage-size=2M \
-filename=/dev/nvme0n1 -runtime=10s -name=test1

[2]
https://lore.kernel.org/io-uring/[email protected]/T/#u

Chenliang Li (4):
  io_uring/rsrc: add hugepage buffer coalesce helpers
  io_uring/rsrc: store folio shift and mask into imu
  io_uring/rsrc: add init and account functions for coalesced imus
  io_uring/rsrc: enable multi-hugepage buffer coalescing

 io_uring/rsrc.c | 217 +++++++++++++++++++++++++++++++++++++++---------
 io_uring/rsrc.h |  12 +++
 2 files changed, 191 insertions(+), 38 deletions(-)


base-commit: 59b28a6e37e650c0d601ed87875b6217140cda5d
-- 
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/4] io_uring/rsrc: add hugepage buffer coalesce helpers
       [not found]   ` <CGME20240514075457epcas5p10f02f1746f957df91353724ec859664f@epcas5p1.samsung.com>
@ 2024-05-14  7:54     ` Chenliang Li
  2024-05-16 14:07       ` Anuj gupta
  2024-06-16 18:04       ` Pavel Begunkov
  0 siblings, 2 replies; 22+ messages in thread
From: Chenliang Li @ 2024-05-14  7:54 UTC (permalink / raw)
  To: axboe, asml.silence
  Cc: io-uring, peiwei.li, joshi.k, kundan.kumar, anuj20.g, gost.dev,
	Chenliang Li

Introduce helper functions to check whether a buffer can
be coalesced or not, and gather folio data for later use.

The coalescing optimizes time and space consumption caused
by mapping and storing multi-hugepage fixed buffers.

A coalescable multi-hugepage buffer should fully cover its folios
(except potentially the first and last one), and these folios should
have the same size. These requirements are for easier later process,
also we need same size'd chunks in io_import_fixed for fast iov_iter
adjust.

Signed-off-by: Chenliang Li <[email protected]>
---
 io_uring/rsrc.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++
 io_uring/rsrc.h | 10 +++++++
 2 files changed, 88 insertions(+)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 65417c9553b1..d08224c0c5b0 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -871,6 +871,84 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
 	return ret;
 }
 
+static bool __io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
+					 struct io_imu_folio_data *data)
+{
+	struct folio *folio = page_folio(pages[0]);
+	unsigned int count = 1;
+	int i;
+
+	data->nr_pages_mid = folio_nr_pages(folio);
+	if (data->nr_pages_mid == 1)
+		return false;
+
+	data->folio_shift = folio_shift(folio);
+	data->folio_size = folio_size(folio);
+	data->nr_folios = 1;
+	/*
+	 * Check if pages are contiguous inside a folio, and all folios have
+	 * the same page count except for the head and tail.
+	 */
+	for (i = 1; i < nr_pages; i++) {
+		if (page_folio(pages[i]) == folio &&
+			pages[i] == pages[i-1] + 1) {
+			count++;
+			continue;
+		}
+
+		if (data->nr_folios == 1)
+			data->nr_pages_head = count;
+		else if (count != data->nr_pages_mid)
+			return false;
+
+		folio = page_folio(pages[i]);
+		if (folio_size(folio) != data->folio_size)
+			return false;
+
+		count = 1;
+		data->nr_folios++;
+	}
+	if (data->nr_folios == 1)
+		data->nr_pages_head = count;
+
+	return true;
+}
+
+static bool io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
+				       struct io_imu_folio_data *data)
+{
+	int i, j;
+
+	if (nr_pages <= 1 ||
+		!__io_sqe_buffer_try_coalesce(pages, nr_pages, data))
+		return false;
+
+	/*
+	 * The pages are bound to the folio, it doesn't
+	 * actually unpin them but drops all but one reference,
+	 * which is usually put down by io_buffer_unmap().
+	 * Note, needs a better helper.
+	 */
+	if (data->nr_pages_head > 1)
+		unpin_user_pages(&pages[1], data->nr_pages_head - 1);
+
+	j = data->nr_pages_head;
+	nr_pages -= data->nr_pages_head;
+	for (i = 1; i < data->nr_folios; i++) {
+		unsigned int nr_unpin;
+
+		nr_unpin = min_t(unsigned int, nr_pages - 1,
+					data->nr_pages_mid - 1);
+		if (nr_unpin == 0)
+			break;
+		unpin_user_pages(&pages[j+1], nr_unpin);
+		j += data->nr_pages_mid;
+		nr_pages -= data->nr_pages_mid;
+	}
+
+	return true;
+}
+
 static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 				  struct io_mapped_ubuf **pimu,
 				  struct page **last_hpage)
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index c032ca3436ca..b2a9d66b76dd 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -50,6 +50,16 @@ struct io_mapped_ubuf {
 	struct bio_vec	bvec[] __counted_by(nr_bvecs);
 };
 
+struct io_imu_folio_data {
+	/* Head folio can be partially included in the fixed buf */
+	unsigned int	nr_pages_head;
+	/* For non-head/tail folios, has to be fully included */
+	unsigned int	nr_pages_mid;
+	unsigned int	nr_folios;
+	unsigned int	folio_shift;
+	size_t		folio_size;
+};
+
 void io_rsrc_node_ref_zero(struct io_rsrc_node *node);
 void io_rsrc_node_destroy(struct io_ring_ctx *ctx, struct io_rsrc_node *ref_node);
 struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/4] io_uring/rsrc: store folio shift and mask into imu
       [not found]   ` <CGME20240514075459epcas5p2275b4c26f16bcfcea200e97fc75c2a14@epcas5p2.samsung.com>
@ 2024-05-14  7:54     ` Chenliang Li
  2024-05-16 14:08       ` Anuj gupta
  0 siblings, 1 reply; 22+ messages in thread
From: Chenliang Li @ 2024-05-14  7:54 UTC (permalink / raw)
  To: axboe, asml.silence
  Cc: io-uring, peiwei.li, joshi.k, kundan.kumar, anuj20.g, gost.dev,
	Chenliang Li

Store the folio shift and folio mask into imu struct and use it in
iov_iter adjust, as we will have non PAGE_SIZE'd chunks if a
multi-hugepage buffer get coalesced.

Signed-off-by: Chenliang Li <[email protected]>
---
 io_uring/rsrc.c | 6 ++++--
 io_uring/rsrc.h | 2 ++
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index d08224c0c5b0..578d382ca9bc 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1015,6 +1015,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	imu->ubuf = (unsigned long) iov->iov_base;
 	imu->ubuf_end = imu->ubuf + iov->iov_len;
 	imu->nr_bvecs = nr_pages;
+	imu->folio_shift = PAGE_SHIFT;
+	imu->folio_mask = PAGE_MASK;
 	*pimu = imu;
 	ret = 0;
 
@@ -1153,12 +1155,12 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 
 			/* skip first vec */
 			offset -= bvec->bv_len;
-			seg_skip = 1 + (offset >> PAGE_SHIFT);
+			seg_skip = 1 + (offset >> imu->folio_shift);
 
 			iter->bvec = bvec + seg_skip;
 			iter->nr_segs -= seg_skip;
 			iter->count -= bvec->bv_len + offset;
-			iter->iov_offset = offset & ~PAGE_MASK;
+			iter->iov_offset = offset & ~imu->folio_mask;
 		}
 	}
 
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index b2a9d66b76dd..93da02e652bc 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -46,7 +46,9 @@ struct io_mapped_ubuf {
 	u64		ubuf;
 	u64		ubuf_end;
 	unsigned int	nr_bvecs;
+	unsigned int	folio_shift;
 	unsigned long	acct_pages;
+	unsigned long	folio_mask;
 	struct bio_vec	bvec[] __counted_by(nr_bvecs);
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 3/4] io_uring/rsrc: add init and account functions for coalesced imus
       [not found]   ` <CGME20240514075500epcas5p1e638b1ae84727b3669ff6b780cd1cb23@epcas5p1.samsung.com>
@ 2024-05-14  7:54     ` Chenliang Li
  2024-06-16 17:43       ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Chenliang Li @ 2024-05-14  7:54 UTC (permalink / raw)
  To: axboe, asml.silence
  Cc: io-uring, peiwei.li, joshi.k, kundan.kumar, anuj20.g, gost.dev,
	Chenliang Li

Introduce two functions to separate the coalesced imu alloc and
accounting path from the original one. This helps to keep the original
code path clean.

Signed-off-by: Chenliang Li <[email protected]>
---
 io_uring/rsrc.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 89 insertions(+)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 578d382ca9bc..53fac5f27bbf 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -871,6 +871,45 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
 	return ret;
 }
 
+static int io_coalesced_buffer_account_pin(struct io_ring_ctx *ctx,
+					   struct page **pages,
+					   struct io_mapped_ubuf *imu,
+					   struct page **last_hpage,
+					   struct io_imu_folio_data *data)
+{
+	int i, j, ret;
+
+	imu->acct_pages = 0;
+	j = 0;
+	for (i = 0; i < data->nr_folios; i++) {
+		struct page *hpage = pages[j];
+
+		if (hpage == *last_hpage)
+			continue;
+		*last_hpage = hpage;
+		/*
+		 * Already checked the page array in try coalesce,
+		 * so pass in nr_pages=0 here to waive that.
+		 */
+		if (headpage_already_acct(ctx, pages, 0, hpage))
+			continue;
+		imu->acct_pages += data->nr_pages_mid;
+		if (i)
+			j += data->nr_pages_mid;
+		else
+			j = data->nr_pages_head;
+	}
+
+	if (!imu->acct_pages)
+		return 0;
+
+	ret = io_account_mem(ctx, imu->acct_pages);
+	if (!ret)
+		return 0;
+	imu->acct_pages = 0;
+	return ret;
+}
+
 static bool __io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
 					 struct io_imu_folio_data *data)
 {
@@ -949,6 +988,56 @@ static bool io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
 	return true;
 }
 
+static int io_coalesced_imu_alloc(struct io_ring_ctx *ctx, struct iovec *iov,
+				  struct io_mapped_ubuf **pimu,
+				  struct page **last_hpage, struct page **pages,
+				  struct io_imu_folio_data *data)
+{
+	struct io_mapped_ubuf *imu = NULL;
+	unsigned long off;
+	size_t size, vec_len;
+	int ret, i, j;
+
+	ret = -ENOMEM;
+	imu = kvmalloc(struct_size(imu, bvec, data->nr_folios), GFP_KERNEL);
+	if (!imu)
+		return ret;
+
+	ret = io_coalesced_buffer_account_pin(ctx, pages, imu, last_hpage,
+						data);
+	if (ret) {
+		unpin_user_page(pages[0]);
+		j = data->nr_pages_head;
+		for (i = 1; i < data->nr_folios; i++) {
+			unpin_user_page(pages[j]);
+			j += data->nr_pages_mid;
+		}
+		return ret;
+	}
+	off = (unsigned long) iov->iov_base & ~PAGE_MASK;
+	size = iov->iov_len;
+	/* store original address for later verification */
+	imu->ubuf = (unsigned long) iov->iov_base;
+	imu->ubuf_end = imu->ubuf + iov->iov_len;
+	imu->nr_bvecs = data->nr_folios;
+	imu->folio_shift = data->folio_shift;
+	imu->folio_mask = ~((1UL << data->folio_shift) - 1);
+	*pimu = imu;
+	ret = 0;
+
+	vec_len = min_t(size_t, size, PAGE_SIZE * data->nr_pages_head - off);
+	bvec_set_page(&imu->bvec[0], pages[0], vec_len, off);
+	size -= vec_len;
+	j = data->nr_pages_head;
+	for (i = 1; i < data->nr_folios; i++) {
+		vec_len = min_t(size_t, size, data->folio_size);
+		bvec_set_page(&imu->bvec[i], pages[j], vec_len, 0);
+		size -= vec_len;
+		j += data->nr_pages_mid;
+	}
+	return ret;
+}
+
 static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 				  struct io_mapped_ubuf **pimu,
 				  struct page **last_hpage)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 4/4] io_uring/rsrc: enable multi-hugepage buffer coalescing
       [not found]   ` <CGME20240514075502epcas5p10be6bef71d284a110277575d6008563d@epcas5p1.samsung.com>
@ 2024-05-14  7:54     ` Chenliang Li
  2024-05-16 14:09       ` Anuj gupta
  0 siblings, 1 reply; 22+ messages in thread
From: Chenliang Li @ 2024-05-14  7:54 UTC (permalink / raw)
  To: axboe, asml.silence
  Cc: io-uring, peiwei.li, joshi.k, kundan.kumar, anuj20.g, gost.dev,
	Chenliang Li

Modify the original buffer registration and enable the coalescing for
buffers with more than one hugepages.

Signed-off-by: Chenliang Li <[email protected]>
---
 io_uring/rsrc.c | 44 ++++++++------------------------------------
 1 file changed, 8 insertions(+), 36 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 53fac5f27bbf..5e5c1d6f3501 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1047,7 +1047,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	unsigned long off;
 	size_t size;
 	int ret, nr_pages, i;
-	struct folio *folio = NULL;
+	struct io_imu_folio_data data;
 
 	*pimu = (struct io_mapped_ubuf *)&dummy_ubuf;
 	if (!iov->iov_base)
@@ -1062,30 +1062,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 		goto done;
 	}
 
-	/* If it's a huge page, try to coalesce them into a single bvec entry */
-	if (nr_pages > 1) {
-		folio = page_folio(pages[0]);
-		for (i = 1; i < nr_pages; i++) {
-			/*
-			 * Pages must be consecutive and on the same folio for
-			 * this to work
-			 */
-			if (page_folio(pages[i]) != folio ||
-			    pages[i] != pages[i - 1] + 1) {
-				folio = NULL;
-				break;
-			}
-		}
-		if (folio) {
-			/*
-			 * The pages are bound to the folio, it doesn't
-			 * actually unpin them but drops all but one reference,
-			 * which is usually put down by io_buffer_unmap().
-			 * Note, needs a better helper.
-			 */
-			unpin_user_pages(&pages[1], nr_pages - 1);
-			nr_pages = 1;
-		}
+	/* If it's huge page(s), try to coalesce them into fewer bvec entries */
+	if (io_sqe_buffer_try_coalesce(pages, nr_pages, &data)) {
+		ret = io_coalesced_imu_alloc(ctx, iov, pimu, last_hpage,
+						pages, &data);
+		goto done;
 	}
 
 	imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
@@ -1109,10 +1090,6 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	*pimu = imu;
 	ret = 0;
 
-	if (folio) {
-		bvec_set_page(&imu->bvec[0], pages[0], size, off);
-		goto done;
-	}
 	for (i = 0; i < nr_pages; i++) {
 		size_t vec_len;
 
@@ -1218,23 +1195,18 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 		 * we know that:
 		 *
 		 * 1) it's a BVEC iter, we set it up
-		 * 2) all bvecs are PAGE_SIZE in size, except potentially the
+		 * 2) all bvecs are the same in size, except potentially the
 		 *    first and last bvec
 		 *
 		 * So just find our index, and adjust the iterator afterwards.
 		 * If the offset is within the first bvec (or the whole first
 		 * bvec, just use iov_iter_advance(). This makes it easier
 		 * since we can just skip the first segment, which may not
-		 * be PAGE_SIZE aligned.
+		 * be folio_size aligned.
 		 */
 		const struct bio_vec *bvec = imu->bvec;
 
 		if (offset < bvec->bv_len) {
-			/*
-			 * Note, huge pages buffers consists of one large
-			 * bvec entry and should always go this way. The other
-			 * branch doesn't expect non PAGE_SIZE'd chunks.
-			 */
 			iter->bvec = bvec;
 			iter->nr_segs = bvec->bv_len;
 			iter->count -= offset;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
  2024-05-14  7:54 ` [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers Chenliang Li
                     ` (3 preceding siblings ...)
       [not found]   ` <CGME20240514075502epcas5p10be6bef71d284a110277575d6008563d@epcas5p1.samsung.com>
@ 2024-05-16 14:01   ` Anuj gupta
  2024-05-16 14:58     ` Jens Axboe
  4 siblings, 1 reply; 22+ messages in thread
From: Anuj gupta @ 2024-05-16 14:01 UTC (permalink / raw)
  To: Chenliang Li
  Cc: axboe, asml.silence, io-uring, peiwei.li, joshi.k, kundan.kumar,
	anuj20.g, gost.dev

On Tue, May 14, 2024 at 1:25 PM Chenliang Li <[email protected]> wrote:
>
> Registered buffers are stored and processed in the form of bvec array,
> each bvec element typically points to a PAGE_SIZE page but can also work
> with hugepages. Specifically, a buffer consisting of a hugepage is
> coalesced to use only one hugepage bvec entry during registration.
> This coalescing feature helps to save both the space and DMA-mapping time.
>
> However, currently the coalescing feature doesn't work for multi-hugepage
> buffers. For a buffer with several 2M hugepages, we still split it into
> thousands of 4K page bvec entries while in fact, we can just use a
> handful of hugepage bvecs.
>
> This patch series enables coalescing registered buffers with more than
> one hugepages. It optimizes the DMA-mapping time and saves memory for
> these kind of buffers.
>
> Testing:
>
> The hugepage fixed buffer I/O can be tested using fio without
> modification. The fio command used in the following test is given
> in [1]. There's also a liburing testcase in [2]. Also, the system
> should have enough hugepages available before testing.
>
> Perf diff of 8M(4 * 2M hugepages) fio randread test:
>
> Before          After           Symbol
> .....................................................
> 4.68%                           [k] __blk_rq_map_sg
> 3.31%                           [k] dma_direct_map_sg
> 2.64%                           [k] dma_pool_alloc
> 1.09%                           [k] sg_next
>                 +0.49%          [k] dma_map_page_attrs
>
> Perf diff of 8M fio randwrite test:
>
> Before          After           Symbol
> ......................................................
> 2.82%                           [k] __blk_rq_map_sg
> 2.05%                           [k] dma_direct_map_sg
> 1.75%                           [k] dma_pool_alloc
> 0.68%                           [k] sg_next
>                 +0.08%          [k] dma_map_page_attrs
>
> First three patches prepare for adding the multi-hugepage coalescing
> into buffer registration, the 4th patch enables the feature.
>
> -----------------
> Changes since v3:
>
> - Delete unnecessary commit message
> - Update test command and test results
>
> v3 : https://lore.kernel.org/io-uring/[email protected]/T/#t
>
> Changes since v2:
>
> - Modify the loop iterator increment to make code cleaner
> - Minor fix to the return procedure in coalesced buffer account
> - Correct commit messages
> - Add test cases in liburing
>
> v2 : https://lore.kernel.org/io-uring/[email protected]/T/#t
>
> Changes since v1:
>
> - Split into 4 patches
> - Fix code style issues
> - Rearrange the change of code for cleaner look
> - Add speciallized pinned page accounting procedure for coalesced
>   buffers
> - Reordered the newly add fields in imu struct for better compaction
>
> v1 : https://lore.kernel.org/io-uring/[email protected]/T/#u
>
> [1]
> fio -iodepth=64 -rw=randread(-rw=randwrite) -direct=1 -ioengine=io_uring \
> -bs=8M -numjobs=1 -group_reporting -mem=shmhuge -fixedbufs -hugepage-size=2M \
> -filename=/dev/nvme0n1 -runtime=10s -name=test1
>
> [2]
> https://lore.kernel.org/io-uring/[email protected]/T/#u
>
> Chenliang Li (4):
>   io_uring/rsrc: add hugepage buffer coalesce helpers
>   io_uring/rsrc: store folio shift and mask into imu
>   io_uring/rsrc: add init and account functions for coalesced imus
>   io_uring/rsrc: enable multi-hugepage buffer coalescing
>
>  io_uring/rsrc.c | 217 +++++++++++++++++++++++++++++++++++++++---------
>  io_uring/rsrc.h |  12 +++
>  2 files changed, 191 insertions(+), 38 deletions(-)
>
>
> base-commit: 59b28a6e37e650c0d601ed87875b6217140cda5d
> --
> 2.34.1
>
>

I tested this series by registering multi-hugepage buffers. The coalescing helps
saving dma-mapping time. This is the gain observed on my setup, while running
the fio workload shared here.

RandomRead:
Baseline        DeltaAbs        Symbol
.....................................................
3.89%            -3.62%            [k] blk_rq_map_sg
3.58%            -3.23%            [k] dma_direct_map_sg
2.25%            -2.23%            [k] sg_next

RandomWrite:
Baseline        DeltaAbs        Symbol
.....................................................
2.46%            -2.31%            [k] dma_direct_map_sg
2.06%            -2.05%            [k] sg_next
2.08%            -1.80%            [k] blk_rq_map_sg

The liburing test case shared works fine too on my setup.

Feel free to add:
Tested-by: Anuj Gupta <[email protected]>
--
Anuj Gupta

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/4] io_uring/rsrc: add hugepage buffer coalesce helpers
  2024-05-14  7:54     ` [PATCH v4 1/4] io_uring/rsrc: add hugepage buffer coalesce helpers Chenliang Li
@ 2024-05-16 14:07       ` Anuj gupta
  2024-06-16 18:04       ` Pavel Begunkov
  1 sibling, 0 replies; 22+ messages in thread
From: Anuj gupta @ 2024-05-16 14:07 UTC (permalink / raw)
  To: Chenliang Li
  Cc: axboe, asml.silence, io-uring, peiwei.li, joshi.k, kundan.kumar,
	anuj20.g, gost.dev

On Tue, May 14, 2024 at 1:28 PM Chenliang Li <[email protected]> wrote:
>
> Introduce helper functions to check whether a buffer can
> be coalesced or not, and gather folio data for later use.
>
> The coalescing optimizes time and space consumption caused
> by mapping and storing multi-hugepage fixed buffers.
>
> A coalescable multi-hugepage buffer should fully cover its folios
> (except potentially the first and last one), and these folios should
> have the same size. These requirements are for easier later process,

Nit: for easier processing later

> also we need same size'd chunks in io_import_fixed for fast iov_iter
> adjust.
>
> Signed-off-by: Chenliang Li <[email protected]>
> ---
>  io_uring/rsrc.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++
>  io_uring/rsrc.h | 10 +++++++
>  2 files changed, 88 insertions(+)
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 65417c9553b1..d08224c0c5b0 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -871,6 +871,84 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
>         return ret;
>  }
>
> +static bool __io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
> +                                        struct io_imu_folio_data *data)
> +{
> +       struct folio *folio = page_folio(pages[0]);
> +       unsigned int count = 1;
> +       int i;
> +
> +       data->nr_pages_mid = folio_nr_pages(folio);
> +       if (data->nr_pages_mid == 1)
> +               return false;
> +
> +       data->folio_shift = folio_shift(folio);
> +       data->folio_size = folio_size(folio);
> +       data->nr_folios = 1;
> +       /*
> +        * Check if pages are contiguous inside a folio, and all folios have
> +        * the same page count except for the head and tail.
> +        */
> +       for (i = 1; i < nr_pages; i++) {
> +               if (page_folio(pages[i]) == folio &&
> +                       pages[i] == pages[i-1] + 1) {
> +                       count++;
> +                       continue;
> +               }
> +
> +               if (data->nr_folios == 1)
> +                       data->nr_pages_head = count;
> +               else if (count != data->nr_pages_mid)
> +                       return false;
> +
> +               folio = page_folio(pages[i]);
> +               if (folio_size(folio) != data->folio_size)
> +                       return false;
> +
> +               count = 1;
> +               data->nr_folios++;
> +       }
> +       if (data->nr_folios == 1)
> +               data->nr_pages_head = count;
> +
> +       return true;
> +}
> +
> +static bool io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
> +                                      struct io_imu_folio_data *data)
> +{
> +       int i, j;
> +
> +       if (nr_pages <= 1 ||
> +               !__io_sqe_buffer_try_coalesce(pages, nr_pages, data))
> +               return false;
> +
> +       /*
> +        * The pages are bound to the folio, it doesn't
> +        * actually unpin them but drops all but one reference,
> +        * which is usually put down by io_buffer_unmap().
> +        * Note, needs a better helper.
> +        */
> +       if (data->nr_pages_head > 1)
> +               unpin_user_pages(&pages[1], data->nr_pages_head - 1);
> +
> +       j = data->nr_pages_head;
> +       nr_pages -= data->nr_pages_head;
> +       for (i = 1; i < data->nr_folios; i++) {
> +               unsigned int nr_unpin;
> +
> +               nr_unpin = min_t(unsigned int, nr_pages - 1,
> +                                       data->nr_pages_mid - 1);
> +               if (nr_unpin == 0)
> +                       break;
> +               unpin_user_pages(&pages[j+1], nr_unpin);
> +               j += data->nr_pages_mid;
> +               nr_pages -= data->nr_pages_mid;
> +       }
> +
> +       return true;
> +}
> +
>  static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
>                                   struct io_mapped_ubuf **pimu,
>                                   struct page **last_hpage)
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index c032ca3436ca..b2a9d66b76dd 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -50,6 +50,16 @@ struct io_mapped_ubuf {
>         struct bio_vec  bvec[] __counted_by(nr_bvecs);
>  };
>
> +struct io_imu_folio_data {
> +       /* Head folio can be partially included in the fixed buf */
> +       unsigned int    nr_pages_head;
> +       /* For non-head/tail folios, has to be fully included */
> +       unsigned int    nr_pages_mid;
> +       unsigned int    nr_folios;
> +       unsigned int    folio_shift;
> +       size_t          folio_size;
> +};
> +
>  void io_rsrc_node_ref_zero(struct io_rsrc_node *node);
>  void io_rsrc_node_destroy(struct io_ring_ctx *ctx, struct io_rsrc_node *ref_node);
>  struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx);
> --
> 2.34.1
>
>
Looks good:
Reviewed-by: Anuj Gupta <[email protected]>
--
Anuj Gupta

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 2/4] io_uring/rsrc: store folio shift and mask into imu
  2024-05-14  7:54     ` [PATCH v4 2/4] io_uring/rsrc: store folio shift and mask into imu Chenliang Li
@ 2024-05-16 14:08       ` Anuj gupta
  0 siblings, 0 replies; 22+ messages in thread
From: Anuj gupta @ 2024-05-16 14:08 UTC (permalink / raw)
  To: Chenliang Li
  Cc: axboe, asml.silence, io-uring, peiwei.li, joshi.k, kundan.kumar,
	anuj20.g, gost.dev

On Tue, May 14, 2024 at 1:25 PM Chenliang Li <[email protected]> wrote:
>
> Store the folio shift and folio mask into imu struct and use it in
> iov_iter adjust, as we will have non PAGE_SIZE'd chunks if a
> multi-hugepage buffer get coalesced.
>
> Signed-off-by: Chenliang Li <[email protected]>
> ---
>  io_uring/rsrc.c | 6 ++++--
>  io_uring/rsrc.h | 2 ++
>  2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index d08224c0c5b0..578d382ca9bc 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -1015,6 +1015,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
>         imu->ubuf = (unsigned long) iov->iov_base;
>         imu->ubuf_end = imu->ubuf + iov->iov_len;
>         imu->nr_bvecs = nr_pages;
> +       imu->folio_shift = PAGE_SHIFT;
> +       imu->folio_mask = PAGE_MASK;
>         *pimu = imu;
>         ret = 0;
>
> @@ -1153,12 +1155,12 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
>
>                         /* skip first vec */
>                         offset -= bvec->bv_len;
> -                       seg_skip = 1 + (offset >> PAGE_SHIFT);
> +                       seg_skip = 1 + (offset >> imu->folio_shift);
>
>                         iter->bvec = bvec + seg_skip;
>                         iter->nr_segs -= seg_skip;
>                         iter->count -= bvec->bv_len + offset;
> -                       iter->iov_offset = offset & ~PAGE_MASK;
> +                       iter->iov_offset = offset & ~imu->folio_mask;
>                 }
>         }
>
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index b2a9d66b76dd..93da02e652bc 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -46,7 +46,9 @@ struct io_mapped_ubuf {
>         u64             ubuf;
>         u64             ubuf_end;
>         unsigned int    nr_bvecs;
> +       unsigned int    folio_shift;
>         unsigned long   acct_pages;
> +       unsigned long   folio_mask;
>         struct bio_vec  bvec[] __counted_by(nr_bvecs);
>  };
>
> --
> 2.34.1
>
>
Looks good.
Reviewed-by: Anuj Gupta <[email protected]>
--
Anuj Gupta

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 4/4] io_uring/rsrc: enable multi-hugepage buffer coalescing
  2024-05-14  7:54     ` [PATCH v4 4/4] io_uring/rsrc: enable multi-hugepage buffer coalescing Chenliang Li
@ 2024-05-16 14:09       ` Anuj gupta
  0 siblings, 0 replies; 22+ messages in thread
From: Anuj gupta @ 2024-05-16 14:09 UTC (permalink / raw)
  To: Chenliang Li
  Cc: axboe, asml.silence, io-uring, peiwei.li, joshi.k, kundan.kumar,
	anuj20.g, gost.dev

On Tue, May 14, 2024 at 1:25 PM Chenliang Li <[email protected]> wrote:
>
> Modify the original buffer registration and enable the coalescing for
> buffers with more than one hugepages.
>
> Signed-off-by: Chenliang Li <[email protected]>
> ---
>  io_uring/rsrc.c | 44 ++++++++------------------------------------
>  1 file changed, 8 insertions(+), 36 deletions(-)
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 53fac5f27bbf..5e5c1d6f3501 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -1047,7 +1047,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
>         unsigned long off;
>         size_t size;
>         int ret, nr_pages, i;
> -       struct folio *folio = NULL;
> +       struct io_imu_folio_data data;
>
>         *pimu = (struct io_mapped_ubuf *)&dummy_ubuf;
>         if (!iov->iov_base)
> @@ -1062,30 +1062,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
>                 goto done;
>         }
>
> -       /* If it's a huge page, try to coalesce them into a single bvec entry */
> -       if (nr_pages > 1) {
> -               folio = page_folio(pages[0]);
> -               for (i = 1; i < nr_pages; i++) {
> -                       /*
> -                        * Pages must be consecutive and on the same folio for
> -                        * this to work
> -                        */
> -                       if (page_folio(pages[i]) != folio ||
> -                           pages[i] != pages[i - 1] + 1) {
> -                               folio = NULL;
> -                               break;
> -                       }
> -               }
> -               if (folio) {
> -                       /*
> -                        * The pages are bound to the folio, it doesn't
> -                        * actually unpin them but drops all but one reference,
> -                        * which is usually put down by io_buffer_unmap().
> -                        * Note, needs a better helper.
> -                        */
> -                       unpin_user_pages(&pages[1], nr_pages - 1);
> -                       nr_pages = 1;
> -               }
> +       /* If it's huge page(s), try to coalesce them into fewer bvec entries */
> +       if (io_sqe_buffer_try_coalesce(pages, nr_pages, &data)) {
> +               ret = io_coalesced_imu_alloc(ctx, iov, pimu, last_hpage,
> +                                               pages, &data);
> +               goto done;
>         }
>
>         imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
> @@ -1109,10 +1090,6 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
>         *pimu = imu;
>         ret = 0;
>
> -       if (folio) {
> -               bvec_set_page(&imu->bvec[0], pages[0], size, off);
> -               goto done;
> -       }
>         for (i = 0; i < nr_pages; i++) {
>                 size_t vec_len;
>
> @@ -1218,23 +1195,18 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
>                  * we know that:
>                  *
>                  * 1) it's a BVEC iter, we set it up
> -                * 2) all bvecs are PAGE_SIZE in size, except potentially the
> +                * 2) all bvecs are the same in size, except potentially the
>                  *    first and last bvec
>                  *
>                  * So just find our index, and adjust the iterator afterwards.
>                  * If the offset is within the first bvec (or the whole first
>                  * bvec, just use iov_iter_advance(). This makes it easier
>                  * since we can just skip the first segment, which may not
> -                * be PAGE_SIZE aligned.
> +                * be folio_size aligned.
>                  */
>                 const struct bio_vec *bvec = imu->bvec;
>
>                 if (offset < bvec->bv_len) {
> -                       /*
> -                        * Note, huge pages buffers consists of one large
> -                        * bvec entry and should always go this way. The other
> -                        * branch doesn't expect non PAGE_SIZE'd chunks.
> -                        */
>                         iter->bvec = bvec;
>                         iter->nr_segs = bvec->bv_len;
>                         iter->count -= offset;
> --
> 2.34.1
>
>
Looks good.
Reviewed-by: Anuj Gupta <[email protected]>
--
Anuj Gupta

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
  2024-05-16 14:01   ` [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers Anuj gupta
@ 2024-05-16 14:58     ` Jens Axboe
       [not found]       ` <CGME20240530051050epcas5p122f30aebcf99e27a8d02cc1318dbafc8@epcas5p1.samsung.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2024-05-16 14:58 UTC (permalink / raw)
  To: Anuj gupta, Chenliang Li
  Cc: asml.silence, io-uring, peiwei.li, joshi.k, kundan.kumar,
	anuj20.g, gost.dev

On 5/16/24 8:01 AM, Anuj gupta wrote:
> On Tue, May 14, 2024 at 1:25?PM Chenliang Li <[email protected]> wrote:
>>
>> Registered buffers are stored and processed in the form of bvec array,
>> each bvec element typically points to a PAGE_SIZE page but can also work
>> with hugepages. Specifically, a buffer consisting of a hugepage is
>> coalesced to use only one hugepage bvec entry during registration.
>> This coalescing feature helps to save both the space and DMA-mapping time.
>>
>> However, currently the coalescing feature doesn't work for multi-hugepage
>> buffers. For a buffer with several 2M hugepages, we still split it into
>> thousands of 4K page bvec entries while in fact, we can just use a
>> handful of hugepage bvecs.
>>
>> This patch series enables coalescing registered buffers with more than
>> one hugepages. It optimizes the DMA-mapping time and saves memory for
>> these kind of buffers.
>>
>> Testing:
>>
>> The hugepage fixed buffer I/O can be tested using fio without
>> modification. The fio command used in the following test is given
>> in [1]. There's also a liburing testcase in [2]. Also, the system
>> should have enough hugepages available before testing.
>>
>> Perf diff of 8M(4 * 2M hugepages) fio randread test:
>>
>> Before          After           Symbol
>> .....................................................
>> 4.68%                           [k] __blk_rq_map_sg
>> 3.31%                           [k] dma_direct_map_sg
>> 2.64%                           [k] dma_pool_alloc
>> 1.09%                           [k] sg_next
>>                 +0.49%          [k] dma_map_page_attrs
>>
>> Perf diff of 8M fio randwrite test:
>>
>> Before          After           Symbol
>> ......................................................
>> 2.82%                           [k] __blk_rq_map_sg
>> 2.05%                           [k] dma_direct_map_sg
>> 1.75%                           [k] dma_pool_alloc
>> 0.68%                           [k] sg_next
>>                 +0.08%          [k] dma_map_page_attrs
>>
>> First three patches prepare for adding the multi-hugepage coalescing
>> into buffer registration, the 4th patch enables the feature.
>>
>> -----------------
>> Changes since v3:
>>
>> - Delete unnecessary commit message
>> - Update test command and test results
>>
>> v3 : https://lore.kernel.org/io-uring/[email protected]/T/#t
>>
>> Changes since v2:
>>
>> - Modify the loop iterator increment to make code cleaner
>> - Minor fix to the return procedure in coalesced buffer account
>> - Correct commit messages
>> - Add test cases in liburing
>>
>> v2 : https://lore.kernel.org/io-uring/[email protected]/T/#t
>>
>> Changes since v1:
>>
>> - Split into 4 patches
>> - Fix code style issues
>> - Rearrange the change of code for cleaner look
>> - Add speciallized pinned page accounting procedure for coalesced
>>   buffers
>> - Reordered the newly add fields in imu struct for better compaction
>>
>> v1 : https://lore.kernel.org/io-uring/[email protected]/T/#u
>>
>> [1]
>> fio -iodepth=64 -rw=randread(-rw=randwrite) -direct=1 -ioengine=io_uring \
>> -bs=8M -numjobs=1 -group_reporting -mem=shmhuge -fixedbufs -hugepage-size=2M \
>> -filename=/dev/nvme0n1 -runtime=10s -name=test1
>>
>> [2]
>> https://lore.kernel.org/io-uring/[email protected]/T/#u
>>
>> Chenliang Li (4):
>>   io_uring/rsrc: add hugepage buffer coalesce helpers
>>   io_uring/rsrc: store folio shift and mask into imu
>>   io_uring/rsrc: add init and account functions for coalesced imus
>>   io_uring/rsrc: enable multi-hugepage buffer coalescing
>>
>>  io_uring/rsrc.c | 217 +++++++++++++++++++++++++++++++++++++++---------
>>  io_uring/rsrc.h |  12 +++
>>  2 files changed, 191 insertions(+), 38 deletions(-)
>>
>>
>> base-commit: 59b28a6e37e650c0d601ed87875b6217140cda5d
>> --
>> 2.34.1
>>
>>
> 
> I tested this series by registering multi-hugepage buffers. The coalescing helps
> saving dma-mapping time. This is the gain observed on my setup, while running
> the fio workload shared here.
> 
> RandomRead:
> Baseline        DeltaAbs        Symbol
> .....................................................
> 3.89%            -3.62%            [k] blk_rq_map_sg
> 3.58%            -3.23%            [k] dma_direct_map_sg
> 2.25%            -2.23%            [k] sg_next
> 
> RandomWrite:
> Baseline        DeltaAbs        Symbol
> .....................................................
> 2.46%            -2.31%            [k] dma_direct_map_sg
> 2.06%            -2.05%            [k] sg_next
> 2.08%            -1.80%            [k] blk_rq_map_sg
> 
> The liburing test case shared works fine too on my setup.
> 
> Feel free to add:
> Tested-by: Anuj Gupta <[email protected]>

It's even more dramatic here, excerpt from profiles:

    32.16%    -25.46%  [kernel.kallsyms]  [k] bio_split_rw
     8.92%     -8.38%  [kernel.kallsyms]  [k] iov_iter_is_aligned
     6.85%     -4.31%  [nvme]             [k] nvme_prep_rq.part.0
    14.71%             [kernel.kallsyms]  [k] __blk_rq_map_sg
     9.49%             [kernel.kallsyms]  [k] dma_direct_map_sg
     8.50%             [kernel.kallsyms]  [k] sg_next

some of it just shifted, but definitely a huge win. This is just using
a single drive, doing about 7GB/sec.

The change looks pretty reasonable to me. I'd love for the test cases to
try and hit corner cases, as it's really more of a functionality test
right now. We should include things like one-off huge pages, ensure we
don't coalesce where we should not, etc.

This is obviously too late for the 6.10 merge window, so there's plenty
of time to get this 100% sorted before the next kernel release.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
       [not found]       ` <CGME20240530051050epcas5p122f30aebcf99e27a8d02cc1318dbafc8@epcas5p1.samsung.com>
@ 2024-05-30  5:10         ` Chenliang Li
  2024-06-04 13:33           ` Anuj gupta
       [not found]           ` <CGME20240613024932epcas5p2f053609efe7e9fb3d87318a66c2ccf53@epcas5p2.samsung.com>
  0 siblings, 2 replies; 22+ messages in thread
From: Chenliang Li @ 2024-05-30  5:10 UTC (permalink / raw)
  To: axboe
  Cc: anuj1072538, anuj20.g, asml.silence, cliang01.li, gost.dev,
	io-uring, joshi.k, kundan.kumar, peiwei.li

On Thu, 16 May 2024 08:58:03 -0600, Jens Axboe wrote:
> The change looks pretty reasonable to me. I'd love for the test cases to
> try and hit corner cases, as it's really more of a functionality test
> right now. We should include things like one-off huge pages, ensure we
> don't coalesce where we should not, etc.

Hi Jens, the testcases are updated here:
https://lore.kernel.org/io-uring/[email protected]/T/#u
Add several corner cases this time, works fine. Please take a look.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
  2024-05-30  5:10         ` Chenliang Li
@ 2024-06-04 13:33           ` Anuj gupta
       [not found]           ` <CGME20240613024932epcas5p2f053609efe7e9fb3d87318a66c2ccf53@epcas5p2.samsung.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Anuj gupta @ 2024-06-04 13:33 UTC (permalink / raw)
  To: Chenliang Li
  Cc: axboe, anuj20.g, asml.silence, gost.dev, io-uring, joshi.k,
	kundan.kumar, peiwei.li

On Thu, May 30, 2024 at 10:41 AM Chenliang Li <[email protected]> wrote:
>
> On Thu, 16 May 2024 08:58:03 -0600, Jens Axboe wrote:
> > The change looks pretty reasonable to me. I'd love for the test cases to
> > try and hit corner cases, as it's really more of a functionality test
> > right now. We should include things like one-off huge pages, ensure we
> > don't coalesce where we should not, etc.
>
> Hi Jens, the testcases are updated here:
> https://lore.kernel.org/io-uring/[email protected]/T/#u
> Add several corner cases this time, works fine. Please take a look.

The additional test cases shared here [1], works fine too on my setup.
Tested-by: Anuj Gupta <[email protected]>

[1] https://lore.kernel.org/io-uring/[email protected]/

Thanks,
Anuj Gupta

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
       [not found]           ` <CGME20240613024932epcas5p2f053609efe7e9fb3d87318a66c2ccf53@epcas5p2.samsung.com>
@ 2024-06-13  2:49             ` Chenliang Li
  2024-06-16  2:54               ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Chenliang Li @ 2024-06-13  2:49 UTC (permalink / raw)
  To: cliang01.li
  Cc: anuj1072538, anuj20.g, asml.silence, axboe, gost.dev, io-uring,
	joshi.k, kundan.kumar, peiwei.li

On Thu, 30 May 2024 13:10:44 +0800, Chenliang Li wrote:
> On Thu, 16 May 2024 08:58:03 -0600, Jens Axboe wrote:
>> The change looks pretty reasonable to me. I'd love for the test cases to
>> try and hit corner cases, as it's really more of a functionality test
>> right now. We should include things like one-off huge pages, ensure we
>> don't coalesce where we should not, etc.
>
> Hi Jens, the testcases are updated here:
> https://lore.kernel.org/io-uring/[email protected]/T/#u
> Add several corner cases this time, works fine. Please take a look.

Hi, a gentle ping here.
The latest liburing testcase: https://lore.kernel.org/io-uring/[email protected]/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
  2024-06-13  2:49             ` Chenliang Li
@ 2024-06-16  2:54               ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2024-06-16  2:54 UTC (permalink / raw)
  To: Chenliang Li
  Cc: anuj1072538, anuj20.g, asml.silence, gost.dev, io-uring, joshi.k,
	kundan.kumar, peiwei.li

On 6/12/24 8:49 PM, Chenliang Li wrote:
> On Thu, 30 May 2024 13:10:44 +0800, Chenliang Li wrote:
>> On Thu, 16 May 2024 08:58:03 -0600, Jens Axboe wrote:
>>> The change looks pretty reasonable to me. I'd love for the test cases to
>>> try and hit corner cases, as it's really more of a functionality test
>>> right now. We should include things like one-off huge pages, ensure we
>>> don't coalesce where we should not, etc.
>>
>> Hi Jens, the testcases are updated here:
>> https://lore.kernel.org/io-uring/[email protected]/T/#u
>> Add several corner cases this time, works fine. Please take a look.
> 
> Hi, a gentle ping here.
> The latest liburing testcase: https://lore.kernel.org/io-uring/[email protected]/

I'll take a look on Monday, thanks.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/4] io_uring/rsrc: add init and account functions for coalesced imus
  2024-05-14  7:54     ` [PATCH v4 3/4] io_uring/rsrc: add init and account functions for coalesced imus Chenliang Li
@ 2024-06-16 17:43       ` Pavel Begunkov
       [not found]         ` <CGME20240617031611epcas5p26e5c5f65a182af069427b1609f01d1d0@epcas5p2.samsung.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2024-06-16 17:43 UTC (permalink / raw)
  To: Chenliang Li, axboe
  Cc: io-uring, peiwei.li, joshi.k, kundan.kumar, anuj20.g, gost.dev

On 5/14/24 08:54, Chenliang Li wrote:
> Introduce two functions to separate the coalesced imu alloc and
> accounting path from the original one. This helps to keep the original
> code path clean.
> 
> Signed-off-by: Chenliang Li <[email protected]>
> ---
>   io_uring/rsrc.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 89 insertions(+)
> 
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 578d382ca9bc..53fac5f27bbf 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -871,6 +871,45 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
>   	return ret;
>   }
>   
> +static int io_coalesced_buffer_account_pin(struct io_ring_ctx *ctx,
> +					   struct page **pages,
> +					   struct io_mapped_ubuf *imu,
> +					   struct page **last_hpage,
> +					   struct io_imu_folio_data *data)
> +{
> +	int i, j, ret;
> +
> +	imu->acct_pages = 0;
> +	j = 0;
> +	for (i = 0; i < data->nr_folios; i++) {
> +		struct page *hpage = pages[j];
> +
> +		if (hpage == *last_hpage)
> +			continue;
> +		*last_hpage = hpage;
> +		/*
> +		 * Already checked the page array in try coalesce,
> +		 * so pass in nr_pages=0 here to waive that.
> +		 */
> +		if (headpage_already_acct(ctx, pages, 0, hpage))
> +			continue;
> +		imu->acct_pages += data->nr_pages_mid;
> +		if (i)
> +			j += data->nr_pages_mid;
> +		else
> +			j = data->nr_pages_head;

You should account an entire folio here, i.e. ->nr_pages_mid
in either case. Let's say the first page in the registration
is the last page of a huge page, you'd account 4K while it
actually pins the entire huge page size.
It seems like you can just call io_buffer_account_pin()
instead.

On that note, you shouldn't duplicate code in either case,
just treat the normal discontig pages case as folios of
shift=PAGE_SHIFT.

Either just plain reuse or adjust io_buffer_account_pin()
instead of io_coalesced_buffer_account_pin().
io_coalesced_imu_alloc() should also go away.

io_sqe_buffer_register() {
	struct io_imu_folio_data data;

	if (!io_sqe_buffer_try_coalesce(pages, folio_data)) {
		folio_data.shift = PAGE_SHIFT;
		...
	}
	
	io_buffer_account_pin(pages, &data);
	imu->data = uaddr;
	...
}

> +	}
> +
> +	if (!imu->acct_pages)
> +		return 0;
> +
> +	ret = io_account_mem(ctx, imu->acct_pages);
> +	if (!ret)
> +		return 0;
> +	imu->acct_pages = 0;
> +	return ret;
> +}
> +
>   static bool __io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
>   					 struct io_imu_folio_data *data)
>   {
> @@ -949,6 +988,56 @@ static bool io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
>   	return true;
>   }
>   
> +static int io_coalesced_imu_alloc(struct io_ring_ctx *ctx, struct iovec *iov,
> +				  struct io_mapped_ubuf **pimu,
> +				  struct page **last_hpage, struct page **pages,
> +				  struct io_imu_folio_data *data)
> +{
> +	struct io_mapped_ubuf *imu = NULL;
> +	unsigned long off;
> +	size_t size, vec_len;
> +	int ret, i, j;
> +
> +	ret = -ENOMEM;
> +	imu = kvmalloc(struct_size(imu, bvec, data->nr_folios), GFP_KERNEL);
> +	if (!imu)
> +		return ret;
> +
> +	ret = io_coalesced_buffer_account_pin(ctx, pages, imu, last_hpage,
> +						data);
> +	if (ret) {
> +		unpin_user_page(pages[0]);
> +		j = data->nr_pages_head;
> +		for (i = 1; i < data->nr_folios; i++) {
> +			unpin_user_page(pages[j]);
> +			j += data->nr_pages_mid;
> +		}
> +		return ret;
> +	}
> +	off = (unsigned long) iov->iov_base & ~PAGE_MASK;
> +	size = iov->iov_len;
> +	/* store original address for later verification */
> +	imu->ubuf = (unsigned long) iov->iov_base;
> +	imu->ubuf_end = imu->ubuf + iov->iov_len;
> +	imu->nr_bvecs = data->nr_folios;
> +	imu->folio_shift = data->folio_shift;
> +	imu->folio_mask = ~((1UL << data->folio_shift) - 1);
> +	*pimu = imu;
> +	ret = 0;
> +
> +	vec_len = min_t(size_t, size, PAGE_SIZE * data->nr_pages_head - off);
> +	bvec_set_page(&imu->bvec[0], pages[0], vec_len, off);
> +	size -= vec_len;
> +	j = data->nr_pages_head;
> +	for (i = 1; i < data->nr_folios; i++) {
> +		vec_len = min_t(size_t, size, data->folio_size);
> +		bvec_set_page(&imu->bvec[i], pages[j], vec_len, 0);
> +		size -= vec_len;
> +		j += data->nr_pages_mid;
> +	}
> +	return ret;
> +}
> +
>   static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
>   				  struct io_mapped_ubuf **pimu,
>   				  struct page **last_hpage)

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 1/4] io_uring/rsrc: add hugepage buffer coalesce helpers
  2024-05-14  7:54     ` [PATCH v4 1/4] io_uring/rsrc: add hugepage buffer coalesce helpers Chenliang Li
  2024-05-16 14:07       ` Anuj gupta
@ 2024-06-16 18:04       ` Pavel Begunkov
       [not found]         ` <CGME20240617031218epcas5p4f706f53094ed8650a2b59b2006120956@epcas5p4.samsung.com>
  1 sibling, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2024-06-16 18:04 UTC (permalink / raw)
  To: Chenliang Li, axboe
  Cc: io-uring, peiwei.li, joshi.k, kundan.kumar, anuj20.g, gost.dev

On 5/14/24 08:54, Chenliang Li wrote:
> Introduce helper functions to check whether a buffer can
> be coalesced or not, and gather folio data for later use.
> 
> The coalescing optimizes time and space consumption caused
> by mapping and storing multi-hugepage fixed buffers.
> 
> A coalescable multi-hugepage buffer should fully cover its folios
> (except potentially the first and last one), and these folios should
> have the same size. These requirements are for easier later process,
> also we need same size'd chunks in io_import_fixed for fast iov_iter
> adjust.
> 
> Signed-off-by: Chenliang Li <[email protected]>
> ---
>   io_uring/rsrc.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++
>   io_uring/rsrc.h | 10 +++++++
>   2 files changed, 88 insertions(+)
> 
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 65417c9553b1..d08224c0c5b0 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -871,6 +871,84 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
>   	return ret;
>   }
>   
> +static bool __io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
> +					 struct io_imu_folio_data *data)

io_can_coalesce_buffer(), you're not actually trying to
do it here.

> +{
> +	struct folio *folio = page_folio(pages[0]);
> +	unsigned int count = 1;
> +	int i;
> +
> +	data->nr_pages_mid = folio_nr_pages(folio);
> +	if (data->nr_pages_mid == 1)
> +		return false;
> +
> +	data->folio_shift = folio_shift(folio);
> +	data->folio_size = folio_size(folio);
> +	data->nr_folios = 1;
> +	/*
> +	 * Check if pages are contiguous inside a folio, and all folios have
> +	 * the same page count except for the head and tail.
> +	 */
> +	for (i = 1; i < nr_pages; i++) {
> +		if (page_folio(pages[i]) == folio &&
> +			pages[i] == pages[i-1] + 1) {
> +			count++;
> +			continue;
> +		}
> +
> +		if (data->nr_folios == 1)
> +			data->nr_pages_head = count;
> +		else if (count != data->nr_pages_mid)
> +			return false;
> +
> +		folio = page_folio(pages[i]);
> +		if (folio_size(folio) != data->folio_size)
> +			return false;
> +
> +		count = 1;
> +		data->nr_folios++;
> +	}
> +	if (data->nr_folios == 1)
> +		data->nr_pages_head = count;
> +
> +	return true;
> +}
> +
> +static bool io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
> +				       struct io_imu_folio_data *data)
> +{
> +	int i, j;
> +
> +	if (nr_pages <= 1 ||
> +		!__io_sqe_buffer_try_coalesce(pages, nr_pages, data))
> +		return false;
> +
> +	/*
> +	 * The pages are bound to the folio, it doesn't
> +	 * actually unpin them but drops all but one reference,
> +	 * which is usually put down by io_buffer_unmap().
> +	 * Note, needs a better helper.
> +	 */
> +	if (data->nr_pages_head > 1)
> +		unpin_user_pages(&pages[1], data->nr_pages_head - 1);

Should be pages[0]. page[1] can be in another folio, and even
though data->nr_pages_head > 1 protects against touching it,
it's still flimsy.

> +
> +	j = data->nr_pages_head;
> +	nr_pages -= data->nr_pages_head;
> +	for (i = 1; i < data->nr_folios; i++) {
> +		unsigned int nr_unpin;
> +
> +		nr_unpin = min_t(unsigned int, nr_pages - 1,
> +					data->nr_pages_mid - 1);
> +		if (nr_unpin == 0)
> +			break;
> +		unpin_user_pages(&pages[j+1], nr_unpin);

same

> +		j += data->nr_pages_mid;

And instead of duplicating this voodoo iteration later,
please just assemble a new compacted ->nr_folios sized
page array.

> +		nr_pages -= data->nr_pages_mid;
> +	}
> +
> +	return true;
> +}
> +
>   static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
>   				  struct io_mapped_ubuf **pimu,
>   				  struct page **last_hpage)
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index c032ca3436ca..b2a9d66b76dd 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -50,6 +50,16 @@ struct io_mapped_ubuf {
>   	struct bio_vec	bvec[] __counted_by(nr_bvecs);
>   };
>   
> +struct io_imu_folio_data {
> +	/* Head folio can be partially included in the fixed buf */
> +	unsigned int	nr_pages_head;
> +	/* For non-head/tail folios, has to be fully included */
> +	unsigned int	nr_pages_mid;
> +	unsigned int	nr_folios;
> +	unsigned int	folio_shift;
> +	size_t		folio_size;
> +};
> +
>   void io_rsrc_node_ref_zero(struct io_rsrc_node *node);
>   void io_rsrc_node_destroy(struct io_ring_ctx *ctx, struct io_rsrc_node *ref_node);
>   struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx);

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
       [not found]         ` <CGME20240617031218epcas5p4f706f53094ed8650a2b59b2006120956@epcas5p4.samsung.com>
@ 2024-06-17  3:12           ` Chenliang Li
  2024-06-17 12:38             ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Chenliang Li @ 2024-06-17  3:12 UTC (permalink / raw)
  To: asml.silence
  Cc: anuj20.g, axboe, cliang01.li, gost.dev, io-uring, joshi.k,
	kundan.kumar, peiwei.li

On Sun, 16 Jun 2024 19:04:38 +0100, Pavel Begunkov wrote:
> On 5/14/24 08:54, Chenliang Li wrote:
>> Introduce helper functions to check whether a buffer can
>> be coalesced or not, and gather folio data for later use.
>> 
>> The coalescing optimizes time and space consumption caused
>> by mapping and storing multi-hugepage fixed buffers.
>> 
>> A coalescable multi-hugepage buffer should fully cover its folios
>> (except potentially the first and last one), and these folios should
>> have the same size. These requirements are for easier later process,
>> also we need same size'd chunks in io_import_fixed for fast iov_iter
>> adjust.
>> 
>> Signed-off-by: Chenliang Li <[email protected]>
>> ---
>>   io_uring/rsrc.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   io_uring/rsrc.h | 10 +++++++
>>   2 files changed, 88 insertions(+)
>> 
>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>> index 65417c9553b1..d08224c0c5b0 100644
>> --- a/io_uring/rsrc.c
>> +++ b/io_uring/rsrc.c
>> @@ -871,6 +871,84 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
>>   	return ret;
>>   }
>>   
>> +static bool __io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
>> +					 struct io_imu_folio_data *data)
> io_can_coalesce_buffer(), you're not actually trying to
> do it here.

Will change it.

>> +static bool io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
>> +				       struct io_imu_folio_data *data)
>> +{
>> +	int i, j;
>> +
>> +	if (nr_pages <= 1 ||
>> +		!__io_sqe_buffer_try_coalesce(pages, nr_pages, data))
>> +		return false;
>> +
>> +	/*
>> +	 * The pages are bound to the folio, it doesn't
>> +	 * actually unpin them but drops all but one reference,
>> +	 * which is usually put down by io_buffer_unmap().
>> +	 * Note, needs a better helper.
>> +	 */
>> +	if (data->nr_pages_head > 1)
>> +		unpin_user_pages(&pages[1], data->nr_pages_head - 1);
> Should be pages[0]. page[1] can be in another folio, and even
> though data->nr_pages_head > 1 protects against touching it,
> it's still flimsy.

But here it is unpinning the tail pages inside those coalesceable folios,
I think we only unpin pages[0] when failure, am I right? And in
__io_sqe_buffer_try_coalesce we have ensured that pages[1:nr_head_pages] are
in same folio and contiguous.

>> +
>> +	j = data->nr_pages_head;
>> +	nr_pages -= data->nr_pages_head;
>> +	for (i = 1; i < data->nr_folios; i++) {
>> +		unsigned int nr_unpin;
>> +
>> +		nr_unpin = min_t(unsigned int, nr_pages - 1,
>> +					data->nr_pages_mid - 1);
>> +		if (nr_unpin == 0)
>> +			break;
>> +		unpin_user_pages(&pages[j+1], nr_unpin);
> same
>> +		j += data->nr_pages_mid;
> And instead of duplicating this voodoo iteration later,
> please just assemble a new compacted ->nr_folios sized
> page array.

Indeed, a new page array would make things a lot easier.
If alloc overhead is not a concern here, then yeah I'll change it.

Thanks,
Chenliang Li

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
       [not found]         ` <CGME20240617031611epcas5p26e5c5f65a182af069427b1609f01d1d0@epcas5p2.samsung.com>
@ 2024-06-17  3:16           ` Chenliang Li
  2024-06-17 12:22             ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Chenliang Li @ 2024-06-17  3:16 UTC (permalink / raw)
  To: asml.silence
  Cc: anuj20.g, axboe, cliang01.li, gost.dev, io-uring, joshi.k,
	kundan.kumar, peiwei.li

Actually here it does account an entire folio. The j is just
array index.

> It seems like you can just call io_buffer_account_pin()
> instead.
>
> On that note, you shouldn't duplicate code in either case,
> just treat the normal discontig pages case as folios of
> shift=PAGE_SHIFT.
>
> Either just plain reuse or adjust io_buffer_account_pin()
> instead of io_coalesced_buffer_account_pin().
> io_coalesced_imu_alloc() should also go away.
>
> io_sqe_buffer_register() {
> 	struct io_imu_folio_data data;
>
>	if (!io_sqe_buffer_try_coalesce(pages, folio_data)) {
>		folio_data.shift = PAGE_SHIFT;
>		...
>	}
>	
>	io_buffer_account_pin(pages, &data);
>	imu->data = uaddr;
>	...
> }

Will remove them.

Thanks,
Chenliang Li

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
  2024-06-17  3:16           ` [PATCH v2 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers Chenliang Li
@ 2024-06-17 12:22             ` Pavel Begunkov
       [not found]               ` <CGME20240618032433epcas5p258e5fe6863a91a1f6243f3408b3378f9@epcas5p2.samsung.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2024-06-17 12:22 UTC (permalink / raw)
  To: Chenliang Li
  Cc: anuj20.g, axboe, gost.dev, io-uring, joshi.k, kundan.kumar,
	peiwei.li

On 6/17/24 04:16, Chenliang Li wrote:
> Actually here it does account an entire folio. The j is just
> array index.

Hmm, indeed. Is iteration the only difference here?

>> It seems like you can just call io_buffer_account_pin()
>> instead.
>>
>> On that note, you shouldn't duplicate code in either case,
>> just treat the normal discontig pages case as folios of
>> shift=PAGE_SHIFT.
>>
>> Either just plain reuse or adjust io_buffer_account_pin()
>> instead of io_coalesced_buffer_account_pin().
>> io_coalesced_imu_alloc() should also go away.
>>
>> io_sqe_buffer_register() {
>> 	struct io_imu_folio_data data;
>>
>> 	if (!io_sqe_buffer_try_coalesce(pages, folio_data)) {
>> 		folio_data.shift = PAGE_SHIFT;
>> 		...
>> 	}
>> 	
>> 	io_buffer_account_pin(pages, &data);
>> 	imu->data = uaddr;
>> 	...
>> }
> 
> Will remove them.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers
  2024-06-17  3:12           ` [PATCH v2 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers Chenliang Li
@ 2024-06-17 12:38             ` Pavel Begunkov
       [not found]               ` <CGME20240618031115epcas5p25e2275b5e73f974f13aa5ba060979973@epcas5p2.samsung.com>
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2024-06-17 12:38 UTC (permalink / raw)
  To: Chenliang Li
  Cc: anuj20.g, axboe, gost.dev, io-uring, joshi.k, kundan.kumar,
	peiwei.li

On 6/17/24 04:12, Chenliang Li wrote:
> On Sun, 16 Jun 2024 19:04:38 +0100, Pavel Begunkov wrote:
>> On 5/14/24 08:54, Chenliang Li wrote:
>>> Introduce helper functions to check whether a buffer can
>>> be coalesced or not, and gather folio data for later use.
>>>
>>> The coalescing optimizes time and space consumption caused
>>> by mapping and storing multi-hugepage fixed buffers.
>>>
>>> A coalescable multi-hugepage buffer should fully cover its folios
>>> (except potentially the first and last one), and these folios should
>>> have the same size. These requirements are for easier later process,
>>> also we need same size'd chunks in io_import_fixed for fast iov_iter
>>> adjust.
>>>
>>> Signed-off-by: Chenliang Li <[email protected]>
>>> ---
>>>    io_uring/rsrc.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>    io_uring/rsrc.h | 10 +++++++
>>>    2 files changed, 88 insertions(+)
>>>
>>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>> index 65417c9553b1..d08224c0c5b0 100644
>>> --- a/io_uring/rsrc.c
>>> +++ b/io_uring/rsrc.c
>>> @@ -871,6 +871,84 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
>>>    	return ret;
>>>    }
>>>    
>>> +static bool __io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
>>> +					 struct io_imu_folio_data *data)
>> io_can_coalesce_buffer(), you're not actually trying to
>> do it here.
> 
> Will change it.
> 
>>> +static bool io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
>>> +				       struct io_imu_folio_data *data)
>>> +{
>>> +	int i, j;
>>> +
>>> +	if (nr_pages <= 1 ||
>>> +		!__io_sqe_buffer_try_coalesce(pages, nr_pages, data))
>>> +		return false;
>>> +
>>> +	/*
>>> +	 * The pages are bound to the folio, it doesn't
>>> +	 * actually unpin them but drops all but one reference,
>>> +	 * which is usually put down by io_buffer_unmap().
>>> +	 * Note, needs a better helper.
>>> +	 */
>>> +	if (data->nr_pages_head > 1)
>>> +		unpin_user_pages(&pages[1], data->nr_pages_head - 1);
>> Should be pages[0]. page[1] can be in another folio, and even
>> though data->nr_pages_head > 1 protects against touching it,
>> it's still flimsy.
> 
> But here it is unpinning the tail pages inside those coalesceable folios,
> I think we only unpin pages[0] when failure, am I right? And in
> __io_sqe_buffer_try_coalesce we have ensured that pages[1:nr_head_pages] are
> in same folio and contiguous.

We want the entire folio to still be pinned, but don't want to
leave just one reference and not care down the line how many
refcounts / etc. you have to put down.

void unpin_user_page(struct page *page)
{
	sanity_check_pinned_pages(&page, 1);
	gup_put_folio(page_folio(page), 1, FOLL_PIN);
}

And all that goes to the folio as a single object, so doesn't
really matter which page you pass. Anyway, let's then leave it
as is then, I wish there would be unpin_folio_nr(), but there
is unpin_user_page_range_dirty_lock() resembling it.

>>> +
>>> +	j = data->nr_pages_head;
>>> +	nr_pages -= data->nr_pages_head;
>>> +	for (i = 1; i < data->nr_folios; i++) {
>>> +		unsigned int nr_unpin;
>>> +
>>> +		nr_unpin = min_t(unsigned int, nr_pages - 1,
>>> +					data->nr_pages_mid - 1);
>>> +		if (nr_unpin == 0)
>>> +			break;
>>> +		unpin_user_pages(&pages[j+1], nr_unpin);
>> same
>>> +		j += data->nr_pages_mid;
>> And instead of duplicating this voodoo iteration later,
>> please just assemble a new compacted ->nr_folios sized
>> page array.
> 
> Indeed, a new page array would make things a lot easier.
> If alloc overhead is not a concern here, then yeah I'll change it.

It's not, and the upside is reducing memory footprint,
which would be noticeable with huge pages. It's also
kvmalloc'ed, so compacting also improves the TLB situation.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/4] io_uring/rsrc: add init and account functions for coalesced imus
       [not found]               ` <CGME20240618031115epcas5p25e2275b5e73f974f13aa5ba060979973@epcas5p2.samsung.com>
@ 2024-06-18  3:11                 ` Chenliang Li
  0 siblings, 0 replies; 22+ messages in thread
From: Chenliang Li @ 2024-06-18  3:11 UTC (permalink / raw)
  To: asml.silence
  Cc: anuj20.g, axboe, cliang01.li, gost.dev, io-uring, joshi.k,
	kundan.kumar, peiwei.li

On Sun, 16 Jun 2024 18:43:13 +0100, Pavel Begunkov wrote:
> On 6/17/24 04:12, Chenliang Li wrote:
>> On Sun, 16 Jun 2024 19:04:38 +0100, Pavel Begunkov wrote:
>>> On 5/14/24 08:54, Chenliang Li wrote:
>>>> Introduce helper functions to check whether a buffer can
>>>> be coalesced or not, and gather folio data for later use.
>>>>
>>>> The coalescing optimizes time and space consumption caused
>>>> by mapping and storing multi-hugepage fixed buffers.
>>>>
>>>> A coalescable multi-hugepage buffer should fully cover its folios
>>>> (except potentially the first and last one), and these folios should
>>>> have the same size. These requirements are for easier later process,
>>>> also we need same size'd chunks in io_import_fixed for fast iov_iter
>>>> adjust.
>>>>
>>>> Signed-off-by: Chenliang Li <[email protected]>
>>>> ---
>>>>    io_uring/rsrc.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    io_uring/rsrc.h | 10 +++++++
>>>>    2 files changed, 88 insertions(+)
>>>>
>>>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>>> index 65417c9553b1..d08224c0c5b0 100644
>>>> --- a/io_uring/rsrc.c
>>>> +++ b/io_uring/rsrc.c
>>>> @@ -871,6 +871,84 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
>>>>    	return ret;
>>>>    }
>>>>    
>>>> +static bool __io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
>>>> +					 struct io_imu_folio_data *data)
>>> io_can_coalesce_buffer(), you're not actually trying to
>>> do it here.
>> 
>> Will change it.
>> 
>>>> +static bool io_sqe_buffer_try_coalesce(struct page **pages, int nr_pages,
>>>> +				       struct io_imu_folio_data *data)
>>>> +{
>>>> +	int i, j;
>>>> +
>>>> +	if (nr_pages <= 1 ||
>>>> +		!__io_sqe_buffer_try_coalesce(pages, nr_pages, data))
>>>> +		return false;
>>>> +
>>>> +	/*
>>>> +	 * The pages are bound to the folio, it doesn't
>>>> +	 * actually unpin them but drops all but one reference,
>>>> +	 * which is usually put down by io_buffer_unmap().
>>>> +	 * Note, needs a better helper.
>>>> +	 */
>>>> +	if (data->nr_pages_head > 1)
>>>> +		unpin_user_pages(&pages[1], data->nr_pages_head - 1);
>>> Should be pages[0]. page[1] can be in another folio, and even
>>> though data->nr_pages_head > 1 protects against touching it,
>>> it's still flimsy.
>> 
>> But here it is unpinning the tail pages inside those coalesceable folios,
>> I think we only unpin pages[0] when failure, am I right? And in
>> __io_sqe_buffer_try_coalesce we have ensured that pages[1:nr_head_pages] are
>> in same folio and contiguous.
>
> We want the entire folio to still be pinned, but don't want to
> leave just one reference and not care down the line how many
> refcounts / etc. you have to put down.
> 
> void unpin_user_page(struct page *page)
> {
> 	sanity_check_pinned_pages(&page, 1);
> 	gup_put_folio(page_folio(page), 1, FOLL_PIN);
> }
>
> And all that goes to the folio as a single object, so doesn't
> really matter which page you pass. Anyway, let's then leave it
> as is then, I wish there would be unpin_folio_nr(), but there
> is unpin_user_page_range_dirty_lock() resembling it.

I see. Thanks for the explanation.

>>>> +
>>>> +	j = data->nr_pages_head;
>>>> +	nr_pages -= data->nr_pages_head;
>>>> +	for (i = 1; i < data->nr_folios; i++) {
>>>> +		unsigned int nr_unpin;
>>>> +
>>>> +		nr_unpin = min_t(unsigned int, nr_pages - 1,
>>>> +					data->nr_pages_mid - 1);
>>>> +		if (nr_unpin == 0)
>>>> +			break;
>>>> +		unpin_user_pages(&pages[j+1], nr_unpin);
>>> same
>>>> +		j += data->nr_pages_mid;
>>> And instead of duplicating this voodoo iteration later,
>>> please just assemble a new compacted ->nr_folios sized
>>> page array.
>> 
>> Indeed, a new page array would make things a lot easier.
>> If alloc overhead is not a concern here, then yeah I'll change it.
>
> It's not, and the upside is reducing memory footprint,
> which would be noticeable with huge pages. It's also
> kvmalloc'ed, so compacting also improves the TLB situation.

OK, will use a new page array.

Thanks,
Chenliang Li

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 3/4] io_uring/rsrc: add init and account functions for coalesced imus
       [not found]               ` <CGME20240618032433epcas5p258e5fe6863a91a1f6243f3408b3378f9@epcas5p2.samsung.com>
@ 2024-06-18  3:24                 ` Chenliang Li
  0 siblings, 0 replies; 22+ messages in thread
From: Chenliang Li @ 2024-06-18  3:24 UTC (permalink / raw)
  To: asml.silence
  Cc: anuj20.g, axboe, cliang01.li, gost.dev, io-uring, joshi.k,
	kundan.kumar, peiwei.li

On Mon, 17 Jun 2024 13:22:43 +0100, Pavel Begunkov wrote:
> On 6/17/24 04:16, Chenliang Li wrote:
>> Actually here it does account an entire folio. The j is just
>> array index.
>
> Hmm, indeed. Is iteration the only difference here?

Yes, but I'll try make this less confusing.

Thanks,
Chenliang Li

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2024-06-18  3:39 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20240514075453epcas5p17974fb62d65a88b1a1b55b97942ee2be@epcas5p1.samsung.com>
2024-05-14  7:54 ` [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers Chenliang Li
     [not found]   ` <CGME20240514075457epcas5p10f02f1746f957df91353724ec859664f@epcas5p1.samsung.com>
2024-05-14  7:54     ` [PATCH v4 1/4] io_uring/rsrc: add hugepage buffer coalesce helpers Chenliang Li
2024-05-16 14:07       ` Anuj gupta
2024-06-16 18:04       ` Pavel Begunkov
     [not found]         ` <CGME20240617031218epcas5p4f706f53094ed8650a2b59b2006120956@epcas5p4.samsung.com>
2024-06-17  3:12           ` [PATCH v2 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers Chenliang Li
2024-06-17 12:38             ` Pavel Begunkov
     [not found]               ` <CGME20240618031115epcas5p25e2275b5e73f974f13aa5ba060979973@epcas5p2.samsung.com>
2024-06-18  3:11                 ` [PATCH v4 3/4] io_uring/rsrc: add init and account functions for coalesced imus Chenliang Li
     [not found]   ` <CGME20240514075459epcas5p2275b4c26f16bcfcea200e97fc75c2a14@epcas5p2.samsung.com>
2024-05-14  7:54     ` [PATCH v4 2/4] io_uring/rsrc: store folio shift and mask into imu Chenliang Li
2024-05-16 14:08       ` Anuj gupta
     [not found]   ` <CGME20240514075500epcas5p1e638b1ae84727b3669ff6b780cd1cb23@epcas5p1.samsung.com>
2024-05-14  7:54     ` [PATCH v4 3/4] io_uring/rsrc: add init and account functions for coalesced imus Chenliang Li
2024-06-16 17:43       ` Pavel Begunkov
     [not found]         ` <CGME20240617031611epcas5p26e5c5f65a182af069427b1609f01d1d0@epcas5p2.samsung.com>
2024-06-17  3:16           ` [PATCH v2 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers Chenliang Li
2024-06-17 12:22             ` Pavel Begunkov
     [not found]               ` <CGME20240618032433epcas5p258e5fe6863a91a1f6243f3408b3378f9@epcas5p2.samsung.com>
2024-06-18  3:24                 ` [PATCH v4 3/4] io_uring/rsrc: add init and account functions for coalesced imus Chenliang Li
     [not found]   ` <CGME20240514075502epcas5p10be6bef71d284a110277575d6008563d@epcas5p1.samsung.com>
2024-05-14  7:54     ` [PATCH v4 4/4] io_uring/rsrc: enable multi-hugepage buffer coalescing Chenliang Li
2024-05-16 14:09       ` Anuj gupta
2024-05-16 14:01   ` [PATCH v4 0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers Anuj gupta
2024-05-16 14:58     ` Jens Axboe
     [not found]       ` <CGME20240530051050epcas5p122f30aebcf99e27a8d02cc1318dbafc8@epcas5p1.samsung.com>
2024-05-30  5:10         ` Chenliang Li
2024-06-04 13:33           ` Anuj gupta
     [not found]           ` <CGME20240613024932epcas5p2f053609efe7e9fb3d87318a66c2ccf53@epcas5p2.samsung.com>
2024-06-13  2:49             ` Chenliang Li
2024-06-16  2:54               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox