[PATCH 0/4] io_import_fixed cleanups and optimisation

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] io_import_fixed cleanups and optimisation
@ 2025-04-17  9:32 Pavel Begunkov
  2025-04-17  9:32 ` [PATCH 1/4] io_uring/rsrc: don't skip offset calculation Pavel Begunkov
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Pavel Begunkov @ 2025-04-17  9:32 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Nitesh Shetty

io_import_fixed() cleanups topped with the nr segs optimisation
patch from Nitesh. Doesn't have the kbuf part, assuming Jens will
add it on top.

Based on io_uring-6.15

Nitesh Shetty (1):
  io_uring/rsrc: send exact nr_segs for fixed buffer

Pavel Begunkov (3):
  io_uring/rsrc: don't skip offset calculation
  io_uring/rsrc: separate kbuf offset adjustments
  io_uring/rsrc: refactor io_import_fixed

 io_uring/rsrc.c | 74 ++++++++++++++++++++-----------------------------
 1 file changed, 30 insertions(+), 44 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] io_uring/rsrc: don't skip offset calculation
  2025-04-17  9:32 [PATCH 0/4] io_import_fixed cleanups and optimisation Pavel Begunkov
@ 2025-04-17  9:32 ` Pavel Begunkov
  2025-04-17  9:32 ` [PATCH 2/4] io_uring/rsrc: separate kbuf offset adjustments Pavel Begunkov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2025-04-17  9:32 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Nitesh Shetty

Don't optimise for requests with offset=0. Large registered buffers are
the preference and hence the user is likely to pass an offset, and the
adjustments are not expensive and will be made even cheaper in following
patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/rsrc.c | 75 ++++++++++++++++++++++++-------------------------
 1 file changed, 37 insertions(+), 38 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index b36c8825550e..4d62897d1c89 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1036,6 +1036,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 			   struct io_mapped_ubuf *imu,
 			   u64 buf_addr, size_t len)
 {
+	const struct bio_vec *bvec;
 	size_t offset;
 	int ret;
 
@@ -1054,47 +1055,45 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 	offset = buf_addr - imu->ubuf;
 	iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
 
-	if (offset) {
-		/*
-		 * Don't use iov_iter_advance() here, as it's really slow for
-		 * using the latter parts of a big fixed buffer - it iterates
-		 * over each segment manually. We can cheat a bit here for user
-		 * registered nodes, because we know that:
-		 *
-		 * 1) it's a BVEC iter, we set it up
-		 * 2) all bvecs are the same in size, except potentially the
-		 *    first and last bvec
-		 *
-		 * So just find our index, and adjust the iterator afterwards.
-		 * If the offset is within the first bvec (or the whole first
-		 * bvec, just use iov_iter_advance(). This makes it easier
-		 * since we can just skip the first segment, which may not
-		 * be folio_size aligned.
-		 */
-		const struct bio_vec *bvec = imu->bvec;
+	/*
+	 * Don't use iov_iter_advance() here, as it's really slow for
+	 * using the latter parts of a big fixed buffer - it iterates
+	 * over each segment manually. We can cheat a bit here for user
+	 * registered nodes, because we know that:
+	 *
+	 * 1) it's a BVEC iter, we set it up
+	 * 2) all bvecs are the same in size, except potentially the
+	 *    first and last bvec
+	 *
+	 * So just find our index, and adjust the iterator afterwards.
+	 * If the offset is within the first bvec (or the whole first
+	 * bvec, just use iov_iter_advance(). This makes it easier
+	 * since we can just skip the first segment, which may not
+	 * be folio_size aligned.
+	 */
+	bvec = imu->bvec;
 
-		/*
-		 * Kernel buffer bvecs, on the other hand, don't necessarily
-		 * have the size property of user registered ones, so we have
-		 * to use the slow iter advance.
-		 */
-		if (offset < bvec->bv_len) {
-			iter->count -= offset;
-			iter->iov_offset = offset;
-		} else if (imu->is_kbuf) {
-			iov_iter_advance(iter, offset);
-		} else {
-			unsigned long seg_skip;
+	/*
+	 * Kernel buffer bvecs, on the other hand, don't necessarily
+	 * have the size property of user registered ones, so we have
+	 * to use the slow iter advance.
+	 */
+	if (offset < bvec->bv_len) {
+		iter->count -= offset;
+		iter->iov_offset = offset;
+	} else if (imu->is_kbuf) {
+		iov_iter_advance(iter, offset);
+	} else {
+		unsigned long seg_skip;
 
-			/* skip first vec */
-			offset -= bvec->bv_len;
-			seg_skip = 1 + (offset >> imu->folio_shift);
+		/* skip first vec */
+		offset -= bvec->bv_len;
+		seg_skip = 1 + (offset >> imu->folio_shift);
 
-			iter->bvec += seg_skip;
-			iter->nr_segs -= seg_skip;
-			iter->count -= bvec->bv_len + offset;
-			iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
-		}
+		iter->bvec += seg_skip;
+		iter->nr_segs -= seg_skip;
+		iter->count -= bvec->bv_len + offset;
+		iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
 	}
 
 	return 0;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] io_uring/rsrc: separate kbuf offset adjustments
  2025-04-17  9:32 [PATCH 0/4] io_import_fixed cleanups and optimisation Pavel Begunkov
  2025-04-17  9:32 ` [PATCH 1/4] io_uring/rsrc: don't skip offset calculation Pavel Begunkov
@ 2025-04-17  9:32 ` Pavel Begunkov
  2025-04-17  9:32 ` [PATCH 3/4] io_uring/rsrc: refactor io_import_fixed Pavel Begunkov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2025-04-17  9:32 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Nitesh Shetty

Kernel registered buffers are special because segments are not uniform
in size, and we have a bunch of optimisations based on that uniformity
for normal buffers. Handle kbuf separately, it'll be cleaner this way.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/rsrc.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 4d62897d1c89..fddde8ffe81e 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1048,11 +1048,14 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 	if (!(imu->dir & (1 << ddir)))
 		return -EFAULT;
 
-	/*
-	 * Might not be a start of buffer, set size appropriately
-	 * and advance us to the beginning.
-	 */
 	offset = buf_addr - imu->ubuf;
+
+	if (imu->is_kbuf) {
+		iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
+		iov_iter_advance(iter, offset);
+		return 0;
+	}
+
 	iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
 
 	/*
@@ -1072,17 +1075,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 	 * be folio_size aligned.
 	 */
 	bvec = imu->bvec;
-
-	/*
-	 * Kernel buffer bvecs, on the other hand, don't necessarily
-	 * have the size property of user registered ones, so we have
-	 * to use the slow iter advance.
-	 */
 	if (offset < bvec->bv_len) {
 		iter->count -= offset;
 		iter->iov_offset = offset;
-	} else if (imu->is_kbuf) {
-		iov_iter_advance(iter, offset);
 	} else {
 		unsigned long seg_skip;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] io_uring/rsrc: refactor io_import_fixed
  2025-04-17  9:32 [PATCH 0/4] io_import_fixed cleanups and optimisation Pavel Begunkov
  2025-04-17  9:32 ` [PATCH 1/4] io_uring/rsrc: don't skip offset calculation Pavel Begunkov
  2025-04-17  9:32 ` [PATCH 2/4] io_uring/rsrc: separate kbuf offset adjustments Pavel Begunkov
@ 2025-04-17  9:32 ` Pavel Begunkov
  2025-04-17  9:32 ` [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer Pavel Begunkov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2025-04-17  9:32 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Nitesh Shetty

io_import_fixed is a mess. Even though we know the final len of the
iterator, we still assign offset + len and do some magic after to
correct for that.

Do offset calculation first and finalise it with iov_iter_bvec at the
end.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/rsrc.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index fddde8ffe81e..5cf854318b1d 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1037,6 +1037,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 			   u64 buf_addr, size_t len)
 {
 	const struct bio_vec *bvec;
+	unsigned nr_segs;
 	size_t offset;
 	int ret;
 
@@ -1056,8 +1057,6 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 		return 0;
 	}
 
-	iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
-
 	/*
 	 * Don't use iov_iter_advance() here, as it's really slow for
 	 * using the latter parts of a big fixed buffer - it iterates
@@ -1067,30 +1066,21 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 	 * 1) it's a BVEC iter, we set it up
 	 * 2) all bvecs are the same in size, except potentially the
 	 *    first and last bvec
-	 *
-	 * So just find our index, and adjust the iterator afterwards.
-	 * If the offset is within the first bvec (or the whole first
-	 * bvec, just use iov_iter_advance(). This makes it easier
-	 * since we can just skip the first segment, which may not
-	 * be folio_size aligned.
 	 */
 	bvec = imu->bvec;
-	if (offset < bvec->bv_len) {
-		iter->count -= offset;
-		iter->iov_offset = offset;
-	} else {
+	if (offset >= bvec->bv_len) {
 		unsigned long seg_skip;
 
 		/* skip first vec */
 		offset -= bvec->bv_len;
 		seg_skip = 1 + (offset >> imu->folio_shift);
-
-		iter->bvec += seg_skip;
-		iter->nr_segs -= seg_skip;
-		iter->count -= bvec->bv_len + offset;
-		iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
+		bvec += seg_skip;
+		offset &= (1UL << imu->folio_shift) - 1;
 	}
 
+	nr_segs = imu->nr_bvecs - (bvec - imu->bvec);
+	iov_iter_bvec(iter, ddir, bvec, nr_segs, len);
+	iter->iov_offset = offset;
 	return 0;
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
  2025-04-17  9:32 [PATCH 0/4] io_import_fixed cleanups and optimisation Pavel Begunkov
                   ` (2 preceding siblings ...)
  2025-04-17  9:32 ` [PATCH 3/4] io_uring/rsrc: refactor io_import_fixed Pavel Begunkov
@ 2025-04-17  9:32 ` Pavel Begunkov
  2025-04-17  9:34   ` Pavel Begunkov
  2025-04-17 12:31 ` [PATCH 0/4] io_import_fixed cleanups and optimisation Jens Axboe
       [not found] ` <CGME20250417142334epcas5p36df55874d21c896115d92e505f9793fd@epcas5p3.samsung.com>
  5 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2025-04-17  9:32 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Nitesh Shetty

From: Nitesh Shetty <nj.shetty@samsung.com>

Sending exact nr_segs, avoids bio split check and processing in
block layer, which takes around 5%[1] of overall CPU utilization.

In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
and 5% less CPU utilization.

[1]
     3.52%  io_uring         [kernel.kallsyms]     [k] bio_split_rw_at
     1.42%  io_uring         [kernel.kallsyms]     [k] bio_split_rw
     0.62%  io_uring         [kernel.kallsyms]     [k] bio_submit_split

[2]
sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
-r4 /dev/nvme0n1 /dev/nvme1n1

Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
[Pavel: fixed for kbuf, rebased and reworked on top of cleanups]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/rsrc.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 5cf854318b1d..4099b8225670 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1037,6 +1037,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 			   u64 buf_addr, size_t len)
 {
 	const struct bio_vec *bvec;
+	size_t folio_mask;
 	unsigned nr_segs;
 	size_t offset;
 	int ret;
@@ -1067,6 +1068,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 	 * 2) all bvecs are the same in size, except potentially the
 	 *    first and last bvec
 	 */
+	folio_mask = (1UL << imu->folio_shift) - 1;
 	bvec = imu->bvec;
 	if (offset >= bvec->bv_len) {
 		unsigned long seg_skip;
@@ -1075,10 +1077,10 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
 		offset -= bvec->bv_len;
 		seg_skip = 1 + (offset >> imu->folio_shift);
 		bvec += seg_skip;
-		offset &= (1UL << imu->folio_shift) - 1;
+		offset &= folio_mask;
 	}
 
-	nr_segs = imu->nr_bvecs - (bvec - imu->bvec);
+	nr_segs = (offset + len + folio_mask) >> imu->folio_shift;
 	iov_iter_bvec(iter, ddir, bvec, nr_segs, len);
 	iter->iov_offset = offset;
 	return 0;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
  2025-04-17  9:32 ` [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer Pavel Begunkov
@ 2025-04-17  9:34   ` Pavel Begunkov
       [not found]     ` <CGME20250417103133epcas5p32c1e004e7f8a5135c4c7e3662b087470@epcas5p3.samsung.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2025-04-17  9:34 UTC (permalink / raw)
  To: io-uring, Nitesh Shetty

On 4/17/25 10:32, Pavel Begunkov wrote:
> From: Nitesh Shetty <nj.shetty@samsung.com>
...
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 5cf854318b1d..4099b8225670 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -1037,6 +1037,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>   			   u64 buf_addr, size_t len)
>   {
>   	const struct bio_vec *bvec;
> +	size_t folio_mask;
>   	unsigned nr_segs;
>   	size_t offset;
>   	int ret;
> @@ -1067,6 +1068,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>   	 * 2) all bvecs are the same in size, except potentially the
>   	 *    first and last bvec
>   	 */
> +	folio_mask = (1UL << imu->folio_shift) - 1;
>   	bvec = imu->bvec;
>   	if (offset >= bvec->bv_len) {
>   		unsigned long seg_skip;
> @@ -1075,10 +1077,10 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>   		offset -= bvec->bv_len;
>   		seg_skip = 1 + (offset >> imu->folio_shift);
>   		bvec += seg_skip;
> -		offset &= (1UL << imu->folio_shift) - 1;
> +		offset &= folio_mask;
>   	}
>   
> -	nr_segs = imu->nr_bvecs - (bvec - imu->bvec);
> +	nr_segs = (offset + len + folio_mask) >> imu->folio_shift;

Nitesh, let me know if you're happy with this version.

>   	iov_iter_bvec(iter, ddir, bvec, nr_segs, len);
>   	iter->iov_offset = offset;
>   	return 0;

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CGME20250417103133epcas5p32c1e004e7f8a5135c4c7e3662b087470@epcas5p3.samsung.com>]

* Re: [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
       [not found]     ` <CGME20250417103133epcas5p32c1e004e7f8a5135c4c7e3662b087470@epcas5p3.samsung.com>
@ 2025-04-17 10:23       ` Nitesh Shetty
  2025-04-17 11:50         ` Nitesh Shetty
  0 siblings, 1 reply; 15+ messages in thread
From: Nitesh Shetty @ 2025-04-17 10:23 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring

[-- Attachment #1: Type: text/plain, Size: 1408 bytes --]

On 17/04/25 10:34AM, Pavel Begunkov wrote:
>On 4/17/25 10:32, Pavel Begunkov wrote:
>>From: Nitesh Shetty <nj.shetty@samsung.com>
>...
>>diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>index 5cf854318b1d..4099b8225670 100644
>>--- a/io_uring/rsrc.c
>>+++ b/io_uring/rsrc.c
>>@@ -1037,6 +1037,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>  			   u64 buf_addr, size_t len)
>>  {
>>  	const struct bio_vec *bvec;
>>+	size_t folio_mask;
>>  	unsigned nr_segs;
>>  	size_t offset;
>>  	int ret;
>>@@ -1067,6 +1068,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>  	 * 2) all bvecs are the same in size, except potentially the
>>  	 *    first and last bvec
>>  	 */
>>+	folio_mask = (1UL << imu->folio_shift) - 1;
>>  	bvec = imu->bvec;
>>  	if (offset >= bvec->bv_len) {
>>  		unsigned long seg_skip;
>>@@ -1075,10 +1077,10 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>  		offset -= bvec->bv_len;
>>  		seg_skip = 1 + (offset >> imu->folio_shift);
>>  		bvec += seg_skip;
>>-		offset &= (1UL << imu->folio_shift) - 1;
>>+		offset &= folio_mask;
>>  	}
>>-	nr_segs = imu->nr_bvecs - (bvec - imu->bvec);
>>+	nr_segs = (offset + len + folio_mask) >> imu->folio_shift;
>
>Nitesh, let me know if you're happy with this version.
>
This looks great to me, I tested this series and see the
improvement in IOPS from 7.15 to 7.65M here.

Thanks,
Nitesh Shetty

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
  2025-04-17 10:23       ` Nitesh Shetty
@ 2025-04-17 11:50         ` Nitesh Shetty
  2025-04-17 12:56           ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Nitesh Shetty @ 2025-04-17 11:50 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring

[-- Attachment #1: Type: text/plain, Size: 2224 bytes --]

On 17/04/25 03:53PM, Nitesh Shetty wrote:
>On 17/04/25 10:34AM, Pavel Begunkov wrote:
>>On 4/17/25 10:32, Pavel Begunkov wrote:
>>>From: Nitesh Shetty <nj.shetty@samsung.com>
>>...
>>>diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>>index 5cf854318b1d..4099b8225670 100644
>>>--- a/io_uring/rsrc.c
>>>+++ b/io_uring/rsrc.c
>>>@@ -1037,6 +1037,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>> 			   u64 buf_addr, size_t len)
>>> {
>>> 	const struct bio_vec *bvec;
>>>+	size_t folio_mask;
>>> 	unsigned nr_segs;
>>> 	size_t offset;
>>> 	int ret;
>>>@@ -1067,6 +1068,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>> 	 * 2) all bvecs are the same in size, except potentially the
>>> 	 *    first and last bvec
>>> 	 */
>>>+	folio_mask = (1UL << imu->folio_shift) - 1;
>>> 	bvec = imu->bvec;
>>> 	if (offset >= bvec->bv_len) {
>>> 		unsigned long seg_skip;
>>>@@ -1075,10 +1077,10 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>> 		offset -= bvec->bv_len;
>>> 		seg_skip = 1 + (offset >> imu->folio_shift);
>>> 		bvec += seg_skip;
>>>-		offset &= (1UL << imu->folio_shift) - 1;
>>>+		offset &= folio_mask;
>>> 	}
>>>-	nr_segs = imu->nr_bvecs - (bvec - imu->bvec);
>>>+	nr_segs = (offset + len + folio_mask) >> imu->folio_shift;
>>
>>Nitesh, let me know if you're happy with this version.
>>
>This looks great to me, I tested this series and see the
>improvement in IOPS from 7.15 to 7.65M here.
>

There is corner case where this might not work,
This happens when there is a first bvec has non zero offset.
Let's say bv_offset = 256, len = 512, iov_offset = 3584 (512*7, 8th IO),
here we expect IO to have 2 segments with present codebase, but this
patch set produces 1 segment.

So having a fix like this solves the issue,
+	nr_segs = (offset + len + bvec->bv_offset + folio_mask) >> imu->folio_shift;

Note:
I am investigating whether this is a valid case or not, because having a
512 byte IO with 256 byte alignment feel odd. So have sent one patch for
that as well[1]. If that patch[1] is upstreamed then above case is taken
care of, so we can use this series as it is.

Thanks,
Nitesh

[1]
https://lore.kernel.org/all/20250415181419.16305-1-nj.shetty@samsung.com/


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
  2025-04-17 11:50         ` Nitesh Shetty
@ 2025-04-17 12:56           ` Pavel Begunkov
  2025-04-17 13:41             ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2025-04-17 12:56 UTC (permalink / raw)
  To: Nitesh Shetty; +Cc: io-uring

On 4/17/25 12:50, Nitesh Shetty wrote:
> On 17/04/25 03:53PM, Nitesh Shetty wrote:
>> On 17/04/25 10:34AM, Pavel Begunkov wrote:
>>> On 4/17/25 10:32, Pavel Begunkov wrote:
>>>> From: Nitesh Shetty <nj.shetty@samsung.com>
>>> ...
>>>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>>> index 5cf854318b1d..4099b8225670 100644
>>>> --- a/io_uring/rsrc.c
>>>> +++ b/io_uring/rsrc.c
>>>> @@ -1037,6 +1037,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>>                u64 buf_addr, size_t len)
>>>> {
>>>>     const struct bio_vec *bvec;
>>>> +    size_t folio_mask;
>>>>     unsigned nr_segs;
>>>>     size_t offset;
>>>>     int ret;
>>>> @@ -1067,6 +1068,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>>      * 2) all bvecs are the same in size, except potentially the
>>>>      *    first and last bvec
>>>>      */
>>>> +    folio_mask = (1UL << imu->folio_shift) - 1;
>>>>     bvec = imu->bvec;
>>>>     if (offset >= bvec->bv_len) {
>>>>         unsigned long seg_skip;
>>>> @@ -1075,10 +1077,10 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>>         offset -= bvec->bv_len;
>>>>         seg_skip = 1 + (offset >> imu->folio_shift);
>>>>         bvec += seg_skip;
>>>> -        offset &= (1UL << imu->folio_shift) - 1;
>>>> +        offset &= folio_mask;
>>>>     }
>>>> -    nr_segs = imu->nr_bvecs - (bvec - imu->bvec);
>>>> +    nr_segs = (offset + len + folio_mask) >> imu->folio_shift;
>>>
>>> Nitesh, let me know if you're happy with this version.
>>>
>> This looks great to me, I tested this series and see the
>> improvement in IOPS from 7.15 to 7.65M here.
>>
> 
> There is corner case where this might not work,
> This happens when there is a first bvec has non zero offset.
> Let's say bv_offset = 256, len = 512, iov_offset = 3584 (512*7, 8th IO),
> here we expect IO to have 2 segments with present codebase, but this
> patch set produces 1 segment.
> 
> So having a fix like this solves the issue,
> +    nr_segs = (offset + len + bvec->bv_offset + folio_mask) >> imu->folio_shift;

Ah yes, looks like the right fix up. We can make it nicer, but
that's for later. It'd also be great to have a test for it.


> Note:
> I am investigating whether this is a valid case or not, because having a
> 512 byte IO with 256 byte alignment feel odd. So have sent one patch for

Block might filter it out, but for example net/ doesn't care,
fs as well. IIUC what you mean, either way we definitely should
correct that.

> that as well[1]. If that patch[1] is upstreamed then above case is taken
> care of, so we can use this series as it is.
> 
> Thanks,
> Nitesh
> 
> [1]
> https://lore.kernel.org/all/20250415181419.16305-1-nj.shetty@samsung.com/
> 
> 

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
  2025-04-17 12:56           ` Pavel Begunkov
@ 2025-04-17 13:41             ` Jens Axboe
  2025-04-17 13:57               ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-04-17 13:41 UTC (permalink / raw)
  To: Pavel Begunkov, Nitesh Shetty; +Cc: io-uring

On 4/17/25 6:56 AM, Pavel Begunkov wrote:
> On 4/17/25 12:50, Nitesh Shetty wrote:
>> On 17/04/25 03:53PM, Nitesh Shetty wrote:
>>> On 17/04/25 10:34AM, Pavel Begunkov wrote:
>>>> On 4/17/25 10:32, Pavel Begunkov wrote:
>>>>> From: Nitesh Shetty <nj.shetty@samsung.com>
>>>> ...
>>>>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>>>> index 5cf854318b1d..4099b8225670 100644
>>>>> --- a/io_uring/rsrc.c
>>>>> +++ b/io_uring/rsrc.c
>>>>> @@ -1037,6 +1037,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>>>                u64 buf_addr, size_t len)
>>>>> {
>>>>>     const struct bio_vec *bvec;
>>>>> +    size_t folio_mask;
>>>>>     unsigned nr_segs;
>>>>>     size_t offset;
>>>>>     int ret;
>>>>> @@ -1067,6 +1068,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>>>      * 2) all bvecs are the same in size, except potentially the
>>>>>      *    first and last bvec
>>>>>      */
>>>>> +    folio_mask = (1UL << imu->folio_shift) - 1;
>>>>>     bvec = imu->bvec;
>>>>>     if (offset >= bvec->bv_len) {
>>>>>         unsigned long seg_skip;
>>>>> @@ -1075,10 +1077,10 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>>>         offset -= bvec->bv_len;
>>>>>         seg_skip = 1 + (offset >> imu->folio_shift);
>>>>>         bvec += seg_skip;
>>>>> -        offset &= (1UL << imu->folio_shift) - 1;
>>>>> +        offset &= folio_mask;
>>>>>     }
>>>>> -    nr_segs = imu->nr_bvecs - (bvec - imu->bvec);
>>>>> +    nr_segs = (offset + len + folio_mask) >> imu->folio_shift;
>>>>
>>>> Nitesh, let me know if you're happy with this version.
>>>>
>>> This looks great to me, I tested this series and see the
>>> improvement in IOPS from 7.15 to 7.65M here.
>>>
>>
>> There is corner case where this might not work,
>> This happens when there is a first bvec has non zero offset.
>> Let's say bv_offset = 256, len = 512, iov_offset = 3584 (512*7, 8th IO),
>> here we expect IO to have 2 segments with present codebase, but this
>> patch set produces 1 segment.
>>
>> So having a fix like this solves the issue,
>> +    nr_segs = (offset + len + bvec->bv_offset + folio_mask) >> imu->folio_shift;
> 
> Ah yes, looks like the right fix up. We can make it nicer, but
> that's for later. It'd also be great to have a test for it.
> 
> 
>> Note:
>> I am investigating whether this is a valid case or not, because having a
>> 512 byte IO with 256 byte alignment feel odd. So have sent one patch for
> 
> Block might filter it out, but for example net/ doesn't care,
> fs as well. IIUC what you mean, either way we definitely should
> correct that.

I just tested it, and yes it certainly blows up... Can also confirm that
the corrected nr_segs calculation does the right thing, doesn't end up
underestimating the segment count by 1 in that case.

I'll turn the test case into something we can add to liburing, and fold
in that change.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
  2025-04-17 13:41             ` Jens Axboe
@ 2025-04-17 13:57               ` Jens Axboe
  2025-04-17 14:02                 ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-04-17 13:57 UTC (permalink / raw)
  To: Pavel Begunkov, Nitesh Shetty; +Cc: io-uring

On 4/17/25 7:41 AM, Jens Axboe wrote:
> I'll turn the test case into something we can add to liburing, and fold
> in that change.

Here's what I tested, fwiw, and it reliably blows up pre the fixup. I'll
turn it into a normal test case, and then folks can add more invariants
to this one if they wish.


#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <liburing.h>

static struct iovec vec;

static int read_it(struct io_uring *ring, int fd, int len, int off)
{
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;
	int ret;

	sqe = io_uring_get_sqe(ring);
	io_uring_prep_read_fixed(sqe, fd, vec.iov_base + off, len, 0, 0);
	sqe->user_data = 1;

	io_uring_submit(ring);

	ret = io_uring_wait_cqe(ring, &cqe);
	if (ret) {
		fprintf(stderr, "wait %d\n", ret);
		return 1;
	}
	if (cqe->res < 0) {
		fprintf(stderr, "cqe res %s\n", strerror(-cqe->res));
		return 1;
	}
	if (cqe->res != len) {
		fprintf(stderr, "Bad read amount: %d\n", cqe->res);
		return 1;
	}
	io_uring_cqe_seen(ring, cqe);
	return 0;
}

static int test(struct io_uring *ring, int fd, int vec_off)
{
	struct iovec v = vec;
	int ret;

	v.iov_base += vec_off;
	v.iov_len -= vec_off;
	ret = io_uring_register_buffers(ring, &v, 1);
	if (ret) {
		fprintf(stderr, "Vec register: %d\n", ret);
		return 1;
	}

	ret = read_it(ring, fd, 4096, vec_off);
	if (ret) {
		fprintf(stderr, "4096 0 failed\n");
		return 1;
	}
	ret = read_it(ring, fd, 8192, 4096);
	if (ret) {
		fprintf(stderr, "8192 4096 failed\n");
		return 1;
	}
	ret = read_it(ring, fd, 4096, 4096);
	if (ret) {
		fprintf(stderr, "4096 4096 failed\n");
		return 1;
	}
	
	io_uring_unregister_buffers(ring);
	return 0;
}

int main(int argc, char *argv[])
{
	struct io_uring ring;
	int fd, ret;

	if (argc == 1) {
		printf("%s: <file>\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY | O_DIRECT);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	posix_memalign(&vec.iov_base, 4096, 512*1024);
	vec.iov_len = 512*1024;

	io_uring_queue_init(32, &ring, 0);

	ret = test(&ring, fd, 0);
	if (ret) {
		fprintf(stderr, "test 0 failed\n");
		return 1;
	}

	ret = test(&ring, fd, 512);
	if (ret) {
		fprintf(stderr, "test 512 failed\n");
		return 1;
	}

	close(fd);
	io_uring_queue_exit(&ring);
	return 0;
}

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
  2025-04-17 13:57               ` Jens Axboe
@ 2025-04-17 14:02                 ` Pavel Begunkov
  2025-04-17 14:05                   ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2025-04-17 14:02 UTC (permalink / raw)
  To: Jens Axboe, Nitesh Shetty; +Cc: io-uring

On 4/17/25 14:57, Jens Axboe wrote:
> On 4/17/25 7:41 AM, Jens Axboe wrote:
>> I'll turn the test case into something we can add to liburing, and fold
>> in that change.
> 
> Here's what I tested, fwiw, and it reliably blows up pre the fixup. I'll
> turn it into a normal test case, and then folks can add more invariants
> to this one if they wish.

Awesome, thanks!

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
  2025-04-17 14:02                 ` Pavel Begunkov
@ 2025-04-17 14:05                   ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-17 14:05 UTC (permalink / raw)
  To: Pavel Begunkov, Nitesh Shetty; +Cc: io-uring

On 4/17/25 8:02 AM, Pavel Begunkov wrote:
> On 4/17/25 14:57, Jens Axboe wrote:
>> On 4/17/25 7:41 AM, Jens Axboe wrote:
>>> I'll turn the test case into something we can add to liburing, and fold
>>> in that change.
>>
>> Here's what I tested, fwiw, and it reliably blows up pre the fixup. I'll
>> turn it into a normal test case, and then folks can add more invariants
>> to this one if they wish.
> 
> Awesome, thanks!

Done:

https://git.kernel.dk/cgit/liburing/commit/?id=d700f83631a780182c3a3f1315e926d5547acedb

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] io_import_fixed cleanups and optimisation
  2025-04-17  9:32 [PATCH 0/4] io_import_fixed cleanups and optimisation Pavel Begunkov
                   ` (3 preceding siblings ...)
  2025-04-17  9:32 ` [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer Pavel Begunkov
@ 2025-04-17 12:31 ` Jens Axboe
       [not found] ` <CGME20250417142334epcas5p36df55874d21c896115d92e505f9793fd@epcas5p3.samsung.com>
  5 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-17 12:31 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov; +Cc: Nitesh Shetty


On Thu, 17 Apr 2025 10:32:30 +0100, Pavel Begunkov wrote:
> io_import_fixed() cleanups topped with the nr segs optimisation
> patch from Nitesh. Doesn't have the kbuf part, assuming Jens will
> add it on top.
> 
> Based on io_uring-6.15
> 
> Nitesh Shetty (1):
>   io_uring/rsrc: send exact nr_segs for fixed buffer
> 
> [...]

Applied, thanks!

[1/4] io_uring/rsrc: don't skip offset calculation
      commit: 1ac571288822253db32196c49f240739148417e3
[2/4] io_uring/rsrc: separate kbuf offset adjustments
      commit: 50169d07548441e3033b9bbaa06e573e7224f140
[3/4] io_uring/rsrc: refactor io_import_fixed
      commit: 59852ebad954c8a3ac8b746930c2ea60febe797a
[4/4] io_uring/rsrc: send exact nr_segs for fixed buffer
      commit: c5a173d9864a597288c6acd091315cfc3b32ca1c

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CGME20250417142334epcas5p36df55874d21c896115d92e505f9793fd@epcas5p3.samsung.com>]

* Re: [PATCH 0/4] io_import_fixed cleanups and optimisation
       [not found] ` <CGME20250417142334epcas5p36df55874d21c896115d92e505f9793fd@epcas5p3.samsung.com>
@ 2025-04-17 14:15   ` Nitesh Shetty
  0 siblings, 0 replies; 15+ messages in thread
From: Nitesh Shetty @ 2025-04-17 14:15 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring

[-- Attachment #1: Type: text/plain, Size: 664 bytes --]

On 17/04/25 10:32AM, Pavel Begunkov wrote:
>io_import_fixed() cleanups topped with the nr segs optimisation
>patch from Nitesh. Doesn't have the kbuf part, assuming Jens will
>add it on top.
>
>Based on io_uring-6.15
>
>Nitesh Shetty (1):
>  io_uring/rsrc: send exact nr_segs for fixed buffer
>
>Pavel Begunkov (3):
>  io_uring/rsrc: don't skip offset calculation
>  io_uring/rsrc: separate kbuf offset adjustments
>  io_uring/rsrc: refactor io_import_fixed
>
> io_uring/rsrc.c | 74 ++++++++++++++++++++-----------------------------
> 1 file changed, 30 insertions(+), 44 deletions(-)
>
>-- 
>2.48.1
>
Reviewed and Tested by: Nitesh Shetty <nj.shetty@samsung.com>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-04-17 15:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-17  9:32 [PATCH 0/4] io_import_fixed cleanups and optimisation Pavel Begunkov
2025-04-17  9:32 ` [PATCH 1/4] io_uring/rsrc: don't skip offset calculation Pavel Begunkov
2025-04-17  9:32 ` [PATCH 2/4] io_uring/rsrc: separate kbuf offset adjustments Pavel Begunkov
2025-04-17  9:32 ` [PATCH 3/4] io_uring/rsrc: refactor io_import_fixed Pavel Begunkov
2025-04-17  9:32 ` [PATCH 4/4] io_uring/rsrc: send exact nr_segs for fixed buffer Pavel Begunkov
2025-04-17  9:34   ` Pavel Begunkov
     [not found]     ` <CGME20250417103133epcas5p32c1e004e7f8a5135c4c7e3662b087470@epcas5p3.samsung.com>
2025-04-17 10:23       ` Nitesh Shetty
2025-04-17 11:50         ` Nitesh Shetty
2025-04-17 12:56           ` Pavel Begunkov
2025-04-17 13:41             ` Jens Axboe
2025-04-17 13:57               ` Jens Axboe
2025-04-17 14:02                 ` Pavel Begunkov
2025-04-17 14:05                   ` Jens Axboe
2025-04-17 12:31 ` [PATCH 0/4] io_import_fixed cleanups and optimisation Jens Axboe
     [not found] ` <CGME20250417142334epcas5p36df55874d21c896115d92e505f9793fd@epcas5p3.samsung.com>
2025-04-17 14:15   ` Nitesh Shetty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox