* [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
[not found] <CGME20250416055250epcas5p25fa8223a1bfeea5583ad8ba88c881a05@epcas5p2.samsung.com>
@ 2025-04-16 5:44 ` Nitesh Shetty
2025-04-16 14:19 ` Jens Axboe
2025-04-16 15:03 ` Pavel Begunkov
0 siblings, 2 replies; 16+ messages in thread
From: Nitesh Shetty @ 2025-04-16 5:44 UTC (permalink / raw)
To: Jens Axboe, Pavel Begunkov
Cc: gost.dev, nitheshshetty, Nitesh Shetty, io-uring, linux-kernel
Sending exact nr_segs, avoids bio split check and processing in
block layer, which takes around 5%[1] of overall CPU utilization.
In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
and 5% less CPU utilization.
[1]
3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
[2]
sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
-r4 /dev/nvme0n1 /dev/nvme1n1
Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
---
io_uring/rsrc.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index b36c8825550e..6fd3a4a85a9c 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1096,6 +1096,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
}
}
+ iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
+ iter->count + ((1UL << imu->folio_shift) - 1)) /
+ (1UL << imu->folio_shift);
return 0;
}
base-commit: 834a4a689699090a406d1662b03affa8b155d025
--
2.43.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 5:44 ` [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer Nitesh Shetty
@ 2025-04-16 14:19 ` Jens Axboe
2025-04-16 14:43 ` Jens Axboe
2025-04-16 15:03 ` Pavel Begunkov
1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-16 14:19 UTC (permalink / raw)
To: Nitesh Shetty, Pavel Begunkov
Cc: gost.dev, nitheshshetty, io-uring, linux-kernel
On 4/15/25 11:44 PM, Nitesh Shetty wrote:
> Sending exact nr_segs, avoids bio split check and processing in
> block layer, which takes around 5%[1] of overall CPU utilization.
>
> In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
> and 5% less CPU utilization.
>
> [1]
> 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
> 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
> 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
>
> [2]
> sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
> -r4 /dev/nvme0n1 /dev/nvme1n1
This must be a regression, do you know which block/io_uring side commit
caused the splits to be done for fixed buffers?
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
> io_uring/rsrc.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index b36c8825550e..6fd3a4a85a9c 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -1096,6 +1096,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
> iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
> }
> }
> + iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
> + iter->count + ((1UL << imu->folio_shift) - 1)) /
> + (1UL << imu->folio_shift);
iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
iter->count + ((1UL << imu->folio_shift) - 1)) >> imu->folio_shift;
to avoid a division, seems worthwhile?
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 14:19 ` Jens Axboe
@ 2025-04-16 14:43 ` Jens Axboe
2025-04-16 14:49 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-16 14:43 UTC (permalink / raw)
To: Nitesh Shetty, Pavel Begunkov
Cc: gost.dev, nitheshshetty, io-uring, linux-kernel
On 4/16/25 8:19 AM, Jens Axboe wrote:
> On 4/15/25 11:44 PM, Nitesh Shetty wrote:
>> Sending exact nr_segs, avoids bio split check and processing in
>> block layer, which takes around 5%[1] of overall CPU utilization.
>>
>> In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
>> and 5% less CPU utilization.
>>
>> [1]
>> 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
>> 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
>> 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
>>
>> [2]
>> sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
>> -r4 /dev/nvme0n1 /dev/nvme1n1
>
> This must be a regression, do you know which block/io_uring side commit
> caused the splits to be done for fixed buffers?
>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>> io_uring/rsrc.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>> index b36c8825550e..6fd3a4a85a9c 100644
>> --- a/io_uring/rsrc.c
>> +++ b/io_uring/rsrc.c
>> @@ -1096,6 +1096,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>> iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
>> }
>> }
>> + iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
>> + iter->count + ((1UL << imu->folio_shift) - 1)) /
>> + (1UL << imu->folio_shift);
>
> iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
> iter->count + ((1UL << imu->folio_shift) - 1)) >> imu->folio_shift;
>
> to avoid a division, seems worthwhile?
And we should be able to drop the ->nr_segs assignment in the above
section as well with this change.
Tested on a box here, previously:
IOPS=99.19M, BW=48.43GiB/s, IOS/call=32/31
IOPS=99.48M, BW=48.57GiB/s, IOS/call=32/32
IOPS=99.43M, BW=48.55GiB/s, IOS/call=32/32
IOPS=99.48M, BW=48.57GiB/s, IOS/call=31/31
IOPS=99.49M, BW=48.58GiB/s, IOS/call=32/32
and with the fix:
IOPS=103.28M, BW=50.43GiB/s, IOS/call=32/31
IOPS=103.18M, BW=50.38GiB/s, IOS/call=32/32
IOPS=103.22M, BW=50.40GiB/s, IOS/call=32/31
IOPS=103.18M, BW=50.38GiB/s, IOS/call=31/32
IOPS=103.19M, BW=50.38GiB/s, IOS/call=31/32
IOPS=103.12M, BW=50.35GiB/s, IOS/call=32/31
and I do indeed see the same ~4% time wasted on splits.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 14:43 ` Jens Axboe
@ 2025-04-16 14:49 ` Jens Axboe
0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-04-16 14:49 UTC (permalink / raw)
To: Nitesh Shetty, Pavel Begunkov
Cc: gost.dev, nitheshshetty, io-uring, linux-kernel
On 4/16/25 8:43 AM, Jens Axboe wrote:
> On 4/16/25 8:19 AM, Jens Axboe wrote:
>> On 4/15/25 11:44 PM, Nitesh Shetty wrote:
>>> Sending exact nr_segs, avoids bio split check and processing in
>>> block layer, which takes around 5%[1] of overall CPU utilization.
>>>
>>> In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
>>> and 5% less CPU utilization.
>>>
>>> [1]
>>> 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
>>> 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
>>> 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
>>>
>>> [2]
>>> sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
>>> -r4 /dev/nvme0n1 /dev/nvme1n1
>>
>> This must be a regression, do you know which block/io_uring side commit
>> caused the splits to be done for fixed buffers?
>>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>> io_uring/rsrc.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>> index b36c8825550e..6fd3a4a85a9c 100644
>>> --- a/io_uring/rsrc.c
>>> +++ b/io_uring/rsrc.c
>>> @@ -1096,6 +1096,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>> iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
>>> }
>>> }
>>> + iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
>>> + iter->count + ((1UL << imu->folio_shift) - 1)) /
>>> + (1UL << imu->folio_shift);
>>
>> iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
>> iter->count + ((1UL << imu->folio_shift) - 1)) >> imu->folio_shift;
>>
>> to avoid a division, seems worthwhile?
>
> And we should be able to drop the ->nr_segs assignment in the above
> section as well with this change.
>
> Tested on a box here, previously:
>
> IOPS=99.19M, BW=48.43GiB/s, IOS/call=32/31
> IOPS=99.48M, BW=48.57GiB/s, IOS/call=32/32
> IOPS=99.43M, BW=48.55GiB/s, IOS/call=32/32
> IOPS=99.48M, BW=48.57GiB/s, IOS/call=31/31
> IOPS=99.49M, BW=48.58GiB/s, IOS/call=32/32
>
> and with the fix:
>
> IOPS=103.28M, BW=50.43GiB/s, IOS/call=32/31
> IOPS=103.18M, BW=50.38GiB/s, IOS/call=32/32
> IOPS=103.22M, BW=50.40GiB/s, IOS/call=32/31
> IOPS=103.18M, BW=50.38GiB/s, IOS/call=31/32
> IOPS=103.19M, BW=50.38GiB/s, IOS/call=31/32
> IOPS=103.12M, BW=50.35GiB/s, IOS/call=32/31
>
> and I do indeed see the same ~4% time wasted on splits.
Applied this with a pre-patch to avoid overly long lines, and
with the redundant nr_segs removed and the division eliminated.
See here:
https://git.kernel.dk/cgit/linux/log/?h=io_uring-6.15
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 5:44 ` [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer Nitesh Shetty
2025-04-16 14:19 ` Jens Axboe
@ 2025-04-16 15:03 ` Pavel Begunkov
2025-04-16 15:07 ` Jens Axboe
1 sibling, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2025-04-16 15:03 UTC (permalink / raw)
To: Nitesh Shetty, Jens Axboe; +Cc: gost.dev, nitheshshetty, io-uring, linux-kernel
On 4/16/25 06:44, Nitesh Shetty wrote:
> Sending exact nr_segs, avoids bio split check and processing in
> block layer, which takes around 5%[1] of overall CPU utilization.
>
> In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
> and 5% less CPU utilization.
>
> [1]
> 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
> 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
> 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
>
> [2]
> sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
> -r4 /dev/nvme0n1 /dev/nvme1n1
>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> ---
> io_uring/rsrc.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index b36c8825550e..6fd3a4a85a9c 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -1096,6 +1096,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
> iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
> }
> }
> + iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
> + iter->count + ((1UL << imu->folio_shift) - 1)) /
> + (1UL << imu->folio_shift);
That's not going to work with ->is_kbuf as the segments are not uniform in
size.
And can we make it saner? Split it into several statements, add variables
for folio size and so, or maybe just use ALIGN. If moved above, you
probably don't even need to recalc
iter->bvec->bv_offset + iter->iov_offset
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 15:03 ` Pavel Begunkov
@ 2025-04-16 15:07 ` Jens Axboe
2025-04-16 18:25 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-16 15:07 UTC (permalink / raw)
To: Pavel Begunkov, Nitesh Shetty
Cc: gost.dev, nitheshshetty, io-uring, linux-kernel
On 4/16/25 9:03 AM, Pavel Begunkov wrote:
> On 4/16/25 06:44, Nitesh Shetty wrote:
>> Sending exact nr_segs, avoids bio split check and processing in
>> block layer, which takes around 5%[1] of overall CPU utilization.
>>
>> In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
>> and 5% less CPU utilization.
>>
>> [1]
>> 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
>> 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
>> 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
>>
>> [2]
>> sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
>> -r4 /dev/nvme0n1 /dev/nvme1n1
>>
>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>> ---
>> io_uring/rsrc.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>> index b36c8825550e..6fd3a4a85a9c 100644
>> --- a/io_uring/rsrc.c
>> +++ b/io_uring/rsrc.c
>> @@ -1096,6 +1096,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>> iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
>> }
>> }
>> + iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
>> + iter->count + ((1UL << imu->folio_shift) - 1)) /
>> + (1UL << imu->folio_shift);
>
> That's not going to work with ->is_kbuf as the segments are not uniform in
> size.
Oops yes good point.
> And can we make it saner? Split it into several statements, add variables
> for folio size and so, or maybe just use ALIGN. If moved above, you
> probably don't even need to recalc
>
> iter->bvec->bv_offset + iter->iov_offset
Agree, was trying to do that with the shift at least, but seems like
there is indeed room for further improvements here in terms of
readability.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 15:07 ` Jens Axboe
@ 2025-04-16 18:25 ` Jens Axboe
2025-04-16 19:57 ` Nitesh Shetty
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-16 18:25 UTC (permalink / raw)
To: Pavel Begunkov, Nitesh Shetty
Cc: gost.dev, nitheshshetty, io-uring, linux-kernel
On 4/16/25 9:07 AM, Jens Axboe wrote:
> On 4/16/25 9:03 AM, Pavel Begunkov wrote:
>> On 4/16/25 06:44, Nitesh Shetty wrote:
>>> Sending exact nr_segs, avoids bio split check and processing in
>>> block layer, which takes around 5%[1] of overall CPU utilization.
>>>
>>> In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
>>> and 5% less CPU utilization.
>>>
>>> [1]
>>> 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
>>> 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
>>> 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
>>>
>>> [2]
>>> sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
>>> -r4 /dev/nvme0n1 /dev/nvme1n1
>>>
>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>> ---
>>> io_uring/rsrc.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>> index b36c8825550e..6fd3a4a85a9c 100644
>>> --- a/io_uring/rsrc.c
>>> +++ b/io_uring/rsrc.c
>>> @@ -1096,6 +1096,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>> iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
>>> }
>>> }
>>> + iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
>>> + iter->count + ((1UL << imu->folio_shift) - 1)) /
>>> + (1UL << imu->folio_shift);
>>
>> That's not going to work with ->is_kbuf as the segments are not uniform in
>> size.
>
> Oops yes good point.
How about something like this? Trims superflous end segments, if they
exist. The 'offset' section already trimmed the front parts. For
!is_kbuf that should be simple math, like in Nitesh's patch. For
is_kbuf, iterate them.
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index bef66e733a77..e482ea1e22a9 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1036,6 +1036,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
struct io_mapped_ubuf *imu,
u64 buf_addr, size_t len)
{
+ const struct bio_vec *bvec;
unsigned int folio_shift;
size_t offset;
int ret;
@@ -1052,9 +1053,10 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
* Might not be a start of buffer, set size appropriately
* and advance us to the beginning.
*/
+ bvec = imu->bvec;
offset = buf_addr - imu->ubuf;
folio_shift = imu->folio_shift;
- iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
+ iov_iter_bvec(iter, ddir, bvec, imu->nr_bvecs, offset + len);
if (offset) {
/*
@@ -1073,7 +1075,6 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
* since we can just skip the first segment, which may not
* be folio_size aligned.
*/
- const struct bio_vec *bvec = imu->bvec;
/*
* Kernel buffer bvecs, on the other hand, don't necessarily
@@ -1099,6 +1100,27 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
}
}
+ /*
+ * Offset trimmed front segments too, if any, now trim the tail.
+ * For is_kbuf we'll iterate them as they may be different sizes,
+ * otherwise we can just do straight up math.
+ */
+ if (len + offset < imu->len) {
+ bvec = iter->bvec;
+ if (imu->is_kbuf) {
+ while (len > bvec->bv_len) {
+ len -= bvec->bv_len;
+ bvec++;
+ }
+ iter->nr_segs = bvec - iter->bvec;
+ } else {
+ size_t vec_len;
+
+ vec_len = bvec->bv_offset + iter->iov_offset +
+ iter->count + ((1UL << folio_shift) - 1);
+ iter->nr_segs = vec_len >> folio_shift;
+ }
+ }
return 0;
}
--
Jens Axboe
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 18:25 ` Jens Axboe
@ 2025-04-16 19:57 ` Nitesh Shetty
2025-04-16 20:01 ` Jens Axboe
2025-04-16 20:03 ` Keith Busch
0 siblings, 2 replies; 16+ messages in thread
From: Nitesh Shetty @ 2025-04-16 19:57 UTC (permalink / raw)
To: Jens Axboe
Cc: Pavel Begunkov, Nitesh Shetty, gost.dev, io-uring, linux-kernel
On Wed, Apr 16, 2025 at 11:55 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/16/25 9:07 AM, Jens Axboe wrote:
> > On 4/16/25 9:03 AM, Pavel Begunkov wrote:
> >> On 4/16/25 06:44, Nitesh Shetty wrote:
> >>> Sending exact nr_segs, avoids bio split check and processing in
> >>> block layer, which takes around 5%[1] of overall CPU utilization.
> >>>
> >>> In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
> >>> and 5% less CPU utilization.
> >>>
> >>> [1]
> >>> 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
> >>> 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
> >>> 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
> >>>
> >>> [2]
> >>> sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
> >>> -r4 /dev/nvme0n1 /dev/nvme1n1
> >>>
> >>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> >>> ---
> >>> io_uring/rsrc.c | 3 +++
> >>> 1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> >>> index b36c8825550e..6fd3a4a85a9c 100644
> >>> --- a/io_uring/rsrc.c
> >>> +++ b/io_uring/rsrc.c
> >>> @@ -1096,6 +1096,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
> >>> iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
> >>> }
> >>> }
> >>> + iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
> >>> + iter->count + ((1UL << imu->folio_shift) - 1)) /
> >>> + (1UL << imu->folio_shift);
> >>
> >> That's not going to work with ->is_kbuf as the segments are not uniform in
> >> size.
> >
> > Oops yes good point.
>
> How about something like this? Trims superflous end segments, if they
> exist. The 'offset' section already trimmed the front parts. For
> !is_kbuf that should be simple math, like in Nitesh's patch. For
> is_kbuf, iterate them.
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index bef66e733a77..e482ea1e22a9 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -1036,6 +1036,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
> struct io_mapped_ubuf *imu,
> u64 buf_addr, size_t len)
> {
> + const struct bio_vec *bvec;
> unsigned int folio_shift;
> size_t offset;
> int ret;
> @@ -1052,9 +1053,10 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
> * Might not be a start of buffer, set size appropriately
> * and advance us to the beginning.
> */
> + bvec = imu->bvec;
> offset = buf_addr - imu->ubuf;
> folio_shift = imu->folio_shift;
> - iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
> + iov_iter_bvec(iter, ddir, bvec, imu->nr_bvecs, offset + len);
>
> if (offset) {
> /*
> @@ -1073,7 +1075,6 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
> * since we can just skip the first segment, which may not
> * be folio_size aligned.
> */
> - const struct bio_vec *bvec = imu->bvec;
>
> /*
> * Kernel buffer bvecs, on the other hand, don't necessarily
> @@ -1099,6 +1100,27 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
> }
> }
>
> + /*
> + * Offset trimmed front segments too, if any, now trim the tail.
> + * For is_kbuf we'll iterate them as they may be different sizes,
> + * otherwise we can just do straight up math.
> + */
> + if (len + offset < imu->len) {
> + bvec = iter->bvec;
> + if (imu->is_kbuf) {
> + while (len > bvec->bv_len) {
> + len -= bvec->bv_len;
> + bvec++;
> + }
> + iter->nr_segs = bvec - iter->bvec;
> + } else {
> + size_t vec_len;
> +
> + vec_len = bvec->bv_offset + iter->iov_offset +
> + iter->count + ((1UL << folio_shift) - 1);
> + iter->nr_segs = vec_len >> folio_shift;
> + }
> + }
> return 0;
> }
This might not be needed for is_kbuf , as they already update nr_seg
inside iov_iter_advance.
How about changing something like this ?
- if (offset < bvec->bv_len) {
- iter->count -= offset;
- iter->iov_offset = offset;
- } else if (imu->is_kbuf) {
+ if (!imu->is_kbuf) {
+ size_t vec_len;
+
+ if (offset < bvec->bv_len) {
+ iter->count -= offset;
+ iter->iov_offset = offset;
+ } else {
+ unsigned long seg_skip;
+
+ /* skip first vec */
+ offset -= bvec->bv_len;
+ seg_skip = 1 + (offset >> folio_shift);
+
+ iter->bvec += seg_skip;
+ iter->count -= bvec->bv_len + offset;
+ iter->iov_offset = offset & ((1UL <<
folio_shift) - 1);
+ }
+ vec_len = ALIGN(iter->bvec->bv_offset +
iter->iov_offset +
+ iter->count, folio_shift;
+ iter->nr_segs = vec_len >> folio_shift;
+ } else
iov_iter_advance(iter, offset);
- } else {
- unsigned long seg_skip;
-
- /* skip first vec */
- offset -= bvec->bv_len;
- seg_skip = 1 + (offset >> folio_shift);
-
- iter->bvec += seg_skip;
- iter->count -= bvec->bv_len + offset;
- iter->iov_offset = offset & ((1UL << folio_shift) - 1);
- }
}
Regards,
Nitesh
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 19:57 ` Nitesh Shetty
@ 2025-04-16 20:01 ` Jens Axboe
2025-04-16 20:29 ` Pavel Begunkov
2025-04-16 20:03 ` Keith Busch
1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-16 20:01 UTC (permalink / raw)
To: Nitesh Shetty
Cc: Pavel Begunkov, Nitesh Shetty, gost.dev, io-uring, linux-kernel
On 4/16/25 1:57 PM, Nitesh Shetty wrote:
> On Wed, Apr 16, 2025 at 11:55?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/16/25 9:07 AM, Jens Axboe wrote:
>>> On 4/16/25 9:03 AM, Pavel Begunkov wrote:
>>>> On 4/16/25 06:44, Nitesh Shetty wrote:
>>>>> Sending exact nr_segs, avoids bio split check and processing in
>>>>> block layer, which takes around 5%[1] of overall CPU utilization.
>>>>>
>>>>> In our setup, we see overall improvement of IOPS from 7.15M to 7.65M [2]
>>>>> and 5% less CPU utilization.
>>>>>
>>>>> [1]
>>>>> 3.52% io_uring [kernel.kallsyms] [k] bio_split_rw_at
>>>>> 1.42% io_uring [kernel.kallsyms] [k] bio_split_rw
>>>>> 0.62% io_uring [kernel.kallsyms] [k] bio_submit_split
>>>>>
>>>>> [2]
>>>>> sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
>>>>> -r4 /dev/nvme0n1 /dev/nvme1n1
>>>>>
>>>>> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
>>>>> ---
>>>>> io_uring/rsrc.c | 3 +++
>>>>> 1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>>>> index b36c8825550e..6fd3a4a85a9c 100644
>>>>> --- a/io_uring/rsrc.c
>>>>> +++ b/io_uring/rsrc.c
>>>>> @@ -1096,6 +1096,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>>> iter->iov_offset = offset & ((1UL << imu->folio_shift) - 1);
>>>>> }
>>>>> }
>>>>> + iter->nr_segs = (iter->bvec->bv_offset + iter->iov_offset +
>>>>> + iter->count + ((1UL << imu->folio_shift) - 1)) /
>>>>> + (1UL << imu->folio_shift);
>>>>
>>>> That's not going to work with ->is_kbuf as the segments are not uniform in
>>>> size.
>>>
>>> Oops yes good point.
>>
>> How about something like this? Trims superflous end segments, if they
>> exist. The 'offset' section already trimmed the front parts. For
>> !is_kbuf that should be simple math, like in Nitesh's patch. For
>> is_kbuf, iterate them.
>>
>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>> index bef66e733a77..e482ea1e22a9 100644
>> --- a/io_uring/rsrc.c
>> +++ b/io_uring/rsrc.c
>> @@ -1036,6 +1036,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>> struct io_mapped_ubuf *imu,
>> u64 buf_addr, size_t len)
>> {
>> + const struct bio_vec *bvec;
>> unsigned int folio_shift;
>> size_t offset;
>> int ret;
>> @@ -1052,9 +1053,10 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>> * Might not be a start of buffer, set size appropriately
>> * and advance us to the beginning.
>> */
>> + bvec = imu->bvec;
>> offset = buf_addr - imu->ubuf;
>> folio_shift = imu->folio_shift;
>> - iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
>> + iov_iter_bvec(iter, ddir, bvec, imu->nr_bvecs, offset + len);
>>
>> if (offset) {
>> /*
>> @@ -1073,7 +1075,6 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>> * since we can just skip the first segment, which may not
>> * be folio_size aligned.
>> */
>> - const struct bio_vec *bvec = imu->bvec;
>>
>> /*
>> * Kernel buffer bvecs, on the other hand, don't necessarily
>> @@ -1099,6 +1100,27 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>> }
>> }
>>
>> + /*
>> + * Offset trimmed front segments too, if any, now trim the tail.
>> + * For is_kbuf we'll iterate them as they may be different sizes,
>> + * otherwise we can just do straight up math.
>> + */
>> + if (len + offset < imu->len) {
>> + bvec = iter->bvec;
>> + if (imu->is_kbuf) {
>> + while (len > bvec->bv_len) {
>> + len -= bvec->bv_len;
>> + bvec++;
>> + }
>> + iter->nr_segs = bvec - iter->bvec;
>> + } else {
>> + size_t vec_len;
>> +
>> + vec_len = bvec->bv_offset + iter->iov_offset +
>> + iter->count + ((1UL << folio_shift) - 1);
>> + iter->nr_segs = vec_len >> folio_shift;
>> + }
>> + }
>> return 0;
>> }
> This might not be needed for is_kbuf , as they already update nr_seg
> inside iov_iter_advance.
How so? If 'offset' is true, then yes it'd skip the front, but it
doesn't skip the end part. And if 'offset' is 0, then no advancing is
done in the first place - which does make sense, as it's just advancing
from the front.
> How about changing something like this ?
You can't hide this in the if (offset) section...
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 19:57 ` Nitesh Shetty
2025-04-16 20:01 ` Jens Axboe
@ 2025-04-16 20:03 ` Keith Busch
1 sibling, 0 replies; 16+ messages in thread
From: Keith Busch @ 2025-04-16 20:03 UTC (permalink / raw)
To: Nitesh Shetty
Cc: Jens Axboe, Pavel Begunkov, Nitesh Shetty, gost.dev, io-uring,
linux-kernel
On Thu, Apr 17, 2025 at 01:27:55AM +0530, Nitesh Shetty wrote:
> > + /*
> > + * Offset trimmed front segments too, if any, now trim the tail.
> > + * For is_kbuf we'll iterate them as they may be different sizes,
> > + * otherwise we can just do straight up math.
> > + */
> > + if (len + offset < imu->len) {
> > + bvec = iter->bvec;
> > + if (imu->is_kbuf) {
> > + while (len > bvec->bv_len) {
> > + len -= bvec->bv_len;
> > + bvec++;
> > + }
> > + iter->nr_segs = bvec - iter->bvec;
> > + } else {
> > + size_t vec_len;
> > +
> > + vec_len = bvec->bv_offset + iter->iov_offset +
> > + iter->count + ((1UL << folio_shift) - 1);
> > + iter->nr_segs = vec_len >> folio_shift;
> > + }
> > + }
> > return 0;
> > }
> This might not be needed for is_kbuf , as they already update nr_seg
> inside iov_iter_advance.
iov_iter_advance only handles the front segs. We still need something
for any trailing segments that are not part of this mapping request.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 20:01 ` Jens Axboe
@ 2025-04-16 20:29 ` Pavel Begunkov
2025-04-16 20:30 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2025-04-16 20:29 UTC (permalink / raw)
To: Jens Axboe, Nitesh Shetty; +Cc: Nitesh Shetty, gost.dev, io-uring, linux-kernel
On 4/16/25 21:01, Jens Axboe wrote:
> On 4/16/25 1:57 PM, Nitesh Shetty wrote:
...
>>> /*
>>> @@ -1073,7 +1075,6 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>> * since we can just skip the first segment, which may not
>>> * be folio_size aligned.
>>> */
>>> - const struct bio_vec *bvec = imu->bvec;
>>>
>>> /*
>>> * Kernel buffer bvecs, on the other hand, don't necessarily
>>> @@ -1099,6 +1100,27 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>> }
>>> }
>>>
>>> + /*
>>> + * Offset trimmed front segments too, if any, now trim the tail.
>>> + * For is_kbuf we'll iterate them as they may be different sizes,
>>> + * otherwise we can just do straight up math.
>>> + */
>>> + if (len + offset < imu->len) {
>>> + bvec = iter->bvec;
>>> + if (imu->is_kbuf) {
>>> + while (len > bvec->bv_len) {
>>> + len -= bvec->bv_len;
>>> + bvec++;
>>> + }
>>> + iter->nr_segs = bvec - iter->bvec;
>>> + } else {
>>> + size_t vec_len;
>>> +
>>> + vec_len = bvec->bv_offset + iter->iov_offset +
>>> + iter->count + ((1UL << folio_shift) - 1);
>>> + iter->nr_segs = vec_len >> folio_shift;
>>> + }
>>> + }
>>> return 0;
>>> }
>> This might not be needed for is_kbuf , as they already update nr_seg
>> inside iov_iter_advance.
>
> How so? If 'offset' is true, then yes it'd skip the front, but it
> doesn't skip the end part. And if 'offset' is 0, then no advancing is
> done in the first place - which does make sense, as it's just advancing
> from the front.
>
>> How about changing something like this ?
>
> You can't hide this in the if (offset) section...
Should we just make it saner first? Sth like these 3 completely
untested commits
https://github.com/isilence/linux/commits/rsrc-import-cleanup/
And then it'll become
nr_segs = ALIGN(offset + len, 1UL << folio_shift);
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 20:29 ` Pavel Begunkov
@ 2025-04-16 20:30 ` Jens Axboe
2025-04-16 21:03 ` Pavel Begunkov
0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-04-16 20:30 UTC (permalink / raw)
To: Pavel Begunkov, Nitesh Shetty
Cc: Nitesh Shetty, gost.dev, io-uring, linux-kernel
On 4/16/25 2:29 PM, Pavel Begunkov wrote:
> On 4/16/25 21:01, Jens Axboe wrote:
>> On 4/16/25 1:57 PM, Nitesh Shetty wrote:
> ...
>>>> /*
>>>> @@ -1073,7 +1075,6 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>> * since we can just skip the first segment, which may not
>>>> * be folio_size aligned.
>>>> */
>>>> - const struct bio_vec *bvec = imu->bvec;
>>>>
>>>> /*
>>>> * Kernel buffer bvecs, on the other hand, don't necessarily
>>>> @@ -1099,6 +1100,27 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>> }
>>>> }
>>>>
>>>> + /*
>>>> + * Offset trimmed front segments too, if any, now trim the tail.
>>>> + * For is_kbuf we'll iterate them as they may be different sizes,
>>>> + * otherwise we can just do straight up math.
>>>> + */
>>>> + if (len + offset < imu->len) {
>>>> + bvec = iter->bvec;
>>>> + if (imu->is_kbuf) {
>>>> + while (len > bvec->bv_len) {
>>>> + len -= bvec->bv_len;
>>>> + bvec++;
>>>> + }
>>>> + iter->nr_segs = bvec - iter->bvec;
>>>> + } else {
>>>> + size_t vec_len;
>>>> +
>>>> + vec_len = bvec->bv_offset + iter->iov_offset +
>>>> + iter->count + ((1UL << folio_shift) - 1);
>>>> + iter->nr_segs = vec_len >> folio_shift;
>>>> + }
>>>> + }
>>>> return 0;
>>>> }
>>> This might not be needed for is_kbuf , as they already update nr_seg
>>> inside iov_iter_advance.
>>
>> How so? If 'offset' is true, then yes it'd skip the front, but it
>> doesn't skip the end part. And if 'offset' is 0, then no advancing is
>> done in the first place - which does make sense, as it's just advancing
>> from the front.
>>
>>> How about changing something like this ?
>>
>> You can't hide this in the if (offset) section...
>
> Should we just make it saner first? Sth like these 3 completely
> untested commits
>
> https://github.com/isilence/linux/commits/rsrc-import-cleanup/
>
> And then it'll become
>
> nr_segs = ALIGN(offset + len, 1UL << folio_shift);
Let's please do that, certainly an improvement. Care to send this out? I
can toss them at the testing. And we'd still need that last patch to
ensure the segment count is correct. Honestly somewhat surprised that
the only odd fallout of that is (needlessly) hitting the bio split path.
--
Jens Axboe
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 20:30 ` Jens Axboe
@ 2025-04-16 21:03 ` Pavel Begunkov
2025-04-16 22:23 ` Jens Axboe
0 siblings, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2025-04-16 21:03 UTC (permalink / raw)
To: Jens Axboe, Nitesh Shetty; +Cc: Nitesh Shetty, gost.dev, io-uring, linux-kernel
On 4/16/25 21:30, Jens Axboe wrote:
> On 4/16/25 2:29 PM, Pavel Begunkov wrote:
>> On 4/16/25 21:01, Jens Axboe wrote:
>>> On 4/16/25 1:57 PM, Nitesh Shetty wrote:
>> ...
>>>>> /*
>>>>> @@ -1073,7 +1075,6 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>>> * since we can just skip the first segment, which may not
>>>>> * be folio_size aligned.
>>>>> */
>>>>> - const struct bio_vec *bvec = imu->bvec;
>>>>>
>>>>> /*
>>>>> * Kernel buffer bvecs, on the other hand, don't necessarily
>>>>> @@ -1099,6 +1100,27 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
>>>>> }
>>>>> }
>>>>>
>>>>> + /*
>>>>> + * Offset trimmed front segments too, if any, now trim the tail.
>>>>> + * For is_kbuf we'll iterate them as they may be different sizes,
>>>>> + * otherwise we can just do straight up math.
>>>>> + */
>>>>> + if (len + offset < imu->len) {
>>>>> + bvec = iter->bvec;
>>>>> + if (imu->is_kbuf) {
>>>>> + while (len > bvec->bv_len) {
>>>>> + len -= bvec->bv_len;
>>>>> + bvec++;
>>>>> + }
>>>>> + iter->nr_segs = bvec - iter->bvec;
>>>>> + } else {
>>>>> + size_t vec_len;
>>>>> +
>>>>> + vec_len = bvec->bv_offset + iter->iov_offset +
>>>>> + iter->count + ((1UL << folio_shift) - 1);
>>>>> + iter->nr_segs = vec_len >> folio_shift;
>>>>> + }
>>>>> + }
>>>>> return 0;
>>>>> }
>>>> This might not be needed for is_kbuf , as they already update nr_seg
>>>> inside iov_iter_advance.
>>>
>>> How so? If 'offset' is true, then yes it'd skip the front, but it
>>> doesn't skip the end part. And if 'offset' is 0, then no advancing is
>>> done in the first place - which does make sense, as it's just advancing
>>> from the front.
>>>
>>>> How about changing something like this ?
>>>
>>> You can't hide this in the if (offset) section...
>>
>> Should we just make it saner first? Sth like these 3 completely
>> untested commits
>>
>> https://github.com/isilence/linux/commits/rsrc-import-cleanup/
>>
>> And then it'll become
>>
>> nr_segs = ALIGN(offset + len, 1UL << folio_shift);
>
> Let's please do that, certainly an improvement. Care to send this out? I
> can toss them at the testing. And we'd still need that last patch to
I need to test it first, perhaps tomorrow
> ensure the segment count is correct. Honestly somewhat surprised that
Right, I can pick up the Nitesh's patch to that.
> the only odd fallout of that is (needlessly) hitting the bio split path.
It's perfectly correct from the iter standpoint, AFAIK, length
and nr of segments don't have to match. Though I am surprised
it causes perf issues in the split path.
Btw, where exactly does it stumble in there? I'd assume we don't
need to do the segment correction for kbuf as the bio splitting
can do it (and probably does) in exactly the same way?
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 21:03 ` Pavel Begunkov
@ 2025-04-16 22:23 ` Jens Axboe
2025-04-16 22:42 ` Jens Axboe
2025-04-17 9:12 ` Pavel Begunkov
0 siblings, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2025-04-16 22:23 UTC (permalink / raw)
To: Pavel Begunkov, Nitesh Shetty
Cc: Nitesh Shetty, gost.dev, io-uring, linux-kernel
>>> Should we just make it saner first? Sth like these 3 completely
>>> untested commits
>>>
>>> https://github.com/isilence/linux/commits/rsrc-import-cleanup/
>>>
>>> And then it'll become
>>>
>>> nr_segs = ALIGN(offset + len, 1UL << folio_shift);
>>
>> Let's please do that, certainly an improvement. Care to send this out? I
>> can toss them at the testing. And we'd still need that last patch to
>
> I need to test it first, perhaps tomorrow
Sounds good, I'll run it through testing here too. Would be nice to
stuff in for -rc3, it's pretty minimal and honestly makes the code much
easier to read and reason about.
>> ensure the segment count is correct. Honestly somewhat surprised that
>
> Right, I can pick up the Nitesh's patch to that.
Sounds good.
>> the only odd fallout of that is (needlessly) hitting the bio split path.
>
> It's perfectly correct from the iter standpoint, AFAIK, length
> and nr of segments don't have to match. Though I am surprised
> it causes perf issues in the split path.
Theoretically it is, but it always makes me a bit nervous as there are
some _really_ odd iov_iter use cases out there. And passing down known
wrong segment counts is pretty wonky.
> Btw, where exactly does it stumble in there? I'd assume we don't
Because segments != 1, and then that hits the slower path.
> need to do the segment correction for kbuf as the bio splitting
> can do it (and probably does) in exactly the same way?
It doesn't strictly need to, but we should handle that case too. That'd
basically just be the loop addition I already did, something ala the
below on top for both of them:
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index d8fa7158e598..767ac89c8426 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1032,6 +1032,25 @@ static int validate_fixed_range(u64 buf_addr, size_t len,
return 0;
}
+static int io_import_kbuf(int ddir, struct iov_iter *iter,
+ struct io_mapped_ubuf *imu, size_t len, size_t offset)
+{
+ iov_iter_bvec(iter, ddir, iter->bvec, imu->nr_bvecs, len + offset);
+ iov_iter_advance(iter, offset);
+
+ if (len + offset < imu->len) {
+ const struct bio_vec *bvec = iter->bvec;
+
+ while (len > bvec->bv_len) {
+ len -= bvec->bv_len;
+ bvec++;
+ }
+ iter->nr_segs = bvec - iter->bvec;
+ }
+
+ return 0;
+}
+
static int io_import_fixed(int ddir, struct iov_iter *iter,
struct io_mapped_ubuf *imu,
u64 buf_addr, size_t len)
@@ -1054,13 +1073,9 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
* and advance us to the beginning.
*/
offset = buf_addr - imu->ubuf;
- bvec = imu->bvec;
- if (imu->is_kbuf) {
- iov_iter_bvec(iter, ddir, bvec, imu->nr_bvecs, offset + len);
- iov_iter_advance(iter, offset);
- return 0;
- }
+ if (imu->is_kbuf)
+ return io_import_kbuf(ddir, iter, imu, len, offset);
/*
* Don't use iov_iter_advance() here, as it's really slow for
@@ -1083,7 +1098,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
* have the size property of user registered ones, so we have
* to use the slow iter advance.
*/
-
+ bvec = imu->bvec;
if (offset >= bvec->bv_len) {
unsigned long seg_skip;
@@ -1094,7 +1109,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
offset &= (1UL << imu->folio_shift) - 1;
}
- nr_segs = imu->nr_bvecs - (bvec - imu->bvec);
+ nr_segs = ALIGN(offset + len, 1UL << imu->folio_shift) >> imu->folio_shift;
iov_iter_bvec(iter, ddir, bvec, nr_segs, len);
iter->iov_offset = offset;
return 0;
--
Jens Axboe
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 22:23 ` Jens Axboe
@ 2025-04-16 22:42 ` Jens Axboe
2025-04-17 9:12 ` Pavel Begunkov
1 sibling, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-04-16 22:42 UTC (permalink / raw)
To: Pavel Begunkov, Nitesh Shetty
Cc: Nitesh Shetty, gost.dev, io-uring, linux-kernel
On 4/16/25 4:23 PM, Jens Axboe wrote:
>>>> Should we just make it saner first? Sth like these 3 completely
>>>> untested commits
>>>>
>>>> https://github.com/isilence/linux/commits/rsrc-import-cleanup/
>>>>
>>>> And then it'll become
>>>>
>>>> nr_segs = ALIGN(offset + len, 1UL << folio_shift);
>>>
>>> Let's please do that, certainly an improvement. Care to send this out? I
>>> can toss them at the testing. And we'd still need that last patch to
>>
>> I need to test it first, perhaps tomorrow
>
> Sounds good, I'll run it through testing here too. Would be nice to
> stuff in for -rc3, it's pretty minimal and honestly makes the code much
> easier to read and reason about.
>
>>> ensure the segment count is correct. Honestly somewhat surprised that
>>
>> Right, I can pick up the Nitesh's patch to that.
>
> Sounds good.
>
>>> the only odd fallout of that is (needlessly) hitting the bio split path.
>>
>> It's perfectly correct from the iter standpoint, AFAIK, length
>> and nr of segments don't have to match. Though I am surprised
>> it causes perf issues in the split path.
>
> Theoretically it is, but it always makes me a bit nervous as there are
> some _really_ odd iov_iter use cases out there. And passing down known
> wrong segment counts is pretty wonky.
>
>> Btw, where exactly does it stumble in there? I'd assume we don't
>
> Because segments != 1, and then that hits the slower path.
>
>> need to do the segment correction for kbuf as the bio splitting
>> can do it (and probably does) in exactly the same way?
>
> It doesn't strictly need to, but we should handle that case too. That'd
> basically just be the loop addition I already did, something ala the
> below on top for both of them:
Made a silly typo in the last patch (updated below), but with that
fixed, tested your 3 patches and that one on top and it passes both
liburing tests and kselftests for ublk (which does test kbuf imports)
too. Tested segment counting with a separate test case too, and it looks
good as well.
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index d8fa7158e598..7abc96b9260d 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1032,6 +1032,26 @@ static int validate_fixed_range(u64 buf_addr, size_t len,
return 0;
}
+static int io_import_kbuf(int ddir, struct iov_iter *iter,
+ struct io_mapped_ubuf *imu, size_t len, size_t offset)
+{
+ size_t count = len + offset;
+
+ iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, count);
+ iov_iter_advance(iter, offset);
+
+ if (count < imu->len) {
+ const struct bio_vec *bvec = iter->bvec;
+
+ while (len > bvec->bv_len) {
+ len -= bvec->bv_len;
+ bvec++;
+ }
+ iter->nr_segs = 1 + bvec - iter->bvec;
+ }
+ return 0;
+}
+
static int io_import_fixed(int ddir, struct iov_iter *iter,
struct io_mapped_ubuf *imu,
u64 buf_addr, size_t len)
@@ -1054,13 +1074,8 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
* and advance us to the beginning.
*/
offset = buf_addr - imu->ubuf;
- bvec = imu->bvec;
-
- if (imu->is_kbuf) {
- iov_iter_bvec(iter, ddir, bvec, imu->nr_bvecs, offset + len);
- iov_iter_advance(iter, offset);
- return 0;
- }
+ if (imu->is_kbuf)
+ return io_import_kbuf(ddir, iter, imu, len, offset);
/*
* Don't use iov_iter_advance() here, as it's really slow for
@@ -1083,7 +1098,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
* have the size property of user registered ones, so we have
* to use the slow iter advance.
*/
-
+ bvec = imu->bvec;
if (offset >= bvec->bv_len) {
unsigned long seg_skip;
@@ -1094,7 +1109,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
offset &= (1UL << imu->folio_shift) - 1;
}
- nr_segs = imu->nr_bvecs - (bvec - imu->bvec);
+ nr_segs = ALIGN(offset + len, 1UL << imu->folio_shift) >> imu->folio_shift;
iov_iter_bvec(iter, ddir, bvec, nr_segs, len);
iter->iov_offset = offset;
return 0;
--
Jens Axboe
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer
2025-04-16 22:23 ` Jens Axboe
2025-04-16 22:42 ` Jens Axboe
@ 2025-04-17 9:12 ` Pavel Begunkov
1 sibling, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2025-04-17 9:12 UTC (permalink / raw)
To: Jens Axboe, Nitesh Shetty; +Cc: Nitesh Shetty, gost.dev, io-uring, linux-kernel
On 4/16/25 23:23, Jens Axboe wrote:
...
> Theoretically it is, but it always makes me a bit nervous as there are
> some _really_ odd iov_iter use cases out there. And passing down known
> wrong segment counts is pretty wonky.
>
>> Btw, where exactly does it stumble in there? I'd assume we don't
>
> Because segments != 1, and then that hits the slower path.
Right, but walking over entire bvec is not fast either. Sounds
like you're saying it's even more expensive and that's slightly
concerning.
>> need to do the segment correction for kbuf as the bio splitting
>> can do it (and probably does) in exactly the same way?
>
> It doesn't strictly need to, but we should handle that case too. That'd
> basically just be the loop addition I already did, something ala the
> below on top for both of them:
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index d8fa7158e598..767ac89c8426 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -1032,6 +1032,25 @@ static int validate_fixed_range(u64 buf_addr, size_t len,
> return 0;
> }
>
> +static int io_import_kbuf(int ddir, struct iov_iter *iter,
> + struct io_mapped_ubuf *imu, size_t len, size_t offset)
> +{
> + iov_iter_bvec(iter, ddir, iter->bvec, imu->nr_bvecs, len + offset);
> + iov_iter_advance(iter, offset);
> +
> + if (len + offset < imu->len) {
It should always be less or equal, are you trying to handle
the latter?
> + const struct bio_vec *bvec = iter->bvec;
> +
> + while (len > bvec->bv_len) {
> + len -= bvec->bv_len;
> + bvec++;
> + }
> + iter->nr_segs = bvec - iter->bvec;
> + }
> +
> + return 0;
> +}
> +
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-04-17 9:11 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20250416055250epcas5p25fa8223a1bfeea5583ad8ba88c881a05@epcas5p2.samsung.com>
2025-04-16 5:44 ` [PATCH] io_uring/rsrc: send exact nr_segs for fixed buffer Nitesh Shetty
2025-04-16 14:19 ` Jens Axboe
2025-04-16 14:43 ` Jens Axboe
2025-04-16 14:49 ` Jens Axboe
2025-04-16 15:03 ` Pavel Begunkov
2025-04-16 15:07 ` Jens Axboe
2025-04-16 18:25 ` Jens Axboe
2025-04-16 19:57 ` Nitesh Shetty
2025-04-16 20:01 ` Jens Axboe
2025-04-16 20:29 ` Pavel Begunkov
2025-04-16 20:30 ` Jens Axboe
2025-04-16 21:03 ` Pavel Begunkov
2025-04-16 22:23 ` Jens Axboe
2025-04-16 22:42 ` Jens Axboe
2025-04-17 9:12 ` Pavel Begunkov
2025-04-16 20:03 ` Keith Busch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox