* [PATCH] io_uring/sqpoll: Increase task_work submission batch size
@ 2025-04-03 19:56 Gabriel Krisman Bertazi
2025-04-03 20:26 ` Jens Axboe
0 siblings, 1 reply; 7+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-04-03 19:56 UTC (permalink / raw)
To: axboe; +Cc: io-uring, Gabriel Krisman Bertazi
Our QA team reported a 10%-23% throughput reduction on an io_uring
sqpoll testcase that I traced back to a reduction of the device
submission queue depth when doing io over an sqpoll. After commit
af5d68f8892f ("io_uring/sqpoll: manage task_work privately"), we capped
the number of tw entries that can be executed from a single spin of
sqpoll to only 8 entries, before the sqpoll goes around to try to sleep.
My understanding is that this starves the device, as seen in device
utilization, mostly because it reduced the opportunity for plugging in the
block layer.
A simple usecase that showcases the issue is using sqpoll against a
nullblk:
fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
--invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \
--filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
--rw=randread --numjobs=1 --sqthread_poll
One QA test machine yielded, with the above command:
SLE Kernel predating af5d68f8892f:
READ: bw=9839MiB/s (10.3GB/s), 9839MiB/s-9839MiB/s (10.3GB/s-10.3GB/s), io=2883GiB (3095GB), run=300001-300001msec
SLE kernel after af5d68f8892f:
READ: bw=8288MiB/s (8691MB/s), 8288MiB/s-8288MiB/s (8691MB/s-8691MB/s), io=2428GiB (2607GB), run=300001-300001msec
Ideally, the tw cap size would at least be the deep enough to fill the
device queue (assuming all uring commands are against only one device),
but we can't predict that behavior and thus can't guess the batch size.
We also don't want to let the tw run unbounded, though I'm not sure it
is really a problem. Instead, let's just give it a more sensible value that
will allow for more efficient batching.
With this patch, my test machine (not the same as above) yielded a
consistent 10% throughput increase when doing randreads on nullb. Our QE
team also reported it solved the regression on all machines they tested.
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
io_uring/sqpoll.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index d037cc68e9d3..e58e4d2b3bde 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -20,7 +20,7 @@
#include "sqpoll.h"
#define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
-#define IORING_TW_CAP_ENTRIES_VALUE 8
+#define IORING_TW_CAP_ENTRIES_VALUE 1024
enum {
IO_SQ_THREAD_SHOULD_STOP = 0,
--
2.49.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] io_uring/sqpoll: Increase task_work submission batch size
2025-04-03 19:56 Gabriel Krisman Bertazi
@ 2025-04-03 20:26 ` Jens Axboe
2025-04-04 1:18 ` Gabriel Krisman Bertazi
0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2025-04-03 20:26 UTC (permalink / raw)
To: Gabriel Krisman Bertazi; +Cc: io-uring
On 4/3/25 1:56 PM, Gabriel Krisman Bertazi wrote:
> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
> index d037cc68e9d3..e58e4d2b3bde 100644
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -20,7 +20,7 @@
> #include "sqpoll.h"
>
> #define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
> -#define IORING_TW_CAP_ENTRIES_VALUE 8
> +#define IORING_TW_CAP_ENTRIES_VALUE 1024
That's a huge bump! This should not be a submission side thing, it's
purely running the task work. For this test case, I'm assuming you don't
see any io-wq activity, and hence everything is done purely inline from
the SQPOLL thread? This confuses me a bit, as this should not be driving
the queue depth at all, as submissions would be done by
__io_sq_thread(). And that part only caps when there is more than a
single ctx in there, which your case would not have. IOW, it should
submit everything that's there and hence this change should not change
the submission/queueing side of things. It only really deals with
running the task_work that will post the completion.
Maybe we should just not submit more until we've depleted the tw list?
In any case, we can _probably_ make this 32 or something without
worrying too much about it, though I would like to fully understand why
it's slower. Maybe it's the getrusage() that we do for every loop? You
could try and disable that just to see if it makes a difference?
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] io_uring/sqpoll: Increase task_work submission batch size
2025-04-03 20:26 ` Jens Axboe
@ 2025-04-04 1:18 ` Gabriel Krisman Bertazi
2025-04-07 15:47 ` Gabriel Krisman Bertazi
0 siblings, 1 reply; 7+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-04-04 1:18 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring
Jens Axboe <axboe@kernel.dk> writes:
> On 4/3/25 1:56 PM, Gabriel Krisman Bertazi wrote:
>> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
>> #define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
>> -#define IORING_TW_CAP_ENTRIES_VALUE 8
>> +#define IORING_TW_CAP_ENTRIES_VALUE 1024
>
> That's a huge bump! This should not be a submission side thing, it's
> purely running the task work. For this test case, I'm assuming you don't
> see any io-wq activity, and hence everything is done purely inline from
> the SQPOLL thread?
> This confuses me a bit, as this should not be driving
> the queue depth at all, as submissions would be done by
> __io_sq_thread().
Indeed, the submission happens fully inside __io_sq_thread, and I can
confirm that from the profile. What is interesting is that, once I lift
the cap, we end up spending more time inside io_submit_sqes, which means
it is able to drive more requests.
Let me share the profile in case it rings a bell:
This is perf-record on a slow kernel:
- 49.30% io_sq_thread
- 41.86% io_submit_sqes
- 20.57% io_issue_sqe
- 19.89% io_read
- __io_read
- 18.19% blkdev_read_iter
- 17.84% blkdev_direct_IO.part.21
+ 7.25% submit_bio_noacct_nocheck
+ 6.49% bio_iov_iter_get_pages
+ 1.80% bio_alloc_bioset
1.27% bio_set_pages_dirty
+ 0.78% security_file_permission
- 10.88% blk_finish_plug
- __blk_flush_plug
- 10.80% blk_mq_flush_plug_list.part.88
+ 10.69% null_queue_rqs
0.83% io_prep_rw
- 4.11% io_sq_tw
- 3.62% io_handle_tw_list
- 2.76% ctx_flush_and_put.isra.72
2.67% __io_submit_flush_completions
0.58% io_req_rw_complete
+ 1.15% io_sq_update_worktime.isra.9
1.05% mutex_unlock
+ 1.05% getrusage
After my patch:
- 50.07% io_sq_thread
- 47.22% io_submit_sqes
- 38.04% io_issue_sqe
- 37.19% io_read
- 37.10% __io_read
- 34.79% blkdev_read_iter
- 34.34% blkdev_direct_IO.part.21
+ 21.01% submit_bio_noacct_nocheck
+ 8.30% bio_iov_iter_get_pages
+ 2.21% bio_alloc_bioset
+ 1.52% bio_set_pages_dirty
+ 1.19% security_file_permission
- 3.29% blk_finish_plug
- __blk_flush_plug
- 3.27% blk_mq_flush_plug_list.part.88
- 3.25% null_queue_rqs
+ null_queue_rq
1.16% io_prep_rw
- 2.25% io_sq_tw
- tctx_task_work_run
- 2.00% io_handle_tw_list
- 1.08% ctx_flush_and_put.isra.72
1.07% __io_submit_flush_completions
0.68% io_req_rw_complete
> And that part only caps when there is more than a
> single ctx in there, which your case would not have. IOW, it should
> submit everything that's there and hence this change should not change
> the submission/queueing side of things. It only really deals with
> running the task_work that will post the completion.
>
> Maybe we should just not submit more until we've depleted the tw list?
>
> In any case, we can _probably_ make this 32 or something without
> worrying too much about it, though I would like to fully understand why
> it's slower. Maybe it's the getrusage() that we do for every loop? You
> could try and disable that just to see if it makes a difference?
While the overhead of the usage accounting is very visible in the
profile, my first test when I got this bug was to drop that code, and it
had very little impact on throughput (around 1%). The main difference
really seems to be around the number of ios we queue per iteration. In
fact, looking at iostat, I can see a very noticeable difference in
aqu-sz between both kernels.
A lower limit should work, yes, but I'm also quite curious how the tw
affects the submission. But also, what is the reason to cap it in the
first place? io_handle_tw_list does a cond_reesched() on each
iteration, so it wont hog to the cpu and, if we drop the cap, we'll have
the behavior of not submitting more until the tw list is empty, as you
suggested.
--
Gabriel Krisman Bertazi
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] io_uring/sqpoll: Increase task_work submission batch size
2025-04-04 1:18 ` Gabriel Krisman Bertazi
@ 2025-04-07 15:47 ` Gabriel Krisman Bertazi
0 siblings, 0 replies; 7+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-04-07 15:47 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring
Gabriel Krisman Bertazi <krisman@suse.de> writes:
> Jens Axboe <axboe@kernel.dk> writes:
>
>> On 4/3/25 1:56 PM, Gabriel Krisman Bertazi wrote:
>>> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
>>> #define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
>>> -#define IORING_TW_CAP_ENTRIES_VALUE 8
>>> +#define IORING_TW_CAP_ENTRIES_VALUE 1024
>>
>> That's a huge bump! This should not be a submission side thing, it's
>> purely running the task work. For this test case, I'm assuming you don't
>> see any io-wq activity, and hence everything is done purely inline from
>> the SQPOLL thread?
>> This confuses me a bit, as this should not be driving
>> the queue depth at all, as submissions would be done by
>> __io_sq_thread().
>
> Indeed, the submission happens fully inside __io_sq_thread, and I can
> confirm that from the profile. What is interesting is that, once I lift
> the cap, we end up spending more time inside io_submit_sqes, which means
> it is able to drive more requests.
I think have more input on what's happening:
Regarding the tw batch not driving the submission. This is a typical
submission with IORING_TW_CAP_ENTRIES_VALUE = 8
254,0 1 49927 0.016024812 5977 Q R 2061024 + 8 [iou-sqp-5976]
254,0 1 49928 0.016025044 5977 G R 2061024 + 8 [iou-sqp-5976]
254,0 1 49929 0.016025116 5977 P N [iou-sqp-5976]
254,0 1 49930 0.016025594 5977 Q R 1132240 + 8 [iou-sqp-5976]
254,0 1 49931 0.016025713 5977 G R 1132240 + 8 [iou-sqp-5976]
254,0 1 49932 0.016026205 5977 Q R 1187696 + 8 [iou-sqp-5976]
254,0 1 49933 0.016026317 5977 G R 1187696 + 8 [iou-sqp-5976]
254,0 1 49934 0.016026811 5977 Q R 1716272 + 8 [iou-sqp-5976]
254,0 1 49935 0.016026927 5977 G R 1716272 + 8 [iou-sqp-5976]
254,0 1 49936 0.016027447 5977 Q R 276336 + 8 [iou-sqp-5976]
254,0 1 49937 0.016027565 5977 G R 276336 + 8 [iou-sqp-5976]
254,0 1 49938 0.016028005 5977 Q R 1672040 + 8 [iou-sqp-5976]
254,0 1 49939 0.016028116 5977 G R 1672040 + 8 [iou-sqp-5976]
254,0 1 49940 0.016028551 5977 Q R 1770880 + 8 [iou-sqp-5976]
254,0 1 49941 0.016028685 5977 G R 1770880 + 8 [iou-sqp-5976]
254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7
We plug 7 requests, flush them all together. with
IORING_TW_CAP_ENTRIES_VALUE=1024, submissions look generally like this:
254,0 1 4931 0.001414021 3145 P N [iou-sqp-3144]
254,0 1 4932 0.001414415 3145 Q R 1268736 + 8 [iou-sqp-3144]
254,0 1 4933 0.001414584 3145 G R 1268736 + 8 [iou-sqp-3144]
254,0 1 4934 0.001414990 3145 Q R 1210304 + 8 [iou-sqp-3144]
254,0 1 4935 0.001415145 3145 G R 1210304 + 8 [iou-sqp-3144]
254,0 1 4936 0.001415553 3145 Q R 1476352 + 8 [iou-sqp-3144]
254,0 1 4937 0.001415722 3145 G R 1476352 + 8 [iou-sqp-3144]
254,0 1 4938 0.001416130 3145 Q R 1291752 + 8 [iou-sqp-3144]
254,0 1 4939 0.001416302 3145 G R 1291752 + 8 [iou-sqp-3144]
254,0 1 4940 0.001416763 3145 Q R 1171664 + 8 [iou-sqp-3144]
254,0 1 4941 0.001416928 3145 G R 1171664 + 8 [iou-sqp-3144]
254,0 1 4942 0.001417444 3145 Q R 197424 + 8 [iou-sqp-3144]
254,0 1 4943 0.001417602 3145 G R 197424 + 8 [iou-sqp-3144]
[...]
[...]
254,0 1 4993 0.001432191 3145 G R 371656 + 8 [iou-sqp-3144]
254,0 1 4994 0.001432601 3145 Q R 1864408 + 8 [iou-sqp-3144]
254,0 1 4995 0.001432771 3145 G R 1864408 + 8 [iou-sqp-3144]
254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32
So I'm able to drive way more I/O per plug with my patch.
If I plot the histogram of the to_submit argument of io_submit_sqes,
which is exactly io_sqring_entries(ctx), since I have only one ctx, I
see that I get much less io to submit in the ring in the first place.
So, because sqpoll is spinning more (and going to sleep more often), it
completes less I/Os, causing us to submit less from fio, as suggested by
the smaller io_sqring_entries? Does it make any sense?
To retest, I fully dropped the accounting code and I can reproduce the
same submission pattern. It really seems to depend on whether we go to
sleep after completing a small tw batch.
This is what I got from existing logs. I'm a bit limited with testing at the
moment, as I lost the machine where I could reproduce it (my other machine
yields the same io pattern, but no numerical regression). But I thought
it might be worth sharing in case I'm being silly and you can call me
out immediately. I'll reproduce it in the next days, once I get more
time on the shared machine.
--
Gabriel Krisman Bertazi
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] io_uring/sqpoll: Increase task_work submission batch size
@ 2025-05-08 18:12 Gabriel Krisman Bertazi
2025-05-08 18:14 ` Gabriel Krisman Bertazi
2025-05-09 13:57 ` Jens Axboe
0 siblings, 2 replies; 7+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-05-08 18:12 UTC (permalink / raw)
To: axboe, asml.silence; +Cc: io-uring, Gabriel Krisman Bertazi
Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep. While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.
There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation. But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput. The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.
My benchmark is:
fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
--invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \
--filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
--rw=randread --numjobs=1 --sqthread_poll
In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec
With this patch:
READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec
which is a 15% improvement in measured bandwidth. The average
submission latency is noticeably lowered too. As measured by
fio:
Baseline:
lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
lat (usec): min=26, max=226, avg=86.48, stdev=4.82
If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.
In the baseline, after a stabilization phase, an ordinary submission
looks like:
254,0 1 49942 0.016028795 5977 U N [iou-sqp-5976] 7
After patching, I see consistently more requests per unplug.
254,0 1 4996 0.001432872 3145 U N [iou-sqp-3144] 32
Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size. We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem. Instead, let's just give it a more sensible value
that will allow for more efficient batching. I've tested with different
cap values, and initially proposed to increase the cap to 1024. Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.
Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
io_uring/sqpoll.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index d037cc68e9d3..03c699493b5a 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -20,7 +20,7 @@
#include "sqpoll.h"
#define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
-#define IORING_TW_CAP_ENTRIES_VALUE 8
+#define IORING_TW_CAP_ENTRIES_VALUE 32
enum {
IO_SQ_THREAD_SHOULD_STOP = 0,
--
2.49.0
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] io_uring/sqpoll: Increase task_work submission batch size
2025-05-08 18:12 [PATCH] io_uring/sqpoll: Increase task_work submission batch size Gabriel Krisman Bertazi
@ 2025-05-08 18:14 ` Gabriel Krisman Bertazi
2025-05-09 13:57 ` Jens Axboe
1 sibling, 0 replies; 7+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-05-08 18:14 UTC (permalink / raw)
To: axboe; +Cc: asml.silence, io-uring
Gabriel Krisman Bertazi <krisman@suse.de> writes:
> Our QA team reported a 10%-23%, throughput reduction on an io_uring
> sqpoll testcase doing IO to a null_blk, that I traced back to a
> reduction of the device submission queue depth utilization. It turns out
> that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work
> privately"), we capped the number of task_work entries that can be
> completed from a single spin of sqpoll to only 8 entries, before the
> sqpoll goes around to (potentially) sleep. While this cap doesn't drive
> the submission side directly, it impacts the completion behavior, which
> affects the number of IO queued by fio per sqpoll cycle on the
> submission side, and io_uring ends up seeing less ios per sqpoll cycle.
> As a result, block layer plugging is less effective, and we see more
> time spent inside the block layer in profilings charts, and increased
> submission latency measured by fio.
Uh, this should have been tagged as v2. The original submission and
some discussion available at:
https://lore.kernel.org/all/94da2142-d7c1-46bb-bc35-05d0d1c28182@kernel.dk/
--
Gabriel Krisman Bertazi
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] io_uring/sqpoll: Increase task_work submission batch size
2025-05-08 18:12 [PATCH] io_uring/sqpoll: Increase task_work submission batch size Gabriel Krisman Bertazi
2025-05-08 18:14 ` Gabriel Krisman Bertazi
@ 2025-05-09 13:57 ` Jens Axboe
1 sibling, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2025-05-09 13:57 UTC (permalink / raw)
To: asml.silence, Gabriel Krisman Bertazi; +Cc: io-uring
On Thu, 08 May 2025 14:12:03 -0400, Gabriel Krisman Bertazi wrote:
> Our QA team reported a 10%-23%, throughput reduction on an io_uring
> sqpoll testcase doing IO to a null_blk, that I traced back to a
> reduction of the device submission queue depth utilization. It turns out
> that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work
> privately"), we capped the number of task_work entries that can be
> completed from a single spin of sqpoll to only 8 entries, before the
> sqpoll goes around to (potentially) sleep. While this cap doesn't drive
> the submission side directly, it impacts the completion behavior, which
> affects the number of IO queued by fio per sqpoll cycle on the
> submission side, and io_uring ends up seeing less ios per sqpoll cycle.
> As a result, block layer plugging is less effective, and we see more
> time spent inside the block layer in profilings charts, and increased
> submission latency measured by fio.
>
> [...]
Applied, thanks!
[1/1] io_uring/sqpoll: Increase task_work submission batch size
(no commit info)
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-05-09 13:57 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-08 18:12 [PATCH] io_uring/sqpoll: Increase task_work submission batch size Gabriel Krisman Bertazi
2025-05-08 18:14 ` Gabriel Krisman Bertazi
2025-05-09 13:57 ` Jens Axboe
-- strict thread matches above, loose matches on Subject: below --
2025-04-03 19:56 Gabriel Krisman Bertazi
2025-04-03 20:26 ` Jens Axboe
2025-04-04 1:18 ` Gabriel Krisman Bertazi
2025-04-07 15:47 ` Gabriel Krisman Bertazi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox