public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] io_uring/sqpoll: Increase task_work submission batch size
@ 2025-05-08 18:12 Gabriel Krisman Bertazi
  2025-05-08 18:14 ` Gabriel Krisman Bertazi
  2025-05-09 13:57 ` Jens Axboe
  0 siblings, 2 replies; 7+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-05-08 18:12 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, Gabriel Krisman Bertazi

Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit af5d68f8892f ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep.  While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.

There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation.  But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput.  The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.

My benchmark is:

fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
    --invalidate=1 --time_based  --ramp_time=10 --group_reporting=1 \
    --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
    --rw=randread --numjobs=1 --sqthread_poll

In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
  READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec

With this patch:
  READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec

which is a 15% improvement in measured bandwidth.  The average
submission latency is noticeably lowered too.  As measured by
fio:

Baseline:
   lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
   lat (usec): min=26, max=226, avg=86.48, stdev=4.82

If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.

In the baseline, after a stabilization phase, an ordinary submission
looks like:
  254,0    1    49942     0.016028795  5977  U   N [iou-sqp-5976] 7

After patching, I see consistently more requests per unplug.
  254,0    1     4996     0.001432872  3145  U   N [iou-sqp-3144] 32

Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size.  We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem.  Instead, let's just give it a more sensible value
that will allow for more efficient batching.  I've tested with different
cap values, and initially proposed to increase the cap to 1024.  Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.

Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 io_uring/sqpoll.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index d037cc68e9d3..03c699493b5a 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -20,7 +20,7 @@
 #include "sqpoll.h"
 
 #define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
-#define IORING_TW_CAP_ENTRIES_VALUE	8
+#define IORING_TW_CAP_ENTRIES_VALUE	32
 
 enum {
 	IO_SQ_THREAD_SHOULD_STOP = 0,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread
* [PATCH] io_uring/sqpoll: Increase task_work submission batch size
@ 2025-04-03 19:56 Gabriel Krisman Bertazi
  2025-04-03 20:26 ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-04-03 19:56 UTC (permalink / raw)
  To: axboe; +Cc: io-uring, Gabriel Krisman Bertazi

Our QA team reported a 10%-23% throughput reduction on an io_uring
sqpoll testcase that I traced back to a reduction of the device
submission queue depth when doing io over an sqpoll. After commit
af5d68f8892f ("io_uring/sqpoll: manage task_work privately"), we capped
the number of tw entries that can be executed from a single spin of
sqpoll to only 8 entries, before the sqpoll goes around to try to sleep.
My understanding is that this starves the device, as seen in device
utilization, mostly because it reduced the opportunity for plugging in the
block layer.

A simple usecase that showcases the issue is using sqpoll against a
nullblk:

fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
    --invalidate=1 --time_based  --ramp_time=10 --group_reporting=1 \
    --filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
    --rw=randread --numjobs=1 --sqthread_poll

One QA test machine yielded, with the above command:

SLE Kernel predating af5d68f8892f:
 READ: bw=9839MiB/s (10.3GB/s), 9839MiB/s-9839MiB/s (10.3GB/s-10.3GB/s), io=2883GiB (3095GB), run=300001-300001msec

SLE kernel after af5d68f8892f:
 READ: bw=8288MiB/s (8691MB/s), 8288MiB/s-8288MiB/s (8691MB/s-8691MB/s), io=2428GiB (2607GB), run=300001-300001msec

Ideally, the tw cap size would at least be the deep enough to fill the
device queue (assuming all uring commands are against only one device),
but we can't predict that behavior and thus can't guess the batch size.
We also don't want to let the tw run unbounded, though I'm not sure it
is really a problem.  Instead, let's just give it a more sensible value that
will allow for more efficient batching.

With this patch, my test machine (not the same as above) yielded a
consistent 10% throughput increase when doing randreads on nullb.  Our QE
team also reported it solved the regression on all machines they tested.

Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 io_uring/sqpoll.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index d037cc68e9d3..e58e4d2b3bde 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -20,7 +20,7 @@
 #include "sqpoll.h"
 
 #define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
-#define IORING_TW_CAP_ENTRIES_VALUE	8
+#define IORING_TW_CAP_ENTRIES_VALUE	1024
 
 enum {
 	IO_SQ_THREAD_SHOULD_STOP = 0,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-09 13:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-08 18:12 [PATCH] io_uring/sqpoll: Increase task_work submission batch size Gabriel Krisman Bertazi
2025-05-08 18:14 ` Gabriel Krisman Bertazi
2025-05-09 13:57 ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2025-04-03 19:56 Gabriel Krisman Bertazi
2025-04-03 20:26 ` Jens Axboe
2025-04-04  1:18   ` Gabriel Krisman Bertazi
2025-04-07 15:47     ` Gabriel Krisman Bertazi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox