From: Andres Freund <[email protected]>
To: [email protected], [email protected]
Subject: Buffered IO async context overhead
Date: Fri, 14 Feb 2020 11:50:30 -0800 [thread overview]
Message-ID: <[email protected]> (raw)
Hi,
For workloads where a lot of buffered IO is done, I see considerably
higher overhead when going through io_uring, then when just doing the IO
synchronously.
This example is on a samsung 970 pro nvme (i.e. decent, but not top
class). I've limited the amount of dirty buffers allowed to 2/4 GB to
keep the test times in check. This is with a relatively recent kernel,
33b40134e5cfbbccad7f3040d1919889537a3df7 .
fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=1
Jobs: 1 (f=1): [W(1)][100.0%][w=1049MiB/s][w=134k IOPS][eta 00m:00s]
t: (groupid=0, jobs=1): err= 0: pid=46625: Fri Feb 14 11:22:05 2020
write: IOPS=73.4k, BW=573MiB/s (601MB/s)(55.0GiB/100001msec); 0 zone resets
slat (nsec): min=571, max=607779, avg=2927.18, stdev=1213.47
clat (nsec): min=40, max=1400.7k, avg=9947.92, stdev=2606.65
lat (usec): min=4, max=1403, avg=12.94, stdev= 3.28
clat percentiles (nsec):
| 1.00th=[ 4960], 5.00th=[ 6368], 10.00th=[ 8032], 20.00th=[ 8640],
| 30.00th=[ 9024], 40.00th=[ 9280], 50.00th=[ 9664], 60.00th=[10048],
| 70.00th=[10560], 80.00th=[11072], 90.00th=[11968], 95.00th=[13120],
| 99.00th=[21632], 99.50th=[23424], 99.90th=[26752], 99.95th=[28288],
| 99.99th=[35584]
bw ( KiB/s): min=192408, max=1031483, per=79.92%, avg=468998.16, stdev=108691.70, samples=199
iops : min=24051, max=128935, avg=58624.37, stdev=13586.54, samples=199
lat (nsec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
lat (nsec) : 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=56.98%, 20=41.67%, 50=1.32%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
lat (msec) : 2=0.01%
cpu : usr=15.63%, sys=34.79%, ctx=7333387, majf=0, minf=41
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,7335827,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=573MiB/s (601MB/s), 573MiB/s-573MiB/s (601MB/s-601MB/s), io=55.0GiB (60.1GB), run=100001-100001msec
Disk stats (read/write):
nvme0n1: ios=0/49033, merge=0/12, ticks=0/1741787, in_queue=1717299, util=20.89%
During this most of the time there's a number of tasks with decent CPU usage
46627 root 20 0 0 0 0 R 99.7 0.0 1:07.74 io_wqe_worker-0
46625 root 20 0 756920 6708 1984 S 51.0 0.0 0:34.56 fio
42818 root 20 0 0 0 0 I 3.3 0.0 0:03.23 kworker/u82:6-flush-259:0
43338 root 20 0 0 0 0 I 1.3 0.0 0:02.06 kworker/u82:18-events_unbound
Comparing that to using sync:
fio --name=t --time_based=1 --runtime=100 --ioengine=sync --rw=write --bs=8k --filesize=300G --iodepth=1
Jobs: 1 (f=1): [W(1)][100.0%][w=1607MiB/s][w=206k IOPS][eta 00m:00s]
t: (groupid=0, jobs=1): err= 0: pid=46690: Fri Feb 14 11:24:17 2020
write: IOPS=234k, BW=1831MiB/s (1920MB/s)(179GiB/100001msec); 0 zone resets
clat (nsec): min=1612, max=6649.7k, avg=3913.39, stdev=3170.65
lat (nsec): min=1653, max=6649.7k, avg=3955.30, stdev=3172.59
clat percentiles (nsec):
| 1.00th=[ 2512], 5.00th=[ 2544], 10.00th=[ 2576], 20.00th=[ 2640],
| 30.00th=[ 2672], 40.00th=[ 2768], 50.00th=[ 3376], 60.00th=[ 3792],
| 70.00th=[ 4192], 80.00th=[ 4832], 90.00th=[ 5792], 95.00th=[ 7008],
| 99.00th=[11968], 99.50th=[15552], 99.90th=[22656], 99.95th=[26496],
| 99.99th=[40192]
bw ( MiB/s): min= 989, max= 2118, per=82.26%, avg=1506.28, stdev=169.56, samples=193
iops : min=126684, max=271118, avg=192803.37, stdev=21703.94, samples=193
lat (usec) : 2=0.01%, 4=66.33%, 10=32.14%, 20=1.34%, 50=0.19%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 10=0.01%
cpu : usr=8.35%, sys=91.55%, ctx=6183, majf=0, minf=74
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,23437892,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1831MiB/s (1920MB/s), 1831MiB/s-1831MiB/s (1920MB/s-1920MB/s), io=179GiB (192GB), run=100001-100001msec
Disk stats (read/write):
nvme0n1: ios=0/152371, merge=0/587772, ticks=0/5657236, in_queue=5580764, util=67.57%
Tasks:
46690 root 20 0 756912 6524 1800 R 100.0 0.0 1:24.64 fio
2913 root 20 0 0 0 0 S 32.2 0.0 0:09.57 kswapd1
45683 root 20 0 0 0 0 I 13.2 0.0 0:08.57 kworker/u82:16-events_unbound
42816 root 20 0 0 0 0 I 11.2 0.0 0:09.71 kworker/u82:1-flush-259:0
36087 root 20 0 0 0 0 I 1.6 0.0 0:01.51 kworker/27:2-xfs-conv/nvme0n1
I.e. sync did about three times the throughput of io_uring.
It's possible to make io_uring perform better, by batching
submissions/completions:
fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=64 --iodepth_batch_submit=32 --iodepth_batch_complete_min=16 --iodepth_low=16
t: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring, iodepth=64
fio-3.17
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=1382MiB/s][w=177k IOPS][eta 00m:00s]
t: (groupid=0, jobs=1): err= 0: pid=46849: Fri Feb 14 11:28:13 2020
write: IOPS=194k, BW=1513MiB/s (1587MB/s)(148GiB/100001msec); 0 zone resets
slat (usec): min=4, max=455, avg=48.09, stdev=18.20
clat (usec): min=6, max=25435, avg=164.91, stdev=104.69
lat (usec): min=66, max=25512, avg=213.03, stdev=101.19
clat percentiles (usec):
| 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 78], 20.00th=[ 94],
| 30.00th=[ 108], 40.00th=[ 127], 50.00th=[ 143], 60.00th=[ 180],
| 70.00th=[ 210], 80.00th=[ 227], 90.00th=[ 265], 95.00th=[ 302],
| 99.00th=[ 379], 99.50th=[ 441], 99.90th=[ 562], 99.95th=[ 611],
| 99.99th=[ 816]
bw ( MiB/s): min= 289, max= 1802, per=81.00%, avg=1225.72, stdev=173.92, samples=197
iops : min=37086, max=230692, avg=156891.42, stdev=22261.80, samples=197
lat (usec) : 10=0.01%, 50=0.09%, 100=25.74%, 250=60.50%, 500=13.46%
lat (usec) : 750=0.19%, 1000=0.01%
lat (msec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=31.80%, sys=26.04%, ctx=793383, majf=0, minf=1301
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=66.7%, >=64=33.3%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=50.0%, 32=50.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,19369087,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=1513MiB/s (1587MB/s), 1513MiB/s-1513MiB/s (1587MB/s-1587MB/s), io=148GiB (159GB), run=100001-100001msec
Disk stats (read/write):
nvme0n1: ios=0/127095, merge=0/15, ticks=0/4936953, in_queue=4873341, util=55.82%
but that's still worse than the plain sync one. And doesn't work well
for reads.
Now, it's clear that there's some overhead in backgrounding small IOs to
the async context, but this seems awfully high. There's still a slowdown
when using considerably larger pieces of IO (128k).
For the 128k, iodepth 1 case I get:
sync:
- 97.77% 0.04% fio fio [.] fio_syncio_queue
97.73% fio_syncio_queue
- __libc_write
- __libc_write
- 97.50% entry_SYSCALL_64
- 97.31% do_syscall_64
- 97.25% ksys_write
- 97.10% vfs_write
- 96.91% new_sync_write
- 96.82% xfs_file_buffered_aio_write
- 96.13% iomap_file_buffered_write
- iomap_apply
- 95.52% iomap_write_actor
+ 37.66% iov_iter_copy_from_user_atomic
+ 26.83% iomap_write_begin
+ 26.64% iomap_write_end.isra.0
2.26% iov_iter_fault_in_readable
0.87% balance_dirty_pages_ratelimited
and --no-children shows:
+ 36.57% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 14.31% fio [kernel.vmlinux] [k] iomap_set_range_uptodate
+ 3.85% fio [kernel.vmlinux] [k] get_page_from_freelist
+ 3.34% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath
+ 2.94% fio [kernel.vmlinux] [k] xas_load
+ 2.66% fio [kernel.vmlinux] [k] __add_to_page_cache_locked
+ 2.51% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave
+ 2.28% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable
+ 1.87% fio [kernel.vmlinux] [k] __lru_cache_add
+ 1.80% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn
+ 1.59% fio [kernel.vmlinux] [k] node_dirty_ok
+ 1.30% fio [kernel.vmlinux] [k] _raw_spin_lock_irq
+ 1.30% fio [kernel.vmlinux] [k] iomap_set_page_dirty
+ 1.25% fio fio [.] get_io_u
+ 1.18% fio [kernel.vmlinux] [k] xas_start
io_uring (the io_wqe_worker-0):
- 100.00% 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
- 77.36% io_wqe_worker
- 77.18% io_worker_handle_work
- 75.80% io_rw_async
- io_wq_submit_work
- 75.57% io_issue_sqe
- io_write
- 70.96% xfs_file_buffered_aio_write
- 70.37% iomap_file_buffered_write
- iomap_apply
- 70.08% iomap_write_actor
+ 30.05% iov_iter_copy_from_user_atomic
- 18.38% iomap_write_begin
- 17.64% grab_cache_page_write_begin
+ 16.73% pagecache_get_page
0.51% wait_for_stable_page
- 16.96% iomap_write_end.isra.0
8.72% iomap_set_range_uptodate
+ 7.35% iomap_set_page_dirty
3.45% iov_iter_fault_in_readable
0.56% balance_dirty_pages_ratelimited
- 4.36% kiocb_done
- 3.56% io_cqring_ev_posted
+ __wake_up_common_lock
0.61% io_cqring_add_event
0.56% io_free_req
--no-children:
+ 29.55% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
+ 8.71% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate
+ 3.41% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable
+ 2.50% io_wqe_worker-0 [kernel.vmlinux] [k] get_page_from_freelist
+ 2.24% io_wqe_worker-0 [kernel.vmlinux] [k] xas_load
+ 1.82% io_wqe_worker-0 [kernel.vmlinux] [k] queued_spin_lock_slowpath
+ 1.80% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
+ 1.72% io_wqe_worker-0 [kernel.vmlinux] [k] __add_to_page_cache_locked
+ 1.58% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq
+ 1.29% io_wqe_worker-0 [kernel.vmlinux] [k] __lru_cache_add
+ 1.27% io_wqe_worker-0 [kernel.vmlinux] [k] __pagevec_lru_add_fn
+ 1.08% io_wqe_worker-0 [kernel.vmlinux] [k] xas_start
So it seems the biggest difference is the cpu time in io_wqe_worker
itself. Which in turn appears to mostly be:
│ mov %rbp,%rdi
│ → callq io_worker_handle_work
│ mov $0x3e7,%eax
│ ↓ jmp 28c
│ rep_nop():
│ }
│
│ /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
│ static __always_inline void rep_nop(void)
│ {
│ asm volatile("rep; nop" ::: "memory");
74.50 │281: pause
│ io_worker_spin_for_work():
│ while (++i < 1000) {
11.15 │ sub $0x1,%eax
│ ↑ je 9b
│ __read_once_size():
│ __READ_ONCE_SIZE;
9.53 │28c: mov 0x8(%rbx),%rdx
│ io_wqe_run_queue():
│ if (!wq_list_empty(&wqe->work_list) &&
│ test %rdx,%rdx
2.68 │ ↓ je 29f
│ io_worker_spin_for_work():
With 8k it looks similar, but more extreme:
sync (fio itself):
+ 29.89% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 14.10% fio [kernel.vmlinux] [k] iomap_set_range_uptodate
+ 2.80% fio [kernel.vmlinux] [k] get_page_from_freelist
+ 2.66% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath
+ 2.46% fio [kernel.vmlinux] [k] xas_load
+ 1.97% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave
+ 1.94% fio [kernel.vmlinux] [k] __add_to_page_cache_locked
+ 1.79% fio [kernel.vmlinux] [k] iomap_set_page_dirty
+ 1.64% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable
+ 1.61% fio fio [.] __fio_gettime
+ 1.49% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn
+ 1.39% fio [kernel.vmlinux] [k] __lru_cache_add
1.30% fio fio [.] get_io_u
+ 1.11% fio [kernel.vmlinux] [k] up_write
+ 1.10% fio [kernel.vmlinux] [k] node_dirty_ok
+ 1.09% fio [kernel.vmlinux] [k] down_write
io_uring (io_wqe_worker-0 again):
+ 54.13% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
+ 11.01% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 2.70% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate
+ 2.62% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable
+ 1.91% io_wqe_worker-0 [kernel.vmlinux] [k] __slab_free
+ 1.85% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
+ 1.66% io_wqe_worker-0 [kernel.vmlinux] [k] try_to_wake_up
+ 1.62% io_wqe_worker-0 [kernel.vmlinux] [k] io_cqring_fill_event
+ 1.55% io_wqe_worker-0 [kernel.vmlinux] [k] io_rw_async
+ 1.43% io_wqe_worker-0 [kernel.vmlinux] [k] __wake_up_common
+ 1.37% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq
+ 1.20% io_wqe_worker-0 [kernel.vmlinux] [k] rw_verify_area
which I think is pretty clear evidence we're hitting fairly significant
contention on the queue lock.
I am hitting this in postgres originally, not fio, but I thought it's
easier to reproduce this way. There's obviously benefit to doing things
in the background - but it requires odd logic around deciding when to
use io_uring, and when not.
To be clear, none of this happens with DIO, but I don't forsee switching
to DIO for all IO by default ever (too high demands on accurate
configuration).
Greetings,
Andres Freund
next reply other threads:[~2020-02-14 19:50 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-14 19:50 Andres Freund [this message]
2020-02-14 20:13 ` Buffered IO async context overhead Jens Axboe
2020-02-14 20:31 ` Andres Freund
2020-02-14 20:49 ` Jens Axboe
2020-02-24 9:35 ` Andres Freund
2020-02-24 15:22 ` Jens Axboe
2020-03-09 20:03 ` Pavel Begunkov
2020-03-09 20:41 ` Jens Axboe
2020-03-09 21:02 ` Pavel Begunkov
2020-03-09 21:29 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox