* Buffered IO async context overhead
@ 2020-02-14 19:50 Andres Freund
2020-02-14 20:13 ` Jens Axboe
0 siblings, 1 reply; 10+ messages in thread
From: Andres Freund @ 2020-02-14 19:50 UTC (permalink / raw)
To: io-uring, axboe
Hi,
For workloads where a lot of buffered IO is done, I see considerably
higher overhead when going through io_uring, then when just doing the IO
synchronously.
This example is on a samsung 970 pro nvme (i.e. decent, but not top
class). I've limited the amount of dirty buffers allowed to 2/4 GB to
keep the test times in check. This is with a relatively recent kernel,
33b40134e5cfbbccad7f3040d1919889537a3df7 .
fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=1
Jobs: 1 (f=1): [W(1)][100.0%][w=1049MiB/s][w=134k IOPS][eta 00m:00s]
t: (groupid=0, jobs=1): err= 0: pid=46625: Fri Feb 14 11:22:05 2020
write: IOPS=73.4k, BW=573MiB/s (601MB/s)(55.0GiB/100001msec); 0 zone resets
slat (nsec): min=571, max=607779, avg=2927.18, stdev=1213.47
clat (nsec): min=40, max=1400.7k, avg=9947.92, stdev=2606.65
lat (usec): min=4, max=1403, avg=12.94, stdev= 3.28
clat percentiles (nsec):
| 1.00th=[ 4960], 5.00th=[ 6368], 10.00th=[ 8032], 20.00th=[ 8640],
| 30.00th=[ 9024], 40.00th=[ 9280], 50.00th=[ 9664], 60.00th=[10048],
| 70.00th=[10560], 80.00th=[11072], 90.00th=[11968], 95.00th=[13120],
| 99.00th=[21632], 99.50th=[23424], 99.90th=[26752], 99.95th=[28288],
| 99.99th=[35584]
bw ( KiB/s): min=192408, max=1031483, per=79.92%, avg=468998.16, stdev=108691.70, samples=199
iops : min=24051, max=128935, avg=58624.37, stdev=13586.54, samples=199
lat (nsec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
lat (nsec) : 1000=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 10=56.98%, 20=41.67%, 50=1.32%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
lat (msec) : 2=0.01%
cpu : usr=15.63%, sys=34.79%, ctx=7333387, majf=0, minf=41
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,7335827,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=573MiB/s (601MB/s), 573MiB/s-573MiB/s (601MB/s-601MB/s), io=55.0GiB (60.1GB), run=100001-100001msec
Disk stats (read/write):
nvme0n1: ios=0/49033, merge=0/12, ticks=0/1741787, in_queue=1717299, util=20.89%
During this most of the time there's a number of tasks with decent CPU usage
46627 root 20 0 0 0 0 R 99.7 0.0 1:07.74 io_wqe_worker-0
46625 root 20 0 756920 6708 1984 S 51.0 0.0 0:34.56 fio
42818 root 20 0 0 0 0 I 3.3 0.0 0:03.23 kworker/u82:6-flush-259:0
43338 root 20 0 0 0 0 I 1.3 0.0 0:02.06 kworker/u82:18-events_unbound
Comparing that to using sync:
fio --name=t --time_based=1 --runtime=100 --ioengine=sync --rw=write --bs=8k --filesize=300G --iodepth=1
Jobs: 1 (f=1): [W(1)][100.0%][w=1607MiB/s][w=206k IOPS][eta 00m:00s]
t: (groupid=0, jobs=1): err= 0: pid=46690: Fri Feb 14 11:24:17 2020
write: IOPS=234k, BW=1831MiB/s (1920MB/s)(179GiB/100001msec); 0 zone resets
clat (nsec): min=1612, max=6649.7k, avg=3913.39, stdev=3170.65
lat (nsec): min=1653, max=6649.7k, avg=3955.30, stdev=3172.59
clat percentiles (nsec):
| 1.00th=[ 2512], 5.00th=[ 2544], 10.00th=[ 2576], 20.00th=[ 2640],
| 30.00th=[ 2672], 40.00th=[ 2768], 50.00th=[ 3376], 60.00th=[ 3792],
| 70.00th=[ 4192], 80.00th=[ 4832], 90.00th=[ 5792], 95.00th=[ 7008],
| 99.00th=[11968], 99.50th=[15552], 99.90th=[22656], 99.95th=[26496],
| 99.99th=[40192]
bw ( MiB/s): min= 989, max= 2118, per=82.26%, avg=1506.28, stdev=169.56, samples=193
iops : min=126684, max=271118, avg=192803.37, stdev=21703.94, samples=193
lat (usec) : 2=0.01%, 4=66.33%, 10=32.14%, 20=1.34%, 50=0.19%
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 10=0.01%
cpu : usr=8.35%, sys=91.55%, ctx=6183, majf=0, minf=74
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,23437892,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1831MiB/s (1920MB/s), 1831MiB/s-1831MiB/s (1920MB/s-1920MB/s), io=179GiB (192GB), run=100001-100001msec
Disk stats (read/write):
nvme0n1: ios=0/152371, merge=0/587772, ticks=0/5657236, in_queue=5580764, util=67.57%
Tasks:
46690 root 20 0 756912 6524 1800 R 100.0 0.0 1:24.64 fio
2913 root 20 0 0 0 0 S 32.2 0.0 0:09.57 kswapd1
45683 root 20 0 0 0 0 I 13.2 0.0 0:08.57 kworker/u82:16-events_unbound
42816 root 20 0 0 0 0 I 11.2 0.0 0:09.71 kworker/u82:1-flush-259:0
36087 root 20 0 0 0 0 I 1.6 0.0 0:01.51 kworker/27:2-xfs-conv/nvme0n1
I.e. sync did about three times the throughput of io_uring.
It's possible to make io_uring perform better, by batching
submissions/completions:
fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=64 --iodepth_batch_submit=32 --iodepth_batch_complete_min=16 --iodepth_low=16
t: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring, iodepth=64
fio-3.17
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=1382MiB/s][w=177k IOPS][eta 00m:00s]
t: (groupid=0, jobs=1): err= 0: pid=46849: Fri Feb 14 11:28:13 2020
write: IOPS=194k, BW=1513MiB/s (1587MB/s)(148GiB/100001msec); 0 zone resets
slat (usec): min=4, max=455, avg=48.09, stdev=18.20
clat (usec): min=6, max=25435, avg=164.91, stdev=104.69
lat (usec): min=66, max=25512, avg=213.03, stdev=101.19
clat percentiles (usec):
| 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 78], 20.00th=[ 94],
| 30.00th=[ 108], 40.00th=[ 127], 50.00th=[ 143], 60.00th=[ 180],
| 70.00th=[ 210], 80.00th=[ 227], 90.00th=[ 265], 95.00th=[ 302],
| 99.00th=[ 379], 99.50th=[ 441], 99.90th=[ 562], 99.95th=[ 611],
| 99.99th=[ 816]
bw ( MiB/s): min= 289, max= 1802, per=81.00%, avg=1225.72, stdev=173.92, samples=197
iops : min=37086, max=230692, avg=156891.42, stdev=22261.80, samples=197
lat (usec) : 10=0.01%, 50=0.09%, 100=25.74%, 250=60.50%, 500=13.46%
lat (usec) : 750=0.19%, 1000=0.01%
lat (msec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
cpu : usr=31.80%, sys=26.04%, ctx=793383, majf=0, minf=1301
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=66.7%, >=64=33.3%
submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=50.0%, 32=50.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,19369087,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=1513MiB/s (1587MB/s), 1513MiB/s-1513MiB/s (1587MB/s-1587MB/s), io=148GiB (159GB), run=100001-100001msec
Disk stats (read/write):
nvme0n1: ios=0/127095, merge=0/15, ticks=0/4936953, in_queue=4873341, util=55.82%
but that's still worse than the plain sync one. And doesn't work well
for reads.
Now, it's clear that there's some overhead in backgrounding small IOs to
the async context, but this seems awfully high. There's still a slowdown
when using considerably larger pieces of IO (128k).
For the 128k, iodepth 1 case I get:
sync:
- 97.77% 0.04% fio fio [.] fio_syncio_queue
97.73% fio_syncio_queue
- __libc_write
- __libc_write
- 97.50% entry_SYSCALL_64
- 97.31% do_syscall_64
- 97.25% ksys_write
- 97.10% vfs_write
- 96.91% new_sync_write
- 96.82% xfs_file_buffered_aio_write
- 96.13% iomap_file_buffered_write
- iomap_apply
- 95.52% iomap_write_actor
+ 37.66% iov_iter_copy_from_user_atomic
+ 26.83% iomap_write_begin
+ 26.64% iomap_write_end.isra.0
2.26% iov_iter_fault_in_readable
0.87% balance_dirty_pages_ratelimited
and --no-children shows:
+ 36.57% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 14.31% fio [kernel.vmlinux] [k] iomap_set_range_uptodate
+ 3.85% fio [kernel.vmlinux] [k] get_page_from_freelist
+ 3.34% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath
+ 2.94% fio [kernel.vmlinux] [k] xas_load
+ 2.66% fio [kernel.vmlinux] [k] __add_to_page_cache_locked
+ 2.51% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave
+ 2.28% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable
+ 1.87% fio [kernel.vmlinux] [k] __lru_cache_add
+ 1.80% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn
+ 1.59% fio [kernel.vmlinux] [k] node_dirty_ok
+ 1.30% fio [kernel.vmlinux] [k] _raw_spin_lock_irq
+ 1.30% fio [kernel.vmlinux] [k] iomap_set_page_dirty
+ 1.25% fio fio [.] get_io_u
+ 1.18% fio [kernel.vmlinux] [k] xas_start
io_uring (the io_wqe_worker-0):
- 100.00% 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
- 77.36% io_wqe_worker
- 77.18% io_worker_handle_work
- 75.80% io_rw_async
- io_wq_submit_work
- 75.57% io_issue_sqe
- io_write
- 70.96% xfs_file_buffered_aio_write
- 70.37% iomap_file_buffered_write
- iomap_apply
- 70.08% iomap_write_actor
+ 30.05% iov_iter_copy_from_user_atomic
- 18.38% iomap_write_begin
- 17.64% grab_cache_page_write_begin
+ 16.73% pagecache_get_page
0.51% wait_for_stable_page
- 16.96% iomap_write_end.isra.0
8.72% iomap_set_range_uptodate
+ 7.35% iomap_set_page_dirty
3.45% iov_iter_fault_in_readable
0.56% balance_dirty_pages_ratelimited
- 4.36% kiocb_done
- 3.56% io_cqring_ev_posted
+ __wake_up_common_lock
0.61% io_cqring_add_event
0.56% io_free_req
--no-children:
+ 29.55% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
+ 8.71% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate
+ 3.41% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable
+ 2.50% io_wqe_worker-0 [kernel.vmlinux] [k] get_page_from_freelist
+ 2.24% io_wqe_worker-0 [kernel.vmlinux] [k] xas_load
+ 1.82% io_wqe_worker-0 [kernel.vmlinux] [k] queued_spin_lock_slowpath
+ 1.80% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
+ 1.72% io_wqe_worker-0 [kernel.vmlinux] [k] __add_to_page_cache_locked
+ 1.58% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq
+ 1.29% io_wqe_worker-0 [kernel.vmlinux] [k] __lru_cache_add
+ 1.27% io_wqe_worker-0 [kernel.vmlinux] [k] __pagevec_lru_add_fn
+ 1.08% io_wqe_worker-0 [kernel.vmlinux] [k] xas_start
So it seems the biggest difference is the cpu time in io_wqe_worker
itself. Which in turn appears to mostly be:
│ mov %rbp,%rdi
│ → callq io_worker_handle_work
│ mov $0x3e7,%eax
│ ↓ jmp 28c
│ rep_nop():
│ }
│
│ /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
│ static __always_inline void rep_nop(void)
│ {
│ asm volatile("rep; nop" ::: "memory");
74.50 │281: pause
│ io_worker_spin_for_work():
│ while (++i < 1000) {
11.15 │ sub $0x1,%eax
│ ↑ je 9b
│ __read_once_size():
│ __READ_ONCE_SIZE;
9.53 │28c: mov 0x8(%rbx),%rdx
│ io_wqe_run_queue():
│ if (!wq_list_empty(&wqe->work_list) &&
│ test %rdx,%rdx
2.68 │ ↓ je 29f
│ io_worker_spin_for_work():
With 8k it looks similar, but more extreme:
sync (fio itself):
+ 29.89% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 14.10% fio [kernel.vmlinux] [k] iomap_set_range_uptodate
+ 2.80% fio [kernel.vmlinux] [k] get_page_from_freelist
+ 2.66% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath
+ 2.46% fio [kernel.vmlinux] [k] xas_load
+ 1.97% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave
+ 1.94% fio [kernel.vmlinux] [k] __add_to_page_cache_locked
+ 1.79% fio [kernel.vmlinux] [k] iomap_set_page_dirty
+ 1.64% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable
+ 1.61% fio fio [.] __fio_gettime
+ 1.49% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn
+ 1.39% fio [kernel.vmlinux] [k] __lru_cache_add
1.30% fio fio [.] get_io_u
+ 1.11% fio [kernel.vmlinux] [k] up_write
+ 1.10% fio [kernel.vmlinux] [k] node_dirty_ok
+ 1.09% fio [kernel.vmlinux] [k] down_write
io_uring (io_wqe_worker-0 again):
+ 54.13% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
+ 11.01% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string
+ 2.70% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate
+ 2.62% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable
+ 1.91% io_wqe_worker-0 [kernel.vmlinux] [k] __slab_free
+ 1.85% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
+ 1.66% io_wqe_worker-0 [kernel.vmlinux] [k] try_to_wake_up
+ 1.62% io_wqe_worker-0 [kernel.vmlinux] [k] io_cqring_fill_event
+ 1.55% io_wqe_worker-0 [kernel.vmlinux] [k] io_rw_async
+ 1.43% io_wqe_worker-0 [kernel.vmlinux] [k] __wake_up_common
+ 1.37% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq
+ 1.20% io_wqe_worker-0 [kernel.vmlinux] [k] rw_verify_area
which I think is pretty clear evidence we're hitting fairly significant
contention on the queue lock.
I am hitting this in postgres originally, not fio, but I thought it's
easier to reproduce this way. There's obviously benefit to doing things
in the background - but it requires odd logic around deciding when to
use io_uring, and when not.
To be clear, none of this happens with DIO, but I don't forsee switching
to DIO for all IO by default ever (too high demands on accurate
configuration).
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Buffered IO async context overhead
2020-02-14 19:50 Buffered IO async context overhead Andres Freund
@ 2020-02-14 20:13 ` Jens Axboe
2020-02-14 20:31 ` Andres Freund
0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2020-02-14 20:13 UTC (permalink / raw)
To: Andres Freund, io-uring
On 2/14/20 12:50 PM, Andres Freund wrote:
> Hi,
>
> For workloads where a lot of buffered IO is done, I see considerably
> higher overhead when going through io_uring, then when just doing the IO
> synchronously.
>
> This example is on a samsung 970 pro nvme (i.e. decent, but not top
> class). I've limited the amount of dirty buffers allowed to 2/4 GB to
> keep the test times in check. This is with a relatively recent kernel,
> 33b40134e5cfbbccad7f3040d1919889537a3df7 .
>
> fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=1
>
> Jobs: 1 (f=1): [W(1)][100.0%][w=1049MiB/s][w=134k IOPS][eta 00m:00s]
> t: (groupid=0, jobs=1): err= 0: pid=46625: Fri Feb 14 11:22:05 2020
> write: IOPS=73.4k, BW=573MiB/s (601MB/s)(55.0GiB/100001msec); 0 zone resets
> slat (nsec): min=571, max=607779, avg=2927.18, stdev=1213.47
> clat (nsec): min=40, max=1400.7k, avg=9947.92, stdev=2606.65
> lat (usec): min=4, max=1403, avg=12.94, stdev= 3.28
> clat percentiles (nsec):
> | 1.00th=[ 4960], 5.00th=[ 6368], 10.00th=[ 8032], 20.00th=[ 8640],
> | 30.00th=[ 9024], 40.00th=[ 9280], 50.00th=[ 9664], 60.00th=[10048],
> | 70.00th=[10560], 80.00th=[11072], 90.00th=[11968], 95.00th=[13120],
> | 99.00th=[21632], 99.50th=[23424], 99.90th=[26752], 99.95th=[28288],
> | 99.99th=[35584]
> bw ( KiB/s): min=192408, max=1031483, per=79.92%, avg=468998.16, stdev=108691.70, samples=199
> iops : min=24051, max=128935, avg=58624.37, stdev=13586.54, samples=199
> lat (nsec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
> lat (nsec) : 1000=0.01%
> lat (usec) : 2=0.01%, 4=0.01%, 10=56.98%, 20=41.67%, 50=1.32%
> lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
> lat (msec) : 2=0.01%
> cpu : usr=15.63%, sys=34.79%, ctx=7333387, majf=0, minf=41
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=0,7335827,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
> WRITE: bw=573MiB/s (601MB/s), 573MiB/s-573MiB/s (601MB/s-601MB/s), io=55.0GiB (60.1GB), run=100001-100001msec
>
> Disk stats (read/write):
> nvme0n1: ios=0/49033, merge=0/12, ticks=0/1741787, in_queue=1717299, util=20.89%
>
>
> During this most of the time there's a number of tasks with decent CPU usage
> 46627 root 20 0 0 0 0 R 99.7 0.0 1:07.74 io_wqe_worker-0
> 46625 root 20 0 756920 6708 1984 S 51.0 0.0 0:34.56 fio
> 42818 root 20 0 0 0 0 I 3.3 0.0 0:03.23 kworker/u82:6-flush-259:0
> 43338 root 20 0 0 0 0 I 1.3 0.0 0:02.06 kworker/u82:18-events_unbound
>
>
> Comparing that to using sync:
> fio --name=t --time_based=1 --runtime=100 --ioengine=sync --rw=write --bs=8k --filesize=300G --iodepth=1
> Jobs: 1 (f=1): [W(1)][100.0%][w=1607MiB/s][w=206k IOPS][eta 00m:00s]
> t: (groupid=0, jobs=1): err= 0: pid=46690: Fri Feb 14 11:24:17 2020
> write: IOPS=234k, BW=1831MiB/s (1920MB/s)(179GiB/100001msec); 0 zone resets
> clat (nsec): min=1612, max=6649.7k, avg=3913.39, stdev=3170.65
> lat (nsec): min=1653, max=6649.7k, avg=3955.30, stdev=3172.59
> clat percentiles (nsec):
> | 1.00th=[ 2512], 5.00th=[ 2544], 10.00th=[ 2576], 20.00th=[ 2640],
> | 30.00th=[ 2672], 40.00th=[ 2768], 50.00th=[ 3376], 60.00th=[ 3792],
> | 70.00th=[ 4192], 80.00th=[ 4832], 90.00th=[ 5792], 95.00th=[ 7008],
> | 99.00th=[11968], 99.50th=[15552], 99.90th=[22656], 99.95th=[26496],
> | 99.99th=[40192]
> bw ( MiB/s): min= 989, max= 2118, per=82.26%, avg=1506.28, stdev=169.56, samples=193
> iops : min=126684, max=271118, avg=192803.37, stdev=21703.94, samples=193
> lat (usec) : 2=0.01%, 4=66.33%, 10=32.14%, 20=1.34%, 50=0.19%
> lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 1000=0.01%
> lat (msec) : 2=0.01%, 10=0.01%
> cpu : usr=8.35%, sys=91.55%, ctx=6183, majf=0, minf=74
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=0,23437892,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
> WRITE: bw=1831MiB/s (1920MB/s), 1831MiB/s-1831MiB/s (1920MB/s-1920MB/s), io=179GiB (192GB), run=100001-100001msec
>
> Disk stats (read/write):
> nvme0n1: ios=0/152371, merge=0/587772, ticks=0/5657236, in_queue=5580764, util=67.57%
>
> Tasks:
> 46690 root 20 0 756912 6524 1800 R 100.0 0.0 1:24.64 fio
> 2913 root 20 0 0 0 0 S 32.2 0.0 0:09.57 kswapd1
> 45683 root 20 0 0 0 0 I 13.2 0.0 0:08.57 kworker/u82:16-events_unbound
> 42816 root 20 0 0 0 0 I 11.2 0.0 0:09.71 kworker/u82:1-flush-259:0
> 36087 root 20 0 0 0 0 I 1.6 0.0 0:01.51 kworker/27:2-xfs-conv/nvme0n1
>
> I.e. sync did about three times the throughput of io_uring.
>
>
> It's possible to make io_uring perform better, by batching
> submissions/completions:
> fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=64 --iodepth_batch_submit=32 --iodepth_batch_complete_min=16 --iodepth_low=16
> t: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring, iodepth=64
> fio-3.17
> Starting 1 process
> Jobs: 1 (f=1): [W(1)][100.0%][w=1382MiB/s][w=177k IOPS][eta 00m:00s]
> t: (groupid=0, jobs=1): err= 0: pid=46849: Fri Feb 14 11:28:13 2020
> write: IOPS=194k, BW=1513MiB/s (1587MB/s)(148GiB/100001msec); 0 zone resets
> slat (usec): min=4, max=455, avg=48.09, stdev=18.20
> clat (usec): min=6, max=25435, avg=164.91, stdev=104.69
> lat (usec): min=66, max=25512, avg=213.03, stdev=101.19
> clat percentiles (usec):
> | 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 78], 20.00th=[ 94],
> | 30.00th=[ 108], 40.00th=[ 127], 50.00th=[ 143], 60.00th=[ 180],
> | 70.00th=[ 210], 80.00th=[ 227], 90.00th=[ 265], 95.00th=[ 302],
> | 99.00th=[ 379], 99.50th=[ 441], 99.90th=[ 562], 99.95th=[ 611],
> | 99.99th=[ 816]
> bw ( MiB/s): min= 289, max= 1802, per=81.00%, avg=1225.72, stdev=173.92, samples=197
> iops : min=37086, max=230692, avg=156891.42, stdev=22261.80, samples=197
> lat (usec) : 10=0.01%, 50=0.09%, 100=25.74%, 250=60.50%, 500=13.46%
> lat (usec) : 750=0.19%, 1000=0.01%
> lat (msec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
> cpu : usr=31.80%, sys=26.04%, ctx=793383, majf=0, minf=1301
> IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=66.7%, >=64=33.3%
> submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=50.0%, 32=50.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued rwts: total=0,19369087,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> WRITE: bw=1513MiB/s (1587MB/s), 1513MiB/s-1513MiB/s (1587MB/s-1587MB/s), io=148GiB (159GB), run=100001-100001msec
>
> Disk stats (read/write):
> nvme0n1: ios=0/127095, merge=0/15, ticks=0/4936953, in_queue=4873341, util=55.82%
>
> but that's still worse than the plain sync one. And doesn't work well
> for reads.
>
>
> Now, it's clear that there's some overhead in backgrounding small IOs to
> the async context, but this seems awfully high. There's still a slowdown
> when using considerably larger pieces of IO (128k).
>
> For the 128k, iodepth 1 case I get:
> sync:
> - 97.77% 0.04% fio fio [.] fio_syncio_queue
> 97.73% fio_syncio_queue
> - __libc_write
> - __libc_write
> - 97.50% entry_SYSCALL_64
> - 97.31% do_syscall_64
> - 97.25% ksys_write
> - 97.10% vfs_write
> - 96.91% new_sync_write
> - 96.82% xfs_file_buffered_aio_write
> - 96.13% iomap_file_buffered_write
> - iomap_apply
> - 95.52% iomap_write_actor
> + 37.66% iov_iter_copy_from_user_atomic
> + 26.83% iomap_write_begin
> + 26.64% iomap_write_end.isra.0
> 2.26% iov_iter_fault_in_readable
> 0.87% balance_dirty_pages_ratelimited
> and --no-children shows:
> + 36.57% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string
> + 14.31% fio [kernel.vmlinux] [k] iomap_set_range_uptodate
> + 3.85% fio [kernel.vmlinux] [k] get_page_from_freelist
> + 3.34% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath
> + 2.94% fio [kernel.vmlinux] [k] xas_load
> + 2.66% fio [kernel.vmlinux] [k] __add_to_page_cache_locked
> + 2.51% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> + 2.28% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable
> + 1.87% fio [kernel.vmlinux] [k] __lru_cache_add
> + 1.80% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn
> + 1.59% fio [kernel.vmlinux] [k] node_dirty_ok
> + 1.30% fio [kernel.vmlinux] [k] _raw_spin_lock_irq
> + 1.30% fio [kernel.vmlinux] [k] iomap_set_page_dirty
> + 1.25% fio fio [.] get_io_u
> + 1.18% fio [kernel.vmlinux] [k] xas_start
>
> io_uring (the io_wqe_worker-0):
> - 100.00% 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
> - 77.36% io_wqe_worker
> - 77.18% io_worker_handle_work
> - 75.80% io_rw_async
> - io_wq_submit_work
> - 75.57% io_issue_sqe
> - io_write
> - 70.96% xfs_file_buffered_aio_write
> - 70.37% iomap_file_buffered_write
> - iomap_apply
> - 70.08% iomap_write_actor
> + 30.05% iov_iter_copy_from_user_atomic
> - 18.38% iomap_write_begin
> - 17.64% grab_cache_page_write_begin
> + 16.73% pagecache_get_page
> 0.51% wait_for_stable_page
> - 16.96% iomap_write_end.isra.0
> 8.72% iomap_set_range_uptodate
> + 7.35% iomap_set_page_dirty
> 3.45% iov_iter_fault_in_readable
> 0.56% balance_dirty_pages_ratelimited
> - 4.36% kiocb_done
> - 3.56% io_cqring_ev_posted
> + __wake_up_common_lock
> 0.61% io_cqring_add_event
> 0.56% io_free_req
> --no-children:
> + 29.55% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string
> + 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
> + 8.71% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate
> + 3.41% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable
> + 2.50% io_wqe_worker-0 [kernel.vmlinux] [k] get_page_from_freelist
> + 2.24% io_wqe_worker-0 [kernel.vmlinux] [k] xas_load
> + 1.82% io_wqe_worker-0 [kernel.vmlinux] [k] queued_spin_lock_slowpath
> + 1.80% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> + 1.72% io_wqe_worker-0 [kernel.vmlinux] [k] __add_to_page_cache_locked
> + 1.58% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq
> + 1.29% io_wqe_worker-0 [kernel.vmlinux] [k] __lru_cache_add
> + 1.27% io_wqe_worker-0 [kernel.vmlinux] [k] __pagevec_lru_add_fn
> + 1.08% io_wqe_worker-0 [kernel.vmlinux] [k] xas_start
>
> So it seems the biggest difference is the cpu time in io_wqe_worker
> itself. Which in turn appears to mostly be:
>
> │ mov %rbp,%rdi
> │ → callq io_worker_handle_work
> │ mov $0x3e7,%eax
> │ ↓ jmp 28c
> │ rep_nop():
> │ }
> │
> │ /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
> │ static __always_inline void rep_nop(void)
> │ {
> │ asm volatile("rep; nop" ::: "memory");
> 74.50 │281: pause
> │ io_worker_spin_for_work():
> │ while (++i < 1000) {
> 11.15 │ sub $0x1,%eax
> │ ↑ je 9b
> │ __read_once_size():
> │ __READ_ONCE_SIZE;
> 9.53 │28c: mov 0x8(%rbx),%rdx
> │ io_wqe_run_queue():
> │ if (!wq_list_empty(&wqe->work_list) &&
> │ test %rdx,%rdx
> 2.68 │ ↓ je 29f
> │ io_worker_spin_for_work():
>
>
> With 8k it looks similar, but more extreme:
> sync (fio itself):
> + 29.89% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string
> + 14.10% fio [kernel.vmlinux] [k] iomap_set_range_uptodate
> + 2.80% fio [kernel.vmlinux] [k] get_page_from_freelist
> + 2.66% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath
> + 2.46% fio [kernel.vmlinux] [k] xas_load
> + 1.97% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> + 1.94% fio [kernel.vmlinux] [k] __add_to_page_cache_locked
> + 1.79% fio [kernel.vmlinux] [k] iomap_set_page_dirty
> + 1.64% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable
> + 1.61% fio fio [.] __fio_gettime
> + 1.49% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn
> + 1.39% fio [kernel.vmlinux] [k] __lru_cache_add
> 1.30% fio fio [.] get_io_u
> + 1.11% fio [kernel.vmlinux] [k] up_write
> + 1.10% fio [kernel.vmlinux] [k] node_dirty_ok
> + 1.09% fio [kernel.vmlinux] [k] down_write
>
> io_uring (io_wqe_worker-0 again):
> + 54.13% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
> + 11.01% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string
> + 2.70% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate
> + 2.62% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable
> + 1.91% io_wqe_worker-0 [kernel.vmlinux] [k] __slab_free
> + 1.85% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> + 1.66% io_wqe_worker-0 [kernel.vmlinux] [k] try_to_wake_up
> + 1.62% io_wqe_worker-0 [kernel.vmlinux] [k] io_cqring_fill_event
> + 1.55% io_wqe_worker-0 [kernel.vmlinux] [k] io_rw_async
> + 1.43% io_wqe_worker-0 [kernel.vmlinux] [k] __wake_up_common
> + 1.37% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq
> + 1.20% io_wqe_worker-0 [kernel.vmlinux] [k] rw_verify_area
>
> which I think is pretty clear evidence we're hitting fairly significant
> contention on the queue lock.
>
>
> I am hitting this in postgres originally, not fio, but I thought it's
> easier to reproduce this way. There's obviously benefit to doing things
> in the background - but it requires odd logic around deciding when to
> use io_uring, and when not.
>
> To be clear, none of this happens with DIO, but I don't forsee switching
> to DIO for all IO by default ever (too high demands on accurate
> configuration).
Can you try with this added?
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 76cbf474c184..207daf83f209 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -620,6 +620,7 @@ static const struct io_op_def io_op_defs[] = {
.async_ctx = 1,
.needs_mm = 1,
.needs_file = 1,
+ .hash_reg_file = 1,
.unbound_nonreg_file = 1,
},
[IORING_OP_WRITEV] = {
@@ -634,6 +635,7 @@ static const struct io_op_def io_op_defs[] = {
},
[IORING_OP_READ_FIXED] = {
.needs_file = 1,
+ .hash_reg_file = 1,
.unbound_nonreg_file = 1,
},
[IORING_OP_WRITE_FIXED] = {
@@ -711,11 +713,13 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_READ] = {
.needs_mm = 1,
.needs_file = 1,
+ .hash_reg_file = 1,
.unbound_nonreg_file = 1,
},
[IORING_OP_WRITE] = {
.needs_mm = 1,
.needs_file = 1,
+ .hash_reg_file = 1,
.unbound_nonreg_file = 1,
},
[IORING_OP_FADVISE] = {
@@ -955,7 +959,7 @@ static inline bool io_prep_async_work(struct io_kiocb *req,
bool do_hashed = false;
if (req->flags & REQ_F_ISREG) {
- if (def->hash_reg_file)
+ if (!(req->kiocb->ki_flags & IOCB_DIRECT) && def->hash_reg_file)
do_hashed = true;
} else {
if (def->unbound_nonreg_file)
--
Jens Axboe
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: Buffered IO async context overhead
2020-02-14 20:13 ` Jens Axboe
@ 2020-02-14 20:31 ` Andres Freund
2020-02-14 20:49 ` Jens Axboe
0 siblings, 1 reply; 10+ messages in thread
From: Andres Freund @ 2020-02-14 20:31 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring
Hi,
On 2020-02-14 13:13:35 -0700, Jens Axboe wrote:
> On 2/14/20 12:50 PM, Andres Freund wrote:
> > which I think is pretty clear evidence we're hitting fairly significant
> > contention on the queue lock.
> >
> >
> > I am hitting this in postgres originally, not fio, but I thought it's
> > easier to reproduce this way. There's obviously benefit to doing things
> > in the background - but it requires odd logic around deciding when to
> > use io_uring, and when not.
> >
> > To be clear, none of this happens with DIO, but I don't forsee switching
> > to DIO for all IO by default ever (too high demands on accurate
> > configuration).
>
> Can you try with this added?
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 76cbf474c184..207daf83f209 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -620,6 +620,7 @@ static const struct io_op_def io_op_defs[] = {
> .async_ctx = 1,
> .needs_mm = 1,
> .needs_file = 1,
> + .hash_reg_file = 1,
> .unbound_nonreg_file = 1,
> },
> [IORING_OP_WRITEV] = {
> @@ -634,6 +635,7 @@ static const struct io_op_def io_op_defs[] = {
> },
> [IORING_OP_READ_FIXED] = {
> .needs_file = 1,
> + .hash_reg_file = 1,
> .unbound_nonreg_file = 1,
> },
> [IORING_OP_WRITE_FIXED] = {
> @@ -711,11 +713,13 @@ static const struct io_op_def io_op_defs[] = {
> [IORING_OP_READ] = {
> .needs_mm = 1,
> .needs_file = 1,
> + .hash_reg_file = 1,
> .unbound_nonreg_file = 1,
> },
> [IORING_OP_WRITE] = {
> .needs_mm = 1,
> .needs_file = 1,
> + .hash_reg_file = 1,
> .unbound_nonreg_file = 1,
> },
> [IORING_OP_FADVISE] = {
> @@ -955,7 +959,7 @@ static inline bool io_prep_async_work(struct io_kiocb *req,
> bool do_hashed = false;
>
> if (req->flags & REQ_F_ISREG) {
> - if (def->hash_reg_file)
> + if (!(req->kiocb->ki_flags & IOCB_DIRECT) && def->hash_reg_file)
> do_hashed = true;
> } else {
> if (def->unbound_nonreg_file)
I can (will do Sunday, on the road till then). But I'm a bit doubtful
it'll help. This is using WRITEV after all, and I only see a single
worker?
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Buffered IO async context overhead
2020-02-14 20:31 ` Andres Freund
@ 2020-02-14 20:49 ` Jens Axboe
2020-02-24 9:35 ` Andres Freund
0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2020-02-14 20:49 UTC (permalink / raw)
To: Andres Freund; +Cc: io-uring
On 2/14/20 1:31 PM, Andres Freund wrote:
> Hi,
>
> On 2020-02-14 13:13:35 -0700, Jens Axboe wrote:
>> On 2/14/20 12:50 PM, Andres Freund wrote:
>>> which I think is pretty clear evidence we're hitting fairly significant
>>> contention on the queue lock.
>>>
>>>
>>> I am hitting this in postgres originally, not fio, but I thought it's
>>> easier to reproduce this way. There's obviously benefit to doing things
>>> in the background - but it requires odd logic around deciding when to
>>> use io_uring, and when not.
>>>
>>> To be clear, none of this happens with DIO, but I don't forsee switching
>>> to DIO for all IO by default ever (too high demands on accurate
>>> configuration).
>>
>> Can you try with this added?
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 76cbf474c184..207daf83f209 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -620,6 +620,7 @@ static const struct io_op_def io_op_defs[] = {
>> .async_ctx = 1,
>> .needs_mm = 1,
>> .needs_file = 1,
>> + .hash_reg_file = 1,
>> .unbound_nonreg_file = 1,
>> },
>> [IORING_OP_WRITEV] = {
>> @@ -634,6 +635,7 @@ static const struct io_op_def io_op_defs[] = {
>> },
>> [IORING_OP_READ_FIXED] = {
>> .needs_file = 1,
>> + .hash_reg_file = 1,
>> .unbound_nonreg_file = 1,
>> },
>> [IORING_OP_WRITE_FIXED] = {
>> @@ -711,11 +713,13 @@ static const struct io_op_def io_op_defs[] = {
>> [IORING_OP_READ] = {
>> .needs_mm = 1,
>> .needs_file = 1,
>> + .hash_reg_file = 1,
>> .unbound_nonreg_file = 1,
>> },
>> [IORING_OP_WRITE] = {
>> .needs_mm = 1,
>> .needs_file = 1,
>> + .hash_reg_file = 1,
>> .unbound_nonreg_file = 1,
>> },
>> [IORING_OP_FADVISE] = {
>> @@ -955,7 +959,7 @@ static inline bool io_prep_async_work(struct io_kiocb *req,
>> bool do_hashed = false;
>>
>> if (req->flags & REQ_F_ISREG) {
>> - if (def->hash_reg_file)
>> + if (!(req->kiocb->ki_flags & IOCB_DIRECT) && def->hash_reg_file)
>> do_hashed = true;
>> } else {
>> if (def->unbound_nonreg_file)
>
> I can (will do Sunday, on the road till then). But I'm a bit doubtful
> it'll help. This is using WRITEV after all, and I only see a single
> worker?
Because I'm working on other items, I didn't read carefully enough. Yes
this won't change the situation for writes. I'll take a look at this when
I get time, maybe there's something we can do to improve the situation.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Buffered IO async context overhead
2020-02-14 20:49 ` Jens Axboe
@ 2020-02-24 9:35 ` Andres Freund
2020-02-24 15:22 ` Jens Axboe
0 siblings, 1 reply; 10+ messages in thread
From: Andres Freund @ 2020-02-24 9:35 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring
Hi,
On 2020-02-14 13:49:31 -0700, Jens Axboe wrote:
> [description of buffered write workloads being slower via io_uring
> than plain writes]
> Because I'm working on other items, I didn't read carefully enough. Yes
> this won't change the situation for writes. I'll take a look at this when
> I get time, maybe there's something we can do to improve the situation.
I looked a bit into this.
I think one issue is the spinning the workers do:
static int io_wqe_worker(void *data)
{
while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
set_current_state(TASK_INTERRUPTIBLE);
loop:
if (did_work)
io_worker_spin_for_work(wqe);
spin_lock_irq(&wqe->lock);
if (io_wqe_run_queue(wqe)) {
static inline void io_worker_spin_for_work(struct io_wqe *wqe)
{
int i = 0;
while (++i < 1000) {
if (io_wqe_run_queue(wqe))
break;
if (need_resched())
break;
cpu_relax();
}
}
even with the cpu_relax(), that causes quite a lot of cross socket
traffic, slowing down the submission side. Which after all frequently
needs to take the wqe->lock, just to be able to submit a queue
entry.
lock, work_list, flags all reside in one cacheline, so it's pretty
likely that a single io_wqe_enqueue would get the cacheline "stolen"
several times during one enqueue - without allowing any progress in the
worker, of course.
I also wonder if we can't avoid dequeuing entries one-by-one within the
worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes
are just hitting the page cache, they're pretty fast, making it
plausible to cause pretty bad contention on the spinlock (even without
the spining above). Whereas the submission side is at least somewhat
likely to be able to submit several queue entries while the worker is
processing one job, that's pretty unlikely for workers.
In the hashed case there shouldn't be another worker processing entries
for the same hash. So it seems quite possible for the wqe to drain a few
of the entries for that hash within one spinlock acquisition, and then
process them one-by-one?
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Buffered IO async context overhead
2020-02-24 9:35 ` Andres Freund
@ 2020-02-24 15:22 ` Jens Axboe
2020-03-09 20:03 ` Pavel Begunkov
0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2020-02-24 15:22 UTC (permalink / raw)
To: Andres Freund; +Cc: io-uring
On 2/24/20 2:35 AM, Andres Freund wrote:
> Hi,
>
> On 2020-02-14 13:49:31 -0700, Jens Axboe wrote:
>> [description of buffered write workloads being slower via io_uring
>> than plain writes]
>> Because I'm working on other items, I didn't read carefully enough. Yes
>> this won't change the situation for writes. I'll take a look at this when
>> I get time, maybe there's something we can do to improve the situation.
>
> I looked a bit into this.
>
> I think one issue is the spinning the workers do:
>
> static int io_wqe_worker(void *data)
> {
>
> while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
> set_current_state(TASK_INTERRUPTIBLE);
> loop:
> if (did_work)
> io_worker_spin_for_work(wqe);
> spin_lock_irq(&wqe->lock);
> if (io_wqe_run_queue(wqe)) {
>
> static inline void io_worker_spin_for_work(struct io_wqe *wqe)
> {
> int i = 0;
>
> while (++i < 1000) {
> if (io_wqe_run_queue(wqe))
> break;
> if (need_resched())
> break;
> cpu_relax();
> }
> }
>
> even with the cpu_relax(), that causes quite a lot of cross socket
> traffic, slowing down the submission side. Which after all frequently
> needs to take the wqe->lock, just to be able to submit a queue
> entry.
>
> lock, work_list, flags all reside in one cacheline, so it's pretty
> likely that a single io_wqe_enqueue would get the cacheline "stolen"
> several times during one enqueue - without allowing any progress in the
> worker, of course.
Since it's provably harmful for this case, and the gain was small (but
noticeable) on single issue cases, I think we should just kill it. With
the poll retry stuff for 5.7, there'll be even less of a need for it.
Care to send a patch for 5.6 to kill it?
> I also wonder if we can't avoid dequeuing entries one-by-one within the
> worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes
> are just hitting the page cache, they're pretty fast, making it
> plausible to cause pretty bad contention on the spinlock (even without
> the spining above). Whereas the submission side is at least somewhat
> likely to be able to submit several queue entries while the worker is
> processing one job, that's pretty unlikely for workers.
>
> In the hashed case there shouldn't be another worker processing entries
> for the same hash. So it seems quite possible for the wqe to drain a few
> of the entries for that hash within one spinlock acquisition, and then
> process them one-by-one?
Yeah, I think that'd be a good optimization for hashed work. Work N+1
can't make any progress before work N is done anyway, so might as well
grab a batch at the time.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Buffered IO async context overhead
2020-02-24 15:22 ` Jens Axboe
@ 2020-03-09 20:03 ` Pavel Begunkov
2020-03-09 20:41 ` Jens Axboe
0 siblings, 1 reply; 10+ messages in thread
From: Pavel Begunkov @ 2020-03-09 20:03 UTC (permalink / raw)
To: Jens Axboe, Andres Freund; +Cc: io-uring
On 24/02/2020 18:22, Jens Axboe wrote:
> On 2/24/20 2:35 AM, Andres Freund wrote:
>> Hi,
>>
>> On 2020-02-14 13:49:31 -0700, Jens Axboe wrote:
>>> [description of buffered write workloads being slower via io_uring
>>> than plain writes]
>>> Because I'm working on other items, I didn't read carefully enough. Yes
>>> this won't change the situation for writes. I'll take a look at this when
>>> I get time, maybe there's something we can do to improve the situation.
>>
>> I looked a bit into this.
>>
>> I think one issue is the spinning the workers do:
>>
>> static int io_wqe_worker(void *data)
>> {
>>
>> while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
>> set_current_state(TASK_INTERRUPTIBLE);
>> loop:
>> if (did_work)
>> io_worker_spin_for_work(wqe);
>> spin_lock_irq(&wqe->lock);
>> if (io_wqe_run_queue(wqe)) {
>>
>> static inline void io_worker_spin_for_work(struct io_wqe *wqe)
>> {
>> int i = 0;
>>
>> while (++i < 1000) {
>> if (io_wqe_run_queue(wqe))
>> break;
>> if (need_resched())
>> break;
>> cpu_relax();
>> }
>> }
>>
>> even with the cpu_relax(), that causes quite a lot of cross socket
>> traffic, slowing down the submission side. Which after all frequently
>> needs to take the wqe->lock, just to be able to submit a queue
>> entry.
>>
>> lock, work_list, flags all reside in one cacheline, so it's pretty
>> likely that a single io_wqe_enqueue would get the cacheline "stolen"
>> several times during one enqueue - without allowing any progress in the
>> worker, of course.
>
> Since it's provably harmful for this case, and the gain was small (but
> noticeable) on single issue cases, I think we should just kill it. With
> the poll retry stuff for 5.7, there'll be even less of a need for it.
>
> Care to send a patch for 5.6 to kill it?
>
>> I also wonder if we can't avoid dequeuing entries one-by-one within the
>> worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes
>> are just hitting the page cache, they're pretty fast, making it
>> plausible to cause pretty bad contention on the spinlock (even without
>> the spining above). Whereas the submission side is at least somewhat
>> likely to be able to submit several queue entries while the worker is
>> processing one job, that's pretty unlikely for workers.
>>
>> In the hashed case there shouldn't be another worker processing entries
>> for the same hash. So it seems quite possible for the wqe to drain a few
>> of the entries for that hash within one spinlock acquisition, and then
>> process them one-by-one?
>
> Yeah, I think that'd be a good optimization for hashed work. Work N+1
> can't make any progress before work N is done anyway, so might as well
> grab a batch at the time.
>
A problem here is that we actually have a 2D array of works because of linked
requests.
We can io_wqe_enqueue() dependant works, if have hashed requests, so delegating
it to other threads. But if the work->list is not per-core, it will hurt
locality. Either re-enqueue hashed ones if there is a dependant work. Need to
think how to do better.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Buffered IO async context overhead
2020-03-09 20:03 ` Pavel Begunkov
@ 2020-03-09 20:41 ` Jens Axboe
2020-03-09 21:02 ` Pavel Begunkov
0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2020-03-09 20:41 UTC (permalink / raw)
To: Pavel Begunkov, Andres Freund; +Cc: io-uring
On 3/9/20 2:03 PM, Pavel Begunkov wrote:
> On 24/02/2020 18:22, Jens Axboe wrote:
>> On 2/24/20 2:35 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2020-02-14 13:49:31 -0700, Jens Axboe wrote:
>>>> [description of buffered write workloads being slower via io_uring
>>>> than plain writes]
>>>> Because I'm working on other items, I didn't read carefully enough. Yes
>>>> this won't change the situation for writes. I'll take a look at this when
>>>> I get time, maybe there's something we can do to improve the situation.
>>>
>>> I looked a bit into this.
>>>
>>> I think one issue is the spinning the workers do:
>>>
>>> static int io_wqe_worker(void *data)
>>> {
>>>
>>> while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
>>> set_current_state(TASK_INTERRUPTIBLE);
>>> loop:
>>> if (did_work)
>>> io_worker_spin_for_work(wqe);
>>> spin_lock_irq(&wqe->lock);
>>> if (io_wqe_run_queue(wqe)) {
>>>
>>> static inline void io_worker_spin_for_work(struct io_wqe *wqe)
>>> {
>>> int i = 0;
>>>
>>> while (++i < 1000) {
>>> if (io_wqe_run_queue(wqe))
>>> break;
>>> if (need_resched())
>>> break;
>>> cpu_relax();
>>> }
>>> }
>>>
>>> even with the cpu_relax(), that causes quite a lot of cross socket
>>> traffic, slowing down the submission side. Which after all frequently
>>> needs to take the wqe->lock, just to be able to submit a queue
>>> entry.
>>>
>>> lock, work_list, flags all reside in one cacheline, so it's pretty
>>> likely that a single io_wqe_enqueue would get the cacheline "stolen"
>>> several times during one enqueue - without allowing any progress in the
>>> worker, of course.
>>
>> Since it's provably harmful for this case, and the gain was small (but
>> noticeable) on single issue cases, I think we should just kill it. With
>> the poll retry stuff for 5.7, there'll be even less of a need for it.
>>
>> Care to send a patch for 5.6 to kill it?
>>
>>> I also wonder if we can't avoid dequeuing entries one-by-one within the
>>> worker, at least for the IO_WQ_WORK_HASHED case. Especially when writes
>>> are just hitting the page cache, they're pretty fast, making it
>>> plausible to cause pretty bad contention on the spinlock (even without
>>> the spining above). Whereas the submission side is at least somewhat
>>> likely to be able to submit several queue entries while the worker is
>>> processing one job, that's pretty unlikely for workers.
>>>
>>> In the hashed case there shouldn't be another worker processing entries
>>> for the same hash. So it seems quite possible for the wqe to drain a few
>>> of the entries for that hash within one spinlock acquisition, and then
>>> process them one-by-one?
>>
>> Yeah, I think that'd be a good optimization for hashed work. Work N+1
>> can't make any progress before work N is done anyway, so might as well
>> grab a batch at the time.
>>
>
> A problem here is that we actually have a 2D array of works because of linked
> requests.
You could either skip anything with a link, or even just ignore it and
simply re-queue a dependent link if it isn't hashed when it's done if
grabbed in a batch.
> We can io_wqe_enqueue() dependant works, if have hashed requests, so delegating
> it to other threads. But if the work->list is not per-core, it will hurt
> locality. Either re-enqueue hashed ones if there is a dependant work. Need to
> think how to do better.
If we ignore links for a second, I think we can all agree that it'd be a
big win to do the batch.
With links, worst case would then be something where every other link is
hashed.
For a first patch, I'd be quite happy to just stop the batch if there's
a link on a request. The normal case here is buffered writes, and
that'll handle that case perfectly. Links will be no worse than before.
Seems like a no-brainer to me.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Buffered IO async context overhead
2020-03-09 20:41 ` Jens Axboe
@ 2020-03-09 21:02 ` Pavel Begunkov
2020-03-09 21:29 ` Jens Axboe
0 siblings, 1 reply; 10+ messages in thread
From: Pavel Begunkov @ 2020-03-09 21:02 UTC (permalink / raw)
To: Jens Axboe, Andres Freund; +Cc: io-uring
On 09/03/2020 23:41, Jens Axboe wrote:
> On 3/9/20 2:03 PM, Pavel Begunkov wrote:
>> On 24/02/2020 18:22, Jens Axboe wrote:
>> A problem here is that we actually have a 2D array of works because of linked
>> requests.
>
> You could either skip anything with a link, or even just ignore it and
> simply re-queue a dependent link if it isn't hashed when it's done if
> grabbed in a batch.
>
>> We can io_wqe_enqueue() dependant works, if have hashed requests, so delegating
>> it to other threads. But if the work->list is not per-core, it will hurt
>> locality. Either re-enqueue hashed ones if there is a dependant work. Need to
>> think how to do better.
>
> If we ignore links for a second, I think we can all agree that it'd be a
> big win to do the batch.
Definitely
>
> With links, worst case would then be something where every other link is
> hashed.
>
> For a first patch, I'd be quite happy to just stop the batch if there's
> a link on a request. The normal case here is buffered writes, and
> that'll handle that case perfectly. Links will be no worse than before.
> Seems like a no-brainer to me.
That isn't really a problem, just pointing that there could be optimisations for
different cases.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Buffered IO async context overhead
2020-03-09 21:02 ` Pavel Begunkov
@ 2020-03-09 21:29 ` Jens Axboe
0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2020-03-09 21:29 UTC (permalink / raw)
To: Pavel Begunkov, Andres Freund; +Cc: io-uring
On 3/9/20 3:02 PM, Pavel Begunkov wrote:
> On 09/03/2020 23:41, Jens Axboe wrote:
>> On 3/9/20 2:03 PM, Pavel Begunkov wrote:
>>> On 24/02/2020 18:22, Jens Axboe wrote:
>>> A problem here is that we actually have a 2D array of works because of linked
>>> requests.
>>
>> You could either skip anything with a link, or even just ignore it and
>> simply re-queue a dependent link if it isn't hashed when it's done if
>> grabbed in a batch.
>>
>>> We can io_wqe_enqueue() dependant works, if have hashed requests, so delegating
>>> it to other threads. But if the work->list is not per-core, it will hurt
>>> locality. Either re-enqueue hashed ones if there is a dependant work. Need to
>>> think how to do better.
>>
>> If we ignore links for a second, I think we can all agree that it'd be a
>> big win to do the batch.
>
> Definitely
>
>>
>> With links, worst case would then be something where every other link is
>> hashed.
>>
>> For a first patch, I'd be quite happy to just stop the batch if there's
>> a link on a request. The normal case here is buffered writes, and
>> that'll handle that case perfectly. Links will be no worse than before.
>> Seems like a no-brainer to me.
>
> That isn't really a problem, just pointing that there could be
> optimisations for different cases.
Definitely, in case it wasn't clear, my suggestion was merely to go for
the main win and ignore the link side for now. It's best done separately
anyway, with the link stuff tweaking the batch behavior.
--
Jens Axboe
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2020-03-09 21:29 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-02-14 19:50 Buffered IO async context overhead Andres Freund
2020-02-14 20:13 ` Jens Axboe
2020-02-14 20:31 ` Andres Freund
2020-02-14 20:49 ` Jens Axboe
2020-02-24 9:35 ` Andres Freund
2020-02-24 15:22 ` Jens Axboe
2020-03-09 20:03 ` Pavel Begunkov
2020-03-09 20:41 ` Jens Axboe
2020-03-09 21:02 ` Pavel Begunkov
2020-03-09 21:29 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox