From: Jens Axboe <[email protected]>
To: Andres Freund <[email protected]>, [email protected]
Subject: Re: Buffered IO async context overhead
Date: Fri, 14 Feb 2020 13:13:35 -0700 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 2/14/20 12:50 PM, Andres Freund wrote:
> Hi,
>
> For workloads where a lot of buffered IO is done, I see considerably
> higher overhead when going through io_uring, then when just doing the IO
> synchronously.
>
> This example is on a samsung 970 pro nvme (i.e. decent, but not top
> class). I've limited the amount of dirty buffers allowed to 2/4 GB to
> keep the test times in check. This is with a relatively recent kernel,
> 33b40134e5cfbbccad7f3040d1919889537a3df7 .
>
> fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=1
>
> Jobs: 1 (f=1): [W(1)][100.0%][w=1049MiB/s][w=134k IOPS][eta 00m:00s]
> t: (groupid=0, jobs=1): err= 0: pid=46625: Fri Feb 14 11:22:05 2020
> write: IOPS=73.4k, BW=573MiB/s (601MB/s)(55.0GiB/100001msec); 0 zone resets
> slat (nsec): min=571, max=607779, avg=2927.18, stdev=1213.47
> clat (nsec): min=40, max=1400.7k, avg=9947.92, stdev=2606.65
> lat (usec): min=4, max=1403, avg=12.94, stdev= 3.28
> clat percentiles (nsec):
> | 1.00th=[ 4960], 5.00th=[ 6368], 10.00th=[ 8032], 20.00th=[ 8640],
> | 30.00th=[ 9024], 40.00th=[ 9280], 50.00th=[ 9664], 60.00th=[10048],
> | 70.00th=[10560], 80.00th=[11072], 90.00th=[11968], 95.00th=[13120],
> | 99.00th=[21632], 99.50th=[23424], 99.90th=[26752], 99.95th=[28288],
> | 99.99th=[35584]
> bw ( KiB/s): min=192408, max=1031483, per=79.92%, avg=468998.16, stdev=108691.70, samples=199
> iops : min=24051, max=128935, avg=58624.37, stdev=13586.54, samples=199
> lat (nsec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
> lat (nsec) : 1000=0.01%
> lat (usec) : 2=0.01%, 4=0.01%, 10=56.98%, 20=41.67%, 50=1.32%
> lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
> lat (msec) : 2=0.01%
> cpu : usr=15.63%, sys=34.79%, ctx=7333387, majf=0, minf=41
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=0,7335827,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
> WRITE: bw=573MiB/s (601MB/s), 573MiB/s-573MiB/s (601MB/s-601MB/s), io=55.0GiB (60.1GB), run=100001-100001msec
>
> Disk stats (read/write):
> nvme0n1: ios=0/49033, merge=0/12, ticks=0/1741787, in_queue=1717299, util=20.89%
>
>
> During this most of the time there's a number of tasks with decent CPU usage
> 46627 root 20 0 0 0 0 R 99.7 0.0 1:07.74 io_wqe_worker-0
> 46625 root 20 0 756920 6708 1984 S 51.0 0.0 0:34.56 fio
> 42818 root 20 0 0 0 0 I 3.3 0.0 0:03.23 kworker/u82:6-flush-259:0
> 43338 root 20 0 0 0 0 I 1.3 0.0 0:02.06 kworker/u82:18-events_unbound
>
>
> Comparing that to using sync:
> fio --name=t --time_based=1 --runtime=100 --ioengine=sync --rw=write --bs=8k --filesize=300G --iodepth=1
> Jobs: 1 (f=1): [W(1)][100.0%][w=1607MiB/s][w=206k IOPS][eta 00m:00s]
> t: (groupid=0, jobs=1): err= 0: pid=46690: Fri Feb 14 11:24:17 2020
> write: IOPS=234k, BW=1831MiB/s (1920MB/s)(179GiB/100001msec); 0 zone resets
> clat (nsec): min=1612, max=6649.7k, avg=3913.39, stdev=3170.65
> lat (nsec): min=1653, max=6649.7k, avg=3955.30, stdev=3172.59
> clat percentiles (nsec):
> | 1.00th=[ 2512], 5.00th=[ 2544], 10.00th=[ 2576], 20.00th=[ 2640],
> | 30.00th=[ 2672], 40.00th=[ 2768], 50.00th=[ 3376], 60.00th=[ 3792],
> | 70.00th=[ 4192], 80.00th=[ 4832], 90.00th=[ 5792], 95.00th=[ 7008],
> | 99.00th=[11968], 99.50th=[15552], 99.90th=[22656], 99.95th=[26496],
> | 99.99th=[40192]
> bw ( MiB/s): min= 989, max= 2118, per=82.26%, avg=1506.28, stdev=169.56, samples=193
> iops : min=126684, max=271118, avg=192803.37, stdev=21703.94, samples=193
> lat (usec) : 2=0.01%, 4=66.33%, 10=32.14%, 20=1.34%, 50=0.19%
> lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 1000=0.01%
> lat (msec) : 2=0.01%, 10=0.01%
> cpu : usr=8.35%, sys=91.55%, ctx=6183, majf=0, minf=74
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=0,23437892,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
> WRITE: bw=1831MiB/s (1920MB/s), 1831MiB/s-1831MiB/s (1920MB/s-1920MB/s), io=179GiB (192GB), run=100001-100001msec
>
> Disk stats (read/write):
> nvme0n1: ios=0/152371, merge=0/587772, ticks=0/5657236, in_queue=5580764, util=67.57%
>
> Tasks:
> 46690 root 20 0 756912 6524 1800 R 100.0 0.0 1:24.64 fio
> 2913 root 20 0 0 0 0 S 32.2 0.0 0:09.57 kswapd1
> 45683 root 20 0 0 0 0 I 13.2 0.0 0:08.57 kworker/u82:16-events_unbound
> 42816 root 20 0 0 0 0 I 11.2 0.0 0:09.71 kworker/u82:1-flush-259:0
> 36087 root 20 0 0 0 0 I 1.6 0.0 0:01.51 kworker/27:2-xfs-conv/nvme0n1
>
> I.e. sync did about three times the throughput of io_uring.
>
>
> It's possible to make io_uring perform better, by batching
> submissions/completions:
> fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=64 --iodepth_batch_submit=32 --iodepth_batch_complete_min=16 --iodepth_low=16
> t: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring, iodepth=64
> fio-3.17
> Starting 1 process
> Jobs: 1 (f=1): [W(1)][100.0%][w=1382MiB/s][w=177k IOPS][eta 00m:00s]
> t: (groupid=0, jobs=1): err= 0: pid=46849: Fri Feb 14 11:28:13 2020
> write: IOPS=194k, BW=1513MiB/s (1587MB/s)(148GiB/100001msec); 0 zone resets
> slat (usec): min=4, max=455, avg=48.09, stdev=18.20
> clat (usec): min=6, max=25435, avg=164.91, stdev=104.69
> lat (usec): min=66, max=25512, avg=213.03, stdev=101.19
> clat percentiles (usec):
> | 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 78], 20.00th=[ 94],
> | 30.00th=[ 108], 40.00th=[ 127], 50.00th=[ 143], 60.00th=[ 180],
> | 70.00th=[ 210], 80.00th=[ 227], 90.00th=[ 265], 95.00th=[ 302],
> | 99.00th=[ 379], 99.50th=[ 441], 99.90th=[ 562], 99.95th=[ 611],
> | 99.99th=[ 816]
> bw ( MiB/s): min= 289, max= 1802, per=81.00%, avg=1225.72, stdev=173.92, samples=197
> iops : min=37086, max=230692, avg=156891.42, stdev=22261.80, samples=197
> lat (usec) : 10=0.01%, 50=0.09%, 100=25.74%, 250=60.50%, 500=13.46%
> lat (usec) : 750=0.19%, 1000=0.01%
> lat (msec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
> cpu : usr=31.80%, sys=26.04%, ctx=793383, majf=0, minf=1301
> IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=66.7%, >=64=33.3%
> submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=50.0%, 32=50.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.1%, 64=0.0%, >=64=0.0%
> issued rwts: total=0,19369087,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> WRITE: bw=1513MiB/s (1587MB/s), 1513MiB/s-1513MiB/s (1587MB/s-1587MB/s), io=148GiB (159GB), run=100001-100001msec
>
> Disk stats (read/write):
> nvme0n1: ios=0/127095, merge=0/15, ticks=0/4936953, in_queue=4873341, util=55.82%
>
> but that's still worse than the plain sync one. And doesn't work well
> for reads.
>
>
> Now, it's clear that there's some overhead in backgrounding small IOs to
> the async context, but this seems awfully high. There's still a slowdown
> when using considerably larger pieces of IO (128k).
>
> For the 128k, iodepth 1 case I get:
> sync:
> - 97.77% 0.04% fio fio [.] fio_syncio_queue
> 97.73% fio_syncio_queue
> - __libc_write
> - __libc_write
> - 97.50% entry_SYSCALL_64
> - 97.31% do_syscall_64
> - 97.25% ksys_write
> - 97.10% vfs_write
> - 96.91% new_sync_write
> - 96.82% xfs_file_buffered_aio_write
> - 96.13% iomap_file_buffered_write
> - iomap_apply
> - 95.52% iomap_write_actor
> + 37.66% iov_iter_copy_from_user_atomic
> + 26.83% iomap_write_begin
> + 26.64% iomap_write_end.isra.0
> 2.26% iov_iter_fault_in_readable
> 0.87% balance_dirty_pages_ratelimited
> and --no-children shows:
> + 36.57% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string
> + 14.31% fio [kernel.vmlinux] [k] iomap_set_range_uptodate
> + 3.85% fio [kernel.vmlinux] [k] get_page_from_freelist
> + 3.34% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath
> + 2.94% fio [kernel.vmlinux] [k] xas_load
> + 2.66% fio [kernel.vmlinux] [k] __add_to_page_cache_locked
> + 2.51% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> + 2.28% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable
> + 1.87% fio [kernel.vmlinux] [k] __lru_cache_add
> + 1.80% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn
> + 1.59% fio [kernel.vmlinux] [k] node_dirty_ok
> + 1.30% fio [kernel.vmlinux] [k] _raw_spin_lock_irq
> + 1.30% fio [kernel.vmlinux] [k] iomap_set_page_dirty
> + 1.25% fio fio [.] get_io_u
> + 1.18% fio [kernel.vmlinux] [k] xas_start
>
> io_uring (the io_wqe_worker-0):
> - 100.00% 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
> - 77.36% io_wqe_worker
> - 77.18% io_worker_handle_work
> - 75.80% io_rw_async
> - io_wq_submit_work
> - 75.57% io_issue_sqe
> - io_write
> - 70.96% xfs_file_buffered_aio_write
> - 70.37% iomap_file_buffered_write
> - iomap_apply
> - 70.08% iomap_write_actor
> + 30.05% iov_iter_copy_from_user_atomic
> - 18.38% iomap_write_begin
> - 17.64% grab_cache_page_write_begin
> + 16.73% pagecache_get_page
> 0.51% wait_for_stable_page
> - 16.96% iomap_write_end.isra.0
> 8.72% iomap_set_range_uptodate
> + 7.35% iomap_set_page_dirty
> 3.45% iov_iter_fault_in_readable
> 0.56% balance_dirty_pages_ratelimited
> - 4.36% kiocb_done
> - 3.56% io_cqring_ev_posted
> + __wake_up_common_lock
> 0.61% io_cqring_add_event
> 0.56% io_free_req
> --no-children:
> + 29.55% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string
> + 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
> + 8.71% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate
> + 3.41% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable
> + 2.50% io_wqe_worker-0 [kernel.vmlinux] [k] get_page_from_freelist
> + 2.24% io_wqe_worker-0 [kernel.vmlinux] [k] xas_load
> + 1.82% io_wqe_worker-0 [kernel.vmlinux] [k] queued_spin_lock_slowpath
> + 1.80% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> + 1.72% io_wqe_worker-0 [kernel.vmlinux] [k] __add_to_page_cache_locked
> + 1.58% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq
> + 1.29% io_wqe_worker-0 [kernel.vmlinux] [k] __lru_cache_add
> + 1.27% io_wqe_worker-0 [kernel.vmlinux] [k] __pagevec_lru_add_fn
> + 1.08% io_wqe_worker-0 [kernel.vmlinux] [k] xas_start
>
> So it seems the biggest difference is the cpu time in io_wqe_worker
> itself. Which in turn appears to mostly be:
>
> │ mov %rbp,%rdi
> │ → callq io_worker_handle_work
> │ mov $0x3e7,%eax
> │ ↓ jmp 28c
> │ rep_nop():
> │ }
> │
> │ /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */
> │ static __always_inline void rep_nop(void)
> │ {
> │ asm volatile("rep; nop" ::: "memory");
> 74.50 │281: pause
> │ io_worker_spin_for_work():
> │ while (++i < 1000) {
> 11.15 │ sub $0x1,%eax
> │ ↑ je 9b
> │ __read_once_size():
> │ __READ_ONCE_SIZE;
> 9.53 │28c: mov 0x8(%rbx),%rdx
> │ io_wqe_run_queue():
> │ if (!wq_list_empty(&wqe->work_list) &&
> │ test %rdx,%rdx
> 2.68 │ ↓ je 29f
> │ io_worker_spin_for_work():
>
>
> With 8k it looks similar, but more extreme:
> sync (fio itself):
> + 29.89% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string
> + 14.10% fio [kernel.vmlinux] [k] iomap_set_range_uptodate
> + 2.80% fio [kernel.vmlinux] [k] get_page_from_freelist
> + 2.66% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath
> + 2.46% fio [kernel.vmlinux] [k] xas_load
> + 1.97% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> + 1.94% fio [kernel.vmlinux] [k] __add_to_page_cache_locked
> + 1.79% fio [kernel.vmlinux] [k] iomap_set_page_dirty
> + 1.64% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable
> + 1.61% fio fio [.] __fio_gettime
> + 1.49% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn
> + 1.39% fio [kernel.vmlinux] [k] __lru_cache_add
> 1.30% fio fio [.] get_io_u
> + 1.11% fio [kernel.vmlinux] [k] up_write
> + 1.10% fio [kernel.vmlinux] [k] node_dirty_ok
> + 1.09% fio [kernel.vmlinux] [k] down_write
>
> io_uring (io_wqe_worker-0 again):
> + 54.13% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker
> + 11.01% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string
> + 2.70% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate
> + 2.62% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable
> + 1.91% io_wqe_worker-0 [kernel.vmlinux] [k] __slab_free
> + 1.85% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> + 1.66% io_wqe_worker-0 [kernel.vmlinux] [k] try_to_wake_up
> + 1.62% io_wqe_worker-0 [kernel.vmlinux] [k] io_cqring_fill_event
> + 1.55% io_wqe_worker-0 [kernel.vmlinux] [k] io_rw_async
> + 1.43% io_wqe_worker-0 [kernel.vmlinux] [k] __wake_up_common
> + 1.37% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq
> + 1.20% io_wqe_worker-0 [kernel.vmlinux] [k] rw_verify_area
>
> which I think is pretty clear evidence we're hitting fairly significant
> contention on the queue lock.
>
>
> I am hitting this in postgres originally, not fio, but I thought it's
> easier to reproduce this way. There's obviously benefit to doing things
> in the background - but it requires odd logic around deciding when to
> use io_uring, and when not.
>
> To be clear, none of this happens with DIO, but I don't forsee switching
> to DIO for all IO by default ever (too high demands on accurate
> configuration).
Can you try with this added?
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 76cbf474c184..207daf83f209 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -620,6 +620,7 @@ static const struct io_op_def io_op_defs[] = {
.async_ctx = 1,
.needs_mm = 1,
.needs_file = 1,
+ .hash_reg_file = 1,
.unbound_nonreg_file = 1,
},
[IORING_OP_WRITEV] = {
@@ -634,6 +635,7 @@ static const struct io_op_def io_op_defs[] = {
},
[IORING_OP_READ_FIXED] = {
.needs_file = 1,
+ .hash_reg_file = 1,
.unbound_nonreg_file = 1,
},
[IORING_OP_WRITE_FIXED] = {
@@ -711,11 +713,13 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_READ] = {
.needs_mm = 1,
.needs_file = 1,
+ .hash_reg_file = 1,
.unbound_nonreg_file = 1,
},
[IORING_OP_WRITE] = {
.needs_mm = 1,
.needs_file = 1,
+ .hash_reg_file = 1,
.unbound_nonreg_file = 1,
},
[IORING_OP_FADVISE] = {
@@ -955,7 +959,7 @@ static inline bool io_prep_async_work(struct io_kiocb *req,
bool do_hashed = false;
if (req->flags & REQ_F_ISREG) {
- if (def->hash_reg_file)
+ if (!(req->kiocb->ki_flags & IOCB_DIRECT) && def->hash_reg_file)
do_hashed = true;
} else {
if (def->unbound_nonreg_file)
--
Jens Axboe
next prev parent reply other threads:[~2020-02-14 20:13 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-02-14 19:50 Buffered IO async context overhead Andres Freund
2020-02-14 20:13 ` Jens Axboe [this message]
2020-02-14 20:31 ` Andres Freund
2020-02-14 20:49 ` Jens Axboe
2020-02-24 9:35 ` Andres Freund
2020-02-24 15:22 ` Jens Axboe
2020-03-09 20:03 ` Pavel Begunkov
2020-03-09 20:41 ` Jens Axboe
2020-03-09 21:02 ` Pavel Begunkov
2020-03-09 21:29 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox