From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D747C35242 for ; Fri, 14 Feb 2020 19:50:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id CF3D72073A for ; Fri, 14 Feb 2020 19:50:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=anarazel.de header.i=@anarazel.de header.b="eMpxWTt6"; dkim=pass (2048-bit key) header.d=messagingengine.com header.i=@messagingengine.com header.b="iR6NwRsJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387964AbgBNTue (ORCPT ); Fri, 14 Feb 2020 14:50:34 -0500 Received: from out4-smtp.messagingengine.com ([66.111.4.28]:52681 "EHLO out4-smtp.messagingengine.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387508AbgBNTue (ORCPT ); Fri, 14 Feb 2020 14:50:34 -0500 Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailout.nyi.internal (Postfix) with ESMTP id CED5321B4C; Fri, 14 Feb 2020 14:50:32 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute5.internal (MEProxy); Fri, 14 Feb 2020 14:50:32 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= date:from:to:subject:message-id:mime-version:content-type :content-transfer-encoding; s=fm3; bh=NDRmT6iyQ+7NUbO5gR25oMGyw2 v/rHLocQWTgG3033M=; b=eMpxWTt6MdV46ag9vNKQzaxTE07aA2lMFMaGks3z93 kbRmmxpsxFET5k7VgshAOFiJjWLnGm5nDPBWLyX1ZqOUMlR7uXE76XpCUJffCXE2 QS4EoyAxjOKge386CMbhe7+FEHE82gcRsMzFKoOrZ7IIJJQAJP+r/oG693aiq2jb GSH0McXCOubUZjpi9YUcED22vdk7D0fwZmMvyassI1+UnvqKx96u0YTtTLGlrHVw /IPOIKeh+K8AHFirq0o6bJrwrrNWcrt7EgpfF3qBN0pLN2z8nSsobfHQfhRCm/SG pdXWagSNydn329rArT0/JBxkwrIEdqAAMuZMYCWhTnwg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:message-id:mime-version:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; bh=NDRmT6 iyQ+7NUbO5gR25oMGyw2v/rHLocQWTgG3033M=; b=iR6NwRsJ0GIwzNZd7tn+2z K1qRebwU5kOom+JFCrC/xYkaGYrtki9tfLqlmikafPFmDDsQW6nsf6MY3J9hDIbR JuvvHmAk5trbpV395Wiu76BZixw3x5K29h+BKzD9iNjitKJDuixPO/bdSpEOLTWp l31bhmCxT9DSum7478E2Lq/IXK1eyOwkbO8rwhYMZ8hQEEp8c0x1HD/rgBpYAPBL SdXbLKu9SgGTN3E11ov37o91ct3lBowjFRKkc3LK4le6m7DcvVeTm5M/TYRhrv5H 93ShxHcbn3LqS99f+dmtRDYoeOloKSxppGuFpW4KriwmUG4iIKND+O3jaClHhkUQ == X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedugedrjedtgddufedvucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpeffhffvuffkgggtugfgsehtkeertd dttdejnecuhfhrohhmpeetnhgurhgvshcuhfhrvghunhguuceorghnughrvghssegrnhgr rhgriigvlhdruggvqeenucfkphepieejrdduiedtrddvudejrddvhedtnecuvehluhhsth gvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprghnughrvghssegrnhgr rhgriigvlhdruggv X-ME-Proxy: Received: from intern.anarazel.de (c-67-160-217-250.hsd1.ca.comcast.net [67.160.217.250]) by mail.messagingengine.com (Postfix) with ESMTPA id 2BA533060BD1; Fri, 14 Feb 2020 14:50:32 -0500 (EST) Date: Fri, 14 Feb 2020 11:50:30 -0800 From: Andres Freund To: io-uring@vger.kernel.org, axboe@kernel.dk Subject: Buffered IO async context overhead Message-ID: <20200214195030.cbnr6msktdl3tqhn@alap3.anarazel.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Sender: io-uring-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org Hi, For workloads where a lot of buffered IO is done, I see considerably higher overhead when going through io_uring, then when just doing the IO synchronously. This example is on a samsung 970 pro nvme (i.e. decent, but not top class). I've limited the amount of dirty buffers allowed to 2/4 GB to keep the test times in check. This is with a relatively recent kernel, 33b40134e5cfbbccad7f3040d1919889537a3df7 . fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=1 Jobs: 1 (f=1): [W(1)][100.0%][w=1049MiB/s][w=134k IOPS][eta 00m:00s] t: (groupid=0, jobs=1): err= 0: pid=46625: Fri Feb 14 11:22:05 2020 write: IOPS=73.4k, BW=573MiB/s (601MB/s)(55.0GiB/100001msec); 0 zone resets slat (nsec): min=571, max=607779, avg=2927.18, stdev=1213.47 clat (nsec): min=40, max=1400.7k, avg=9947.92, stdev=2606.65 lat (usec): min=4, max=1403, avg=12.94, stdev= 3.28 clat percentiles (nsec): | 1.00th=[ 4960], 5.00th=[ 6368], 10.00th=[ 8032], 20.00th=[ 8640], | 30.00th=[ 9024], 40.00th=[ 9280], 50.00th=[ 9664], 60.00th=[10048], | 70.00th=[10560], 80.00th=[11072], 90.00th=[11968], 95.00th=[13120], | 99.00th=[21632], 99.50th=[23424], 99.90th=[26752], 99.95th=[28288], | 99.99th=[35584] bw ( KiB/s): min=192408, max=1031483, per=79.92%, avg=468998.16, stdev=108691.70, samples=199 iops : min=24051, max=128935, avg=58624.37, stdev=13586.54, samples=199 lat (nsec) : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01% lat (nsec) : 1000=0.01% lat (usec) : 2=0.01%, 4=0.01%, 10=56.98%, 20=41.67%, 50=1.32% lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01% lat (msec) : 2=0.01% cpu : usr=15.63%, sys=34.79%, ctx=7333387, majf=0, minf=41 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,7335827,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=573MiB/s (601MB/s), 573MiB/s-573MiB/s (601MB/s-601MB/s), io=55.0GiB (60.1GB), run=100001-100001msec Disk stats (read/write): nvme0n1: ios=0/49033, merge=0/12, ticks=0/1741787, in_queue=1717299, util=20.89% During this most of the time there's a number of tasks with decent CPU usage 46627 root 20 0 0 0 0 R 99.7 0.0 1:07.74 io_wqe_worker-0 46625 root 20 0 756920 6708 1984 S 51.0 0.0 0:34.56 fio 42818 root 20 0 0 0 0 I 3.3 0.0 0:03.23 kworker/u82:6-flush-259:0 43338 root 20 0 0 0 0 I 1.3 0.0 0:02.06 kworker/u82:18-events_unbound Comparing that to using sync: fio --name=t --time_based=1 --runtime=100 --ioengine=sync --rw=write --bs=8k --filesize=300G --iodepth=1 Jobs: 1 (f=1): [W(1)][100.0%][w=1607MiB/s][w=206k IOPS][eta 00m:00s] t: (groupid=0, jobs=1): err= 0: pid=46690: Fri Feb 14 11:24:17 2020 write: IOPS=234k, BW=1831MiB/s (1920MB/s)(179GiB/100001msec); 0 zone resets clat (nsec): min=1612, max=6649.7k, avg=3913.39, stdev=3170.65 lat (nsec): min=1653, max=6649.7k, avg=3955.30, stdev=3172.59 clat percentiles (nsec): | 1.00th=[ 2512], 5.00th=[ 2544], 10.00th=[ 2576], 20.00th=[ 2640], | 30.00th=[ 2672], 40.00th=[ 2768], 50.00th=[ 3376], 60.00th=[ 3792], | 70.00th=[ 4192], 80.00th=[ 4832], 90.00th=[ 5792], 95.00th=[ 7008], | 99.00th=[11968], 99.50th=[15552], 99.90th=[22656], 99.95th=[26496], | 99.99th=[40192] bw ( MiB/s): min= 989, max= 2118, per=82.26%, avg=1506.28, stdev=169.56, samples=193 iops : min=126684, max=271118, avg=192803.37, stdev=21703.94, samples=193 lat (usec) : 2=0.01%, 4=66.33%, 10=32.14%, 20=1.34%, 50=0.19% lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 10=0.01% cpu : usr=8.35%, sys=91.55%, ctx=6183, majf=0, minf=74 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,23437892,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=1831MiB/s (1920MB/s), 1831MiB/s-1831MiB/s (1920MB/s-1920MB/s), io=179GiB (192GB), run=100001-100001msec Disk stats (read/write): nvme0n1: ios=0/152371, merge=0/587772, ticks=0/5657236, in_queue=5580764, util=67.57% Tasks: 46690 root 20 0 756912 6524 1800 R 100.0 0.0 1:24.64 fio 2913 root 20 0 0 0 0 S 32.2 0.0 0:09.57 kswapd1 45683 root 20 0 0 0 0 I 13.2 0.0 0:08.57 kworker/u82:16-events_unbound 42816 root 20 0 0 0 0 I 11.2 0.0 0:09.71 kworker/u82:1-flush-259:0 36087 root 20 0 0 0 0 I 1.6 0.0 0:01.51 kworker/27:2-xfs-conv/nvme0n1 I.e. sync did about three times the throughput of io_uring. It's possible to make io_uring perform better, by batching submissions/completions: fio --name=t --time_based=1 --runtime=100 --ioengine=io_uring --rw=write --bs=8k --filesize=300G --iodepth=64 --iodepth_batch_submit=32 --iodepth_batch_complete_min=16 --iodepth_low=16 t: (g=0): rw=write, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring, iodepth=64 fio-3.17 Starting 1 process Jobs: 1 (f=1): [W(1)][100.0%][w=1382MiB/s][w=177k IOPS][eta 00m:00s] t: (groupid=0, jobs=1): err= 0: pid=46849: Fri Feb 14 11:28:13 2020 write: IOPS=194k, BW=1513MiB/s (1587MB/s)(148GiB/100001msec); 0 zone resets slat (usec): min=4, max=455, avg=48.09, stdev=18.20 clat (usec): min=6, max=25435, avg=164.91, stdev=104.69 lat (usec): min=66, max=25512, avg=213.03, stdev=101.19 clat percentiles (usec): | 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 78], 20.00th=[ 94], | 30.00th=[ 108], 40.00th=[ 127], 50.00th=[ 143], 60.00th=[ 180], | 70.00th=[ 210], 80.00th=[ 227], 90.00th=[ 265], 95.00th=[ 302], | 99.00th=[ 379], 99.50th=[ 441], 99.90th=[ 562], 99.95th=[ 611], | 99.99th=[ 816] bw ( MiB/s): min= 289, max= 1802, per=81.00%, avg=1225.72, stdev=173.92, samples=197 iops : min=37086, max=230692, avg=156891.42, stdev=22261.80, samples=197 lat (usec) : 10=0.01%, 50=0.09%, 100=25.74%, 250=60.50%, 500=13.46% lat (usec) : 750=0.19%, 1000=0.01% lat (msec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=31.80%, sys=26.04%, ctx=793383, majf=0, minf=1301 IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=66.7%, >=64=33.3% submit : 0=0.0%, 4=0.0%, 8=0.0%, 16=50.0%, 32=50.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=0,19369087,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: bw=1513MiB/s (1587MB/s), 1513MiB/s-1513MiB/s (1587MB/s-1587MB/s), io=148GiB (159GB), run=100001-100001msec Disk stats (read/write): nvme0n1: ios=0/127095, merge=0/15, ticks=0/4936953, in_queue=4873341, util=55.82% but that's still worse than the plain sync one. And doesn't work well for reads. Now, it's clear that there's some overhead in backgrounding small IOs to the async context, but this seems awfully high. There's still a slowdown when using considerably larger pieces of IO (128k). For the 128k, iodepth 1 case I get: sync: - 97.77% 0.04% fio fio [.] fio_syncio_queue 97.73% fio_syncio_queue - __libc_write - __libc_write - 97.50% entry_SYSCALL_64 - 97.31% do_syscall_64 - 97.25% ksys_write - 97.10% vfs_write - 96.91% new_sync_write - 96.82% xfs_file_buffered_aio_write - 96.13% iomap_file_buffered_write - iomap_apply - 95.52% iomap_write_actor + 37.66% iov_iter_copy_from_user_atomic + 26.83% iomap_write_begin + 26.64% iomap_write_end.isra.0 2.26% iov_iter_fault_in_readable 0.87% balance_dirty_pages_ratelimited and --no-children shows: + 36.57% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 14.31% fio [kernel.vmlinux] [k] iomap_set_range_uptodate + 3.85% fio [kernel.vmlinux] [k] get_page_from_freelist + 3.34% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath + 2.94% fio [kernel.vmlinux] [k] xas_load + 2.66% fio [kernel.vmlinux] [k] __add_to_page_cache_locked + 2.51% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave + 2.28% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable + 1.87% fio [kernel.vmlinux] [k] __lru_cache_add + 1.80% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn + 1.59% fio [kernel.vmlinux] [k] node_dirty_ok + 1.30% fio [kernel.vmlinux] [k] _raw_spin_lock_irq + 1.30% fio [kernel.vmlinux] [k] iomap_set_page_dirty + 1.25% fio fio [.] get_io_u + 1.18% fio [kernel.vmlinux] [k] xas_start io_uring (the io_wqe_worker-0): - 100.00% 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker - 77.36% io_wqe_worker - 77.18% io_worker_handle_work - 75.80% io_rw_async - io_wq_submit_work - 75.57% io_issue_sqe - io_write - 70.96% xfs_file_buffered_aio_write - 70.37% iomap_file_buffered_write - iomap_apply - 70.08% iomap_write_actor + 30.05% iov_iter_copy_from_user_atomic - 18.38% iomap_write_begin - 17.64% grab_cache_page_write_begin + 16.73% pagecache_get_page 0.51% wait_for_stable_page - 16.96% iomap_write_end.isra.0 8.72% iomap_set_range_uptodate + 7.35% iomap_set_page_dirty 3.45% iov_iter_fault_in_readable 0.56% balance_dirty_pages_ratelimited - 4.36% kiocb_done - 3.56% io_cqring_ev_posted + __wake_up_common_lock 0.61% io_cqring_add_event 0.56% io_free_req --no-children: + 29.55% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 22.64% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker + 8.71% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate + 3.41% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable + 2.50% io_wqe_worker-0 [kernel.vmlinux] [k] get_page_from_freelist + 2.24% io_wqe_worker-0 [kernel.vmlinux] [k] xas_load + 1.82% io_wqe_worker-0 [kernel.vmlinux] [k] queued_spin_lock_slowpath + 1.80% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave + 1.72% io_wqe_worker-0 [kernel.vmlinux] [k] __add_to_page_cache_locked + 1.58% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq + 1.29% io_wqe_worker-0 [kernel.vmlinux] [k] __lru_cache_add + 1.27% io_wqe_worker-0 [kernel.vmlinux] [k] __pagevec_lru_add_fn + 1.08% io_wqe_worker-0 [kernel.vmlinux] [k] xas_start So it seems the biggest difference is the cpu time in io_wqe_worker itself. Which in turn appears to mostly be: │ mov %rbp,%rdi │ → callq io_worker_handle_work │ mov $0x3e7,%eax │ ↓ jmp 28c │ rep_nop(): │ } │ │ /* REP NOP (PAUSE) is a good thing to insert into busy-wait loops. */ │ static __always_inline void rep_nop(void) │ { │ asm volatile("rep; nop" ::: "memory"); 74.50 │281: pause │ io_worker_spin_for_work(): │ while (++i < 1000) { 11.15 │ sub $0x1,%eax │ ↑ je 9b │ __read_once_size(): │ __READ_ONCE_SIZE; 9.53 │28c: mov 0x8(%rbx),%rdx │ io_wqe_run_queue(): │ if (!wq_list_empty(&wqe->work_list) && │ test %rdx,%rdx 2.68 │ ↓ je 29f │ io_worker_spin_for_work(): With 8k it looks similar, but more extreme: sync (fio itself): + 29.89% fio [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 14.10% fio [kernel.vmlinux] [k] iomap_set_range_uptodate + 2.80% fio [kernel.vmlinux] [k] get_page_from_freelist + 2.66% fio [kernel.vmlinux] [k] queued_spin_lock_slowpath + 2.46% fio [kernel.vmlinux] [k] xas_load + 1.97% fio [kernel.vmlinux] [k] _raw_spin_lock_irqsave + 1.94% fio [kernel.vmlinux] [k] __add_to_page_cache_locked + 1.79% fio [kernel.vmlinux] [k] iomap_set_page_dirty + 1.64% fio [kernel.vmlinux] [k] iov_iter_fault_in_readable + 1.61% fio fio [.] __fio_gettime + 1.49% fio [kernel.vmlinux] [k] __pagevec_lru_add_fn + 1.39% fio [kernel.vmlinux] [k] __lru_cache_add 1.30% fio fio [.] get_io_u + 1.11% fio [kernel.vmlinux] [k] up_write + 1.10% fio [kernel.vmlinux] [k] node_dirty_ok + 1.09% fio [kernel.vmlinux] [k] down_write io_uring (io_wqe_worker-0 again): + 54.13% io_wqe_worker-0 [kernel.vmlinux] [k] io_wqe_worker + 11.01% io_wqe_worker-0 [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 2.70% io_wqe_worker-0 [kernel.vmlinux] [k] iomap_set_range_uptodate + 2.62% io_wqe_worker-0 [kernel.vmlinux] [k] iov_iter_fault_in_readable + 1.91% io_wqe_worker-0 [kernel.vmlinux] [k] __slab_free + 1.85% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irqsave + 1.66% io_wqe_worker-0 [kernel.vmlinux] [k] try_to_wake_up + 1.62% io_wqe_worker-0 [kernel.vmlinux] [k] io_cqring_fill_event + 1.55% io_wqe_worker-0 [kernel.vmlinux] [k] io_rw_async + 1.43% io_wqe_worker-0 [kernel.vmlinux] [k] __wake_up_common + 1.37% io_wqe_worker-0 [kernel.vmlinux] [k] _raw_spin_lock_irq + 1.20% io_wqe_worker-0 [kernel.vmlinux] [k] rw_verify_area which I think is pretty clear evidence we're hitting fairly significant contention on the queue lock. I am hitting this in postgres originally, not fio, but I thought it's easier to reproduce this way. There's obviously benefit to doing things in the background - but it requires odd logic around deciding when to use io_uring, and when not. To be clear, none of this happens with DIO, but I don't forsee switching to DIO for all IO by default ever (too high demands on accurate configuration). Greetings, Andres Freund