From: Jens Axboe <[email protected]>
To: Keith Busch <[email protected]>,
[email protected], [email protected],
[email protected]
Cc: Keith Busch <[email protected]>
Subject: Re: [PATCHv3] io_uring: set plug tags for same file
Date: Fri, 11 Aug 2023 13:24:17 -0600 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 7/31/23 2:39 PM, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> io_uring tries to optimize allocating tags by hinting to the plug how
> many it expects to need for a batch instead of allocating each tag
> individually. But io_uring submission queueus may have a mix of many
> devices for io, so the number of io's counted may be overestimated. This
> can lead to allocating too many tags, which adds overhead to finding
> that many contiguous tags, freeing up the ones we didn't use, and may
> starve out other users that can actually use them.
>
> When starting a new batch of uring commands, count only commands that
> match the file descriptor of the first seen for this optimization. This
> avoids have to call the unlikely "blk_mq_free_plug_rqs()" at the end of
> a submission when multiple devices are used in a batch.
Wanted to run this through both the peak IOPS and networking testing,
started with the former. Here's a peak run with -git + pending 6.5
changes + pending 6.6 changes:
IOPS=125.88M, BW=61.46GiB/s, IOS/call=32/31
IOPS=125.39M, BW=61.23GiB/s, IOS/call=31/31
IOPS=124.97M, BW=61.02GiB/s, IOS/call=32/32
IOPS=124.60M, BW=60.84GiB/s, IOS/call=32/32
IOPS=124.27M, BW=60.68GiB/s, IOS/call=31/31
IOPS=124.00M, BW=60.54GiB/s, IOS/call=32/31
and here's one with the patch:
IOPS=121.69M, BW=59.42GiB/s, IOS/call=31/32
IOPS=121.26M, BW=59.21GiB/s, IOS/call=32/32
IOPS=120.87M, BW=59.02GiB/s, IOS/call=31/31
IOPS=120.87M, BW=59.02GiB/s, IOS/call=32/32
IOPS=121.02M, BW=59.09GiB/s, IOS/call=32/32
IOPS=121.63M, BW=59.39GiB/s, IOS/call=31/32
IOPS=121.48M, BW=59.32GiB/s, IOS/call=31/31
Running a quick profile, here's the top diff:
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ................ ...........................................
#
6.69% -3.03% [kernel.vmlinux] [k] io_issue_sqe
9.65% +2.30% [nvme] [k] nvme_poll_cq
4.86% -1.55% [kernel.vmlinux] [k] io_submit_sqes
4.61% +1.40% [kernel.vmlinux] [k] blk_mq_submit_bio
4.79% -0.98% [kernel.vmlinux] [k] io_read
0.56% +0.97% [kernel.vmlinux] [k] blkdev_dio_unaligned.isra.0
3.61% +0.52% [kernel.vmlinux] [k] dma_unmap_page_attrs
2.04% -0.45% [kernel.vmlinux] [k] blk_add_rq_to_plug
Note that this is perf.data.old being the kernel with your patch, and
perf.data being the "stock" kernel mentioned above. The main thing looks
like spending more time in io_issue_sqe() and io_submit_sqes(), and
converserly we're spending less time polling. Usually for profiling a
polled workload, having more time in the polling function is good, as it
shows us we're spending less time everywhere else.
This is what I'm using:
sudo t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R0 -X1 -n24 -P1 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme12n1 /dev/nvme13n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 /dev/nvme17n1 /dev/nvme18n1 /dev/nvme19n1 /dev/nvme20n1 /dev/nvme21n1 /dev/nvme22n1 /dev/nvme23n1
which is submitting 32 requests at the time. Obviously we don't expect a
win in this case, as each thread is handling just a single NVMe device.
The stock kernel will not over-allocate in this case.
If we change that to -n12 instead, meaning each will drive two devices,
here's what the stock kernel gets:
IOPS=60.95M, BW=29.76GiB/s, IOS/call=31/32
IOPS=60.99M, BW=29.78GiB/s, IOS/call=31/32
IOPS=60.96M, BW=29.77GiB/s, IOS/call=31/31
IOPS=60.95M, BW=29.76GiB/s, IOS/call=31/31
IOPS=60.91M, BW=29.74GiB/s, IOS/call=32/32
and with the patch:
IOPS=59.64M, BW=29.12GiB/s, IOS/call=32/31
IOPS=59.63M, BW=29.12GiB/s, IOS/call=31/32
IOPS=59.57M, BW=29.09GiB/s, IOS/call=31/31
IOPS=59.57M, BW=29.09GiB/s, IOS/call=32/32
IOPS=59.65M, BW=29.12GiB/s, IOS/call=31/31
Now these are both obviously lower, but I haven't done anything to
ensure that the two-drives-per-poller workload is optimized. For all I
know, the numa layout is now messed up too. Just as a caveat, but they
are comparable to each other.
Perf diff again looks similar, note that this time it's perf.data.old
that's the stock kernel and perf.data that's the one with your patch:
3.51% +2.84% [kernel.vmlinux] [k] io_issue_sqe
3.24% +1.35% [kernel.vmlinux] [k] io_submit_sqes
With the kernel without your patch, I was looking for tag flush overhead
but didn't find much:
0.02% io_uring [kernel.vmlinux] [k] blk_mq_free_plug_rqs
Outside of the peak worry with the patch, do you have a workload that we
should test this on?
--
Jens Axboe
next prev parent reply other threads:[~2023-08-11 19:24 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-31 20:39 [PATCHv3] io_uring: set plug tags for same file Keith Busch
2023-08-09 12:37 ` Pavel Begunkov
2023-08-11 19:24 ` Jens Axboe [this message]
2023-08-14 16:50 ` Keith Busch
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox