From: Xiaoguang Wang <[email protected]>
To: Keith Busch <[email protected]>
Cc: [email protected], [email protected],
Christoph Hellwig <[email protected]>, Jens Axboe <[email protected]>
Subject: Re: another nvme pssthrough design based on nvme hardware queue file abstraction
Date: Fri, 28 Apr 2023 10:42:32 +0800 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
hi,
> On Thu, Apr 27, 2023 at 08:17:30PM +0800, Xiaoguang Wang wrote:
>>> On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
>>>> hi all,
>>>>
>>>> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
>>>> thought its performance would be much better than normal polled nvme test, but test results
>>>> show that it's not:
>>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 /dev/ng1n1
>>>> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
>>>> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31
>>>>
>>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
>>>> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
>>>> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32
>>>>
>>>> about 10% iops improvement, I'm not saying its not good, just had thought it should
>>>> perform much better.
>>> What did you think it should be? What is the maximum 512b read IOPs your device
>>> is capable of producing?
>> From the naming of this feature, I thought it would bypass blocker thoroughly, hence
>> would gain much higher performance, for myself, if this feature can improves 25% higher
>> or more, that would be much more attractive, and users would like to try it. Again, I'm
>> not saying this feature is not good, just thought it would perform much better for small io.
> It does bypass the block layer. The driver just uses library functions provided
> by the block layer for things it doesn't want to duplicate. Reimplementing that
> functionality in driver isn't going to improve anything.
>
>>>> In our kernel config, no active q->stats->callbacks, but still has this overhead.
>>>>
>>>> 2. 0.97% io_uring [kernel.vmlinux] [k] bio_associate_blkg_from_css
>>>> 0.85% io_uring [kernel.vmlinux] [k] bio_associate_blkg
>>>> 0.74% io_uring [kernel.vmlinux] [k] blkg_lookup_create
>>>> For nvme passthrough feature, it tries to dispatch nvme commands to nvme
>>>> controller directly, so should get rid of these overheads.
>>>>
>>>> 3. 3.19% io_uring [kernel.vmlinux] [k] __rcu_read_unlock
>>>> 2.65% io_uring [kernel.vmlinux] [k] __rcu_read_lock
>>>> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.
>>>>
>>>> 4. 7.90% io_uring [nvme] [k] nvme_poll
>>>> 3.59% io_uring [nvme_core] [k] nvme_ns_chr_uring_cmd_iopoll
>>>> 2.63% io_uring [kernel.vmlinux] [k] blk_mq_poll_classic
>>>> 1.88% io_uring [nvme] [k] nvme_poll_cq
>>>> 1.74% io_uring [kernel.vmlinux] [k] bio_poll
>>>> 1.89% io_uring [kernel.vmlinux] [k] xas_load
>>>> 0.86% io_uring [kernel.vmlinux] [k] xas_start
>>>> 0.80% io_uring [kernel.vmlinux] [k] xas_start
>>>> Seems that the block poll operation call chain is somewhat deep, also
>>> It's not really that deep, though the xarray lookups are unfortunate.
>>>
>>> And if you were to remove block layer, it looks like you'd end up just shifting
>>> the CPU utilization to a different polling function without increasing IOPs.
>>> Your hardware doesn't look fast enough for this software overhead to be a
>>> concern.
>> No, I may not agree with you here, sorry. Real products(not like t/io_uring tools,
>> which just polls block layer when ios are issued) will have many other work
>> to run, such as network work. If we can cut the nvme passthrough overhead more,
>> saved cpu will use to do other useful work.
> You initiated this thread with supposed underwhelming IOPs improvements from
> the io engine, but now you've shifted your criteria.
Sorry, but how did you come to this conclusion that I have shifted my criteria...
I'm not a native english speaker, may not express my thoughts clearly. And
I forgot to mention that indeed in real products, they may manage more than
one nvme ssd with one cpu(software is taskseted to corresponding cpu), so
I think software overhead would be a concern.
No offense at all, I initiated this thread just to discuss whether we can improve
nvme passthrough performance more. For myself, also need to understand
nvme codes more.
>
> You can always turn off the kernel's stats and cgroups if you don't find them
> useful.
In example of cgroups, do you mean disable CONFIG_BLK_CGROUP?
I'm not sure it will work, a physical machine may have many disk drives,
others drives may need blkcg.
Regards,
Xiaoguang Wang
prev parent reply other threads:[~2023-04-28 2:42 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20230426132010epcas5p4ad551f7bdebd6841e2004ba47ab468b3@epcas5p4.samsung.com>
2023-04-26 13:19 ` another nvme pssthrough design based on nvme hardware queue file abstraction Xiaoguang Wang
2023-04-26 13:59 ` Kanchan Joshi
2023-04-27 11:00 ` Xiaoguang Wang
2023-04-26 14:12 ` Keith Busch
2023-04-27 12:17 ` Xiaoguang Wang
2023-04-27 15:03 ` Keith Busch
2023-04-28 2:42 ` Xiaoguang Wang [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4ab4b11f-d775-1c75-127a-c62535de897b@linux.alibaba.com \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox