Re: another nvme pssthrough design based on nvme hardware queue file abstraction

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Xiaoguang Wang <[email protected]>
To: Keith Busch <[email protected]>
Cc: [email protected], [email protected],
	Christoph Hellwig <[email protected]>, Jens Axboe <[email protected]>
Subject: Re: another nvme pssthrough design based on nvme hardware queue file abstraction
Date: Fri, 28 Apr 2023 10:42:32 +0800	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

hi,

> On Thu, Apr 27, 2023 at 08:17:30PM +0800, Xiaoguang Wang wrote:
>>> On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
>>>> hi all,
>>>>
>>>> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
>>>> thought its performance would be much better than normal polled nvme test, but test results
>>>> show that it's not:
>>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
>>>> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
>>>> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31
>>>>
>>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
>>>> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
>>>> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32
>>>>
>>>> about 10% iops improvement, I'm not saying its not good, just had thought it should
>>>> perform much better.
>>> What did you think it should be? What is the maximum 512b read IOPs your device
>>> is capable of producing?
>> From the naming of this feature, I thought it would bypass blocker thoroughly, hence
>> would gain much higher performance, for myself, if this feature can improves 25% higher
>> or more, that would be much more attractive, and users would like to try it. Again, I'm
>> not saying this feature is not good, just thought it would perform much better for small io.
> It does bypass the block layer. The driver just uses library functions provided
> by the block layer for things it doesn't want to duplicate. Reimplementing that
> functionality in driver isn't going to improve anything.
>
>>>> In our kernel config, no active q->stats->callbacks, but still has this overhead.
>>>>
>>>> 2. 0.97%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg_from_css
>>>>     0.85%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg
>>>>     0.74%  io_uring  [kernel.vmlinux]  [k] blkg_lookup_create
>>>> For nvme passthrough feature, it tries to dispatch nvme commands to nvme
>>>> controller directly, so should get rid of these overheads.
>>>>
>>>> 3. 3.19%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
>>>>     2.65%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
>>>> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.
>>>>
>>>> 4. 7.90%  io_uring  [nvme]            [k] nvme_poll
>>>>     3.59%  io_uring  [nvme_core]       [k] nvme_ns_chr_uring_cmd_iopoll
>>>>     2.63%  io_uring  [kernel.vmlinux]  [k] blk_mq_poll_classic
>>>>     1.88%  io_uring  [nvme]            [k] nvme_poll_cq
>>>>     1.74%  io_uring  [kernel.vmlinux]  [k] bio_poll
>>>>     1.89%  io_uring  [kernel.vmlinux]  [k] xas_load
>>>>     0.86%  io_uring  [kernel.vmlinux]  [k] xas_start
>>>>     0.80%  io_uring  [kernel.vmlinux]  [k] xas_start
>>>> Seems that the block poll operation call chain is somewhat deep, also
>>> It's not really that deep, though the xarray lookups are unfortunate.
>>>
>>> And if you were to remove block layer, it looks like you'd end up just shifting
>>> the CPU utilization to a different polling function without increasing IOPs.
>>> Your hardware doesn't look fast enough for this software overhead to be a
>>> concern.
>> No, I may not agree with you here, sorry. Real products(not like t/io_uring tools,
>> which just polls block layer when ios are issued) will have many other work
>> to run, such as network work. If we can cut the nvme passthrough overhead more,
>> saved cpu will use to do other useful work.
> You initiated this thread with supposed underwhelming IOPs improvements from
> the io engine, but now you've shifted your criteria.
Sorry, but how did you come to this conclusion that I have shifted my criteria...
I'm not a native english speaker, may not express my thoughts clearly. And
I forgot to mention that indeed in real products, they may manage more than
one nvme ssd with one cpu(software is taskseted to corresponding cpu), so
I think software overhead would be a concern.

No offense at all, I initiated this thread just to discuss whether we can improve
nvme passthrough performance more. For myself, also need to understand
nvme codes more.
>
> You can always turn off the kernel's stats and cgroups if you don't find them
> useful.
In example of cgroups, do you mean disable CONFIG_BLK_CGROUP?
I'm not sure it will work, a physical machine may have many disk drives,
others drives may need blkcg.

Regards,
Xiaoguang Wang

     prev parent reply	other threads:[~2023-04-28  2:42 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20230426132010epcas5p4ad551f7bdebd6841e2004ba47ab468b3@epcas5p4.samsung.com>
2023-04-26 13:19 ` another nvme pssthrough design based on nvme hardware queue file abstraction Xiaoguang Wang
2023-04-26 13:59   ` Kanchan Joshi
2023-04-27 11:00     ` Xiaoguang Wang
2023-04-26 14:12   ` Keith Busch
2023-04-27 12:17     ` Xiaoguang Wang
2023-04-27 15:03       ` Keith Busch
2023-04-28  2:42         ` Xiaoguang Wang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ab4b11f-d775-1c75-127a-c62535de897b@linux.alibaba.com \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox