Re: another nvme pssthrough design based on nvme hardware queue file abstraction

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Keith Busch <[email protected]>
To: Xiaoguang Wang <[email protected]>
Cc: [email protected], [email protected],
	Christoph Hellwig <[email protected]>, Jens Axboe <[email protected]>
Subject: Re: another nvme pssthrough design based on nvme hardware queue file abstraction
Date: Thu, 27 Apr 2023 09:03:39 -0600	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On Thu, Apr 27, 2023 at 08:17:30PM +0800, Xiaoguang Wang wrote:
> > On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
> >> hi all,
> >>
> >> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
> >> thought its performance would be much better than normal polled nvme test, but test results
> >> show that it's not:
> >> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
> >> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
> >> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31
> >>
> >> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
> >> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
> >> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32
> >>
> >> about 10% iops improvement, I'm not saying its not good, just had thought it should
> >> perform much better.
> > What did you think it should be? What is the maximum 512b read IOPs your device
> > is capable of producing?
> From the naming of this feature, I thought it would bypass blocker thoroughly, hence
> would gain much higher performance, for myself, if this feature can improves 25% higher
> or more, that would be much more attractive, and users would like to try it. Again, I'm
> not saying this feature is not good, just thought it would perform much better for small io.

It does bypass the block layer. The driver just uses library functions provided
by the block layer for things it doesn't want to duplicate. Reimplementing that
functionality in driver isn't going to improve anything.

> >> In our kernel config, no active q->stats->callbacks, but still has this overhead.
> >>
> >> 2. 0.97%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg_from_css
> >>     0.85%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg
> >>     0.74%  io_uring  [kernel.vmlinux]  [k] blkg_lookup_create
> >> For nvme passthrough feature, it tries to dispatch nvme commands to nvme
> >> controller directly, so should get rid of these overheads.
> >>
> >> 3. 3.19%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
> >>     2.65%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
> >> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.
> >>
> >> 4. 7.90%  io_uring  [nvme]            [k] nvme_poll
> >>     3.59%  io_uring  [nvme_core]       [k] nvme_ns_chr_uring_cmd_iopoll
> >>     2.63%  io_uring  [kernel.vmlinux]  [k] blk_mq_poll_classic
> >>     1.88%  io_uring  [nvme]            [k] nvme_poll_cq
> >>     1.74%  io_uring  [kernel.vmlinux]  [k] bio_poll
> >>     1.89%  io_uring  [kernel.vmlinux]  [k] xas_load
> >>     0.86%  io_uring  [kernel.vmlinux]  [k] xas_start
> >>     0.80%  io_uring  [kernel.vmlinux]  [k] xas_start
> >> Seems that the block poll operation call chain is somewhat deep, also
> > It's not really that deep, though the xarray lookups are unfortunate.
> >
> > And if you were to remove block layer, it looks like you'd end up just shifting
> > the CPU utilization to a different polling function without increasing IOPs.
> > Your hardware doesn't look fast enough for this software overhead to be a
> > concern.
> No, I may not agree with you here, sorry. Real products(not like t/io_uring tools,
> which just polls block layer when ios are issued) will have many other work
> to run, such as network work. If we can cut the nvme passthrough overhead more,
> saved cpu will use to do other useful work.

You initiated this thread with supposed underwhelming IOPs improvements from
the io engine, but now you've shifted your criteria.

You can always turn off the kernel's stats and cgroups if you don't find them
useful.

next prev parent reply	other threads:[~2023-04-27 15:03 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20230426132010epcas5p4ad551f7bdebd6841e2004ba47ab468b3@epcas5p4.samsung.com>
2023-04-26 13:19 ` another nvme pssthrough design based on nvme hardware queue file abstraction Xiaoguang Wang
2023-04-26 13:59   ` Kanchan Joshi
2023-04-27 11:00     ` Xiaoguang Wang
2023-04-26 14:12   ` Keith Busch
2023-04-27 12:17     ` Xiaoguang Wang
2023-04-27 15:03       ` Keith Busch [this message]
2023-04-28  2:42         ` Xiaoguang Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox