another nvme pssthrough design based on nvme hardware queue file abstraction

public inbox for [email protected]
 help / color / mirror / Atom feed

* another nvme pssthrough design based on nvme hardware queue file abstraction
@ 2023-04-26 13:19 ` Xiaoguang Wang
  2023-04-26 13:59   ` Kanchan Joshi
  2023-04-26 14:12   ` Keith Busch
  0 siblings, 2 replies; 7+ messages in thread
From: Xiaoguang Wang @ 2023-04-26 13:19 UTC (permalink / raw)
  To: linux-block, io-uring; +Cc: Christoph Hellwig, Jens Axboe

hi all,

Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
thought its performance would be much better than normal polled nvme test, but test results
show that it's not:
$ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
IOPS=891.07K, BW=435MiB/s, IOS/call=31/31

$ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
IOPS=808.13K, BW=394MiB/s, IOS/call=32/32

about 10% iops improvement, I'm not saying its not good, just had thought it should
perform much better. After reading codes, I finds that this nvme passthrough feature
is still based on blk-mq, use perf tool to analyse and there are some block layer
overheads that seems somewhat big:
1. 2.48%  io_uring  [kernel.vmlinux]  [k] blk_stat_add
In our kernel config, no active q->stats->callbacks, but still has this overhead.

2. 0.97%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg_from_css
    0.85%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg
    0.74%  io_uring  [kernel.vmlinux]  [k] blkg_lookup_create
For nvme passthrough feature, it tries to dispatch nvme commands to nvme
controller directly, so should get rid of these overheads.

3. 3.19%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
    2.65%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.

4. 7.90%  io_uring  [nvme]            [k] nvme_poll
    3.59%  io_uring  [nvme_core]       [k] nvme_ns_chr_uring_cmd_iopoll
    2.63%  io_uring  [kernel.vmlinux]  [k] blk_mq_poll_classic
    1.88%  io_uring  [nvme]            [k] nvme_poll_cq
    1.74%  io_uring  [kernel.vmlinux]  [k] bio_poll
    1.89%  io_uring  [kernel.vmlinux]  [k] xas_load
    0.86%  io_uring  [kernel.vmlinux]  [k] xas_start
    0.80%  io_uring  [kernel.vmlinux]  [k] xas_start
Seems that the block poll operation call chain is somewhat deep, also
not sure whether we can improve it a bit, and the xas overheads also
looks big, it's introduced by https://lore.kernel.org/all/[email protected]/
which fixed one use-after-free bug.

5. other blocker overhead I don't spend time to look into.

Some of our clients are interested in nvme passthrough feature, they visit
nvme devices by open(2) and read(2)/write(2) nvme device files directly, bypass
filesystem, so they'd like to try nvme passthrough feature, to gain bigger iops, but
currenty performance seems not that good. And they don't want to use spdk yet,
still try to build fast storage based on linux kernel io stack for various reasons  :)

So I'd like to propose a new nvme passthrough design here, which may improve
performance a lot. Here are just rough design ideas here, not start to code yet.
  1. There are three types of nvme hardware queues, "default", "write" and "poll",
currently all these queues are visible to block layer, blk-mq will map these queues
properly.  Here this new design will add two new nvme hardware queues, name them
"user_irq" and "user_poll" queues, which will need to add two nvme module parameters,
similar to current "write_queues" and "poll_queues".
  2. "user_irq" and "user_poll" queues will not be visible to block layer, and will create
corresponding char device file for them,  that means nvme hardware queues will be
abstracted as linux file, not sure whether to support read_iter or write_iter, but
uring_cmd() interface will be supported firstly. user_irq queue will still have irq, user_poll
queue will support poll.
  3. Finally the data flow will look like below in example of user_irq queue:
io issue: io_uring  uring_cmd >> prep nvme command in its char device's uring_cmd() >> submit to nvme.
io reap: find io_uring request by nvme command id, and call uring_cmd_done for it.
Yeah, need to build association between io_uring request and nvme command id.

Possible advantages:
1. Bypass block layer thoroughly.
2. Since now we have file abstraction, it can support mmap operation, we can mmap
nvme hardware queue's cqes to user space, then we can implement a much efficient
poll. We may run nvme_cqe_pending()'s logic in user space, only it shows there are nvme
requests completed, can we enter kernel to reap them. As I said before, current
kernel poll chain looks deep, with this method, we can eliminate much useless iopoll
operation. In my t/io_uring tests, below bpftrace script shows that half of iopoll operations
are useless:
BEGIN
{
    @a = 0;
    @b = 0;
}

kretprobe:nvme_poll
{
    if (retval == 0) {
        @a++;
    } else{
        @b++;
    }
}

3. With file based hardware queue abstraction, we may implement various qos
strategy in user space based queue depth control, or more flexible control, user
space apps can map cpu to hardware queue arbitrarily, not like current blk-mq,
which has fixed map strategy.

Finally, as I said before, current it's just rough ideas, and there will definitely be
overlapping functionality with blk-mq, at least this new char device file needs
to map user space add to pages, then nvme sgls or prps could be set properly.

Any feedback are welcome, thanks.

Regards,
Xiaoguang Wang

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: another nvme pssthrough design based on nvme hardware queue file abstraction
  2023-04-26 13:19 ` another nvme pssthrough design based on nvme hardware queue file abstraction Xiaoguang Wang
@ 2023-04-26 13:59   ` Kanchan Joshi
  2023-04-27 11:00     ` Xiaoguang Wang
  2023-04-26 14:12   ` Keith Busch
  1 sibling, 1 reply; 7+ messages in thread
From: Kanchan Joshi @ 2023-04-26 13:59 UTC (permalink / raw)
  To: Xiaoguang Wang; +Cc: linux-block, io-uring, Christoph Hellwig, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 392 bytes --]

On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:

Good to see this.
So I have a prototype that tries to address some of the overheads you
mentioned. This was briefly discussed here [1], as a precursor to LSFMM.

PoC is nearly in shape. I should be able to post in this week.

[1] fourth point at
https://lore.kernel.org/linux-nvme/[email protected]/


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: another nvme pssthrough design based on nvme hardware queue file abstraction
  2023-04-26 13:19 ` another nvme pssthrough design based on nvme hardware queue file abstraction Xiaoguang Wang
  2023-04-26 13:59   ` Kanchan Joshi
@ 2023-04-26 14:12   ` Keith Busch
  2023-04-27 12:17     ` Xiaoguang Wang
  1 sibling, 1 reply; 7+ messages in thread
From: Keith Busch @ 2023-04-26 14:12 UTC (permalink / raw)
  To: Xiaoguang Wang; +Cc: linux-block, io-uring, Christoph Hellwig, Jens Axboe

On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
> hi all,
> 
> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
> thought its performance would be much better than normal polled nvme test, but test results
> show that it's not:
> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31
> 
> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32
> 
> about 10% iops improvement, I'm not saying its not good, just had thought it should
> perform much better.

What did you think it should be? What is the maximum 512b read IOPs your device
is capable of producing?

> After reading codes, I finds that this nvme passthrough feature
> is still based on blk-mq, use perf tool to analyse and there are some block layer
> overheads that seems somewhat big:
> 1. 2.48%  io_uring  [kernel.vmlinux]  [k] blk_stat_add
> In our kernel config, no active q->stats->callbacks, but still has this overhead.
> 
> 2. 0.97%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg_from_css
>     0.85%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg
>     0.74%  io_uring  [kernel.vmlinux]  [k] blkg_lookup_create
> For nvme passthrough feature, it tries to dispatch nvme commands to nvme
> controller directly, so should get rid of these overheads.
> 
> 3. 3.19%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
>     2.65%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.
> 
> 4. 7.90%  io_uring  [nvme]            [k] nvme_poll
>     3.59%  io_uring  [nvme_core]       [k] nvme_ns_chr_uring_cmd_iopoll
>     2.63%  io_uring  [kernel.vmlinux]  [k] blk_mq_poll_classic
>     1.88%  io_uring  [nvme]            [k] nvme_poll_cq
>     1.74%  io_uring  [kernel.vmlinux]  [k] bio_poll
>     1.89%  io_uring  [kernel.vmlinux]  [k] xas_load
>     0.86%  io_uring  [kernel.vmlinux]  [k] xas_start
>     0.80%  io_uring  [kernel.vmlinux]  [k] xas_start
> Seems that the block poll operation call chain is somewhat deep, also

It's not really that deep, though the xarray lookups are unfortunate.

And if you were to remove block layer, it looks like you'd end up just shifting
the CPU utilization to a different polling function without increasing IOPs.
Your hardware doesn't look fast enough for this software overhead to be a
concern.

> not sure whether we can improve it a bit, and the xas overheads also
> looks big, it's introduced by https://lore.kernel.org/all/[email protected]/
> which fixed one use-after-free bug.
> 
> 5. other blocker overhead I don't spend time to look into.
> 
> Some of our clients are interested in nvme passthrough feature, they visit
> nvme devices by open(2) and read(2)/write(2) nvme device files directly, bypass
> filesystem, so they'd like to try nvme passthrough feature, to gain bigger iops, but
> currenty performance seems not that good. And they don't want to use spdk yet,
> still try to build fast storage based on linux kernel io stack for various reasons  :)
> 
> So I'd like to propose a new nvme passthrough design here, which may improve
> performance a lot. Here are just rough design ideas here, not start to code yet.
>   1. There are three types of nvme hardware queues, "default", "write" and "poll",
> currently all these queues are visible to block layer, blk-mq will map these queues
> properly.  Here this new design will add two new nvme hardware queues, name them
> "user_irq" and "user_poll" queues, which will need to add two nvme module parameters,
> similar to current "write_queues" and "poll_queues".
>   2. "user_irq" and "user_poll" queues will not be visible to block layer, and will create
> corresponding char device file for them,  that means nvme hardware queues will be
> abstracted as linux file, not sure whether to support read_iter or write_iter, but
> uring_cmd() interface will be supported firstly. user_irq queue will still have irq, user_poll
> queue will support poll.
>   3. Finally the data flow will look like below in example of user_irq queue:
> io issue: io_uring  uring_cmd >> prep nvme command in its char device's uring_cmd() >> submit to nvme.
> io reap: find io_uring request by nvme command id, and call uring_cmd_done for it.
> Yeah, need to build association between io_uring request and nvme command id.
> 
> Possible advantages:
> 1. Bypass block layer thoroughly.

blk-mq has common solutions that we don't want to duplicate in driver. It
provides safe access to shared tags across multiple processes, ensures queue
live-ness during a controller reset, tracks commands for timeouts, among other
things.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: another nvme pssthrough design based on nvme hardware queue file abstraction
  2023-04-26 13:59   ` Kanchan Joshi
@ 2023-04-27 11:00     ` Xiaoguang Wang
  0 siblings, 0 replies; 7+ messages in thread
From: Xiaoguang Wang @ 2023-04-27 11:00 UTC (permalink / raw)
  To: Kanchan Joshi; +Cc: linux-block, io-uring, Christoph Hellwig, Jens Axboe

hi,

> On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
>
> Good to see this.
> So I have a prototype that tries to address some of the overheads you
> mentioned. This was briefly discussed here [1], as a precursor to LSFMM.
Cool, and I'll go through your discussions later, thanks.

Regards,
Xiaoguang Wang
>
> PoC is nearly in shape. I should be able to post in this week.
>
> [1] fourth point at
> https://lore.kernel.org/linux-nvme/[email protected]/
>
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: another nvme pssthrough design based on nvme hardware queue file abstraction
  2023-04-26 14:12   ` Keith Busch
@ 2023-04-27 12:17     ` Xiaoguang Wang
  2023-04-27 15:03       ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Xiaoguang Wang @ 2023-04-27 12:17 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, io-uring, Christoph Hellwig, Jens Axboe

hi,

> On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
>> hi all,
>>
>> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
>> thought its performance would be much better than normal polled nvme test, but test results
>> show that it's not:
>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
>> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
>> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31
>>
>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
>> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
>> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32
>>
>> about 10% iops improvement, I'm not saying its not good, just had thought it should
>> perform much better.
> What did you think it should be? What is the maximum 512b read IOPs your device
> is capable of producing?
From the naming of this feature, I thought it would bypass blocker thoroughly, hence
would gain much higher performance, for myself, if this feature can improves 25% higher
or more, that would be much more attractive, and users would like to try it. Again, I'm
not saying this feature is not good, just thought it would perform much better for small io.

My test environment has one intel p4510 nvme ssd and one intel p4800x nvme ssd.
According to spec, p4510 's rand read iops is about 640000, and p4800x is 550000.
To maximizing device performance, I'll do one discard before test, that is
sudo blkdiscard /dev/nvme0n1 or /dev/nvme1n1.

In 6.3.0-rc2, taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
shows 890k iops. I modified codes a bit to get rid of blkcg overhead, iops will increase 920k.
And if I benchmark /dev/ng0n1 and /dev/ng1n1 at the same time, total iops would be
about 1150k, cannot utilize the maximum capacity of these two devices.

>
>> After reading codes, I finds that this nvme passthrough feature
>> is still based on blk-mq, use perf tool to analyse and there are some block layer
>> overheads that seems somewhat big:
>> 1. 2.48%  io_uring  [kernel.vmlinux]  [k] blk_stat_add
>> In our kernel config, no active q->stats->callbacks, but still has this overhead.
>>
>> 2. 0.97%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg_from_css
>>     0.85%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg
>>     0.74%  io_uring  [kernel.vmlinux]  [k] blkg_lookup_create
>> For nvme passthrough feature, it tries to dispatch nvme commands to nvme
>> controller directly, so should get rid of these overheads.
>>
>> 3. 3.19%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
>>     2.65%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
>> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.
>>
>> 4. 7.90%  io_uring  [nvme]            [k] nvme_poll
>>     3.59%  io_uring  [nvme_core]       [k] nvme_ns_chr_uring_cmd_iopoll
>>     2.63%  io_uring  [kernel.vmlinux]  [k] blk_mq_poll_classic
>>     1.88%  io_uring  [nvme]            [k] nvme_poll_cq
>>     1.74%  io_uring  [kernel.vmlinux]  [k] bio_poll
>>     1.89%  io_uring  [kernel.vmlinux]  [k] xas_load
>>     0.86%  io_uring  [kernel.vmlinux]  [k] xas_start
>>     0.80%  io_uring  [kernel.vmlinux]  [k] xas_start
>> Seems that the block poll operation call chain is somewhat deep, also
> It's not really that deep, though the xarray lookups are unfortunate.
>
> And if you were to remove block layer, it looks like you'd end up just shifting
> the CPU utilization to a different polling function without increasing IOPs.
> Your hardware doesn't look fast enough for this software overhead to be a
> concern.
No, I may not agree with you here, sorry. Real products(not like t/io_uring tools,
which just polls block layer when ios are issued) will have many other work
to run, such as network work. If we can cut the nvme passthrough overhead more,
saved cpu will use to do other useful work.

For example, some produces would poll storage and network, if we can reduce
poll storage quicker, we can poll network earlier, which may reduce network
latency. As I said in below section, If we can map nvme cqes to user space, we
may check whether io has been completed in user space, only do kernel block iopoll
necessary.

>
>> not sure whether we can improve it a bit, and the xas overheads also
>> looks big, it's introduced by https://lore.kernel.org/all/[email protected]/
>> which fixed one use-after-free bug.
>>
>> 5. other blocker overhead I don't spend time to look into.
>>
>> Some of our clients are interested in nvme passthrough feature, they visit
>> nvme devices by open(2) and read(2)/write(2) nvme device files directly, bypass
>> filesystem, so they'd like to try nvme passthrough feature, to gain bigger iops, but
>> currenty performance seems not that good. And they don't want to use spdk yet,
>> still try to build fast storage based on linux kernel io stack for various reasons  :)
>>
>> So I'd like to propose a new nvme passthrough design here, which may improve
>> performance a lot. Here are just rough design ideas here, not start to code yet.
>>   1. There are three types of nvme hardware queues, "default", "write" and "poll",
>> currently all these queues are visible to block layer, blk-mq will map these queues
>> properly.  Here this new design will add two new nvme hardware queues, name them
>> "user_irq" and "user_poll" queues, which will need to add two nvme module parameters,
>> similar to current "write_queues" and "poll_queues".
>>   2. "user_irq" and "user_poll" queues will not be visible to block layer, and will create
>> corresponding char device file for them,  that means nvme hardware queues will be
>> abstracted as linux file, not sure whether to support read_iter or write_iter, but
>> uring_cmd() interface will be supported firstly. user_irq queue will still have irq, user_poll
>> queue will support poll.
>>   3. Finally the data flow will look like below in example of user_irq queue:
>> io issue: io_uring  uring_cmd >> prep nvme command in its char device's uring_cmd() >> submit to nvme.
>> io reap: find io_uring request by nvme command id, and call uring_cmd_done for it.
>> Yeah, need to build association between io_uring request and nvme command id.
>>
>> Possible advantages:
>> 1. Bypass block layer thoroughly.
> blk-mq has common solutions that we don't want to duplicate in driver. It
> provides safe access to shared tags across multiple processes, ensures queue
> live-ness during a controller reset, tracks commands for timeouts, among other
> things.
Yeah, I agree there will be some duplicate functionality with blk-mq,
not start to do detailed design yet(will do later), but I think there maybe
not much. I'd like to implement prototype firstly for you to review, to see
what performance we can get. If performance data is really impressive, I
think it maybe deserve the duplicate.

Regards,
Xiaoguang Wang


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: another nvme pssthrough design based on nvme hardware queue file abstraction
  2023-04-27 12:17     ` Xiaoguang Wang
@ 2023-04-27 15:03       ` Keith Busch
  2023-04-28  2:42         ` Xiaoguang Wang
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2023-04-27 15:03 UTC (permalink / raw)
  To: Xiaoguang Wang; +Cc: linux-block, io-uring, Christoph Hellwig, Jens Axboe

On Thu, Apr 27, 2023 at 08:17:30PM +0800, Xiaoguang Wang wrote:
> > On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
> >> hi all,
> >>
> >> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
> >> thought its performance would be much better than normal polled nvme test, but test results
> >> show that it's not:
> >> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
> >> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
> >> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31
> >>
> >> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
> >> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
> >> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32
> >>
> >> about 10% iops improvement, I'm not saying its not good, just had thought it should
> >> perform much better.
> > What did you think it should be? What is the maximum 512b read IOPs your device
> > is capable of producing?
> From the naming of this feature, I thought it would bypass blocker thoroughly, hence
> would gain much higher performance, for myself, if this feature can improves 25% higher
> or more, that would be much more attractive, and users would like to try it. Again, I'm
> not saying this feature is not good, just thought it would perform much better for small io.

It does bypass the block layer. The driver just uses library functions provided
by the block layer for things it doesn't want to duplicate. Reimplementing that
functionality in driver isn't going to improve anything.

> >> In our kernel config, no active q->stats->callbacks, but still has this overhead.
> >>
> >> 2. 0.97%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg_from_css
> >>     0.85%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg
> >>     0.74%  io_uring  [kernel.vmlinux]  [k] blkg_lookup_create
> >> For nvme passthrough feature, it tries to dispatch nvme commands to nvme
> >> controller directly, so should get rid of these overheads.
> >>
> >> 3. 3.19%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
> >>     2.65%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
> >> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.
> >>
> >> 4. 7.90%  io_uring  [nvme]            [k] nvme_poll
> >>     3.59%  io_uring  [nvme_core]       [k] nvme_ns_chr_uring_cmd_iopoll
> >>     2.63%  io_uring  [kernel.vmlinux]  [k] blk_mq_poll_classic
> >>     1.88%  io_uring  [nvme]            [k] nvme_poll_cq
> >>     1.74%  io_uring  [kernel.vmlinux]  [k] bio_poll
> >>     1.89%  io_uring  [kernel.vmlinux]  [k] xas_load
> >>     0.86%  io_uring  [kernel.vmlinux]  [k] xas_start
> >>     0.80%  io_uring  [kernel.vmlinux]  [k] xas_start
> >> Seems that the block poll operation call chain is somewhat deep, also
> > It's not really that deep, though the xarray lookups are unfortunate.
> >
> > And if you were to remove block layer, it looks like you'd end up just shifting
> > the CPU utilization to a different polling function without increasing IOPs.
> > Your hardware doesn't look fast enough for this software overhead to be a
> > concern.
> No, I may not agree with you here, sorry. Real products(not like t/io_uring tools,
> which just polls block layer when ios are issued) will have many other work
> to run, such as network work. If we can cut the nvme passthrough overhead more,
> saved cpu will use to do other useful work.

You initiated this thread with supposed underwhelming IOPs improvements from
the io engine, but now you've shifted your criteria.

You can always turn off the kernel's stats and cgroups if you don't find them
useful.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: another nvme pssthrough design based on nvme hardware queue file abstraction
  2023-04-27 15:03       ` Keith Busch
@ 2023-04-28  2:42         ` Xiaoguang Wang
  0 siblings, 0 replies; 7+ messages in thread
From: Xiaoguang Wang @ 2023-04-28  2:42 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-block, io-uring, Christoph Hellwig, Jens Axboe

hi,

> On Thu, Apr 27, 2023 at 08:17:30PM +0800, Xiaoguang Wang wrote:
>>> On Wed, Apr 26, 2023 at 09:19:57PM +0800, Xiaoguang Wang wrote:
>>>> hi all,
>>>>
>>>> Recently we start to test nvme passthrough feature, which is based on io_uring. Originally we
>>>> thought its performance would be much better than normal polled nvme test, but test results
>>>> show that it's not:
>>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O0 -n1  -u1 /dev/ng1n1
>>>> IOPS=891.49K, BW=435MiB/s, IOS/call=32/31
>>>> IOPS=891.07K, BW=435MiB/s, IOS/call=31/31
>>>>
>>>> $ sudo taskset -c 1 /home/feiman.wxg/fio/t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -O1 -n1 /dev/nvme1n1
>>>> IOPS=807.81K, BW=394MiB/s, IOS/call=32/31
>>>> IOPS=808.13K, BW=394MiB/s, IOS/call=32/32
>>>>
>>>> about 10% iops improvement, I'm not saying its not good, just had thought it should
>>>> perform much better.
>>> What did you think it should be? What is the maximum 512b read IOPs your device
>>> is capable of producing?
>> From the naming of this feature, I thought it would bypass blocker thoroughly, hence
>> would gain much higher performance, for myself, if this feature can improves 25% higher
>> or more, that would be much more attractive, and users would like to try it. Again, I'm
>> not saying this feature is not good, just thought it would perform much better for small io.
> It does bypass the block layer. The driver just uses library functions provided
> by the block layer for things it doesn't want to duplicate. Reimplementing that
> functionality in driver isn't going to improve anything.
>
>>>> In our kernel config, no active q->stats->callbacks, but still has this overhead.
>>>>
>>>> 2. 0.97%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg_from_css
>>>>     0.85%  io_uring  [kernel.vmlinux]  [k] bio_associate_blkg
>>>>     0.74%  io_uring  [kernel.vmlinux]  [k] blkg_lookup_create
>>>> For nvme passthrough feature, it tries to dispatch nvme commands to nvme
>>>> controller directly, so should get rid of these overheads.
>>>>
>>>> 3. 3.19%  io_uring  [kernel.vmlinux]  [k] __rcu_read_unlock
>>>>     2.65%  io_uring  [kernel.vmlinux]  [k] __rcu_read_lock
>>>> Frequent rcu_read_lock/unlcok overheads, not sure whether we can improve a bit.
>>>>
>>>> 4. 7.90%  io_uring  [nvme]            [k] nvme_poll
>>>>     3.59%  io_uring  [nvme_core]       [k] nvme_ns_chr_uring_cmd_iopoll
>>>>     2.63%  io_uring  [kernel.vmlinux]  [k] blk_mq_poll_classic
>>>>     1.88%  io_uring  [nvme]            [k] nvme_poll_cq
>>>>     1.74%  io_uring  [kernel.vmlinux]  [k] bio_poll
>>>>     1.89%  io_uring  [kernel.vmlinux]  [k] xas_load
>>>>     0.86%  io_uring  [kernel.vmlinux]  [k] xas_start
>>>>     0.80%  io_uring  [kernel.vmlinux]  [k] xas_start
>>>> Seems that the block poll operation call chain is somewhat deep, also
>>> It's not really that deep, though the xarray lookups are unfortunate.
>>>
>>> And if you were to remove block layer, it looks like you'd end up just shifting
>>> the CPU utilization to a different polling function without increasing IOPs.
>>> Your hardware doesn't look fast enough for this software overhead to be a
>>> concern.
>> No, I may not agree with you here, sorry. Real products(not like t/io_uring tools,
>> which just polls block layer when ios are issued) will have many other work
>> to run, such as network work. If we can cut the nvme passthrough overhead more,
>> saved cpu will use to do other useful work.
> You initiated this thread with supposed underwhelming IOPs improvements from
> the io engine, but now you've shifted your criteria.
Sorry, but how did you come to this conclusion that I have shifted my criteria...
I'm not a native english speaker, may not express my thoughts clearly. And
I forgot to mention that indeed in real products, they may manage more than
one nvme ssd with one cpu(software is taskseted to corresponding cpu), so
I think software overhead would be a concern.

No offense at all, I initiated this thread just to discuss whether we can improve
nvme passthrough performance more. For myself, also need to understand
nvme codes more.
>
> You can always turn off the kernel's stats and cgroups if you don't find them
> useful.
In example of cgroups, do you mean disable CONFIG_BLK_CGROUP?
I'm not sure it will work, a physical machine may have many disk drives,
others drives may need blkcg.

Regards,
Xiaoguang Wang



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-04-28  2:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20230426132010epcas5p4ad551f7bdebd6841e2004ba47ab468b3@epcas5p4.samsung.com>
2023-04-26 13:19 ` another nvme pssthrough design based on nvme hardware queue file abstraction Xiaoguang Wang
2023-04-26 13:59   ` Kanchan Joshi
2023-04-27 11:00     ` Xiaoguang Wang
2023-04-26 14:12   ` Keith Busch
2023-04-27 12:17     ` Xiaoguang Wang
2023-04-27 15:03       ` Keith Busch
2023-04-28  2:42         ` Xiaoguang Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox