public inbox for [email protected]
 help / color / mirror / Atom feed
* [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
       [not found] <CGME20230210180226epcas5p1bd2e1150de067f8af61de2bbf571594d@epcas5p1.samsung.com>
@ 2023-02-10 18:00 ` Kanchan Joshi
  2023-02-10 18:18   ` Bart Van Assche
                     ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Kanchan Joshi @ 2023-02-10 18:00 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-block, linux-nvme, io-uring, axboe, hch, kbusch, ming.lei,
	Kanchan Joshi

is getting more common than it used to be.
NVMe is no longer tied to block storage. Command sets in NVMe 2.0 spec
opened an excellent way to present non-block interfaces to the Host. ZNS
and KV came along with it, and some new command sets are emerging.

OTOH, Kernel IO advances historically centered around the block IO path.
Passthrough IO path existed, but it stayed far from all the advances, be
it new features or performance.

Current state & discussion points:
---------------------------------
Status-quo changed in the recent past with the new passthrough path (ng
char interface + io_uring command). Feature parity does not exist, but
performance parity does.
Adoption draws asks. I propose a session covering a few voices and
finding a path-forward for some ideas too.

1. Command cancellation: while NVMe mandatorily supports the abort
command, we do not have a way to trigger that from user-space. There
are ways to go about it (with or without the uring-cancel interface) but
not without certain tradeoffs. It will be good to discuss the choices in
person.

2. Cgroups: works for only block dev at the moment. Are there outright
objections to extending this to char-interface IO?

3. DMA cost: is high in presence of IOMMU. Keith posted the work[1],
with block IO path, last year. I imagine plumbing to get a bit simpler
with passthrough-only support. But what are the other things that must
be sorted out to have progress on moving DMA cost out of the fast path?

4. Direct NVMe queues - will there be interest in having io_uring
managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
io_uring SQE to NVMe SQE without having to go through intermediate
constructs (i.e., bio/request). Hopefully,that can further amp up the
efficiency of IO.

5. <anything else that might be of interest to folks>

I hope to send some code/PoC to discuss the stuff better.

[1]https://lore.kernel.org/linux-nvme/[email protected]/



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 18:00 ` [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO Kanchan Joshi
@ 2023-02-10 18:18   ` Bart Van Assche
  2023-02-10 19:34     ` Kanchan Joshi
                       ` (2 more replies)
  2023-02-10 19:53   ` Jens Axboe
                     ` (4 subsequent siblings)
  5 siblings, 3 replies; 19+ messages in thread
From: Bart Van Assche @ 2023-02-10 18:18 UTC (permalink / raw)
  To: Kanchan Joshi, lsf-pc
  Cc: linux-block, linux-nvme, io-uring, axboe, hch, kbusch, ming.lei

On 2/10/23 10:00, Kanchan Joshi wrote:
> 3. DMA cost: is high in presence of IOMMU. Keith posted the work[1],
> with block IO path, last year. I imagine plumbing to get a bit simpler
> with passthrough-only support. But what are the other things that must
> be sorted out to have progress on moving DMA cost out of the fast path?

Are performance numbers available?

Isn't IOMMU cost something that has already been solved? From 
https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf: 
"Evaluation of our designs under Linux shows that (1)
they achieve 88.5%–100% of the performance obtained
without an IOMMU".

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 18:18   ` Bart Van Assche
@ 2023-02-10 19:34     ` Kanchan Joshi
  2023-02-13 20:24       ` Bart Van Assche
  2023-02-10 19:47     ` Jens Axboe
  2023-02-14 10:33     ` John Garry
  2 siblings, 1 reply; 19+ messages in thread
From: Kanchan Joshi @ 2023-02-10 19:34 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: lsf-pc, linux-block, linux-nvme, io-uring, axboe, hch, kbusch, ming.lei

[-- Attachment #1: Type: text/plain, Size: 998 bytes --]

On Fri, Feb 10, 2023 at 10:18:08AM -0800, Bart Van Assche wrote:
>On 2/10/23 10:00, Kanchan Joshi wrote:
>>3. DMA cost: is high in presence of IOMMU. Keith posted the work[1],
>>with block IO path, last year. I imagine plumbing to get a bit simpler
>>with passthrough-only support. But what are the other things that must
>>be sorted out to have progress on moving DMA cost out of the fast path?
>
>Are performance numbers available?

Around 55% decline when I checked last (6.1-rcX kernel).
512b randread IOPS with optane, on AMD ryzen 9 box -
when iommu is set to lazy (default config)= 3.1M
when iommmu is disabled or in passthrough mode = 4.9M

>Isn't IOMMU cost something that has already been solved? From https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf: 
>"Evaluation of our designs under Linux shows that (1)
>they achieve 88.5%–100% of the performance obtained
>without an IOMMU".

Since above numbers are more recent than the paper, this is yet to be
solved.

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 18:18   ` Bart Van Assche
  2023-02-10 19:34     ` Kanchan Joshi
@ 2023-02-10 19:47     ` Jens Axboe
  2023-02-14 10:33     ` John Garry
  2 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2023-02-10 19:47 UTC (permalink / raw)
  To: Bart Van Assche, Kanchan Joshi, lsf-pc
  Cc: linux-block, linux-nvme, io-uring, hch, kbusch, ming.lei

On 2/10/23 11:18?AM, Bart Van Assche wrote:
> On 2/10/23 10:00, Kanchan Joshi wrote:
>> 3. DMA cost: is high in presence of IOMMU. Keith posted the work[1],
>> with block IO path, last year. I imagine plumbing to get a bit simpler
>> with passthrough-only support. But what are the other things that must
>> be sorted out to have progress on moving DMA cost out of the fast path?
> 
> Are performance numbers available?
> 
> Isn't IOMMU cost something that has already been solved? From https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf: "Evaluation of our designs under Linux shows that (1)
> they achieve 88.5%?100% of the performance obtained
> without an IOMMU".

Sorry no, IOMMU cost is definitely not a solved problem, it adds
considerable overhead. Caveat that I didn't read that paper, but
speaking from practical experience. Let's not be naive here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 18:00 ` [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO Kanchan Joshi
  2023-02-10 18:18   ` Bart Van Assche
@ 2023-02-10 19:53   ` Jens Axboe
  2023-02-13 11:54     ` Sagi Grimberg
  2023-04-11 22:48     ` Kanchan Joshi
  2023-02-10 20:07   ` Clay Mayers
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Jens Axboe @ 2023-02-10 19:53 UTC (permalink / raw)
  To: Kanchan Joshi, lsf-pc
  Cc: linux-block, linux-nvme, io-uring, hch, kbusch, ming.lei

On 2/10/23 11:00?AM, Kanchan Joshi wrote:
> is getting more common than it used to be.
> NVMe is no longer tied to block storage. Command sets in NVMe 2.0 spec
> opened an excellent way to present non-block interfaces to the Host. ZNS
> and KV came along with it, and some new command sets are emerging.
> 
> OTOH, Kernel IO advances historically centered around the block IO path.
> Passthrough IO path existed, but it stayed far from all the advances, be
> it new features or performance.
> 
> Current state & discussion points:
> ---------------------------------
> Status-quo changed in the recent past with the new passthrough path (ng
> char interface + io_uring command). Feature parity does not exist, but
> performance parity does.
> Adoption draws asks. I propose a session covering a few voices and
> finding a path-forward for some ideas too.
> 
> 1. Command cancellation: while NVMe mandatorily supports the abort
> command, we do not have a way to trigger that from user-space. There
> are ways to go about it (with or without the uring-cancel interface) but
> not without certain tradeoffs. It will be good to discuss the choices in
> person.
> 
> 2. Cgroups: works for only block dev at the moment. Are there outright
> objections to extending this to char-interface IO?
> 
> 3. DMA cost: is high in presence of IOMMU. Keith posted the work[1],
> with block IO path, last year. I imagine plumbing to get a bit simpler
> with passthrough-only support. But what are the other things that must
> be sorted out to have progress on moving DMA cost out of the fast path?

Yeah, this one is still pending... Would be nice to make some progress
there at some point.

> 4. Direct NVMe queues - will there be interest in having io_uring
> managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
> io_uring SQE to NVMe SQE without having to go through intermediate
> constructs (i.e., bio/request). Hopefully,that can further amp up the
> efficiency of IO.

This is interesting, and I've pondered something like that before too. I
think it's worth investigating and hacking up a prototype. I recently
had one user of IOPOLL assume that setting up a ring with IOPOLL would
automatically create a polled queue on the driver side and that is what
would be used for IO. And while that's not how it currently works, it
definitely does make sense and we could make some things faster like
that. It would also potentially easier enable cancelation referenced in
#1 above, if it's restricted to the queue(s) that the ring "owns".

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 18:00 ` [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO Kanchan Joshi
  2023-02-10 18:18   ` Bart Van Assche
  2023-02-10 19:53   ` Jens Axboe
@ 2023-02-10 20:07   ` Clay Mayers
  2023-02-11  3:33   ` Ming Lei
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Clay Mayers @ 2023-02-10 20:07 UTC (permalink / raw)
  To: Kanchan Joshi, lsf-pc
  Cc: linux-block, linux-nvme, io-uring, axboe, hch, kbusch, ming.lei

> From Kanchan Joshi
> Sent: Friday, February 10, 2023 10:01 AM
> To: [email protected]
> Cc: [email protected]; [email protected]; io-
> [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; Kanchan Joshi <[email protected]>
> Subject: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
> 
> is getting more common than it used to be.
> NVMe is no longer tied to block storage. Command sets in NVMe 2.0 spec
> opened an excellent way to present non-block interfaces to the Host. ZNS
> and KV came along with it, and some new command sets are emerging.

Some command sets require features of NVMe the kernel doesn't support;
fused and some AENs for example.  It would be very useful to work with
non-block command sets w/o modifying the NVMe driver, having a custom
NVMe driver per command set or resorting to using spdk.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 18:00 ` [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO Kanchan Joshi
                     ` (2 preceding siblings ...)
  2023-02-10 20:07   ` Clay Mayers
@ 2023-02-11  3:33   ` Ming Lei
  2023-02-11 12:06   ` Hannes Reinecke
  2023-02-28 16:05   ` John Meneghini
  5 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2023-02-11  3:33 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: lsf-pc, linux-block, linux-nvme, io-uring, axboe, hch, kbusch, ming.lei

On Fri, Feb 10, 2023 at 11:30:33PM +0530, Kanchan Joshi wrote:
> is getting more common than it used to be.
> NVMe is no longer tied to block storage. Command sets in NVMe 2.0 spec
> opened an excellent way to present non-block interfaces to the Host. ZNS
> and KV came along with it, and some new command sets are emerging.
> 
> OTOH, Kernel IO advances historically centered around the block IO path.
> Passthrough IO path existed, but it stayed far from all the advances, be
> it new features or performance.
> 
> Current state & discussion points:
> ---------------------------------
> Status-quo changed in the recent past with the new passthrough path (ng
> char interface + io_uring command). Feature parity does not exist, but
> performance parity does.
> Adoption draws asks. I propose a session covering a few voices and
> finding a path-forward for some ideas too.
> 
> 1. Command cancellation: while NVMe mandatorily supports the abort
> command, we do not have a way to trigger that from user-space. There
> are ways to go about it (with or without the uring-cancel interface) but
> not without certain tradeoffs. It will be good to discuss the choices in
> person.
> 
> 2. Cgroups: works for only block dev at the moment. Are there outright
> objections to extending this to char-interface IO?

But recently the blk-cgroup change towards to associate with disk only,
which may become far away from supporting cgroup for pt IO.

Another thing is io scheduler, I guess it isn't important for nvme any
more?

Also IO accounting.

> 
> 3. DMA cost: is high in presence of IOMMU. Keith posted the work[1],
> with block IO path, last year. I imagine plumbing to get a bit simpler
> with passthrough-only support. But what are the other things that must
> be sorted out to have progress on moving DMA cost out of the fast path?
> 
> 4. Direct NVMe queues - will there be interest in having io_uring
> managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
> io_uring SQE to NVMe SQE without having to go through intermediate
> constructs (i.e., bio/request). Hopefully,that can further amp up the
> efficiency of IO.

Interesting!

There hasn't bio for nvme io_uring command pt, but request is still
here. If SQE can provide unique ID, request may reuse it as tag.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 18:00 ` [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO Kanchan Joshi
                     ` (3 preceding siblings ...)
  2023-02-11  3:33   ` Ming Lei
@ 2023-02-11 12:06   ` Hannes Reinecke
  2023-02-28 16:05   ` John Meneghini
  5 siblings, 0 replies; 19+ messages in thread
From: Hannes Reinecke @ 2023-02-11 12:06 UTC (permalink / raw)
  To: Kanchan Joshi, lsf-pc
  Cc: linux-block, linux-nvme, io-uring, axboe, hch, kbusch, ming.lei

On 2/10/23 19:00, Kanchan Joshi wrote:
> is getting more common than it used to be.
> NVMe is no longer tied to block storage. Command sets in NVMe 2.0 spec
> opened an excellent way to present non-block interfaces to the Host. ZNS
> and KV came along with it, and some new command sets are emerging.
> 
> OTOH, Kernel IO advances historically centered around the block IO path.
> Passthrough IO path existed, but it stayed far from all the advances, be
> it new features or performance.
> 
> Current state & discussion points:
> ---------------------------------
> Status-quo changed in the recent past with the new passthrough path (ng
> char interface + io_uring command). Feature parity does not exist, but
> performance parity does.
> Adoption draws asks. I propose a session covering a few voices and
> finding a path-forward for some ideas too.
> 
> 1. Command cancellation: while NVMe mandatorily supports the abort
> command, we do not have a way to trigger that from user-space. There
> are ways to go about it (with or without the uring-cancel interface) but
> not without certain tradeoffs. It will be good to discuss the choices in
> person.
> 
I would love to have this discussion; that's something which has been on 
my personal to-do list for a long time, and io_uring might finally be a 
solution to it.

Or, alternatively, looking at CDL for NVMe; that would be an alternative 
approach.
Maybe it's even worthwhile to schedule a separate meeting for it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
[email protected]                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 19:53   ` Jens Axboe
@ 2023-02-13 11:54     ` Sagi Grimberg
  2023-04-11 22:48     ` Kanchan Joshi
  1 sibling, 0 replies; 19+ messages in thread
From: Sagi Grimberg @ 2023-02-13 11:54 UTC (permalink / raw)
  To: Jens Axboe, Kanchan Joshi, lsf-pc
  Cc: linux-block, linux-nvme, io-uring, hch, kbusch, ming.lei



On 2/10/23 21:53, Jens Axboe wrote:
> On 2/10/23 11:00?AM, Kanchan Joshi wrote:
>> is getting more common than it used to be.
>> NVMe is no longer tied to block storage. Command sets in NVMe 2.0 spec
>> opened an excellent way to present non-block interfaces to the Host. ZNS
>> and KV came along with it, and some new command sets are emerging.
>>
>> OTOH, Kernel IO advances historically centered around the block IO path.
>> Passthrough IO path existed, but it stayed far from all the advances, be
>> it new features or performance.
>>
>> Current state & discussion points:
>> ---------------------------------
>> Status-quo changed in the recent past with the new passthrough path (ng
>> char interface + io_uring command). Feature parity does not exist, but
>> performance parity does.
>> Adoption draws asks. I propose a session covering a few voices and
>> finding a path-forward for some ideas too.
>>
>> 1. Command cancellation: while NVMe mandatorily supports the abort
>> command, we do not have a way to trigger that from user-space. There
>> are ways to go about it (with or without the uring-cancel interface) but
>> not without certain tradeoffs. It will be good to discuss the choices in
>> person.

This would require some rework of how the driver handles aborts today.
I'm unsure what the cancellation guarantees that io_uring provides, but
need to understand if it fits with the guarantees that nvme provides.

It is also unclear to me how this would work if different namespaces
are handed to different users, and have them all submit aborts on
the admin queue. How do you even differentiate which user sent which
command?

>>
>> 2. Cgroups: works for only block dev at the moment. Are there outright
>> objections to extending this to char-interface IO?
>>
>> 3. DMA cost: is high in presence of IOMMU. Keith posted the work[1],
>> with block IO path, last year. I imagine plumbing to get a bit simpler
>> with passthrough-only support. But what are the other things that must
>> be sorted out to have progress on moving DMA cost out of the fast path?
> 
> Yeah, this one is still pending... Would be nice to make some progress
> there at some point.
> 
>> 4. Direct NVMe queues - will there be interest in having io_uring
>> managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
>> io_uring SQE to NVMe SQE without having to go through intermediate
>> constructs (i.e., bio/request). Hopefully,that can further amp up the
>> efficiency of IO.
> 
> This is interesting, and I've pondered something like that before too. I
> think it's worth investigating and hacking up a prototype. I recently
> had one user of IOPOLL assume that setting up a ring with IOPOLL would
> automatically create a polled queue on the driver side and that is what
> would be used for IO. And while that's not how it currently works, it
> definitely does make sense and we could make some things faster like
> that.

I also think it can makes sense, I'd use it if it was available.
Though io_uring may need to abstract the fact that the device may be
limited on the number of queues it supports, also this would need to be
an interface needed from the driver that would need to understand how to
coordinate controller reset/teardown in the presence of "alien" queues.

  It would also potentially easier enable cancelation referenced in
> #1 above, if it's restricted to the queue(s) that the ring "owns".
> 

That could be a potential enforcement, correlating the command with
the dedicated queue. Still feels dangerous because if admin abort(s)
time out the driver really needs to reset the entire controller...
So it is not really "isolated" when it comes to aborts/cancellations.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 19:34     ` Kanchan Joshi
@ 2023-02-13 20:24       ` Bart Van Assche
  0 siblings, 0 replies; 19+ messages in thread
From: Bart Van Assche @ 2023-02-13 20:24 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: lsf-pc, linux-block, linux-nvme, io-uring, axboe, hch, kbusch, ming.lei

On 2/10/23 11:34, Kanchan Joshi wrote:
> On Fri, Feb 10, 2023 at 10:18:08AM -0800, Bart Van Assche wrote:
>> On 2/10/23 10:00, Kanchan Joshi wrote:
>>> 3. DMA cost: is high in presence of IOMMU. Keith posted the work[1],
>>> with block IO path, last year. I imagine plumbing to get a bit simpler
>>> with passthrough-only support. But what are the other things that must
>>> be sorted out to have progress on moving DMA cost out of the fast path?
>>
>> Are performance numbers available?
> 
> Around 55% decline when I checked last (6.1-rcX kernel).
> 512b randread IOPS with optane, on AMD ryzen 9 box -
> when iommu is set to lazy (default config)= 3.1M
> when iommmu is disabled or in passthrough mode = 4.9M

Hi Kanchan,

Thank you for having shared these numbers. More information would be 
welcome, e.g. the latency impact on a QD=1 test of the IOMMU, the queue 
depth of the test results mentioned above and also how much additional 
CPU time is needed with the IOMMU enabled. I'm wondering whether the 
IOMMU cost is dominated by the IOMMU hardware or by software bottlenecks 
(e.g. spinlocks).

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 18:18   ` Bart Van Assche
  2023-02-10 19:34     ` Kanchan Joshi
  2023-02-10 19:47     ` Jens Axboe
@ 2023-02-14 10:33     ` John Garry
  2 siblings, 0 replies; 19+ messages in thread
From: John Garry @ 2023-02-14 10:33 UTC (permalink / raw)
  To: Bart Van Assche, Kanchan Joshi, lsf-pc
  Cc: linux-block, linux-nvme, io-uring, axboe, hch, kbusch, ming.lei

On 10/02/2023 18:18, Bart Van Assche wrote:
> On 2/10/23 10:00, Kanchan Joshi wrote:
>> 3. DMA cost: is high in presence of IOMMU. Keith posted the work[1],
>> with block IO path, last year. I imagine plumbing to get a bit simpler
>> with passthrough-only support. But what are the other things that must
>> be sorted out to have progress on moving DMA cost out of the fast path?
> 
> Are performance numbers available?
> 
> Isn't IOMMU cost something that has already been solved? From 
> https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf: "Evaluation of our designs under Linux shows that (1)
> they achieve 88.5%–100% of the performance obtained
> without an IOMMU".

That paper is ~8 years old now. Some recommendations are already 
supported in the kernel since then, like per-CPU IOVA caching and 
per-IOMMU domain IOTLB flushing with per-CPU queues (which is relevant 
to lazy mode only).

Thanks,
John

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 18:00 ` [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO Kanchan Joshi
                     ` (4 preceding siblings ...)
  2023-02-11 12:06   ` Hannes Reinecke
@ 2023-02-28 16:05   ` John Meneghini
  5 siblings, 0 replies; 19+ messages in thread
From: John Meneghini @ 2023-02-28 16:05 UTC (permalink / raw)
  To: Kanchan Joshi, lsf-pc
  Cc: linux-block, linux-nvme, io-uring, axboe, hch, kbusch, ming.lei

On 2/10/23 13:00, Kanchan Joshi wrote:
> 1. Command cancellation: while NVMe mandatorily supports the abort
> command, we do not have a way to trigger that from user-space. There
> are ways to go about it (with or without the uring-cancel interface) but
> not without certain tradeoffs. It will be good to discuss the choices in
> person.

As one of the principle authors of TP4097a and the author of the one NVMe controller implementation that supports the NVMe 
Cancel command I would like to attend LSF/MM this year and talk about this.

See my SDC presentation where I describe all of the problems with the NVMe Abort command and demonstrates a Linux host sending 
NVMe Abort and Cancel command to an IO controller:

https://www.youtube.com/watch?v=vRrAD1U0IRw


/John


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-02-10 19:53   ` Jens Axboe
  2023-02-13 11:54     ` Sagi Grimberg
@ 2023-04-11 22:48     ` Kanchan Joshi
  2023-04-11 22:53       ` Jens Axboe
  2023-04-12  2:33       ` Ming Lei
  1 sibling, 2 replies; 19+ messages in thread
From: Kanchan Joshi @ 2023-04-11 22:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kanchan Joshi, lsf-pc, linux-block, linux-nvme, io-uring, hch,
	kbusch, ming.lei

> > 4. Direct NVMe queues - will there be interest in having io_uring
> > managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
> > io_uring SQE to NVMe SQE without having to go through intermediate
> > constructs (i.e., bio/request). Hopefully,that can further amp up the
> > efficiency of IO.
>
> This is interesting, and I've pondered something like that before too. I
> think it's worth investigating and hacking up a prototype. I recently
> had one user of IOPOLL assume that setting up a ring with IOPOLL would
> automatically create a polled queue on the driver side and that is what
> would be used for IO. And while that's not how it currently works, it
> definitely does make sense and we could make some things faster like
> that. It would also potentially easier enable cancelation referenced in
> #1 above, if it's restricted to the queue(s) that the ring "owns".

So I am looking at prototyping it, exclusively for the polled-io case.
And for that, is there already a way to ensure that there are no
concurrent submissions to this ring (set with IORING_SETUP_IOPOLL
flag)?
That will be the case generally (and submissions happen under
uring_lock mutex), but submission may still get punted to io-wq
worker(s) which do not take that mutex.
So the original task and worker may get into doing concurrent submissions.

The flag IORING_SETUP_SINGLE_ISSUER - is not for this case, or is it?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-04-11 22:48     ` Kanchan Joshi
@ 2023-04-11 22:53       ` Jens Axboe
  2023-04-11 23:28         ` Kanchan Joshi
  2023-04-12  2:33       ` Ming Lei
  1 sibling, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2023-04-11 22:53 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Kanchan Joshi, lsf-pc, linux-block, linux-nvme, io-uring, hch,
	kbusch, ming.lei

On 4/11/23 4:48 PM, Kanchan Joshi wrote:
>>> 4. Direct NVMe queues - will there be interest in having io_uring
>>> managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
>>> io_uring SQE to NVMe SQE without having to go through intermediate
>>> constructs (i.e., bio/request). Hopefully,that can further amp up the
>>> efficiency of IO.
>>
>> This is interesting, and I've pondered something like that before too. I
>> think it's worth investigating and hacking up a prototype. I recently
>> had one user of IOPOLL assume that setting up a ring with IOPOLL would
>> automatically create a polled queue on the driver side and that is what
>> would be used for IO. And while that's not how it currently works, it
>> definitely does make sense and we could make some things faster like
>> that. It would also potentially easier enable cancelation referenced in
>> #1 above, if it's restricted to the queue(s) that the ring "owns".
> 
> So I am looking at prototyping it, exclusively for the polled-io case.
> And for that, is there already a way to ensure that there are no
> concurrent submissions to this ring (set with IORING_SETUP_IOPOLL
> flag)?
> That will be the case generally (and submissions happen under
> uring_lock mutex), but submission may still get punted to io-wq
> worker(s) which do not take that mutex.
> So the original task and worker may get into doing concurrent submissions.

io-wq may indeed get in your way. But I think for something like this,
you'd never want to punt to io-wq to begin with. If userspace is managing
the queue, then by definition you cannot run out of tags. If there are
other conditions for this kind of request that may run into out-of-memory
conditions, then the error just needs to be returned.

With that, you have exclusive submits on that ring and lower down.

> The flag IORING_SETUP_SINGLE_ISSUER - is not for this case, or is it?

It's not, it enables optimizations around the ring creator saying that
only one userspace task is submitting requests on this ring.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-04-11 22:53       ` Jens Axboe
@ 2023-04-11 23:28         ` Kanchan Joshi
  2023-04-12  2:12           ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Kanchan Joshi @ 2023-04-11 23:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kanchan Joshi, lsf-pc, linux-block, linux-nvme, io-uring, hch,
	kbusch, ming.lei

On Wed, Apr 12, 2023 at 4:23 AM Jens Axboe <[email protected]> wrote:
>
> On 4/11/23 4:48 PM, Kanchan Joshi wrote:
> >>> 4. Direct NVMe queues - will there be interest in having io_uring
> >>> managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
> >>> io_uring SQE to NVMe SQE without having to go through intermediate
> >>> constructs (i.e., bio/request). Hopefully,that can further amp up the
> >>> efficiency of IO.
> >>
> >> This is interesting, and I've pondered something like that before too. I
> >> think it's worth investigating and hacking up a prototype. I recently
> >> had one user of IOPOLL assume that setting up a ring with IOPOLL would
> >> automatically create a polled queue on the driver side and that is what
> >> would be used for IO. And while that's not how it currently works, it
> >> definitely does make sense and we could make some things faster like
> >> that. It would also potentially easier enable cancelation referenced in
> >> #1 above, if it's restricted to the queue(s) that the ring "owns".
> >
> > So I am looking at prototyping it, exclusively for the polled-io case.
> > And for that, is there already a way to ensure that there are no
> > concurrent submissions to this ring (set with IORING_SETUP_IOPOLL
> > flag)?
> > That will be the case generally (and submissions happen under
> > uring_lock mutex), but submission may still get punted to io-wq
> > worker(s) which do not take that mutex.
> > So the original task and worker may get into doing concurrent submissions.
>
> io-wq may indeed get in your way. But I think for something like this,
> you'd never want to punt to io-wq to begin with. If userspace is managing
> the queue, then by definition you cannot run out of tags.

Unfortunately we have lifetime differences between io_uring and NVMe.
NVMe tag remains valid/occupied until completion (we do not have a
nice sq->head to look at and decide).
For io_uring, it can be reused much earlier i.e. just after submission.
So tag shortage is possible.

>If there are
> other conditions for this kind of request that may run into out-of-memory
> conditions, then the error just needs to be returned.

I see, and IOSQE_ASYNC can also be flagged as an error/not-supported. Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-04-11 23:28         ` Kanchan Joshi
@ 2023-04-12  2:12           ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2023-04-12  2:12 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Kanchan Joshi, lsf-pc, linux-block, linux-nvme, io-uring, hch,
	kbusch, ming.lei

On 4/11/23 5:28?PM, Kanchan Joshi wrote:
> On Wed, Apr 12, 2023 at 4:23?AM Jens Axboe <[email protected]> wrote:
>>
>> On 4/11/23 4:48?PM, Kanchan Joshi wrote:
>>>>> 4. Direct NVMe queues - will there be interest in having io_uring
>>>>> managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
>>>>> io_uring SQE to NVMe SQE without having to go through intermediate
>>>>> constructs (i.e., bio/request). Hopefully,that can further amp up the
>>>>> efficiency of IO.
>>>>
>>>> This is interesting, and I've pondered something like that before too. I
>>>> think it's worth investigating and hacking up a prototype. I recently
>>>> had one user of IOPOLL assume that setting up a ring with IOPOLL would
>>>> automatically create a polled queue on the driver side and that is what
>>>> would be used for IO. And while that's not how it currently works, it
>>>> definitely does make sense and we could make some things faster like
>>>> that. It would also potentially easier enable cancelation referenced in
>>>> #1 above, if it's restricted to the queue(s) that the ring "owns".
>>>
>>> So I am looking at prototyping it, exclusively for the polled-io case.
>>> And for that, is there already a way to ensure that there are no
>>> concurrent submissions to this ring (set with IORING_SETUP_IOPOLL
>>> flag)?
>>> That will be the case generally (and submissions happen under
>>> uring_lock mutex), but submission may still get punted to io-wq
>>> worker(s) which do not take that mutex.
>>> So the original task and worker may get into doing concurrent submissions.
>>
>> io-wq may indeed get in your way. But I think for something like this,
>> you'd never want to punt to io-wq to begin with. If userspace is managing
>> the queue, then by definition you cannot run out of tags.
> 
> Unfortunately we have lifetime differences between io_uring and NVMe.
> NVMe tag remains valid/occupied until completion (we do not have a
> nice sq->head to look at and decide).
> For io_uring, it can be reused much earlier i.e. just after submission.
> So tag shortage is possible.

The sqe cannot be the tag, the tag has to be generated separately. It
doesn't make sense to tie the sqe and tag together, as one is consumed
in order and the other one is not.

>> If there are
>> other conditions for this kind of request that may run into out-of-memory
>> conditions, then the error just needs to be returned.
> 
> I see, and IOSQE_ASYNC can also be flagged as an error/not-supported.

Yep!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-04-11 22:48     ` Kanchan Joshi
  2023-04-11 22:53       ` Jens Axboe
@ 2023-04-12  2:33       ` Ming Lei
  2023-04-12 13:26         ` Kanchan Joshi
  1 sibling, 1 reply; 19+ messages in thread
From: Ming Lei @ 2023-04-12  2:33 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Jens Axboe, Kanchan Joshi, lsf-pc, linux-block, linux-nvme,
	io-uring, hch, kbusch, ming.lei

On Wed, Apr 12, 2023 at 04:18:16AM +0530, Kanchan Joshi wrote:
> > > 4. Direct NVMe queues - will there be interest in having io_uring
> > > managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
> > > io_uring SQE to NVMe SQE without having to go through intermediate
> > > constructs (i.e., bio/request). Hopefully,that can further amp up the
> > > efficiency of IO.
> >
> > This is interesting, and I've pondered something like that before too. I
> > think it's worth investigating and hacking up a prototype. I recently
> > had one user of IOPOLL assume that setting up a ring with IOPOLL would
> > automatically create a polled queue on the driver side and that is what
> > would be used for IO. And while that's not how it currently works, it
> > definitely does make sense and we could make some things faster like
> > that. It would also potentially easier enable cancelation referenced in
> > #1 above, if it's restricted to the queue(s) that the ring "owns".
> 
> So I am looking at prototyping it, exclusively for the polled-io case.
> And for that, is there already a way to ensure that there are no
> concurrent submissions to this ring (set with IORING_SETUP_IOPOLL
> flag)?
> That will be the case generally (and submissions happen under
> uring_lock mutex), but submission may still get punted to io-wq
> worker(s) which do not take that mutex.
> So the original task and worker may get into doing concurrent submissions.

It seems one defect for uring command support, since io_ring_ctx and
io_ring_submit_lock() can't be exported for driver.

It could be triggered if the request is in one link chain too.

Probably the issue may be workaround by:

	if (issue_flags & IO_URING_F_UNLOCKED)
		io_uring_cmd_complete_in_task(task_work_cb);


Thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-04-12  2:33       ` Ming Lei
@ 2023-04-12 13:26         ` Kanchan Joshi
  2023-04-12 13:47           ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Kanchan Joshi @ 2023-04-12 13:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kanchan Joshi, Jens Axboe, lsf-pc, linux-block, linux-nvme,
	io-uring, hch, kbusch

[-- Attachment #1: Type: text/plain, Size: 1846 bytes --]

On Wed, Apr 12, 2023 at 10:33:40AM +0800, Ming Lei wrote:
>On Wed, Apr 12, 2023 at 04:18:16AM +0530, Kanchan Joshi wrote:
>> > > 4. Direct NVMe queues - will there be interest in having io_uring
>> > > managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
>> > > io_uring SQE to NVMe SQE without having to go through intermediate
>> > > constructs (i.e., bio/request). Hopefully,that can further amp up the
>> > > efficiency of IO.
>> >
>> > This is interesting, and I've pondered something like that before too. I
>> > think it's worth investigating and hacking up a prototype. I recently
>> > had one user of IOPOLL assume that setting up a ring with IOPOLL would
>> > automatically create a polled queue on the driver side and that is what
>> > would be used for IO. And while that's not how it currently works, it
>> > definitely does make sense and we could make some things faster like
>> > that. It would also potentially easier enable cancelation referenced in
>> > #1 above, if it's restricted to the queue(s) that the ring "owns".
>>
>> So I am looking at prototyping it, exclusively for the polled-io case.
>> And for that, is there already a way to ensure that there are no
>> concurrent submissions to this ring (set with IORING_SETUP_IOPOLL
>> flag)?
>> That will be the case generally (and submissions happen under
>> uring_lock mutex), but submission may still get punted to io-wq
>> worker(s) which do not take that mutex.
>> So the original task and worker may get into doing concurrent submissions.
>
>It seems one defect for uring command support, since io_ring_ctx and
>io_ring_submit_lock() can't be exported for driver.

Sorry, did not follow the defect part.
io-wq not acquring uring_lock in case of uring-cmd - is a defect? 
The same happens for direct block-io too.
Or do you mean anything else here?

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO
  2023-04-12 13:26         ` Kanchan Joshi
@ 2023-04-12 13:47           ` Ming Lei
  0 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2023-04-12 13:47 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Kanchan Joshi, Jens Axboe, lsf-pc, linux-block, linux-nvme,
	io-uring, hch, kbusch, ming.lei

On Wed, Apr 12, 2023 at 06:56:15PM +0530, Kanchan Joshi wrote:
> On Wed, Apr 12, 2023 at 10:33:40AM +0800, Ming Lei wrote:
> > On Wed, Apr 12, 2023 at 04:18:16AM +0530, Kanchan Joshi wrote:
> > > > > 4. Direct NVMe queues - will there be interest in having io_uring
> > > > > managed NVMe queues?  Sort of a new ring, for which I/O is destaged from
> > > > > io_uring SQE to NVMe SQE without having to go through intermediate
> > > > > constructs (i.e., bio/request). Hopefully,that can further amp up the
> > > > > efficiency of IO.
> > > >
> > > > This is interesting, and I've pondered something like that before too. I
> > > > think it's worth investigating and hacking up a prototype. I recently
> > > > had one user of IOPOLL assume that setting up a ring with IOPOLL would
> > > > automatically create a polled queue on the driver side and that is what
> > > > would be used for IO. And while that's not how it currently works, it
> > > > definitely does make sense and we could make some things faster like
> > > > that. It would also potentially easier enable cancelation referenced in
> > > > #1 above, if it's restricted to the queue(s) that the ring "owns".
> > > 
> > > So I am looking at prototyping it, exclusively for the polled-io case.
> > > And for that, is there already a way to ensure that there are no
> > > concurrent submissions to this ring (set with IORING_SETUP_IOPOLL
> > > flag)?
> > > That will be the case generally (and submissions happen under
> > > uring_lock mutex), but submission may still get punted to io-wq
> > > worker(s) which do not take that mutex.
> > > So the original task and worker may get into doing concurrent submissions.
> > 
> > It seems one defect for uring command support, since io_ring_ctx and
> > io_ring_submit_lock() can't be exported for driver.
> 
> Sorry, did not follow the defect part.
> io-wq not acquring uring_lock in case of uring-cmd - is a defect? The same
> happens for direct block-io too.
> Or do you mean anything else here?

Maybe defect isn't one accurate word here.

I meant ->uring_cmd() is the only driver/fs callback in which
issue_flags is exposed, so IO_URING_F_UNLOCKED is visible to
driver, but io_ring_submit_lock() can't be done inside driver.

No such problem for direct io since the above io_uring details
isn't exposed to direct io code.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-04-12 13:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20230210180226epcas5p1bd2e1150de067f8af61de2bbf571594d@epcas5p1.samsung.com>
2023-02-10 18:00 ` [LSF/MM/BPF ATTEND][LSF/MM/BPF Topic] Non-block IO Kanchan Joshi
2023-02-10 18:18   ` Bart Van Assche
2023-02-10 19:34     ` Kanchan Joshi
2023-02-13 20:24       ` Bart Van Assche
2023-02-10 19:47     ` Jens Axboe
2023-02-14 10:33     ` John Garry
2023-02-10 19:53   ` Jens Axboe
2023-02-13 11:54     ` Sagi Grimberg
2023-04-11 22:48     ` Kanchan Joshi
2023-04-11 22:53       ` Jens Axboe
2023-04-11 23:28         ` Kanchan Joshi
2023-04-12  2:12           ` Jens Axboe
2023-04-12  2:33       ` Ming Lei
2023-04-12 13:26         ` Kanchan Joshi
2023-04-12 13:47           ` Ming Lei
2023-02-10 20:07   ` Clay Mayers
2023-02-11  3:33   ` Ming Lei
2023-02-11 12:06   ` Hannes Reinecke
2023-02-28 16:05   ` John Meneghini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox