[LSF/MM/BPF TOPIC] dmabuf backed read/write

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] dmabuf backed read/write
@ 2026-02-03 14:29 ` Pavel Begunkov
  2026-02-03 18:07   ` Keith Busch
                     ` (4 more replies)
  0 siblings, 5 replies; 25+ messages in thread
From: Pavel Begunkov @ 2026-02-03 14:29 UTC (permalink / raw)
  To: linux-block
  Cc: io-uring, linux-nvme@lists.infradead.org, Gohad, Tushar,
	Christian König, Christoph Hellwig, Kanchan Joshi,
	Anuj Gupta, Nitesh Shetty, lsf-pc@lists.linux-foundation.org

Good day everyone,

dma-buf is a powerful abstraction for managing buffers and DMA mappings,
and there is growing interest in extending it to the read/write path to
enable device-to-device transfers without bouncing data through system
memory. I was encouraged to submit it to LSF/MM/BPF as that might be
useful to mull over details and what capabilities and features people
may need.

The proposal consists of two parts. The first is a small in-kernel
framework that allows a dma-buf to be registered against a given file
and returns an object representing a DMA mapping. The actual mapping
creation is delegated to the target subsystem (e.g. NVMe). This
abstraction centralises request accounting, mapping management, dynamic
recreation, etc. The resulting mapping object is passed through the I/O
stack via a new iov_iter type.

As for the user API, a dma-buf is installed as an io_uring registered
buffer for a specific file. Once registered, the buffer can be used by
read / write io_uring requests as normal. io_uring will enforce that the
buffer is only used with "compatible files", which is for now restricted
to the target registration file, but will be expanded in the future.
Notably, io_uring is a consumer of the framework rather than a
dependency, and the infrastructure can be reused.

It took a couple of iterations on the list to get it to the current
design, v2 of the series can be looked up at [1], which implements the
infrastructure and initial wiring for NVMe. It slightly diverges from
the description above, as some of the framework bits are block specific,
and I'll be working on refining that and simplifying some of the
interfaces for v3. A good chunk of block handling is based on prior work
from Keith that was pre DMA mapping buffers [2].

Tushar was helping and mention he got good numbers for P2P transfers
compared to bouncing it via RAM. Anuj, Kanchan and Nitesh also
previously reported encouraging results for system memory backed
dma-buf for optimising IOMMU overhead, quoting Anuj:

- STRICT: before = 570 KIOPS, after = 5.01 MIOPS
- LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
- PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS

[1] https://lore.kernel.org/io-uring/cover.1763725387.git.asml.silence@gmail.com/
[2] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
@ 2026-02-03 18:07   ` Keith Busch
  2026-02-04  6:07     ` Anuj Gupta/Anuj Gupta
  2026-02-04 11:38     ` Pavel Begunkov
  2026-02-04 15:26   ` Nitesh Shetty
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 25+ messages in thread
From: Keith Busch @ 2026-02-03 18:07 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> Good day everyone,
> 
> dma-buf is a powerful abstraction for managing buffers and DMA mappings,
> and there is growing interest in extending it to the read/write path to
> enable device-to-device transfers without bouncing data through system
> memory. I was encouraged to submit it to LSF/MM/BPF as that might be
> useful to mull over details and what capabilities and features people
> may need.
> 
> The proposal consists of two parts. The first is a small in-kernel
> framework that allows a dma-buf to be registered against a given file
> and returns an object representing a DMA mapping. The actual mapping
> creation is delegated to the target subsystem (e.g. NVMe). This
> abstraction centralises request accounting, mapping management, dynamic
> recreation, etc. The resulting mapping object is passed through the I/O
> stack via a new iov_iter type.
> 
> As for the user API, a dma-buf is installed as an io_uring registered
> buffer for a specific file. Once registered, the buffer can be used by
> read / write io_uring requests as normal. io_uring will enforce that the
> buffer is only used with "compatible files", which is for now restricted
> to the target registration file, but will be expanded in the future.
> Notably, io_uring is a consumer of the framework rather than a
> dependency, and the infrastructure can be reused.
> 
> It took a couple of iterations on the list to get it to the current
> design, v2 of the series can be looked up at [1], which implements the
> infrastructure and initial wiring for NVMe. It slightly diverges from
> the description above, as some of the framework bits are block specific,
> and I'll be working on refining that and simplifying some of the
> interfaces for v3. A good chunk of block handling is based on prior work
> from Keith that was pre DMA mapping buffers [2].
> 
> Tushar was helping and mention he got good numbers for P2P transfers
> compared to bouncing it via RAM. Anuj, Kanchan and Nitesh also
> previously reported encouraging results for system memory backed
> dma-buf for optimising IOMMU overhead, quoting Anuj:
> 
> - STRICT: before = 570 KIOPS, after = 5.01 MIOPS
> - LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
> - PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS

Thanks for submitting the topic. The performance wins look great, but
I'm a little surpised passthrough didn't show any difference. We're
still skipping a bit of transformations with the dmabuf compared to not
having it, so maybe it's just a matter of crafting the right benchmark
to show the benefit.

Anyway, I look forward to the next version of this feature. I promise to
have more cycles to review and test the v3.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-03 18:07   ` Keith Busch
@ 2026-02-04  6:07     ` Anuj Gupta/Anuj Gupta
  2026-02-04 11:38     ` Pavel Begunkov
  1 sibling, 0 replies; 25+ messages in thread
From: Anuj Gupta/Anuj Gupta @ 2026-02-04  6:07 UTC (permalink / raw)
  To: Keith Busch, Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Christian König, Christoph Hellwig, Kanchan Joshi,
	Nitesh Shetty, lsf-pc@lists.linux-foundation.org

On 2/3/2026 11:37 PM, Keith Busch wrote:
> Thanks for submitting the topic. The performance wins look great, but
> I'm a little surpised passthrough didn't show any difference. We're
> still skipping a bit of transformations with the dmabuf compared to not
> having it, so maybe it's just a matter of crafting the right benchmark
> to show the benefit.
> 

Those numbers were from a drive that saturates at ~5M IOPS, 
sopassthrough didn’t have much headroom. I did a quick run with two such
drives and saw a small improvement (~2–3%): ~5.97 MIOPS -> ~6.13 MIOPS,
but I’ll try tweaking the kernel config a bit to see if there’s more
headroom.

+1 on the topic - I'm interested in attending the discussion and
reviewing/testing v3 when it lands.

Thanks,
Anuj

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-03 18:07   ` Keith Busch
  2026-02-04  6:07     ` Anuj Gupta/Anuj Gupta
@ 2026-02-04 11:38     ` Pavel Begunkov
  1 sibling, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2026-02-04 11:38 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On 2/3/26 18:07, Keith Busch wrote:
> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>> Good day everyone,
>>
...
>> Tushar was helping and mention he got good numbers for P2P transfers
>> compared to bouncing it via RAM. Anuj, Kanchan and Nitesh also
>> previously reported encouraging results for system memory backed
>> dma-buf for optimising IOMMU overhead, quoting Anuj:
>>
>> - STRICT: before = 570 KIOPS, after = 5.01 MIOPS
>> - LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
>> - PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS
> 
> Thanks for submitting the topic. The performance wins look great, but
> I'm a little surpised passthrough didn't show any difference. We're
> still skipping a bit of transformations with the dmabuf compared to not
> having it, so maybe it's just a matter of crafting the right benchmark
> to show the benefit.

My first thought was that hardware couldn't push more and would
be great to have idle numbers, but Anuj already demystified it.

> Anyway, I look forward to the next version of this feature. I promise to
> have more cycles to review and test the v3.

Thanks! And in general, IMHO at this point waiting for next
version would be more time efficient for reviewers.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
  2026-02-03 18:07   ` Keith Busch
@ 2026-02-04 15:26   ` Nitesh Shetty
  2026-02-09 11:15     ` Pavel Begunkov
  2026-02-05  3:12   ` Ming Lei
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 25+ messages in thread
From: Nitesh Shetty @ 2026-02-04 15:26 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Christian König, Christoph Hellwig, Kanchan Joshi,
	Anuj Gupta, lsf-pc@lists.linux-foundation.org

[-- Attachment #1: Type: text/plain, Size: 1934 bytes --]

On 03/02/26 02:29PM, Pavel Begunkov wrote:
>Good day everyone,
>
>dma-buf is a powerful abstraction for managing buffers and DMA mappings,
>and there is growing interest in extending it to the read/write path to
>enable device-to-device transfers without bouncing data through system
>memory. I was encouraged to submit it to LSF/MM/BPF as that might be
>useful to mull over details and what capabilities and features people
>may need.
>
>The proposal consists of two parts. The first is a small in-kernel
>framework that allows a dma-buf to be registered against a given file
>and returns an object representing a DMA mapping. The actual mapping
>creation is delegated to the target subsystem (e.g. NVMe). This
>abstraction centralises request accounting, mapping management, dynamic
>recreation, etc. The resulting mapping object is passed through the I/O
>stack via a new iov_iter type.
>
>As for the user API, a dma-buf is installed as an io_uring registered
>buffer for a specific file. Once registered, the buffer can be used by
>read / write io_uring requests as normal. io_uring will enforce that the
>buffer is only used with "compatible files", which is for now restricted
>to the target registration file, but will be expanded in the future.
>Notably, io_uring is a consumer of the framework rather than a
>dependency, and the infrastructure can be reused.
>
We have been following the series, its interesting from couple of angles,
- IOPS wise we see a major improvement especially for IOMMU
- Series provides a way to do p2pdma to accelerator memory

Here are few topics which I am looking into specifically,
- Right now the series uses a PRP list. We need a good way to keep the
   sg_table info around and decide on‑the‑fly whether to expose the buffer
   as a PRP list or an SG list, depending on the I/O size.
- Possibility of futher optimization for new type of iov iter to reduce
   per IO cost

Thanks,
Nitesh

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-04 15:26   ` Nitesh Shetty
@ 2026-02-09 11:15     ` Pavel Begunkov
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2026-02-09 11:15 UTC (permalink / raw)
  To: Nitesh Shetty
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Christian König, Christoph Hellwig, Kanchan Joshi,
	Anuj Gupta, lsf-pc@lists.linux-foundation.org

On 2/4/26 15:26, Nitesh Shetty wrote:
> On 03/02/26 02:29PM, Pavel Begunkov wrote:
>> Good day everyone,
>>
>> dma-buf is a powerful abstraction for managing buffers and DMA mappings,
>> and there is growing interest in extending it to the read/write path to
>> enable device-to-device transfers without bouncing data through system
>> memory. I was encouraged to submit it to LSF/MM/BPF as that might be
>> useful to mull over details and what capabilities and features people
>> may need.
>>
>> The proposal consists of two parts. The first is a small in-kernel
>> framework that allows a dma-buf to be registered against a given file
>> and returns an object representing a DMA mapping. The actual mapping
>> creation is delegated to the target subsystem (e.g. NVMe). This
>> abstraction centralises request accounting, mapping management, dynamic
>> recreation, etc. The resulting mapping object is passed through the I/O
>> stack via a new iov_iter type.
>>
>> As for the user API, a dma-buf is installed as an io_uring registered
>> buffer for a specific file. Once registered, the buffer can be used by
>> read / write io_uring requests as normal. io_uring will enforce that the
>> buffer is only used with "compatible files", which is for now restricted
>> to the target registration file, but will be expanded in the future.
>> Notably, io_uring is a consumer of the framework rather than a
>> dependency, and the infrastructure can be reused.
>>
> We have been following the series, its interesting from couple of angles,
> - IOPS wise we see a major improvement especially for IOMMU
> - Series provides a way to do p2pdma to accelerator memory
> 
> Here are few topics which I am looking into specifically,
> - Right now the series uses a PRP list. We need a good way to keep the
>    sg_table info around and decide on‑the‑fly whether to expose the buffer
>    as a PRP list or an SG list, depending on the I/O size.
> - Possibility of futher optimization for new type of iov iter to reduce
>    per IO cost

There is a bunch of improvements that we can have on the NVMe driver
side, just take a look what Keith was doing in his series ([2] in the
first email in the thread), that looked very exciting (I dropped it for
simplicity). I was planning to take a closer look at optimising the driver
part after, but if someone wants to take it off my hands, it'll definitely
be welcome!

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
  2026-02-03 18:07   ` Keith Busch
  2026-02-04 15:26   ` Nitesh Shetty
@ 2026-02-05  3:12   ` Ming Lei
  2026-02-05 18:13     ` Pavel Begunkov
  2026-02-05 17:41   ` Jason Gunthorpe
  2026-02-09 10:04   ` Kanchan Joshi
  4 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2026-02-05  3:12 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> Good day everyone,
> 
> dma-buf is a powerful abstraction for managing buffers and DMA mappings,
> and there is growing interest in extending it to the read/write path to
> enable device-to-device transfers without bouncing data through system
> memory. I was encouraged to submit it to LSF/MM/BPF as that might be
> useful to mull over details and what capabilities and features people
> may need.
> 
> The proposal consists of two parts. The first is a small in-kernel
> framework that allows a dma-buf to be registered against a given file
> and returns an object representing a DMA mapping. The actual mapping
> creation is delegated to the target subsystem (e.g. NVMe). This
> abstraction centralises request accounting, mapping management, dynamic
> recreation, etc. The resulting mapping object is passed through the I/O
> stack via a new iov_iter type.
> 
> As for the user API, a dma-buf is installed as an io_uring registered
> buffer for a specific file. Once registered, the buffer can be used by
> read / write io_uring requests as normal. io_uring will enforce that the
> buffer is only used with "compatible files", which is for now restricted
> to the target registration file, but will be expanded in the future.
> Notably, io_uring is a consumer of the framework rather than a
> dependency, and the infrastructure can be reused.

I am interested in this topic.

Given dma-buf is inherently designed for sharing, I hope the io-uring
interface can be generic for covering:

- read/write with same dma-buf can be submitted to multiple devices

- read/write with dma-buf can cross stackable devices(device mapper, raid,
  ...)


Thanks,
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-05  3:12   ` Ming Lei
@ 2026-02-05 18:13     ` Pavel Begunkov
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2026-02-05 18:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On 2/5/26 03:12, Ming Lei wrote:
> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>> Good day everyone,
>>
>> dma-buf is a powerful abstraction for managing buffers and DMA mappings,
>> and there is growing interest in extending it to the read/write path to
>> enable device-to-device transfers without bouncing data through system
>> memory. I was encouraged to submit it to LSF/MM/BPF as that might be
>> useful to mull over details and what capabilities and features people
>> may need.
>>
>> The proposal consists of two parts. The first is a small in-kernel
>> framework that allows a dma-buf to be registered against a given file
>> and returns an object representing a DMA mapping. The actual mapping
>> creation is delegated to the target subsystem (e.g. NVMe). This
>> abstraction centralises request accounting, mapping management, dynamic
>> recreation, etc. The resulting mapping object is passed through the I/O
>> stack via a new iov_iter type.
>>
>> As for the user API, a dma-buf is installed as an io_uring registered
>> buffer for a specific file. Once registered, the buffer can be used by
>> read / write io_uring requests as normal. io_uring will enforce that the
>> buffer is only used with "compatible files", which is for now restricted
>> to the target registration file, but will be expanded in the future.
>> Notably, io_uring is a consumer of the framework rather than a
>> dependency, and the infrastructure can be reused.
> 
> I am interested in this topic.
> 
> Given dma-buf is inherently designed for sharing, I hope the io-uring
> interface can be generic for covering:
> 
> - read/write with same dma-buf can be submitted to multiple devices
> 
> - read/write with dma-buf can cross stackable devices(device mapper, raid,
>    ...)

Yes, those should be possible to do, IIRC Christoph mentioned it
as well while asking to change the design of v1. The
implementation will need to forward registration down and create
a dma-buf attachment for each device.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
                     ` (2 preceding siblings ...)
  2026-02-05  3:12   ` Ming Lei
@ 2026-02-05 17:41   ` Jason Gunthorpe
  2026-02-05 19:06     ` Pavel Begunkov
  2026-02-09 10:04   ` Kanchan Joshi
  4 siblings, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2026-02-05 17:41 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:

> The proposal consists of two parts. The first is a small in-kernel
> framework that allows a dma-buf to be registered against a given file
> and returns an object representing a DMA mapping. 

What is this about and why would you need something like this?

The rest makes more sense - pass a DMABUF (or even memfd) to iouring
and pre-setup the DMA mapping to get dma_addr_t, then directly use
dma_addr_t through the entire block stack right into the eventual
driver.

> Tushar was helping and mention he got good numbers for P2P transfers
> compared to bouncing it via RAM. 

We can already avoid the bouncing, it seems the main improvements here
are avoiding the DMA map per-io and allowing the use of P2P without
also creating struct page. Meanginful wins for sure.

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-05 17:41   ` Jason Gunthorpe
@ 2026-02-05 19:06     ` Pavel Begunkov
  2026-02-05 23:56       ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2026-02-05 19:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On 2/5/26 17:41, Jason Gunthorpe wrote:
> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> 
>> The proposal consists of two parts. The first is a small in-kernel
>> framework that allows a dma-buf to be registered against a given file
>> and returns an object representing a DMA mapping.
> 
> What is this about and why would you need something like this?
> 
> The rest makes more sense - pass a DMABUF (or even memfd) to iouring
> and pre-setup the DMA mapping to get dma_addr_t, then directly use
> dma_addr_t through the entire block stack right into the eventual
> driver.

That's more or less what I tried to do in v1, but 1) people didn't like
the idea of passing raw dma addresses directly, and having it wrapped
into a black box gives more flexibility like potentially supporting
multi-device filesystems. And 2) dma-buf folks want dynamic attachments,
and it makes it quite a bit more complicated when you might be asked to
shoot down DMA mappings at any moment, so I'm isolating all that
into something that can be reused.

>> Tushar was helping and mention he got good numbers for P2P transfers
>> compared to bouncing it via RAM.
> 
> We can already avoid the bouncing, it seems the main improvements here
> are avoiding the DMA map per-io and allowing the use of P2P without
> also creating struct page. Meanginful wins for sure.

Yes, and it should probably be nicer for frameworks that already
expose dma-bufs.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-05 19:06     ` Pavel Begunkov
@ 2026-02-05 23:56       ` Jason Gunthorpe
  2026-02-06 15:08         ` Pavel Begunkov
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2026-02-05 23:56 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On Thu, Feb 05, 2026 at 07:06:03PM +0000, Pavel Begunkov wrote:
> On 2/5/26 17:41, Jason Gunthorpe wrote:
> > On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> > 
> > > The proposal consists of two parts. The first is a small in-kernel
> > > framework that allows a dma-buf to be registered against a given file
> > > and returns an object representing a DMA mapping.
> > 
> > What is this about and why would you need something like this?
> > 
> > The rest makes more sense - pass a DMABUF (or even memfd) to iouring
> > and pre-setup the DMA mapping to get dma_addr_t, then directly use
> > dma_addr_t through the entire block stack right into the eventual
> > driver.
> 
> That's more or less what I tried to do in v1, but 1) people didn't like
> the idea of passing raw dma addresses directly, and having it wrapped
> into a black box gives more flexibility like potentially supporting
> multi-device filesystems. 

Ok.. but what does that have to do with a user space visible file?

> 2) dma-buf folks want dynamic attachments,
> and it makes it quite a bit more complicated when you might be asked to
> shoot down DMA mappings at any moment, so I'm isolating all that
> into something that can be reused.

IMHO there is probably nothing really resuable here. The logic to
fence any usage is entirely unique to whoever is using it, and the
locking tends to be really hard.

You should review the email threads linked to this patch and all it's
prior versions as the expected importer behavior for pinned dmabufs is
not well understood.

https://lore.kernel.org/all/20260131-dmabuf-revoke-v7-0-463d956bd527@nvidia.com/

> > > Tushar was helping and mention he got good numbers for P2P transfers
> > > compared to bouncing it via RAM.
> > 
> > We can already avoid the bouncing, it seems the main improvements here
> > are avoiding the DMA map per-io and allowing the use of P2P without
> > also creating struct page. Meanginful wins for sure.
> 
> Yes, and it should probably be nicer for frameworks that already
> expose dma-bufs.

I'm not sure what this means?

Jason
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-05 23:56       ` Jason Gunthorpe
@ 2026-02-06 15:08         ` Pavel Begunkov
  2026-02-06 15:20           ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2026-02-06 15:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On 2/5/26 23:56, Jason Gunthorpe wrote:
> On Thu, Feb 05, 2026 at 07:06:03PM +0000, Pavel Begunkov wrote:
>> On 2/5/26 17:41, Jason Gunthorpe wrote:
>>> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>>>
>>>> The proposal consists of two parts. The first is a small in-kernel
>>>> framework that allows a dma-buf to be registered against a given file
>>>> and returns an object representing a DMA mapping.
>>>
>>> What is this about and why would you need something like this?
>>>
>>> The rest makes more sense - pass a DMABUF (or even memfd) to iouring
>>> and pre-setup the DMA mapping to get dma_addr_t, then directly use
>>> dma_addr_t through the entire block stack right into the eventual
>>> driver.
>>
>> That's more or less what I tried to do in v1, but 1) people didn't like
>> the idea of passing raw dma addresses directly, and having it wrapped
>> into a black box gives more flexibility like potentially supporting
>> multi-device filesystems.
> 
> Ok.. but what does that have to do with a user space visible file?

If you're referring to registration taking a file, it's used to forward
this registration to the right driver, which knows about devices and can
create dma-buf attachment[s]. The abstraction users get is not just a
buffer but rather a buffer registered for a "subsystem" represented by
the passed file. With nvme raw bdev as the only importer in the patch set,
it's simply converges to "registered for the file", but the notion will
need to be expanded later, e.g. to accommodate filesystems.

>> 2) dma-buf folks want dynamic attachments,
>> and it makes it quite a bit more complicated when you might be asked to
>> shoot down DMA mappings at any moment, so I'm isolating all that
>> into something that can be reused.
> 
> IMHO there is probably nothing really resuable here. The logic to
> fence any usage is entirely unique to whoever is using it, and the
> locking tends to be really hard.
> 
> You should review the email threads linked to this patch and all it's
> prior versions as the expected importer behavior for pinned dmabufs is
> not well understood.

I'm not pinning it (i.e. no dma_buf_pin()), it should be a fair
dynamic implementation. In short, It adds a fence on move_notify
and signals when all requests using it are gone. New requests will
be trying to create a new mapping (and wait for fences).

> https://lore.kernel.org/all/20260131-dmabuf-revoke-v7-0-463d956bd527@nvidia.com/
> 
>>>> Tushar was helping and mention he got good numbers for P2P transfers
>>>> compared to bouncing it via RAM.
>>>
>>> We can already avoid the bouncing, it seems the main improvements here
>>> are avoiding the DMA map per-io and allowing the use of P2P without
>>> also creating struct page. Meanginful wins for sure.
>>
>> Yes, and it should probably be nicer for frameworks that already
>> expose dma-bufs.
> 
> I'm not sure what this means?

I'm saying that when a user app can easily get or already has a
dma-buf fd, it should be easier to just use it instead of finding
its way to FOLL_PCI_P2PDMA. I'm actually curious, is there a way
to somehow create a MEMORY_DEVICE_PCI_P2PDMA mapping out of a random
dma-buf? From a quick glance, I only see nvme cmb and some accelerator
being registered to P2PDMA, but maybe I'm missing something.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-06 15:08         ` Pavel Begunkov
@ 2026-02-06 15:20           ` Jason Gunthorpe
  2026-02-06 17:57             ` Pavel Begunkov
  2026-02-09  9:54             ` Kanchan Joshi
  0 siblings, 2 replies; 25+ messages in thread
From: Jason Gunthorpe @ 2026-02-06 15:20 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On Fri, Feb 06, 2026 at 03:08:25PM +0000, Pavel Begunkov wrote:
> On 2/5/26 23:56, Jason Gunthorpe wrote:
> > On Thu, Feb 05, 2026 at 07:06:03PM +0000, Pavel Begunkov wrote:
> > > On 2/5/26 17:41, Jason Gunthorpe wrote:
> > > > On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> > > > 
> > > > > The proposal consists of two parts. The first is a small in-kernel
> > > > > framework that allows a dma-buf to be registered against a given file
> > > > > and returns an object representing a DMA mapping.
> > > > 
> > > > What is this about and why would you need something like this?
> > > > 
> > > > The rest makes more sense - pass a DMABUF (or even memfd) to iouring
> > > > and pre-setup the DMA mapping to get dma_addr_t, then directly use
> > > > dma_addr_t through the entire block stack right into the eventual
> > > > driver.
> > > 
> > > That's more or less what I tried to do in v1, but 1) people didn't like
> > > the idea of passing raw dma addresses directly, and having it wrapped
> > > into a black box gives more flexibility like potentially supporting
> > > multi-device filesystems.
> > 
> > Ok.. but what does that have to do with a user space visible file?
> 
> If you're referring to registration taking a file, it's used to forward
> this registration to the right driver, which knows about devices and can
> create dma-buf attachment[s]. The abstraction users get is not just a
> buffer but rather a buffer registered for a "subsystem" represented by
> the passed file. With nvme raw bdev as the only importer in the patch set,
> it's simply converges to "registered for the file", but the notion will
> need to be expanded later, e.g. to accommodate filesystems.

Sounds completely goofy to me. A wrapper around DMABUF that lets you
attach to DMABUFs? Huh?

I feel like io uring should be dealing with this internally somehow not
creating more and more uapi..

The longer term goal has been to get page * out of the io stack and
start using phys_addr_t, if we could pass the DMABUF's MMIO as a
phys_addr_t around the IO stack then we only need to close the gap of
getting the p2p provider into the final DMA mapping.

Alot of this has improved in the past few cycles where the main issue
now is the carrying the provider and phys_addr_t through the io to the
nvme driver. vs when you started this and even that fundamental
infrastructure was missing.

> > > > > Tushar was helping and mention he got good numbers for P2P transfers
> > > > > compared to bouncing it via RAM.
> > > > 
> > > > We can already avoid the bouncing, it seems the main improvements here
> > > > are avoiding the DMA map per-io and allowing the use of P2P without
> > > > also creating struct page. Meanginful wins for sure.
> > > 
> > > Yes, and it should probably be nicer for frameworks that already
> > > expose dma-bufs.
> > 
> > I'm not sure what this means?
> 
> I'm saying that when a user app can easily get or already has a
> dma-buf fd, it should be easier to just use it instead of finding
> its way to FOLL_PCI_P2PDMA.

But that all exists already and this proposal does nothing to improve
it..

> I'm actually curious, is there a way to somehow create a
> MEMORY_DEVICE_PCI_P2PDMA mapping out of a random dma-buf? 

No. The driver owning the P2P MMIO has to do this during its probe and
then it has to provide a VMA with normal pages so GUP works. This is
usally not hard on the exporting driver side.

It costs some memory but then everything works naturally in the IO
stack.

Your project is interesting and would be a nice improvement, but I
also don't entirely understand why you are bothering when the P2PDMA
solution is already fully there ready to go... Is something preventing
you from creating the P2PDMA pages for your exporting driver?

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-06 15:20           ` Jason Gunthorpe
@ 2026-02-06 17:57             ` Pavel Begunkov
  2026-02-06 18:37               ` Jason Gunthorpe
  2026-02-09  9:54             ` Kanchan Joshi
  1 sibling, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2026-02-06 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On 2/6/26 15:20, Jason Gunthorpe wrote:
> On Fri, Feb 06, 2026 at 03:08:25PM +0000, Pavel Begunkov wrote:
>> On 2/5/26 23:56, Jason Gunthorpe wrote:
>>> On Thu, Feb 05, 2026 at 07:06:03PM +0000, Pavel Begunkov wrote:
>>>> On 2/5/26 17:41, Jason Gunthorpe wrote:
>>>>> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>>>>>
>>>>>> The proposal consists of two parts. The first is a small in-kernel
>>>>>> framework that allows a dma-buf to be registered against a given file
>>>>>> and returns an object representing a DMA mapping.
>>>>>
>>>>> What is this about and why would you need something like this?
>>>>>
>>>>> The rest makes more sense - pass a DMABUF (or even memfd) to iouring
>>>>> and pre-setup the DMA mapping to get dma_addr_t, then directly use
>>>>> dma_addr_t through the entire block stack right into the eventual
>>>>> driver.
>>>>
>>>> That's more or less what I tried to do in v1, but 1) people didn't like
>>>> the idea of passing raw dma addresses directly, and having it wrapped
>>>> into a black box gives more flexibility like potentially supporting
>>>> multi-device filesystems.
>>>
>>> Ok.. but what does that have to do with a user space visible file?
>>
>> If you're referring to registration taking a file, it's used to forward
>> this registration to the right driver, which knows about devices and can
>> create dma-buf attachment[s]. The abstraction users get is not just a
>> buffer but rather a buffer registered for a "subsystem" represented by
>> the passed file. With nvme raw bdev as the only importer in the patch set,
>> it's simply converges to "registered for the file", but the notion will
>> need to be expanded later, e.g. to accommodate filesystems.
> 
> Sounds completely goofy to me.

Hmm... the discussion is not going to be productive, isn't it?

> A wrapper around DMABUF that lets you
> attach to DMABUFs? Huh?

I have no idea what you mean and what "attach to DMABUFs" is.
dma-buf is passed to the driver, which attaches it (as in
calls dma_buf_dynamic_attach()).

> I feel like io uring should be dealing with this internally somehow not
> creating more and more uapi..

uapi changes are already minimal and outside of the IO path.

> The longer term goal has been to get page * out of the io stack and
> start using phys_addr_t, if we could pass the DMABUF's MMIO as a

Except that I already tried passing device mapped addresses directly,
and it was rejected because it won't be able to handle more complicated
cases like multi-device filesystems and probably for other reasons.
Or would it be mapping it for each IO?

> phys_addr_t around the IO stack then we only need to close the gap of
> getting the p2p provider into the final DMA mapping.
> 
> Alot of this has improved in the past few cycles where the main issue
> now is the carrying the provider and phys_addr_t through the io to the
> nvme driver. vs when you started this and even that fundamental
> infrastructure was missing.
> 
>>>>>> Tushar was helping and mention he got good numbers for P2P transfers
>>>>>> compared to bouncing it via RAM.
>>>>>
>>>>> We can already avoid the bouncing, it seems the main improvements here
>>>>> are avoiding the DMA map per-io and allowing the use of P2P without
>>>>> also creating struct page. Meanginful wins for sure.
>>>>
>>>> Yes, and it should probably be nicer for frameworks that already
>>>> expose dma-bufs.
>>>
>>> I'm not sure what this means?
>>
>> I'm saying that when a user app can easily get or already has a
>> dma-buf fd, it should be easier to just use it instead of finding
>> its way to FOLL_PCI_P2PDMA.
> 
> But that all exists already and this proposal does nothing to improve
> it..

dma-buf already exists as well, and I'm ashamed to admit,
but I don't know how a user program can read into / write from
memory provided by dma-buf.

>> I'm actually curious, is there a way to somehow create a
>> MEMORY_DEVICE_PCI_P2PDMA mapping out of a random dma-buf?
> 
> No. The driver owning the P2P MMIO has to do this during its probe and
> then it has to provide a VMA with normal pages so GUP works. This is
> usally not hard on the exporting driver side.
> 
> It costs some memory but then everything works naturally in the IO
> stack.
> 
> Your project is interesting and would be a nice improvement, but I
> also don't entirely understand why you are bothering when the P2PDMA
> solution is already fully there ready to go... Is something preventing
> you from creating the P2PDMA pages for your exporting driver?

I'm not doing it for any particular driver but rather trying
to reuse what's already there, i.e. a good coverage of existing
dma-buf exporters, and infrastructure dma-buf provides, e.g.
move_notify. And trying to do that efficiently, avoiding GUP
(what io_uring can already do for normal memory), keeping long
term mappings (modulo move_notify), and so. That includes
optimising the cost of system memory rw with iommu.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-06 17:57             ` Pavel Begunkov
@ 2026-02-06 18:37               ` Jason Gunthorpe
  2026-02-09 10:59                 ` Pavel Begunkov
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2026-02-06 18:37 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On Fri, Feb 06, 2026 at 05:57:14PM +0000, Pavel Begunkov wrote:
> On 2/6/26 15:20, Jason Gunthorpe wrote:
> > On Fri, Feb 06, 2026 at 03:08:25PM +0000, Pavel Begunkov wrote:
> > > On 2/5/26 23:56, Jason Gunthorpe wrote:
> > > > On Thu, Feb 05, 2026 at 07:06:03PM +0000, Pavel Begunkov wrote:
> > > > > On 2/5/26 17:41, Jason Gunthorpe wrote:
> > > > > > On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> > > > > > 
> > > > > > > The proposal consists of two parts. The first is a small in-kernel
> > > > > > > framework that allows a dma-buf to be registered against a given file
> > > > > > > and returns an object representing a DMA mapping.
> > > > > > 
> > > > > > What is this about and why would you need something like this?
> > > > > > 
> > > > > > The rest makes more sense - pass a DMABUF (or even memfd) to iouring
> > > > > > and pre-setup the DMA mapping to get dma_addr_t, then directly use
> > > > > > dma_addr_t through the entire block stack right into the eventual
> > > > > > driver.
> > > > > 
> > > > > That's more or less what I tried to do in v1, but 1) people didn't like
> > > > > the idea of passing raw dma addresses directly, and having it wrapped
> > > > > into a black box gives more flexibility like potentially supporting
> > > > > multi-device filesystems.
> > > > 
> > > > Ok.. but what does that have to do with a user space visible file?
> > > 
> > > If you're referring to registration taking a file, it's used to forward
> > > this registration to the right driver, which knows about devices and can
> > > create dma-buf attachment[s]. The abstraction users get is not just a
> > > buffer but rather a buffer registered for a "subsystem" represented by
> > > the passed file. With nvme raw bdev as the only importer in the patch set,
> > > it's simply converges to "registered for the file", but the notion will
> > > need to be expanded later, e.g. to accommodate filesystems.
> > 
> > Sounds completely goofy to me.
> 
> Hmm... the discussion is not going to be productive, isn't it?

Well, this FD thing is very confounding and, sorry I don't see much
logic to this design. I understand the problems you are explaining but
not this solution.

> Or would it be mapping it for each IO?

mapping for each IO could be possible with a phys_addr_t path.

> dma-buf already exists as well, and I'm ashamed to admit,
> but I don't know how a user program can read into / write from
> memory provided by dma-buf.

You can mmap them. It can even be used with read() write() system
calls if the dma buf exporter is using P2P pages.

> I'm not doing it for any particular driver but rather trying
> to reuse what's already there, i.e. a good coverage of existing
> dma-buf exporters, and infrastructure dma-buf provides, e.g.
> move_notify. And trying to do that efficiently, avoiding GUP
> (what io_uring can already do for normal memory), keeping long
> term mappings (modulo move_notify), and so. That includes
> optimising the cost of system memory rw with iommu.

I would suggest leading with these reasons to frame why you are trying
to do this. It seems the main motivation is to create a pre
registered, and pre-IOMMU-mapped io uring pool of MMIO memory, and
indeed you cannot do that with the existing mechanisms at all.

As a step forward I could imagine having a DMABUF handing out P2P
pages and allowing io uring to "register" it complete with move
notify. This would get you half the way there and doesn't require
major changes to the block stack since you can still be pushing
unmapped struct page backed addresses and everything will work
fine. It is a good way to sidestep the FOLL_LONGTERM issue.

Pre-iommu-mapping the pool seems like an orthogonal project as it
applies to everything coming from pre-registered io uring buffers,
even normal cpu memory. You could have a next step of pre-mapping the
P2P pages and CPU pages equally.

Finally you could try a project to remove the P2P page requirement for
cases that use the pre-iommu-mapping flow.

It would probably be helpful not to mixup those three things..

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-06 18:37               ` Jason Gunthorpe
@ 2026-02-09 10:59                 ` Pavel Begunkov
  2026-02-09 13:06                   ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2026-02-09 10:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On 2/6/26 18:37, Jason Gunthorpe wrote:
> On Fri, Feb 06, 2026 at 05:57:14PM +0000, Pavel Begunkov wrote:
>> On 2/6/26 15:20, Jason Gunthorpe wrote:
>>> On Fri, Feb 06, 2026 at 03:08:25PM +0000, Pavel Begunkov wrote:
>>>> On 2/5/26 23:56, Jason Gunthorpe wrote:
>>>>> On Thu, Feb 05, 2026 at 07:06:03PM +0000, Pavel Begunkov wrote:
>>>>>> On 2/5/26 17:41, Jason Gunthorpe wrote:
>>>>>>> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>>>>>>>
>>>>>>>> The proposal consists of two parts. The first is a small in-kernel
>>>>>>>> framework that allows a dma-buf to be registered against a given file
>>>>>>>> and returns an object representing a DMA mapping.
>>>>>>>
>>>>>>> What is this about and why would you need something like this?
>>>>>>>
>>>>>>> The rest makes more sense - pass a DMABUF (or even memfd) to iouring
>>>>>>> and pre-setup the DMA mapping to get dma_addr_t, then directly use
>>>>>>> dma_addr_t through the entire block stack right into the eventual
>>>>>>> driver.
>>>>>>
>>>>>> That's more or less what I tried to do in v1, but 1) people didn't like
>>>>>> the idea of passing raw dma addresses directly, and having it wrapped
>>>>>> into a black box gives more flexibility like potentially supporting
>>>>>> multi-device filesystems.
>>>>>
>>>>> Ok.. but what does that have to do with a user space visible file?
>>>>
>>>> If you're referring to registration taking a file, it's used to forward
>>>> this registration to the right driver, which knows about devices and can
>>>> create dma-buf attachment[s]. The abstraction users get is not just a
>>>> buffer but rather a buffer registered for a "subsystem" represented by
>>>> the passed file. With nvme raw bdev as the only importer in the patch set,
>>>> it's simply converges to "registered for the file", but the notion will
>>>> need to be expanded later, e.g. to accommodate filesystems.
>>>
>>> Sounds completely goofy to me.
>>
>> Hmm... the discussion is not going to be productive, isn't it?
> 
> Well, this FD thing is very confounding and, sorry I don't see much
> logic to this design. I understand the problems you are explaining but
> not this solution.

We won't agree on this one. If an abstraction linking two entities
makes sense and is useful, it should be fine to have that. It's
functionally the same as well, just register the buffer multiple times.
In fact, I could get rid of the fd argument, and keep and dynamically
update a map in io_uring for all files / subsystems the registration
can be used with, but all such generality has additional cost while
not bringing anything new to the table.

>> Or would it be mapping it for each IO?
> 
> mapping for each IO could be possible with a phys_addr_t path.
> 
>> dma-buf already exists as well, and I'm ashamed to admit,
>> but I don't know how a user program can read into / write from
>> memory provided by dma-buf.
> 
> You can mmap them. It can even be used with read() write() system
> calls if the dma buf exporter is using P2P pages.
> 
>> I'm not doing it for any particular driver but rather trying
>> to reuse what's already there, i.e. a good coverage of existing
>> dma-buf exporters, and infrastructure dma-buf provides, e.g.
>> move_notify. And trying to do that efficiently, avoiding GUP
>> (what io_uring can already do for normal memory), keeping long
>> term mappings (modulo move_notify), and so. That includes
>> optimising the cost of system memory rw with iommu.
> 
> I would suggest leading with these reasons to frame why you are trying
> to do this. It seems the main motivation is to create a pre
> registered, and pre-IOMMU-mapped io uring pool of MMIO memory, and
> indeed you cannot do that with the existing mechanisms at all.

My bad, I didn't try to go into "why" well enough. I want to be
able to use dma-buf as a first class citizen in the path, i.e.
keeping a clear notion of a buffer for the user and being able to
interact with any existing exporter, all that while maximising
performance, which definitely includes pre-mapping memory.

> As a step forward I could imagine having a DMABUF handing out P2P
> pages and allowing io uring to "register" it complete with move

Forcing dma-buf to have pages is a big step back, IMHO

> notify. This would get you half the way there and doesn't require
> major changes to the block stack since you can still be pushing
> unmapped struct page backed addresses and everything will work
> fine. It is a good way to sidestep the FOLL_LONGTERM issue.
> 
> Pre-iommu-mapping the pool seems like an orthogonal project as it
> applies to everything coming from pre-registered io uring buffers,
> even normal cpu memory. You could have a next step of pre-mapping the
> P2P pages and CPU pages equally.

It was already tried for normal user memory (not by me), but
the verdict was that it should be dma-buf based.

> Finally you could try a project to remove the P2P page requirement for
> cases that use the pre-iommu-mapping flow.
> 
> It would probably be helpful not to mixup those three things..

I agree that pre-mapping and p2p can in theory can be decoupled here,
but in practice dma-buf already provides right abstractions and
infrastructure to cover both in one go. I really don't think it's a
good idea to re-engineer parts of dma-buf that are responsible for
interaction with the importer device.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-09 10:59                 ` Pavel Begunkov
@ 2026-02-09 13:06                   ` Jason Gunthorpe
  2026-02-09 13:09                     ` Christian König
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2026-02-09 13:06 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christian König, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On Mon, Feb 09, 2026 at 10:59:53AM +0000, Pavel Begunkov wrote:

> > As a step forward I could imagine having a DMABUF handing out P2P
> > pages and allowing io uring to "register" it complete with move
> 
> Forcing dma-buf to have pages is a big step back, IMHO

Naw, some drivers already have them anyhow, and we are already looking
at optional ways to allow a very limited select group of importers to
access the underlying physical.

It is not a big leap from there to say io_uring pre-registration is a
special importer that only interworks with drivers providing P2P
pages.

It could immediately address everything except pre-registration. And
do you really care about pre-registration? Why? Running performance
workloads with the iommu doing a DMA mapping is pretty unusual.

> > Pre-iommu-mapping the pool seems like an orthogonal project as it
> > applies to everything coming from pre-registered io uring buffers,
> > even normal cpu memory. You could have a next step of pre-mapping the
> > P2P pages and CPU pages equally.
> 
> It was already tried for normal user memory (not by me), but
> the verdict was that it should be dma-buf based.

I'm not sure how DMA-buf helps anything here. It is the io uring layer
that should be interacting with DMA-buf, the lower level stuff
shouldn't touch it.

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-09 13:06                   ` Jason Gunthorpe
@ 2026-02-09 13:09                     ` Christian König
  2026-02-09 13:24                       ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Christian König @ 2026-02-09 13:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Gohad, Tushar, Christoph Hellwig, Kanchan Joshi, Anuj Gupta,
	Nitesh Shetty, lsf-pc@lists.linux-foundation.org

On 2/9/26 14:06, Jason Gunthorpe wrote:
> On Mon, Feb 09, 2026 at 10:59:53AM +0000, Pavel Begunkov wrote:
> 
>>> As a step forward I could imagine having a DMABUF handing out P2P
>>> pages and allowing io uring to "register" it complete with move
>>
>> Forcing dma-buf to have pages is a big step back, IMHO
> 
> Naw, some drivers already have them anyhow, and we are already looking
> at optional ways to allow a very limited select group of importers to
> access the underlying physical.

That is just between two specific exporters/importers and certainly won't be allowed as common interface.

> It is not a big leap from there to say io_uring pre-registration is a
> special importer that only interworks with drivers providing P2P
> pages.

Completely NAK from my side to that approach.

We have exercised and discussed this in absolutely detail and it is not going to fly anywhere.

The struct page based approach in fundamentally incompatible with driver managed exporters.

Regards,
Christian.

> 
> It could immediately address everything except pre-registration. And
> do you really care about pre-registration? Why? Running performance
> workloads with the iommu doing a DMA mapping is pretty unusual.
> 
>>> Pre-iommu-mapping the pool seems like an orthogonal project as it
>>> applies to everything coming from pre-registered io uring buffers,
>>> even normal cpu memory. You could have a next step of pre-mapping the
>>> P2P pages and CPU pages equally.
>>
>> It was already tried for normal user memory (not by me), but
>> the verdict was that it should be dma-buf based.
> 
> I'm not sure how DMA-buf helps anything here. It is the io uring layer
> that should be interacting with DMA-buf, the lower level stuff
> shouldn't touch it.
> 
> Jason


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-09 13:09                     ` Christian König
@ 2026-02-09 13:24                       ` Jason Gunthorpe
  2026-02-09 13:55                         ` Christian König
  0 siblings, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2026-02-09 13:24 UTC (permalink / raw)
  To: Christian König
  Cc: Pavel Begunkov, linux-block, io-uring,
	linux-nvme@lists.infradead.org, Gohad, Tushar, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On Mon, Feb 09, 2026 at 02:09:24PM +0100, Christian König wrote:

> We have exercised and discussed this in absolutely detail and it is
> not going to fly anywhere.

Yes, I understand you concerns with struct page from past abuses.

> The struct page based approach in fundamentally incompatible with
> driver managed exporters.

The *general* struct page system is incompatible - but that is not
what I'm suggesting. I'm suggesting io_uring, and only io_uring could
use this with it fully implementing all the lifecycle rules that are
needed.  Including move_notify and fences so that the driver managed
exporter has no issue.

Reworking the block stack to not rely on page is also a good path, but
probably alot harder. :\

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-09 13:24                       ` Jason Gunthorpe
@ 2026-02-09 13:55                         ` Christian König
  2026-02-09 14:01                           ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Christian König @ 2026-02-09 13:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pavel Begunkov, linux-block, io-uring,
	linux-nvme@lists.infradead.org, Gohad, Tushar, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On 2/9/26 14:24, Jason Gunthorpe wrote:
> On Mon, Feb 09, 2026 at 02:09:24PM +0100, Christian König wrote:
> 
>> We have exercised and discussed this in absolutely detail and it is
>> not going to fly anywhere.
> 
> Yes, I understand you concerns with struct page from past abuses.
>  
>> The struct page based approach in fundamentally incompatible with
>> driver managed exporters.
> 
> The *general* struct page system is incompatible - but that is not
> what I'm suggesting. I'm suggesting io_uring, and only io_uring could
> use this with it fully implementing all the lifecycle rules that are
> needed.  Including move_notify and fences so that the driver managed
> exporter has no issue.

Yeah, that is basically what everybody currently does with out of tree code.

The problem is that this requires internal knowledge of the exported buffer and how the I/O path is using it.

So to generalize this for upstreaming it would need something like a giant whitelist of exporter/importer combinations which are known to work together and not crash the kernel in surprising and hard to track down ways.

I had this conversation multiple times with both AMD internal as well as external people and just using an exporter specific io_uring (or whatever approach the exporter uses) implementation is just simpler.

> Reworking the block stack to not rely on page is also a good path, but
> probably alot harder. :\

Yeah, that would be really really nice to have and the latest patches for extending the struct file stuff actually looked quite promising.

Christian.

> 
> Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-09 13:55                         ` Christian König
@ 2026-02-09 14:01                           ` Jason Gunthorpe
  0 siblings, 0 replies; 25+ messages in thread
From: Jason Gunthorpe @ 2026-02-09 14:01 UTC (permalink / raw)
  To: Christian König
  Cc: Pavel Begunkov, linux-block, io-uring,
	linux-nvme@lists.infradead.org, Gohad, Tushar, Christoph Hellwig,
	Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On Mon, Feb 09, 2026 at 02:55:26PM +0100, Christian König wrote:

> Yeah, that is basically what everybody currently does with out of tree code.

:\

> The problem is that this requires internal knowledge of the exported
> buffer and how the I/O path is using it.

Well here I am saying the buffer has a P2P struct page so that is all
you actually need to know. It just follows the existing proven path in
the IO stack.

> So to generalize this for upstreaming it would need something like a
> giant whitelist of exporter/importer combinations which are known to
> work together and not crash the kernel in surprising and hard to
> track down ways.

Well I think the mapping type proposal goes a long way toward dealing
wiht this problem. Let's shelve the discussion until after we discuss
that with patches.

> > Reworking the block stack to not rely on page is also a good path, but
> > probably alot harder. :\
> 
> Yeah, that would be really really nice to have and the latest
> patches for extending the struct file stuff actually looked quite
> promising.

Yeah, I thought the dma token through the IO stack looked very
interesting too. I hope it eventually succeeds!

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-06 15:20           ` Jason Gunthorpe
  2026-02-06 17:57             ` Pavel Begunkov
@ 2026-02-09  9:54             ` Kanchan Joshi
  2026-02-09 10:13               ` Christian König
  1 sibling, 1 reply; 25+ messages in thread
From: Kanchan Joshi @ 2026-02-09  9:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Christian König, Christoph Hellwig, Anuj Gupta,
	Nitesh Shetty, lsf-pc@lists.linux-foundation.org

On 2/6/2026 8:50 PM, Jason Gunthorpe wrote:
>> I'm actually curious, is there a way to somehow create a
>> MEMORY_DEVICE_PCI_P2PDMA mapping out of a random dma-buf?
> No. The driver owning the P2P MMIO has to do this during its probe and
> then it has to provide a VMA with normal pages so GUP works. This is
> usally not hard on the exporting driver side.
> 
> It costs some memory but then everything works naturally in the IO
> stack.
> 
> Your project is interesting and would be a nice improvement, but I
> also don't entirely understand why you are bothering when the P2PDMA
> solution is already fully there ready to go... Is something preventing
> you from creating the P2PDMA pages for your exporting driver?

The exporter driver may have opted out of the P2PDMA struct page path
(MEMORY_DEVICE_PCI_P2PDMA route). This maybe a design choice to avoid
the system RAM overhead.
As an example, for a H100 GPU with 80 GB of VRAM and a 4 KB system page
size: we would need ~20 million entries, and with each 'struct page' as
64 bytes in size, this would amount to extra ~1.2 GB of RAM tax.

At this point, the series does not introduce any change on the
exporter side and that is a good thing. No?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-09  9:54             ` Kanchan Joshi
@ 2026-02-09 10:13               ` Christian König
  2026-02-09 12:54                 ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Christian König @ 2026-02-09 10:13 UTC (permalink / raw)
  To: Kanchan Joshi, Jason Gunthorpe, Pavel Begunkov
  Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
	Christoph Hellwig, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On 2/9/26 10:54, Kanchan Joshi wrote:
> On 2/6/2026 8:50 PM, Jason Gunthorpe wrote:
>>> I'm actually curious, is there a way to somehow create a
>>> MEMORY_DEVICE_PCI_P2PDMA mapping out of a random dma-buf?
>> No. The driver owning the P2P MMIO has to do this during its probe and
>> then it has to provide a VMA with normal pages so GUP works. This is
>> usally not hard on the exporting driver side.
>>
>> It costs some memory but then everything works naturally in the IO
>> stack.
>>
>> Your project is interesting and would be a nice improvement, but I
>> also don't entirely understand why you are bothering when the P2PDMA
>> solution is already fully there ready to go... Is something preventing
>> you from creating the P2PDMA pages for your exporting driver?
> 
> The exporter driver may have opted out of the P2PDMA struct page path
> (MEMORY_DEVICE_PCI_P2PDMA route). This maybe a design choice to avoid
> the system RAM overhead.
> As an example, for a H100 GPU with 80 GB of VRAM and a 4 KB system page
> size: we would need ~20 million entries, and with each 'struct page' as
> 64 bytes in size, this would amount to extra ~1.2 GB of RAM tax.

That is a good argumentation, but the killer argument for DMA-buf to not use pages (or folios) is that the exported resource is sometimes not even memory.

For example we have MMIO doorbells which are exported between devices to signal to firmware that a certain event is done and follow up processing can start.

Using the struct page based approach to manage the lifetime of such exports would completely break such use cases.

> At this point, the series does not introduce any change on the
> exporter side and that is a good thing. No?

I need something like a free month to wrap my head around all that stuff again, but from the DMA-buf side the last patch set I've seen looked pretty straight forward.

So yes that no exporter or framework changes are necessary are definitely a good thing.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-09 10:13               ` Christian König
@ 2026-02-09 12:54                 ` Jason Gunthorpe
  0 siblings, 0 replies; 25+ messages in thread
From: Jason Gunthorpe @ 2026-02-09 12:54 UTC (permalink / raw)
  To: Christian König
  Cc: Kanchan Joshi, Pavel Begunkov, linux-block, io-uring,
	linux-nvme@lists.infradead.org, Christoph Hellwig, Anuj Gupta,
	Nitesh Shetty, lsf-pc@lists.linux-foundation.org

On Mon, Feb 09, 2026 at 11:13:42AM +0100, Christian König wrote:
> On 2/9/26 10:54, Kanchan Joshi wrote:
> > On 2/6/2026 8:50 PM, Jason Gunthorpe wrote:
> >>> I'm actually curious, is there a way to somehow create a
> >>> MEMORY_DEVICE_PCI_P2PDMA mapping out of a random dma-buf?
> >> No. The driver owning the P2P MMIO has to do this during its probe and
> >> then it has to provide a VMA with normal pages so GUP works. This is
> >> usally not hard on the exporting driver side.
> >>
> >> It costs some memory but then everything works naturally in the IO
> >> stack.
> >>
> >> Your project is interesting and would be a nice improvement, but I
> >> also don't entirely understand why you are bothering when the P2PDMA
> >> solution is already fully there ready to go... Is something preventing
> >> you from creating the P2PDMA pages for your exporting driver?
> > 
> > The exporter driver may have opted out of the P2PDMA struct page path
> > (MEMORY_DEVICE_PCI_P2PDMA route). This maybe a design choice to avoid
> > the system RAM overhead.

Currently you have to pay this tax to use the block stack.

It is certainly bad on x86, but for example 64k page size ARM pays
only 83MB, for the same configuration.

> That is a good argumentation, but the killer argument for DMA-buf to
> not use pages (or folios) is that the exported resource is sometimes
> not even memory.

I don't think anyone is saying that all DMA-buf must use pages, just
that if you want to use the MMIO with the *block stack* then a page
based approach already exists and is already being used. Usually
through VMAs.

I'm aware of all the downsides, but this proposal doesn't explain
which ones are motivating the work. Is the lack of pre-registration or
the tax the main motivation?

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
  2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
                     ` (3 preceding siblings ...)
  2026-02-05 17:41   ` Jason Gunthorpe
@ 2026-02-09 10:04   ` Kanchan Joshi
  4 siblings, 0 replies; 25+ messages in thread
From: Kanchan Joshi @ 2026-02-09 10:04 UTC (permalink / raw)
  To: Pavel Begunkov, linux-block
  Cc: io-uring, linux-nvme@lists.infradead.org, Christian König,
	Christoph Hellwig, Anuj Gupta, Nitesh Shetty,
	lsf-pc@lists.linux-foundation.org

On 2/3/2026 7:59 PM, Pavel Begunkov wrote:
> Good day everyone,
> 
> dma-buf is a powerful abstraction for managing buffers and DMA mappings,
> and there is growing interest in extending it to the read/write path to
> enable device-to-device transfers without bouncing data through system
> memory. I was encouraged to submit it to LSF/MM/BPF as that might be
> useful to mull over details and what capabilities and features people
> may need.

Guilty as charged, I'm interested in the topic. Thanks for posting.
We've had several attempts to move the DMA mapping cost out of the fast 
path; hopefully, this will be the final one.



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-02-09 14:02 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CGME20260204153051epcas5p1c2efd01ef32883680fed2541f9fca6c2@epcas5p1.samsung.com>
2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
2026-02-03 18:07   ` Keith Busch
2026-02-04  6:07     ` Anuj Gupta/Anuj Gupta
2026-02-04 11:38     ` Pavel Begunkov
2026-02-04 15:26   ` Nitesh Shetty
2026-02-09 11:15     ` Pavel Begunkov
2026-02-05  3:12   ` Ming Lei
2026-02-05 18:13     ` Pavel Begunkov
2026-02-05 17:41   ` Jason Gunthorpe
2026-02-05 19:06     ` Pavel Begunkov
2026-02-05 23:56       ` Jason Gunthorpe
2026-02-06 15:08         ` Pavel Begunkov
2026-02-06 15:20           ` Jason Gunthorpe
2026-02-06 17:57             ` Pavel Begunkov
2026-02-06 18:37               ` Jason Gunthorpe
2026-02-09 10:59                 ` Pavel Begunkov
2026-02-09 13:06                   ` Jason Gunthorpe
2026-02-09 13:09                     ` Christian König
2026-02-09 13:24                       ` Jason Gunthorpe
2026-02-09 13:55                         ` Christian König
2026-02-09 14:01                           ` Jason Gunthorpe
2026-02-09  9:54             ` Kanchan Joshi
2026-02-09 10:13               ` Christian König
2026-02-09 12:54                 ` Jason Gunthorpe
2026-02-09 10:04   ` Kanchan Joshi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox