* [LSF/MM/BPF TOPIC] dmabuf backed read/write
@ 2026-02-03 14:29 ` Pavel Begunkov
2026-02-03 18:07 ` Keith Busch
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Pavel Begunkov @ 2026-02-03 14:29 UTC (permalink / raw)
To: linux-block
Cc: io-uring, linux-nvme@lists.infradead.org, Gohad, Tushar,
Christian König, Christoph Hellwig, Kanchan Joshi,
Anuj Gupta, Nitesh Shetty, lsf-pc@lists.linux-foundation.org
Good day everyone,
dma-buf is a powerful abstraction for managing buffers and DMA mappings,
and there is growing interest in extending it to the read/write path to
enable device-to-device transfers without bouncing data through system
memory. I was encouraged to submit it to LSF/MM/BPF as that might be
useful to mull over details and what capabilities and features people
may need.
The proposal consists of two parts. The first is a small in-kernel
framework that allows a dma-buf to be registered against a given file
and returns an object representing a DMA mapping. The actual mapping
creation is delegated to the target subsystem (e.g. NVMe). This
abstraction centralises request accounting, mapping management, dynamic
recreation, etc. The resulting mapping object is passed through the I/O
stack via a new iov_iter type.
As for the user API, a dma-buf is installed as an io_uring registered
buffer for a specific file. Once registered, the buffer can be used by
read / write io_uring requests as normal. io_uring will enforce that the
buffer is only used with "compatible files", which is for now restricted
to the target registration file, but will be expanded in the future.
Notably, io_uring is a consumer of the framework rather than a
dependency, and the infrastructure can be reused.
It took a couple of iterations on the list to get it to the current
design, v2 of the series can be looked up at [1], which implements the
infrastructure and initial wiring for NVMe. It slightly diverges from
the description above, as some of the framework bits are block specific,
and I'll be working on refining that and simplifying some of the
interfaces for v3. A good chunk of block handling is based on prior work
from Keith that was pre DMA mapping buffers [2].
Tushar was helping and mention he got good numbers for P2P transfers
compared to bouncing it via RAM. Anuj, Kanchan and Nitesh also
previously reported encouraging results for system memory backed
dma-buf for optimising IOMMU overhead, quoting Anuj:
- STRICT: before = 570 KIOPS, after = 5.01 MIOPS
- LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
- PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS
[1] https://lore.kernel.org/io-uring/cover.1763725387.git.asml.silence@gmail.com/
[2] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
@ 2026-02-03 18:07 ` Keith Busch
2026-02-04 6:07 ` Anuj Gupta/Anuj Gupta
2026-02-04 11:38 ` Pavel Begunkov
2026-02-04 15:26 ` Nitesh Shetty
` (2 subsequent siblings)
3 siblings, 2 replies; 10+ messages in thread
From: Keith Busch @ 2026-02-03 18:07 UTC (permalink / raw)
To: Pavel Begunkov
Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
Gohad, Tushar, Christian König, Christoph Hellwig,
Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
lsf-pc@lists.linux-foundation.org
On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> Good day everyone,
>
> dma-buf is a powerful abstraction for managing buffers and DMA mappings,
> and there is growing interest in extending it to the read/write path to
> enable device-to-device transfers without bouncing data through system
> memory. I was encouraged to submit it to LSF/MM/BPF as that might be
> useful to mull over details and what capabilities and features people
> may need.
>
> The proposal consists of two parts. The first is a small in-kernel
> framework that allows a dma-buf to be registered against a given file
> and returns an object representing a DMA mapping. The actual mapping
> creation is delegated to the target subsystem (e.g. NVMe). This
> abstraction centralises request accounting, mapping management, dynamic
> recreation, etc. The resulting mapping object is passed through the I/O
> stack via a new iov_iter type.
>
> As for the user API, a dma-buf is installed as an io_uring registered
> buffer for a specific file. Once registered, the buffer can be used by
> read / write io_uring requests as normal. io_uring will enforce that the
> buffer is only used with "compatible files", which is for now restricted
> to the target registration file, but will be expanded in the future.
> Notably, io_uring is a consumer of the framework rather than a
> dependency, and the infrastructure can be reused.
>
> It took a couple of iterations on the list to get it to the current
> design, v2 of the series can be looked up at [1], which implements the
> infrastructure and initial wiring for NVMe. It slightly diverges from
> the description above, as some of the framework bits are block specific,
> and I'll be working on refining that and simplifying some of the
> interfaces for v3. A good chunk of block handling is based on prior work
> from Keith that was pre DMA mapping buffers [2].
>
> Tushar was helping and mention he got good numbers for P2P transfers
> compared to bouncing it via RAM. Anuj, Kanchan and Nitesh also
> previously reported encouraging results for system memory backed
> dma-buf for optimising IOMMU overhead, quoting Anuj:
>
> - STRICT: before = 570 KIOPS, after = 5.01 MIOPS
> - LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
> - PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS
Thanks for submitting the topic. The performance wins look great, but
I'm a little surpised passthrough didn't show any difference. We're
still skipping a bit of transformations with the dmabuf compared to not
having it, so maybe it's just a matter of crafting the right benchmark
to show the benefit.
Anyway, I look forward to the next version of this feature. I promise to
have more cycles to review and test the v3.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
2026-02-03 18:07 ` Keith Busch
@ 2026-02-04 6:07 ` Anuj Gupta/Anuj Gupta
2026-02-04 11:38 ` Pavel Begunkov
1 sibling, 0 replies; 10+ messages in thread
From: Anuj Gupta/Anuj Gupta @ 2026-02-04 6:07 UTC (permalink / raw)
To: Keith Busch, Pavel Begunkov
Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
Christian König, Christoph Hellwig, Kanchan Joshi,
Nitesh Shetty, lsf-pc@lists.linux-foundation.org
On 2/3/2026 11:37 PM, Keith Busch wrote:
> Thanks for submitting the topic. The performance wins look great, but
> I'm a little surpised passthrough didn't show any difference. We're
> still skipping a bit of transformations with the dmabuf compared to not
> having it, so maybe it's just a matter of crafting the right benchmark
> to show the benefit.
>
Those numbers were from a drive that saturates at ~5M IOPS,
sopassthrough didn’t have much headroom. I did a quick run with two such
drives and saw a small improvement (~2–3%): ~5.97 MIOPS -> ~6.13 MIOPS,
but I’ll try tweaking the kernel config a bit to see if there’s more
headroom.
+1 on the topic - I'm interested in attending the discussion and
reviewing/testing v3 when it lands.
Thanks,
Anuj
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
2026-02-03 18:07 ` Keith Busch
2026-02-04 6:07 ` Anuj Gupta/Anuj Gupta
@ 2026-02-04 11:38 ` Pavel Begunkov
1 sibling, 0 replies; 10+ messages in thread
From: Pavel Begunkov @ 2026-02-04 11:38 UTC (permalink / raw)
To: Keith Busch
Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
Gohad, Tushar, Christian König, Christoph Hellwig,
Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
lsf-pc@lists.linux-foundation.org
On 2/3/26 18:07, Keith Busch wrote:
> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>> Good day everyone,
>>
...
>> Tushar was helping and mention he got good numbers for P2P transfers
>> compared to bouncing it via RAM. Anuj, Kanchan and Nitesh also
>> previously reported encouraging results for system memory backed
>> dma-buf for optimising IOMMU overhead, quoting Anuj:
>>
>> - STRICT: before = 570 KIOPS, after = 5.01 MIOPS
>> - LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
>> - PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS
>
> Thanks for submitting the topic. The performance wins look great, but
> I'm a little surpised passthrough didn't show any difference. We're
> still skipping a bit of transformations with the dmabuf compared to not
> having it, so maybe it's just a matter of crafting the right benchmark
> to show the benefit.
My first thought was that hardware couldn't push more and would
be great to have idle numbers, but Anuj already demystified it.
> Anyway, I look forward to the next version of this feature. I promise to
> have more cycles to review and test the v3.
Thanks! And in general, IMHO at this point waiting for next
version would be more time efficient for reviewers.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
2026-02-03 18:07 ` Keith Busch
@ 2026-02-04 15:26 ` Nitesh Shetty
2026-02-05 3:12 ` Ming Lei
2026-02-05 17:41 ` Jason Gunthorpe
3 siblings, 0 replies; 10+ messages in thread
From: Nitesh Shetty @ 2026-02-04 15:26 UTC (permalink / raw)
To: Pavel Begunkov
Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
Christian König, Christoph Hellwig, Kanchan Joshi,
Anuj Gupta, lsf-pc@lists.linux-foundation.org
[-- Attachment #1: Type: text/plain, Size: 1934 bytes --]
On 03/02/26 02:29PM, Pavel Begunkov wrote:
>Good day everyone,
>
>dma-buf is a powerful abstraction for managing buffers and DMA mappings,
>and there is growing interest in extending it to the read/write path to
>enable device-to-device transfers without bouncing data through system
>memory. I was encouraged to submit it to LSF/MM/BPF as that might be
>useful to mull over details and what capabilities and features people
>may need.
>
>The proposal consists of two parts. The first is a small in-kernel
>framework that allows a dma-buf to be registered against a given file
>and returns an object representing a DMA mapping. The actual mapping
>creation is delegated to the target subsystem (e.g. NVMe). This
>abstraction centralises request accounting, mapping management, dynamic
>recreation, etc. The resulting mapping object is passed through the I/O
>stack via a new iov_iter type.
>
>As for the user API, a dma-buf is installed as an io_uring registered
>buffer for a specific file. Once registered, the buffer can be used by
>read / write io_uring requests as normal. io_uring will enforce that the
>buffer is only used with "compatible files", which is for now restricted
>to the target registration file, but will be expanded in the future.
>Notably, io_uring is a consumer of the framework rather than a
>dependency, and the infrastructure can be reused.
>
We have been following the series, its interesting from couple of angles,
- IOPS wise we see a major improvement especially for IOMMU
- Series provides a way to do p2pdma to accelerator memory
Here are few topics which I am looking into specifically,
- Right now the series uses a PRP list. We need a good way to keep the
sg_table info around and decide on‑the‑fly whether to expose the buffer
as a PRP list or an SG list, depending on the I/O size.
- Possibility of futher optimization for new type of iov iter to reduce
per IO cost
Thanks,
Nitesh
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
2026-02-03 18:07 ` Keith Busch
2026-02-04 15:26 ` Nitesh Shetty
@ 2026-02-05 3:12 ` Ming Lei
2026-02-05 18:13 ` Pavel Begunkov
2026-02-05 17:41 ` Jason Gunthorpe
3 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2026-02-05 3:12 UTC (permalink / raw)
To: Pavel Begunkov
Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
Gohad, Tushar, Christian König, Christoph Hellwig,
Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
lsf-pc@lists.linux-foundation.org
On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> Good day everyone,
>
> dma-buf is a powerful abstraction for managing buffers and DMA mappings,
> and there is growing interest in extending it to the read/write path to
> enable device-to-device transfers without bouncing data through system
> memory. I was encouraged to submit it to LSF/MM/BPF as that might be
> useful to mull over details and what capabilities and features people
> may need.
>
> The proposal consists of two parts. The first is a small in-kernel
> framework that allows a dma-buf to be registered against a given file
> and returns an object representing a DMA mapping. The actual mapping
> creation is delegated to the target subsystem (e.g. NVMe). This
> abstraction centralises request accounting, mapping management, dynamic
> recreation, etc. The resulting mapping object is passed through the I/O
> stack via a new iov_iter type.
>
> As for the user API, a dma-buf is installed as an io_uring registered
> buffer for a specific file. Once registered, the buffer can be used by
> read / write io_uring requests as normal. io_uring will enforce that the
> buffer is only used with "compatible files", which is for now restricted
> to the target registration file, but will be expanded in the future.
> Notably, io_uring is a consumer of the framework rather than a
> dependency, and the infrastructure can be reused.
I am interested in this topic.
Given dma-buf is inherently designed for sharing, I hope the io-uring
interface can be generic for covering:
- read/write with same dma-buf can be submitted to multiple devices
- read/write with dma-buf can cross stackable devices(device mapper, raid,
...)
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
` (2 preceding siblings ...)
2026-02-05 3:12 ` Ming Lei
@ 2026-02-05 17:41 ` Jason Gunthorpe
2026-02-05 19:06 ` Pavel Begunkov
3 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2026-02-05 17:41 UTC (permalink / raw)
To: Pavel Begunkov
Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
Gohad, Tushar, Christian König, Christoph Hellwig,
Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
lsf-pc@lists.linux-foundation.org
On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> The proposal consists of two parts. The first is a small in-kernel
> framework that allows a dma-buf to be registered against a given file
> and returns an object representing a DMA mapping.
What is this about and why would you need something like this?
The rest makes more sense - pass a DMABUF (or even memfd) to iouring
and pre-setup the DMA mapping to get dma_addr_t, then directly use
dma_addr_t through the entire block stack right into the eventual
driver.
> Tushar was helping and mention he got good numbers for P2P transfers
> compared to bouncing it via RAM.
We can already avoid the bouncing, it seems the main improvements here
are avoiding the DMA map per-io and allowing the use of P2P without
also creating struct page. Meanginful wins for sure.
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
2026-02-05 3:12 ` Ming Lei
@ 2026-02-05 18:13 ` Pavel Begunkov
0 siblings, 0 replies; 10+ messages in thread
From: Pavel Begunkov @ 2026-02-05 18:13 UTC (permalink / raw)
To: Ming Lei
Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
Gohad, Tushar, Christian König, Christoph Hellwig,
Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
lsf-pc@lists.linux-foundation.org
On 2/5/26 03:12, Ming Lei wrote:
> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>> Good day everyone,
>>
>> dma-buf is a powerful abstraction for managing buffers and DMA mappings,
>> and there is growing interest in extending it to the read/write path to
>> enable device-to-device transfers without bouncing data through system
>> memory. I was encouraged to submit it to LSF/MM/BPF as that might be
>> useful to mull over details and what capabilities and features people
>> may need.
>>
>> The proposal consists of two parts. The first is a small in-kernel
>> framework that allows a dma-buf to be registered against a given file
>> and returns an object representing a DMA mapping. The actual mapping
>> creation is delegated to the target subsystem (e.g. NVMe). This
>> abstraction centralises request accounting, mapping management, dynamic
>> recreation, etc. The resulting mapping object is passed through the I/O
>> stack via a new iov_iter type.
>>
>> As for the user API, a dma-buf is installed as an io_uring registered
>> buffer for a specific file. Once registered, the buffer can be used by
>> read / write io_uring requests as normal. io_uring will enforce that the
>> buffer is only used with "compatible files", which is for now restricted
>> to the target registration file, but will be expanded in the future.
>> Notably, io_uring is a consumer of the framework rather than a
>> dependency, and the infrastructure can be reused.
>
> I am interested in this topic.
>
> Given dma-buf is inherently designed for sharing, I hope the io-uring
> interface can be generic for covering:
>
> - read/write with same dma-buf can be submitted to multiple devices
>
> - read/write with dma-buf can cross stackable devices(device mapper, raid,
> ...)
Yes, those should be possible to do, IIRC Christoph mentioned it
as well while asking to change the design of v1. The
implementation will need to forward registration down and create
a dma-buf attachment for each device.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
2026-02-05 17:41 ` Jason Gunthorpe
@ 2026-02-05 19:06 ` Pavel Begunkov
2026-02-05 23:56 ` Jason Gunthorpe
0 siblings, 1 reply; 10+ messages in thread
From: Pavel Begunkov @ 2026-02-05 19:06 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
Gohad, Tushar, Christian König, Christoph Hellwig,
Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
lsf-pc@lists.linux-foundation.org
On 2/5/26 17:41, Jason Gunthorpe wrote:
> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>
>> The proposal consists of two parts. The first is a small in-kernel
>> framework that allows a dma-buf to be registered against a given file
>> and returns an object representing a DMA mapping.
>
> What is this about and why would you need something like this?
>
> The rest makes more sense - pass a DMABUF (or even memfd) to iouring
> and pre-setup the DMA mapping to get dma_addr_t, then directly use
> dma_addr_t through the entire block stack right into the eventual
> driver.
That's more or less what I tried to do in v1, but 1) people didn't like
the idea of passing raw dma addresses directly, and having it wrapped
into a black box gives more flexibility like potentially supporting
multi-device filesystems. And 2) dma-buf folks want dynamic attachments,
and it makes it quite a bit more complicated when you might be asked to
shoot down DMA mappings at any moment, so I'm isolating all that
into something that can be reused.
>> Tushar was helping and mention he got good numbers for P2P transfers
>> compared to bouncing it via RAM.
>
> We can already avoid the bouncing, it seems the main improvements here
> are avoiding the DMA map per-io and allowing the use of P2P without
> also creating struct page. Meanginful wins for sure.
Yes, and it should probably be nicer for frameworks that already
expose dma-bufs.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
2026-02-05 19:06 ` Pavel Begunkov
@ 2026-02-05 23:56 ` Jason Gunthorpe
0 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2026-02-05 23:56 UTC (permalink / raw)
To: Pavel Begunkov
Cc: linux-block, io-uring, linux-nvme@lists.infradead.org,
Gohad, Tushar, Christian König, Christoph Hellwig,
Kanchan Joshi, Anuj Gupta, Nitesh Shetty,
lsf-pc@lists.linux-foundation.org
On Thu, Feb 05, 2026 at 07:06:03PM +0000, Pavel Begunkov wrote:
> On 2/5/26 17:41, Jason Gunthorpe wrote:
> > On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
> >
> > > The proposal consists of two parts. The first is a small in-kernel
> > > framework that allows a dma-buf to be registered against a given file
> > > and returns an object representing a DMA mapping.
> >
> > What is this about and why would you need something like this?
> >
> > The rest makes more sense - pass a DMABUF (or even memfd) to iouring
> > and pre-setup the DMA mapping to get dma_addr_t, then directly use
> > dma_addr_t through the entire block stack right into the eventual
> > driver.
>
> That's more or less what I tried to do in v1, but 1) people didn't like
> the idea of passing raw dma addresses directly, and having it wrapped
> into a black box gives more flexibility like potentially supporting
> multi-device filesystems.
Ok.. but what does that have to do with a user space visible file?
> 2) dma-buf folks want dynamic attachments,
> and it makes it quite a bit more complicated when you might be asked to
> shoot down DMA mappings at any moment, so I'm isolating all that
> into something that can be reused.
IMHO there is probably nothing really resuable here. The logic to
fence any usage is entirely unique to whoever is using it, and the
locking tends to be really hard.
You should review the email threads linked to this patch and all it's
prior versions as the expected importer behavior for pinned dmabufs is
not well understood.
https://lore.kernel.org/all/20260131-dmabuf-revoke-v7-0-463d956bd527@nvidia.com/
> > > Tushar was helping and mention he got good numbers for P2P transfers
> > > compared to bouncing it via RAM.
> >
> > We can already avoid the bouncing, it seems the main improvements here
> > are avoiding the DMA map per-io and allowing the use of P2P without
> > also creating struct page. Meanginful wins for sure.
>
> Yes, and it should probably be nicer for frameworks that already
> expose dma-bufs.
I'm not sure what this means?
Jason
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-02-05 23:56 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20260204153051epcas5p1c2efd01ef32883680fed2541f9fca6c2@epcas5p1.samsung.com>
2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
2026-02-03 18:07 ` Keith Busch
2026-02-04 6:07 ` Anuj Gupta/Anuj Gupta
2026-02-04 11:38 ` Pavel Begunkov
2026-02-04 15:26 ` Nitesh Shetty
2026-02-05 3:12 ` Ming Lei
2026-02-05 18:13 ` Pavel Begunkov
2026-02-05 17:41 ` Jason Gunthorpe
2026-02-05 19:06 ` Pavel Begunkov
2026-02-05 23:56 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox