From: Pavel Begunkov <asml.silence@gmail.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: linux-block@vger.kernel.org, io-uring <io-uring@vger.kernel.org>,
"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
"Gohad, Tushar" <tushar.gohad@intel.com>,
"Christian König" <christian.koenig@amd.com>,
"Christoph Hellwig" <hch@lst.de>,
"Kanchan Joshi" <joshi.k@samsung.com>,
"Anuj Gupta" <anuj20.g@samsung.com>,
"Nitesh Shetty" <nj.shetty@samsung.com>,
"lsf-pc@lists.linux-foundation.org"
<lsf-pc@lists.linux-foundation.org>
Subject: Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
Date: Mon, 9 Feb 2026 10:59:53 +0000 [thread overview]
Message-ID: <4736af5c-4bf8-4e9b-8f82-4b89a75d2cdb@gmail.com> (raw)
In-Reply-To: <20260206183756.GB1874040@nvidia.com>
On 2/6/26 18:37, Jason Gunthorpe wrote:
> On Fri, Feb 06, 2026 at 05:57:14PM +0000, Pavel Begunkov wrote:
>> On 2/6/26 15:20, Jason Gunthorpe wrote:
>>> On Fri, Feb 06, 2026 at 03:08:25PM +0000, Pavel Begunkov wrote:
>>>> On 2/5/26 23:56, Jason Gunthorpe wrote:
>>>>> On Thu, Feb 05, 2026 at 07:06:03PM +0000, Pavel Begunkov wrote:
>>>>>> On 2/5/26 17:41, Jason Gunthorpe wrote:
>>>>>>> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>>>>>>>
>>>>>>>> The proposal consists of two parts. The first is a small in-kernel
>>>>>>>> framework that allows a dma-buf to be registered against a given file
>>>>>>>> and returns an object representing a DMA mapping.
>>>>>>>
>>>>>>> What is this about and why would you need something like this?
>>>>>>>
>>>>>>> The rest makes more sense - pass a DMABUF (or even memfd) to iouring
>>>>>>> and pre-setup the DMA mapping to get dma_addr_t, then directly use
>>>>>>> dma_addr_t through the entire block stack right into the eventual
>>>>>>> driver.
>>>>>>
>>>>>> That's more or less what I tried to do in v1, but 1) people didn't like
>>>>>> the idea of passing raw dma addresses directly, and having it wrapped
>>>>>> into a black box gives more flexibility like potentially supporting
>>>>>> multi-device filesystems.
>>>>>
>>>>> Ok.. but what does that have to do with a user space visible file?
>>>>
>>>> If you're referring to registration taking a file, it's used to forward
>>>> this registration to the right driver, which knows about devices and can
>>>> create dma-buf attachment[s]. The abstraction users get is not just a
>>>> buffer but rather a buffer registered for a "subsystem" represented by
>>>> the passed file. With nvme raw bdev as the only importer in the patch set,
>>>> it's simply converges to "registered for the file", but the notion will
>>>> need to be expanded later, e.g. to accommodate filesystems.
>>>
>>> Sounds completely goofy to me.
>>
>> Hmm... the discussion is not going to be productive, isn't it?
>
> Well, this FD thing is very confounding and, sorry I don't see much
> logic to this design. I understand the problems you are explaining but
> not this solution.
We won't agree on this one. If an abstraction linking two entities
makes sense and is useful, it should be fine to have that. It's
functionally the same as well, just register the buffer multiple times.
In fact, I could get rid of the fd argument, and keep and dynamically
update a map in io_uring for all files / subsystems the registration
can be used with, but all such generality has additional cost while
not bringing anything new to the table.
>> Or would it be mapping it for each IO?
>
> mapping for each IO could be possible with a phys_addr_t path.
>
>> dma-buf already exists as well, and I'm ashamed to admit,
>> but I don't know how a user program can read into / write from
>> memory provided by dma-buf.
>
> You can mmap them. It can even be used with read() write() system
> calls if the dma buf exporter is using P2P pages.
>
>> I'm not doing it for any particular driver but rather trying
>> to reuse what's already there, i.e. a good coverage of existing
>> dma-buf exporters, and infrastructure dma-buf provides, e.g.
>> move_notify. And trying to do that efficiently, avoiding GUP
>> (what io_uring can already do for normal memory), keeping long
>> term mappings (modulo move_notify), and so. That includes
>> optimising the cost of system memory rw with iommu.
>
> I would suggest leading with these reasons to frame why you are trying
> to do this. It seems the main motivation is to create a pre
> registered, and pre-IOMMU-mapped io uring pool of MMIO memory, and
> indeed you cannot do that with the existing mechanisms at all.
My bad, I didn't try to go into "why" well enough. I want to be
able to use dma-buf as a first class citizen in the path, i.e.
keeping a clear notion of a buffer for the user and being able to
interact with any existing exporter, all that while maximising
performance, which definitely includes pre-mapping memory.
> As a step forward I could imagine having a DMABUF handing out P2P
> pages and allowing io uring to "register" it complete with move
Forcing dma-buf to have pages is a big step back, IMHO
> notify. This would get you half the way there and doesn't require
> major changes to the block stack since you can still be pushing
> unmapped struct page backed addresses and everything will work
> fine. It is a good way to sidestep the FOLL_LONGTERM issue.
>
> Pre-iommu-mapping the pool seems like an orthogonal project as it
> applies to everything coming from pre-registered io uring buffers,
> even normal cpu memory. You could have a next step of pre-mapping the
> P2P pages and CPU pages equally.
It was already tried for normal user memory (not by me), but
the verdict was that it should be dma-buf based.
> Finally you could try a project to remove the P2P page requirement for
> cases that use the pre-iommu-mapping flow.
>
> It would probably be helpful not to mixup those three things..
I agree that pre-mapping and p2p can in theory can be decoupled here,
but in practice dma-buf already provides right abstractions and
infrastructure to cover both in one go. I really don't think it's a
good idea to re-engineer parts of dma-buf that are responsible for
interaction with the importer device.
--
Pavel Begunkov
next prev parent reply other threads:[~2026-02-09 10:59 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20260204153051epcas5p1c2efd01ef32883680fed2541f9fca6c2@epcas5p1.samsung.com>
2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
2026-02-03 18:07 ` Keith Busch
2026-02-04 6:07 ` Anuj Gupta/Anuj Gupta
2026-02-04 11:38 ` Pavel Begunkov
2026-02-04 15:26 ` Nitesh Shetty
2026-02-09 11:15 ` Pavel Begunkov
2026-02-05 3:12 ` Ming Lei
2026-02-05 18:13 ` Pavel Begunkov
2026-02-05 17:41 ` Jason Gunthorpe
2026-02-05 19:06 ` Pavel Begunkov
2026-02-05 23:56 ` Jason Gunthorpe
2026-02-06 15:08 ` Pavel Begunkov
2026-02-06 15:20 ` Jason Gunthorpe
2026-02-06 17:57 ` Pavel Begunkov
2026-02-06 18:37 ` Jason Gunthorpe
2026-02-09 10:59 ` Pavel Begunkov [this message]
2026-02-09 13:06 ` Jason Gunthorpe
2026-02-09 13:09 ` Christian König
2026-02-09 13:24 ` Jason Gunthorpe
2026-02-09 13:55 ` Christian König
2026-02-09 14:01 ` Jason Gunthorpe
2026-02-09 9:54 ` Kanchan Joshi
2026-02-09 10:13 ` Christian König
2026-02-09 12:54 ` Jason Gunthorpe
2026-02-09 10:04 ` Kanchan Joshi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4736af5c-4bf8-4e9b-8f82-4b89a75d2cdb@gmail.com \
--to=asml.silence@gmail.com \
--cc=anuj20.g@samsung.com \
--cc=christian.koenig@amd.com \
--cc=hch@lst.de \
--cc=io-uring@vger.kernel.org \
--cc=jgg@nvidia.com \
--cc=joshi.k@samsung.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=nj.shetty@samsung.com \
--cc=tushar.gohad@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox