public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: linux-block@vger.kernel.org, io-uring <io-uring@vger.kernel.org>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"Gohad, Tushar" <tushar.gohad@intel.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Christoph Hellwig" <hch@lst.de>,
	"Kanchan Joshi" <joshi.k@samsung.com>,
	"Anuj Gupta" <anuj20.g@samsung.com>,
	"Nitesh Shetty" <nj.shetty@samsung.com>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>
Subject: Re: [LSF/MM/BPF TOPIC] dmabuf backed read/write
Date: Fri, 6 Feb 2026 17:57:14 +0000	[thread overview]
Message-ID: <df7fe4d7-ca28-408e-bed3-bd1fa23e7588@gmail.com> (raw)
In-Reply-To: <20260206152041.GA1874040@nvidia.com>

On 2/6/26 15:20, Jason Gunthorpe wrote:
> On Fri, Feb 06, 2026 at 03:08:25PM +0000, Pavel Begunkov wrote:
>> On 2/5/26 23:56, Jason Gunthorpe wrote:
>>> On Thu, Feb 05, 2026 at 07:06:03PM +0000, Pavel Begunkov wrote:
>>>> On 2/5/26 17:41, Jason Gunthorpe wrote:
>>>>> On Tue, Feb 03, 2026 at 02:29:55PM +0000, Pavel Begunkov wrote:
>>>>>
>>>>>> The proposal consists of two parts. The first is a small in-kernel
>>>>>> framework that allows a dma-buf to be registered against a given file
>>>>>> and returns an object representing a DMA mapping.
>>>>>
>>>>> What is this about and why would you need something like this?
>>>>>
>>>>> The rest makes more sense - pass a DMABUF (or even memfd) to iouring
>>>>> and pre-setup the DMA mapping to get dma_addr_t, then directly use
>>>>> dma_addr_t through the entire block stack right into the eventual
>>>>> driver.
>>>>
>>>> That's more or less what I tried to do in v1, but 1) people didn't like
>>>> the idea of passing raw dma addresses directly, and having it wrapped
>>>> into a black box gives more flexibility like potentially supporting
>>>> multi-device filesystems.
>>>
>>> Ok.. but what does that have to do with a user space visible file?
>>
>> If you're referring to registration taking a file, it's used to forward
>> this registration to the right driver, which knows about devices and can
>> create dma-buf attachment[s]. The abstraction users get is not just a
>> buffer but rather a buffer registered for a "subsystem" represented by
>> the passed file. With nvme raw bdev as the only importer in the patch set,
>> it's simply converges to "registered for the file", but the notion will
>> need to be expanded later, e.g. to accommodate filesystems.
> 
> Sounds completely goofy to me.

Hmm... the discussion is not going to be productive, isn't it?

> A wrapper around DMABUF that lets you
> attach to DMABUFs? Huh?

I have no idea what you mean and what "attach to DMABUFs" is.
dma-buf is passed to the driver, which attaches it (as in
calls dma_buf_dynamic_attach()).

> I feel like io uring should be dealing with this internally somehow not
> creating more and more uapi..

uapi changes are already minimal and outside of the IO path.

> The longer term goal has been to get page * out of the io stack and
> start using phys_addr_t, if we could pass the DMABUF's MMIO as a

Except that I already tried passing device mapped addresses directly,
and it was rejected because it won't be able to handle more complicated
cases like multi-device filesystems and probably for other reasons.
Or would it be mapping it for each IO?

> phys_addr_t around the IO stack then we only need to close the gap of
> getting the p2p provider into the final DMA mapping.
> 
> Alot of this has improved in the past few cycles where the main issue
> now is the carrying the provider and phys_addr_t through the io to the
> nvme driver. vs when you started this and even that fundamental
> infrastructure was missing.
> 
>>>>>> Tushar was helping and mention he got good numbers for P2P transfers
>>>>>> compared to bouncing it via RAM.
>>>>>
>>>>> We can already avoid the bouncing, it seems the main improvements here
>>>>> are avoiding the DMA map per-io and allowing the use of P2P without
>>>>> also creating struct page. Meanginful wins for sure.
>>>>
>>>> Yes, and it should probably be nicer for frameworks that already
>>>> expose dma-bufs.
>>>
>>> I'm not sure what this means?
>>
>> I'm saying that when a user app can easily get or already has a
>> dma-buf fd, it should be easier to just use it instead of finding
>> its way to FOLL_PCI_P2PDMA.
> 
> But that all exists already and this proposal does nothing to improve
> it..

dma-buf already exists as well, and I'm ashamed to admit,
but I don't know how a user program can read into / write from
memory provided by dma-buf.

>> I'm actually curious, is there a way to somehow create a
>> MEMORY_DEVICE_PCI_P2PDMA mapping out of a random dma-buf?
> 
> No. The driver owning the P2P MMIO has to do this during its probe and
> then it has to provide a VMA with normal pages so GUP works. This is
> usally not hard on the exporting driver side.
> 
> It costs some memory but then everything works naturally in the IO
> stack.
> 
> Your project is interesting and would be a nice improvement, but I
> also don't entirely understand why you are bothering when the P2PDMA
> solution is already fully there ready to go... Is something preventing
> you from creating the P2PDMA pages for your exporting driver?

I'm not doing it for any particular driver but rather trying
to reuse what's already there, i.e. a good coverage of existing
dma-buf exporters, and infrastructure dma-buf provides, e.g.
move_notify. And trying to do that efficiently, avoiding GUP
(what io_uring can already do for normal memory), keeping long
term mappings (modulo move_notify), and so. That includes
optimising the cost of system memory rw with iommu.

-- 
Pavel Begunkov


  reply	other threads:[~2026-02-06 17:57 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20260204153051epcas5p1c2efd01ef32883680fed2541f9fca6c2@epcas5p1.samsung.com>
2026-02-03 14:29 ` [LSF/MM/BPF TOPIC] dmabuf backed read/write Pavel Begunkov
2026-02-03 18:07   ` Keith Busch
2026-02-04  6:07     ` Anuj Gupta/Anuj Gupta
2026-02-04 11:38     ` Pavel Begunkov
2026-02-04 15:26   ` Nitesh Shetty
2026-02-05  3:12   ` Ming Lei
2026-02-05 18:13     ` Pavel Begunkov
2026-02-05 17:41   ` Jason Gunthorpe
2026-02-05 19:06     ` Pavel Begunkov
2026-02-05 23:56       ` Jason Gunthorpe
2026-02-06 15:08         ` Pavel Begunkov
2026-02-06 15:20           ` Jason Gunthorpe
2026-02-06 17:57             ` Pavel Begunkov [this message]
2026-02-06 18:37               ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=df7fe4d7-ca28-408e-bed3-bd1fa23e7588@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=anuj20.g@samsung.com \
    --cc=christian.koenig@amd.com \
    --cc=hch@lst.de \
    --cc=io-uring@vger.kernel.org \
    --cc=jgg@nvidia.com \
    --cc=joshi.k@samsung.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=nj.shetty@samsung.com \
    --cc=tushar.gohad@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox