* [PATCH v3 00/10] Add dmabuf read/write via io_uring
@ 2026-04-29 15:25 Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps Pavel Begunkov
` (9 more replies)
0 siblings, 10 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
The patch set allows to register a dmabuf to an io_uring instance for
a specified file and use it with io_uring read / write requests. The
infrastructure is not tied to io_uring and there could be more users
in the future. A similar idea was attempted some years ago by Keith [1],
from where I borrowed a good number of changes, and later was brough up
by Tushar and Vishal from Intel.
It's an opt-in feature for files, and they need to implement a new
file operation to use it. Only NVMe block devices are supported in this
series. The user API is built on top of io_uring's "registered buffers",
where a dmabuf is registered in a special way, but after it can be used
as any other "registered buffer" with IORING_OP_{READ,WRITE}_FIXED
requests. It's created via a new file operation and the resulted map is
then passed through the I/O stack in a new iterator type. There is some
additional infrastructure to bind it all, which also counts requests
using a dmabuf map and managing lifetimes, which is used to implement
map invalidation.
It was tested for GPU <-> NVMe transfers. Also, as it maintains a
long-term dma mapping, it helps with the IOMMU cost. The numbers
below are for udmabuf reads previously run by Anuj for different
IOMMU modes:
- STRICT: before = 570 KIOPS, after = 5.01 MIOPS
- LAZY: before = 1.93 MIOPS, after = 5.01 MIOPS
- PASSTHROUGH: before = 5.01 MIOPS, after = 5.01 MIOPS
There are some liburing tests that can serve as an example:
git: https://github.com/isilence/liburing.git rw-dmabuf-tests-v3
url: https://github.com/isilence/liburing/tree/rw-dmabuf-tests-v3
[1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
v3: - Rework io_uring registration
- Move token/map infrastructure code out of blk-mq
- Simplify callbacks: remove a separate blk-mq table, which was
mostly just forwarding calls (to nvme).
- Don't skip dma sync depending on request direction
- Fix a couple of hangs
- Rename s/dma/dmabuf/
- Other small changes
v2: - Don't pass raw dma addresses, wrap it into a driver specific object
- Split into two objects: token and map
- Implement move_notify
Pavel Begunkov (10):
file: add callback for creating long-term dmabuf maps
iov_iter: add iterator type for dmabuf maps
block: move bvec init into __bio_clone
block: introduce dma map backed bio type
lib: add dmabuf token infrastructure
block: forward create_dmabuf_token to drivers
nvme-pci: implement dma_token backed requests
io_uring/rsrc: introduce buf registration structure
io_uring/rsrc: extend buffer update
io_uring/rsrc: add dmabuf backed registered buffers
block/bio.c | 28 +++-
block/blk-merge.c | 14 ++
block/blk.h | 3 +-
block/fops.c | 16 ++
drivers/nvme/host/pci.c | 282 ++++++++++++++++++++++++++++++++
include/linux/bio.h | 19 ++-
include/linux/blk-mq.h | 9 +
include/linux/blk_types.h | 8 +-
include/linux/fs.h | 2 +
include/linux/io_dmabuf_token.h | 92 +++++++++++
include/linux/io_uring_types.h | 5 +
include/linux/uio.h | 11 ++
include/uapi/linux/io_uring.h | 31 +++-
io_uring/io_uring.c | 3 +-
io_uring/rsrc.c | 266 +++++++++++++++++++++++++-----
io_uring/rsrc.h | 30 +++-
io_uring/rw.c | 4 +-
lib/Kconfig | 4 +
lib/Makefile | 2 +
lib/io_dmabuf_token.c | 272 ++++++++++++++++++++++++++++++
lib/iov_iter.c | 29 +++-
21 files changed, 1071 insertions(+), 59 deletions(-)
create mode 100644 include/linux/io_dmabuf_token.h
create mode 100644 lib/io_dmabuf_token.c
--
2.53.0
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
2026-04-30 6:03 ` Christian König
2026-04-29 15:25 ` [PATCH v3 02/10] iov_iter: add iterator type for " Pavel Begunkov
` (8 subsequent siblings)
9 siblings, 1 reply; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
Introduce a new file callback that allows creating long-term dma
mapping. All necessary information together with a dmabuf will be passed
in the second argument of type struct io_dmabuf_token, which will be
defined in following patches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/fs.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b01bb22d12..c5558aab4628 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1920,6 +1920,7 @@ struct dir_context {
struct io_uring_cmd;
struct offset_ctx;
+struct io_dmabuf_token;
typedef unsigned int __bitwise fop_flags_t;
@@ -1967,6 +1968,7 @@ struct file_operations {
int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
unsigned int poll_flags);
int (*mmap_prepare)(struct vm_area_desc *);
+ int (*create_dmabuf_token)(struct file *, struct io_dmabuf_token *);
} __randomize_layout;
/* Supports async buffered reads */
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 02/10] iov_iter: add iterator type for dmabuf maps
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 03/10] block: move bvec init into __bio_clone Pavel Begunkov
` (7 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
Introduce a new iterator type for dmabuf maps. The map in an opaque
object with internals and format specific to the subsystem / driver, and
only it can use that subsystem / driver for issuing IO. The task of the
middle layers is to pass the map / iterator further down, maybe doing
basic splitting and length checking. The iterator can only be used by
operations of the file the associated map was created for.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/uio.h | 11 +++++++++++
lib/iov_iter.c | 29 +++++++++++++++++++++++------
2 files changed, 34 insertions(+), 6 deletions(-)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index a9bc5b3067e3..75051aed70de 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -12,6 +12,7 @@
struct page;
struct folio_queue;
+struct io_dmabuf_map;
typedef unsigned int __bitwise iov_iter_extraction_t;
@@ -29,6 +30,7 @@ enum iter_type {
ITER_FOLIOQ,
ITER_XARRAY,
ITER_DISCARD,
+ ITER_DMABUF_MAP,
};
#define ITER_SOURCE 1 // == WRITE
@@ -71,6 +73,7 @@ struct iov_iter {
const struct folio_queue *folioq;
struct xarray *xarray;
void __user *ubuf;
+ struct io_dmabuf_map *dmabuf_map;
};
size_t count;
};
@@ -155,6 +158,11 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
return iov_iter_type(i) == ITER_XARRAY;
}
+static inline bool iov_iter_is_dmabuf_map(const struct iov_iter *i)
+{
+ return iov_iter_type(i) == ITER_DMABUF_MAP;
+}
+
static inline unsigned char iov_iter_rw(const struct iov_iter *i)
{
return i->data_source ? WRITE : READ;
@@ -300,6 +308,9 @@ void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
unsigned int first_slot, unsigned int offset, size_t count);
void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
loff_t start, size_t count);
+void iov_iter_dmabuf_map(struct iov_iter *i, unsigned int direction,
+ struct io_dmabuf_map *map,
+ loff_t off, size_t count);
ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 243662af1af7..e2253684b991 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -575,7 +575,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
{
if (unlikely(i->count < size))
size = i->count;
- if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i))) {
+ if (likely(iter_is_ubuf(i)) || unlikely(iov_iter_is_xarray(i)) ||
+ unlikely(iov_iter_is_dmabuf_map(i))) {
i->iov_offset += size;
i->count -= size;
} else if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i))) {
@@ -631,7 +632,8 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
return;
}
unroll -= i->iov_offset;
- if (iov_iter_is_xarray(i) || iter_is_ubuf(i)) {
+ if (iov_iter_is_xarray(i) || iter_is_ubuf(i) ||
+ iov_iter_is_dmabuf_map(i)) {
BUG(); /* We should never go beyond the start of the specified
* range since we might then be straying into pages that
* aren't pinned.
@@ -775,6 +777,20 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
}
EXPORT_SYMBOL(iov_iter_xarray);
+void iov_iter_dmabuf_map(struct iov_iter *i, unsigned int direction,
+ struct io_dmabuf_map *map,
+ loff_t off, size_t count)
+{
+ WARN_ON(direction & ~(READ | WRITE));
+ *i = (struct iov_iter){
+ .iter_type = ITER_DMABUF_MAP,
+ .data_source = direction,
+ .dmabuf_map = map,
+ .count = count,
+ .iov_offset = off,
+ };
+}
+
/**
* iov_iter_discard - Initialise an I/O iterator that discards data
* @i: The iterator to initialise.
@@ -841,7 +857,7 @@ static unsigned long iov_iter_alignment_bvec(const struct iov_iter *i)
unsigned long iov_iter_alignment(const struct iov_iter *i)
{
- if (likely(iter_is_ubuf(i))) {
+ if (likely(iter_is_ubuf(i)) || iov_iter_is_dmabuf_map(i)) {
size_t size = i->count;
if (size)
return ((unsigned long)i->ubuf + i->iov_offset) | size;
@@ -872,7 +888,7 @@ unsigned long iov_iter_gap_alignment(const struct iov_iter *i)
size_t size = i->count;
unsigned k;
- if (iter_is_ubuf(i))
+ if (iter_is_ubuf(i) || iov_iter_is_dmabuf_map(i))
return 0;
if (WARN_ON(!iter_is_iovec(i)))
@@ -1469,11 +1485,12 @@ EXPORT_SYMBOL_GPL(import_ubuf);
void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
{
if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i) &&
- !iter_is_ubuf(i)) && !iov_iter_is_kvec(i))
+ !iter_is_ubuf(i) && !iov_iter_is_kvec(i) &&
+ !iov_iter_is_dmabuf_map(i)))
return;
i->iov_offset = state->iov_offset;
i->count = state->count;
- if (iter_is_ubuf(i))
+ if (iter_is_ubuf(i) || iov_iter_is_dmabuf_map(i))
return;
/*
* For the *vec iters, nr_segs + iov is constant - if we increment
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 03/10] block: move bvec init into __bio_clone
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 02/10] iov_iter: add iterator type for " Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 04/10] block: introduce dma map backed bio type Pavel Begunkov
` (6 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
To quote Cristoph: "Historically __bio_clone itself does not clone the
payload, just the bio. But we got rid of the callers that want to clone
a bio but not the payload long time ago". So let's move ->bi_io_vec
assignment into __bio_clone(), so we have a single point where it's set.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
block/bio.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 4d46af0cd256..0734b50d4992 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -851,6 +851,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
bio->bi_write_hint = bio_src->bi_write_hint;
bio->bi_write_stream = bio_src->bi_write_stream;
bio->bi_iter = bio_src->bi_iter;
+ bio->bi_io_vec = bio_src->bi_io_vec;
if (bio->bi_bdev) {
if (bio->bi_bdev == bio_src->bi_bdev &&
@@ -893,8 +894,6 @@ struct bio *bio_alloc_clone(struct block_device *bdev, struct bio *bio_src,
bio_put(bio);
return NULL;
}
- bio->bi_io_vec = bio_src->bi_io_vec;
-
return bio;
}
EXPORT_SYMBOL(bio_alloc_clone);
@@ -914,7 +913,7 @@ int bio_init_clone(struct block_device *bdev, struct bio *bio,
{
int ret;
- bio_init(bio, bdev, bio_src->bi_io_vec, 0, bio_src->bi_opf);
+ bio_init(bio, bdev, NULL, 0, bio_src->bi_opf);
ret = __bio_clone(bio, bio_src, gfp);
if (ret)
bio_uninit(bio);
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 04/10] block: introduce dma map backed bio type
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
` (2 preceding siblings ...)
2026-04-29 15:25 ` [PATCH v3 03/10] block: move bvec init into __bio_clone Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 05/10] lib: add dmabuf token infrastructure Pavel Begunkov
` (5 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
Premapped buffers don't require a generic bio_vec since these have
already been dma mapped. Repurpose the bi_io_vec space to strore dmabuf
maps as they are mutually exclusive.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
block/bio.c | 25 ++++++++++++++++++++++++-
block/blk-merge.c | 14 ++++++++++++++
block/blk.h | 3 ++-
block/fops.c | 2 ++
include/linux/bio.h | 19 ++++++++++++++++---
include/linux/blk_types.h | 8 +++++++-
6 files changed, 65 insertions(+), 6 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 0734b50d4992..bdc91777c288 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -851,7 +851,13 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
bio->bi_write_hint = bio_src->bi_write_hint;
bio->bi_write_stream = bio_src->bi_write_stream;
bio->bi_iter = bio_src->bi_iter;
- bio->bi_io_vec = bio_src->bi_io_vec;
+
+ if (!bio_flagged(bio_src, BIO_DMABUF_MAP)) {
+ bio->bi_io_vec = bio_src->bi_io_vec;
+ } else {
+ bio->dmabuf_map = bio_src->dmabuf_map;
+ bio_set_flag(bio, BIO_DMABUF_MAP);
+ }
if (bio->bi_bdev) {
if (bio->bi_bdev == bio_src->bi_bdev &&
@@ -1183,6 +1189,18 @@ void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter)
bio_set_flag(bio, BIO_CLONED);
}
+void bio_dmabuf_map_set(struct bio *bio, struct iov_iter *iter)
+{
+ WARN_ON_ONCE(bio->bi_max_vecs);
+
+ bio->dmabuf_map = iter->dmabuf_map;
+ bio->bi_vcnt = 0;
+ bio->bi_iter.bi_bvec_done = iter->iov_offset;
+ bio->bi_iter.bi_size = iov_iter_count(iter);
+ bio->bi_opf |= REQ_NOMERGE;
+ bio_set_flag(bio, BIO_DMABUF_MAP);
+}
+
/*
* Aligns the bio size to the len_align_mask, releasing excessive bio vecs that
* __bio_iov_iter_get_pages may have inserted, and reverts the trimmed length
@@ -1252,6 +1270,11 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
iov_iter_advance(iter, bio->bi_iter.bi_size);
return 0;
}
+ if (iov_iter_is_dmabuf_map(iter)) {
+ bio_dmabuf_map_set(bio, iter);
+ iov_iter_advance(iter, bio->bi_iter.bi_size);
+ return 0;
+ }
if (iov_iter_extract_will_pin(iter))
bio_set_flag(bio, BIO_PAGE_PINNED);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index fcf09325b22e..fc2c0c428001 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -348,6 +348,19 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
len_align_mask |= (bc->bc_key->crypto_cfg.data_unit_size - 1);
}
+ if (bio_flagged(bio, BIO_DMABUF_MAP)) {
+ nsegs = 1;
+
+ if ((bio->bi_iter.bi_bvec_done & lim->dma_alignment) ||
+ (bio->bi_iter.bi_size & len_align_mask))
+ return -EINVAL;
+ if (bio->bi_iter.bi_size > max_bytes) {
+ bytes = max_bytes;
+ goto split;
+ }
+ goto out;
+ }
+
bio_for_each_bvec(bv, bio, iter) {
if (bv.bv_offset & start_align_mask ||
bv.bv_len & len_align_mask)
@@ -378,6 +391,7 @@ int bio_split_io_at(struct bio *bio, const struct queue_limits *lim,
bvprvp = &bvprv;
}
+out:
*segs = nsegs;
bio->bi_bvec_gap_bit = ffs(gaps);
return 0;
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf..b4b09abebce8 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -424,7 +424,8 @@ static inline struct bio *__bio_split_to_limits(struct bio *bio,
switch (bio_op(bio)) {
case REQ_OP_READ:
case REQ_OP_WRITE:
- if (bio_may_need_split(bio, lim))
+ if (bio_may_need_split(bio, lim) ||
+ bio_flagged(bio, BIO_DMABUF_MAP))
return bio_split_rw(bio, lim, nr_segs);
*nr_segs = 1;
return bio;
diff --git a/block/fops.c b/block/fops.c
index bb6642b45937..713a3ba3f457 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -349,6 +349,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
* bio_iov_iter_get_pages() and set the bvec directly.
*/
bio_iov_bvec_set(bio, iter);
+ } else if (iov_iter_is_dmabuf_map(iter)) {
+ bio_dmabuf_map_set(bio, iter);
} else {
ret = blkdev_iov_iter_get_pages(bio, iter, bdev);
if (unlikely(ret))
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 97d747320b35..0c43fa6b0900 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -108,16 +108,26 @@ static inline bool bio_next_segment(const struct bio *bio,
#define bio_for_each_segment_all(bvl, bio, iter) \
for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
+static inline void bio_advance_iter_dmabuf_map(struct bvec_iter *iter,
+ unsigned int bytes)
+{
+ iter->bi_bvec_done += bytes;
+ iter->bi_size -= bytes;
+}
+
static inline void bio_advance_iter(const struct bio *bio,
struct bvec_iter *iter, unsigned int bytes)
{
iter->bi_sector += bytes >> 9;
- if (bio_no_advance_iter(bio))
+ if (bio_no_advance_iter(bio)) {
iter->bi_size -= bytes;
- else
+ } else if (bio_flagged(bio, BIO_DMABUF_MAP)) {
+ bio_advance_iter_dmabuf_map(iter, bytes);
+ } else {
bvec_iter_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
+ }
}
/* @bytes should be less or equal to bvec[i->bi_idx].bv_len */
@@ -129,6 +139,8 @@ static inline void bio_advance_iter_single(const struct bio *bio,
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
+ else if (bio_flagged(bio, BIO_DMABUF_MAP))
+ bio_advance_iter_dmabuf_map(iter, bytes);
else
bvec_iter_advance_single(bio->bi_io_vec, iter, bytes);
}
@@ -391,7 +403,7 @@ static inline void bio_wouldblock_error(struct bio *bio)
*/
static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
{
- if (iov_iter_is_bvec(iter))
+ if (iov_iter_is_bvec(iter) || iov_iter_is_dmabuf_map(iter))
return 0;
return iov_iter_npages(iter, max_segs);
}
@@ -471,6 +483,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter,
unsigned len_align_mask);
void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter);
+void bio_dmabuf_map_set(struct bio *bio, struct iov_iter *iter);
void __bio_release_pages(struct bio *bio, bool mark_dirty);
extern void bio_set_pages_dirty(struct bio *bio);
extern void bio_check_pages_dirty(struct bio *bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8808ee76e73c..d5ad085b701d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -233,7 +233,12 @@ struct bio {
atomic_t __bi_remaining;
/* The actual vec list, preserved by bio_reset() */
- struct bio_vec *bi_io_vec;
+ union {
+ struct bio_vec *bi_io_vec;
+ /* Driver specific dma map, present only with BIO_DMABUF_MAP */
+ struct io_dmabuf_map *dmabuf_map;
+ };
+
struct bvec_iter bi_iter;
union {
@@ -322,6 +327,7 @@ enum {
BIO_REMAPPED,
BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
+ BIO_DMABUF_MAP, /* Using premmaped dma buffers */
BIO_FLAG_LAST
};
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 05/10] lib: add dmabuf token infrastructure
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
` (3 preceding siblings ...)
2026-04-29 15:25 ` [PATCH v3 04/10] block: introduce dma map backed bio type Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 06/10] block: forward create_dmabuf_token to drivers Pavel Begunkov
` (4 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
There are two main objects. struct io_dmabuf_token and struct
io_dmabuf_map. The token is used during initial registration and serves
as an interface between the upper layer user like io_uring and to the
importer subsystem / driver. io_dmabuf_map represens the actual dma map
established for the target device[s] with dma_buf_map_attachment() and
stored in a device specific format.
The separation into two different objects exists to support map
invalidation (see dma_buf_invalidate_mappings()). A token can create
multiple maps during its lifetime, but there can only be one (active)
map attached to it. It's aslo possible to not have an active map.
Invalidation drops the active map if present, and the next map will
only be attempted to be created once there is a new request that
wants to use the token.
The primary task of the io_dmabuf_map object is to count all requests
currently using it, which is done with percpu refcounts. When a map is
invalidated, we remove it from the token, so there can be no new
requests, then it adds a fence to the dmabuf reservation object. Once
all the requests complete, we signal the fence and unmap it.
[un]mapping and any work with dma addresses is delegated to the
importer driver via an ops table stored in the token, see struct
io_dmabuf_token_dev_ops. That's required because the generic layer
doesn't have knowledge about the device it's going to be use with,
and there will be more complex use cases with multiple devices.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/io_dmabuf_token.h | 92 +++++++++++
lib/Kconfig | 4 +
lib/Makefile | 2 +
lib/io_dmabuf_token.c | 272 ++++++++++++++++++++++++++++++++
4 files changed, 370 insertions(+)
create mode 100644 include/linux/io_dmabuf_token.h
create mode 100644 lib/io_dmabuf_token.c
diff --git a/include/linux/io_dmabuf_token.h b/include/linux/io_dmabuf_token.h
new file mode 100644
index 000000000000..b94bda684812
--- /dev/null
+++ b/include/linux/io_dmabuf_token.h
@@ -0,0 +1,92 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_DMA_TOKEN_H
+#define _LINUX_DMA_TOKEN_H
+
+#include <linux/dma-buf.h>
+
+struct io_dmabuf_fence;
+struct io_dmabuf_token;
+struct io_dmabuf_map;
+
+struct io_dmabuf_token_dev_ops {
+ /*
+ * Create a new map for the given token. It should be initialised
+ * with io_dmabuf_init_map(). The callback is executed with the
+ * reservation lock held.
+ */
+ struct io_dmabuf_map *(*map)(struct io_dmabuf_token *);
+
+ /*
+ * Clean up device specific parts of the map. The callback is
+ * executed with the reservation lock held.
+ */
+ void (*unmap)(struct io_dmabuf_token *, struct io_dmabuf_map *);
+
+ /*
+ * The user tries to destroy the token. Release all device specific
+ * parts of the token.
+ */
+ void (*release)(struct io_dmabuf_token *);
+};
+
+struct io_dmabuf_map {
+ /*
+ * Counts attached requests and other users. Device specific unmapping
+ * is deferred until all refs are dropped.
+ */
+ struct percpu_ref refs;
+
+ struct work_struct release_work;
+ struct io_dmabuf_fence *fence;
+ struct io_dmabuf_token *token;
+};
+
+struct io_dmabuf_token {
+ struct io_dmabuf_map __rcu *map;
+ struct dma_buf *dmabuf;
+ enum dma_data_direction dir;
+
+ atomic_t fence_seq;
+ u64 fence_ctx;
+ struct work_struct release_work;
+ refcount_t refs;
+
+ void *dev_priv;
+ const struct io_dmabuf_token_dev_ops *dev_ops;
+};
+
+int io_dmabuf_token_create(struct file *file,
+ struct io_dmabuf_token *token,
+ struct dma_buf *dmabuf,
+ enum dma_data_direction dir);
+void io_dmabuf_token_release(struct io_dmabuf_token *token);
+
+struct io_dmabuf_map *io_dmabuf_create_map(struct io_dmabuf_token *token);
+
+static inline struct io_dmabuf_map *io_dmabuf_get_map(struct io_dmabuf_token *token)
+{
+ struct io_dmabuf_map *map;
+
+ guard(rcu)();
+
+ map = rcu_dereference(token->map);
+ if (unlikely(!map || !percpu_ref_tryget_live_rcu(&map->refs)))
+ return NULL;
+
+ return map;
+}
+
+static inline void io_dmabuf_map_drop(struct io_dmabuf_map *map)
+{
+ percpu_ref_put(&map->refs);
+}
+
+/*
+ * Device API
+ */
+
+void io_dmabuf_token_invalidate_mappings(struct io_dmabuf_token *token);
+int io_dmabuf_init_map(struct io_dmabuf_token *token, struct io_dmabuf_map *map);
+
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 0f2fb9610647..853f10bf8e1a 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -636,3 +636,7 @@ config UNION_FIND
config MIN_HEAP
bool
+
+config DMABUF_TOKEN
+ def_bool y
+ depends on DMA_SHARED_BUFFER
diff --git a/lib/Makefile b/lib/Makefile
index ea660cca04f4..4a42cfcaa80c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -246,6 +246,8 @@ obj-$(CONFIG_IRQ_POLL) += irq_poll.o
obj-$(CONFIG_POLYNOMIAL) += polynomial.o
+obj-$(CONFIG_DMABUF_TOKEN) += io_dmabuf_token.o
+
# stackdepot.c should not be instrumented or call instrumented functions.
# Prevent the compiler from calling builtins like memcmp() or bcmp() from this
# file.
diff --git a/lib/io_dmabuf_token.c b/lib/io_dmabuf_token.c
new file mode 100644
index 000000000000..808b5ad33dbc
--- /dev/null
+++ b/lib/io_dmabuf_token.c
@@ -0,0 +1,272 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Common infrastructure for supporing dma-buf in the I/O path.
+ *
+ * Copyright (C) 2026 Pavel Begunkov <asml.silence@gmail.com>
+ */
+#include <linux/io_dmabuf_token.h>
+#include <linux/dma-resv.h>
+
+struct io_dmabuf_fence {
+ struct dma_fence base;
+ spinlock_t lock;
+};
+
+static const char *io_dmabuf_fence_drv_name(struct dma_fence *fence)
+{
+ /* default fence release kfree's the base pointer */
+ BUILD_BUG_ON(offsetof(struct io_dmabuf_fence, base));
+
+ return "DMABUF token";
+}
+
+static const char *io_dmabuf_fence_timeline_name(struct dma_fence *fence)
+{
+ return "DMABUF token";
+}
+
+const struct dma_fence_ops io_dmabuf_fence_ops = {
+ .get_driver_name = io_dmabuf_fence_drv_name,
+ .get_timeline_name = io_dmabuf_fence_timeline_name,
+};
+
+static void io_dmabuf_token_destroy_work(struct work_struct *work)
+{
+ struct io_dmabuf_token *token = container_of(work, struct io_dmabuf_token,
+ release_work);
+
+ if (WARN_ON_ONCE(refcount_read(&token->refs)))
+ return;
+
+ token->dev_ops->release(token);
+ dma_buf_put(token->dmabuf);
+ kfree(token);
+}
+
+static void io_dmabuf_map_release_work(struct work_struct *work)
+{
+ struct io_dmabuf_map *map = container_of(work, struct io_dmabuf_map,
+ release_work);
+ struct io_dmabuf_fence *fence = map->fence;
+ struct io_dmabuf_token *token = map->token;
+ struct dma_buf *dmabuf = token->dmabuf;
+
+ /* the release path must wait for fences */
+ if (WARN_ON_ONCE(refcount_read(&token->refs) == 0))
+ return;
+
+ /* Prevent from destoying the token while unmapping */
+ refcount_inc(&token->refs);
+
+ /*
+ * There are no more requests using the map, we can signal the fence.
+ * It should be done before taking the resv lock as someone could be
+ * waiting for the fence while holding the lock.
+ */
+ dma_fence_signal(&fence->base);
+
+ dma_resv_lock(dmabuf->resv, NULL);
+ token->dev_ops->unmap(token, map);
+ dma_resv_unlock(dmabuf->resv);
+
+ dma_fence_put(&fence->base);
+ percpu_ref_exit(&map->refs);
+ kfree(map);
+
+ if (refcount_dec_and_test(&token->refs)) {
+ /*
+ * Destruction needs to wait for I/O and dma fences. Defer it to
+ * simplify locking.
+ */
+ INIT_WORK(&token->release_work, io_dmabuf_token_destroy_work);
+ queue_work(system_wq, &token->release_work);
+ }
+}
+
+static void io_dmabuf_map_refs_release(struct percpu_ref *ref)
+{
+ struct io_dmabuf_map *map = container_of(ref, struct io_dmabuf_map, refs);
+
+ /* might sleep, use a worker */
+ INIT_WORK(&map->release_work, io_dmabuf_map_release_work);
+ queue_work(system_wq, &map->release_work);
+}
+
+int io_dmabuf_init_map(struct io_dmabuf_token *token, struct io_dmabuf_map *map)
+{
+ struct io_dmabuf_fence *fence = NULL;
+ int ret;
+
+ fence = kzalloc(sizeof(*fence), GFP_KERNEL);
+ if (!fence)
+ return -ENOMEM;
+
+ ret = percpu_ref_init(&map->refs, io_dmabuf_map_refs_release, 0, GFP_KERNEL);
+ if (ret) {
+ kfree(fence);
+ return ret;
+ }
+
+ spin_lock_init(&fence->lock);
+ dma_fence_init(&fence->base, &io_dmabuf_fence_ops, &fence->lock,
+ token->fence_ctx, atomic_inc_return(&token->fence_seq));
+ map->fence = fence;
+ map->token = token;
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(io_dmabuf_init_map, "DMA_BUF");
+
+struct io_dmabuf_map *io_dmabuf_create_map(struct io_dmabuf_token *token)
+{
+ struct dma_buf *dmabuf = token->dmabuf;
+ struct io_dmabuf_map *map;
+ long ret;
+
+retry:
+ /*
+ * ->dmabuf_map() will be calling dma_buf_map_attachment(), for which
+ * we'll need to wait for fences. Do a bit nicer and try to wait
+ * without the resv lock first.
+ */
+ ret = dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_KERNEL,
+ true, MAX_SCHEDULE_TIMEOUT);
+ if (!ret)
+ ret = -EAGAIN;
+ if (ret < 0)
+ return ERR_PTR(ret);
+
+ dma_resv_lock(dmabuf->resv, NULL);
+ map = io_dmabuf_get_map(token);
+ if (map) {
+ ret = 0;
+ goto out;
+ }
+
+ if (dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_KERNEL,
+ true, 0) < 0) {
+ dma_resv_unlock(dmabuf->resv);
+ goto retry;
+ }
+
+ map = token->dev_ops->map(token);
+ if (IS_ERR(map)) {
+ ret = PTR_ERR(map);
+ goto out;
+ }
+
+ percpu_ref_get(&map->refs);
+ rcu_assign_pointer(token->map, map);
+out:
+ dma_resv_unlock(dmabuf->resv);
+ if (ret < 0)
+ return ERR_PTR(ret);
+ return map;
+}
+
+static void io_dmabuf_drop_map(struct io_dmabuf_token *token)
+{
+ struct dma_buf *dmabuf = token->dmabuf;
+ struct io_dmabuf_map *map;
+ int ret;
+
+ dma_resv_assert_held(dmabuf->resv);
+
+ map = rcu_dereference_protected(token->map,
+ dma_resv_held(dmabuf->resv));
+ if (!map)
+ return;
+ rcu_assign_pointer(token->map, NULL);
+
+ ret = dma_resv_reserve_fences(dmabuf->resv, 1);
+ if (WARN_ON_ONCE(ret)) {
+ struct dma_fence *fence = &map->fence->base;
+
+ dma_fence_get(fence);
+ percpu_ref_kill(&map->refs);
+ dma_fence_wait(fence, false);
+ dma_fence_put(fence);
+ return;
+ }
+
+ dma_resv_add_fence(dmabuf->resv, &map->fence->base,
+ DMA_RESV_USAGE_KERNEL);
+ /*
+ * Delay destruction until all inflight requests using the map are
+ * gone. It'll also signal the fence then.
+ */
+ percpu_ref_kill(&map->refs);
+}
+
+void io_dmabuf_token_invalidate_mappings(struct io_dmabuf_token *token)
+{
+ io_dmabuf_drop_map(token);
+}
+EXPORT_SYMBOL_NS_GPL(io_dmabuf_token_invalidate_mappings, "DMA_BUF");
+
+static void io_dmabuf_token_release_work(struct work_struct *work)
+{
+ struct io_dmabuf_token *token = container_of(work, struct io_dmabuf_token,
+ release_work);
+ struct dma_buf *dmabuf = token->dmabuf;
+ long ret;
+
+ dma_resv_lock(dmabuf->resv, NULL);
+ /* Remove the last map, there should be no new ones going forward. */
+ io_dmabuf_drop_map(token);
+ dma_resv_unlock(dmabuf->resv);
+
+ /* Wait until all maps are destroyed. */
+ ret = dma_resv_wait_timeout(dmabuf->resv, DMA_RESV_USAGE_KERNEL,
+ false, MAX_SCHEDULE_TIMEOUT);
+
+ if (WARN_ON_ONCE(ret <= 0))
+ return;
+ if (WARN_ON_ONCE(rcu_dereference_protected(token->map, true)))
+ return;
+
+ if (refcount_dec_and_test(&token->refs))
+ io_dmabuf_token_destroy_work(&token->release_work);
+}
+
+void io_dmabuf_token_release(struct io_dmabuf_token *token)
+{
+ /*
+ * Destruction needs to wait for I/O and dma fences. Defer it to
+ * simplify locking.
+ */
+ INIT_WORK(&token->release_work, io_dmabuf_token_release_work);
+ queue_work(system_wq, &token->release_work);
+}
+
+int io_dmabuf_token_create(struct file *file,
+ struct io_dmabuf_token *token,
+ struct dma_buf *dmabuf,
+ enum dma_data_direction dir)
+{
+ int ret;
+
+ if (!file->f_op->create_dmabuf_token)
+ return -EOPNOTSUPP;
+
+ memset(token, 0, sizeof(*token));
+ token->fence_ctx = dma_fence_context_alloc(1);
+ token->dir = dir;
+ token->dmabuf = dmabuf;
+ refcount_set(&token->refs, 1);
+ get_dma_buf(dmabuf);
+
+ ret = file->f_op->create_dmabuf_token(file, token);
+ if (ret) {
+ memset(token, 0, sizeof(*token));
+ dma_buf_put(dmabuf);
+ return ret;
+ }
+
+ if (WARN_ON_ONCE(!token->dev_ops ||
+ !token->dev_ops->map ||
+ !token->dev_ops->unmap ||
+ !token->dev_ops->release))
+ return -EINVAL;
+
+ return ret;
+}
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 06/10] block: forward create_dmabuf_token to drivers
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
` (4 preceding siblings ...)
2026-04-29 15:25 ` [PATCH v3 05/10] lib: add dmabuf token infrastructure Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 07/10] nvme-pci: implement dma_token backed requests Pavel Begunkov
` (3 subsequent siblings)
9 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
Add a trivial implementation of the create_dmabuf_token call for
block devices that forwards the call to a new blk-mq callback if it's
available.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
block/fops.c | 14 ++++++++++++++
include/linux/blk-mq.h | 9 +++++++++
2 files changed, 23 insertions(+)
diff --git a/block/fops.c b/block/fops.c
index 713a3ba3f457..3d8a48a7d645 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -951,6 +951,19 @@ static int blkdev_mmap_prepare(struct vm_area_desc *desc)
return generic_file_mmap_prepare(desc);
}
+static int blkdev_create_dmabuf_token(struct file *file,
+ struct io_dmabuf_token *token)
+{
+ struct request_queue *q = bdev_get_queue(file_bdev(file));
+
+ if (!(file->f_flags & O_DIRECT))
+ return -EINVAL;
+ if (!q->mq_ops || !q->mq_ops->create_dmabuf_token)
+ return -EINVAL;
+
+ return q->mq_ops->create_dmabuf_token(q, token);
+}
+
const struct file_operations def_blk_fops = {
.open = blkdev_open,
.release = blkdev_release,
@@ -969,6 +982,7 @@ const struct file_operations def_blk_fops = {
.fallocate = blkdev_fallocate,
.uring_cmd = blkdev_uring_cmd,
.fop_flags = FOP_BUFFER_RASYNC,
+ .create_dmabuf_token = blkdev_create_dmabuf_token,
};
static __init int blkdev_init(void)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..ee31fb3ada10 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -15,6 +15,8 @@ struct blk_mq_tags;
struct blk_flush_queue;
struct io_comp_batch;
+struct io_dmabuf_token;
+
#define BLKDEV_MIN_RQ 4
#define BLKDEV_DEFAULT_RQ 128
@@ -684,6 +686,13 @@ struct blk_mq_ops {
*/
void (*show_rq)(struct seq_file *m, struct request *rq);
#endif
+
+ /**
+ * @create_dma_token: Create a dma token, which will be using to map
+ * a dmabuf for IO requests.
+ */
+ int (*create_dmabuf_token)(struct request_queue *,
+ struct io_dmabuf_token *token);
};
/* Keep hctx_flag_name[] in sync with the definitions below */
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 07/10] nvme-pci: implement dma_token backed requests
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
` (5 preceding siblings ...)
2026-04-29 15:25 ` [PATCH v3 06/10] block: forward create_dmabuf_token to drivers Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
2026-04-29 15:29 ` Pavel Begunkov
2026-04-29 16:07 ` Maurizio Lombardi
2026-04-29 15:25 ` [PATCH v3 08/10] io_uring/rsrc: introduce buf registration structure Pavel Begunkov
` (2 subsequent siblings)
9 siblings, 2 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
Enable BIO_DMABUF_MAP backed requests. It creates a prp list for the
dmabuf when it's mapped, which is then used to initialise requests.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
drivers/nvme/host/pci.c | 282 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 282 insertions(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index db5fc9bf6627..d2629853a972 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -27,6 +27,8 @@
#include <linux/io-64-nonatomic-lo-hi.h>
#include <linux/io-64-nonatomic-hi-lo.h>
#include <linux/sed-opal.h>
+#include <linux/io_dmabuf_token.h>
+#include <linux/dma-resv.h>
#include "trace.h"
#include "nvme.h"
@@ -393,6 +395,17 @@ struct nvme_queue {
struct completion delete_done;
};
+struct nvme_dmabuf_token {
+ struct dma_buf_attachment *attach;
+};
+
+struct nvme_dmabuf_map {
+ struct io_dmabuf_map base;
+ dma_addr_t *dma_list;
+ struct sg_table *sgt;
+ unsigned nr_entries;
+};
+
/* bits for iod->flags */
enum nvme_iod_flags {
/* this command has been aborted by the timeout handler */
@@ -854,6 +867,134 @@ static void nvme_free_descriptors(struct request *req)
}
}
+static void nvme_dmabuf_map_sync(struct nvme_dev *nvme_dev, struct request *req,
+ bool for_cpu)
+{
+ int length = blk_rq_payload_bytes(req);
+ struct device *dev = nvme_dev->dev;
+ enum dma_data_direction dma_dir;
+ struct bio *bio = req->bio;
+ struct nvme_dmabuf_map *map;
+ dma_addr_t *dma_list;
+ int offset, map_idx;
+
+ dma_dir = rq_data_dir(req) == READ ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
+ map = container_of(bio->dmabuf_map, struct nvme_dmabuf_map, base);
+ dma_list = map->dma_list;
+
+ offset = bio->bi_iter.bi_bvec_done;
+ map_idx = offset / NVME_CTRL_PAGE_SIZE;
+ length += offset & (NVME_CTRL_PAGE_SIZE - 1);
+
+ while (length > 0) {
+ u64 dma_addr = dma_list[map_idx++];
+
+ if (for_cpu)
+ __dma_sync_single_for_cpu(dev, dma_addr,
+ NVME_CTRL_PAGE_SIZE, dma_dir);
+ else
+ __dma_sync_single_for_device(dev, dma_addr,
+ NVME_CTRL_PAGE_SIZE,
+ dma_dir);
+ length -= NVME_CTRL_PAGE_SIZE;
+ }
+}
+
+static void nvme_rq_clean_dmabuf_map(struct nvme_dev *dev,
+ struct request *req)
+{
+ struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+ nvme_dmabuf_map_sync(dev, req, true);
+
+ if (!(iod->flags & IOD_SINGLE_SEGMENT))
+ nvme_free_descriptors(req);
+}
+
+static blk_status_t nvme_rq_setup_dmabuf_map(struct request *req,
+ struct nvme_queue *nvmeq)
+{
+ struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ int length = blk_rq_payload_bytes(req);
+ u64 dma_addr, prp1_dma, prp2_dma;
+ struct bio *bio = req->bio;
+ struct nvme_dmabuf_map *map;
+ dma_addr_t *dma_list;
+ dma_addr_t prp_dma;
+ __le64 *prp_list;
+ int i, map_idx;
+ int offset;
+
+ nvme_dmabuf_map_sync(nvmeq->dev, req, false);
+
+ map = container_of(bio->dmabuf_map, struct nvme_dmabuf_map, base);
+ dma_list = map->dma_list;
+
+ offset = bio->bi_iter.bi_bvec_done;
+ map_idx = offset / NVME_CTRL_PAGE_SIZE;
+ offset &= (NVME_CTRL_PAGE_SIZE - 1);
+ prp1_dma = dma_list[map_idx++] + offset;
+
+ length -= (NVME_CTRL_PAGE_SIZE - offset);
+ if (length <= 0) {
+ prp2_dma = 0;
+ goto done;
+ }
+
+ if (length <= NVME_CTRL_PAGE_SIZE) {
+ prp2_dma = dma_list[map_idx];
+ goto done;
+ }
+
+ if (DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE) <=
+ NVME_SMALL_POOL_SIZE / sizeof(__le64))
+ iod->flags |= IOD_SMALL_DESCRIPTOR;
+
+ prp_list = dma_pool_alloc(nvme_dma_pool(nvmeq, iod), GFP_ATOMIC,
+ &prp_dma);
+ if (!prp_list)
+ return BLK_STS_RESOURCE;
+
+ iod->descriptors[iod->nr_descriptors++] = prp_list;
+ prp2_dma = prp_dma;
+ i = 0;
+ for (;;) {
+ if (i == NVME_CTRL_PAGE_SIZE >> 3) {
+ __le64 *old_prp_list = prp_list;
+
+ prp_list = dma_pool_alloc(nvmeq->descriptor_pools.large,
+ GFP_ATOMIC, &prp_dma);
+ if (!prp_list)
+ goto free_prps;
+ iod->descriptors[iod->nr_descriptors++] = prp_list;
+ prp_list[0] = old_prp_list[i - 1];
+ old_prp_list[i - 1] = cpu_to_le64(prp_dma);
+ i = 1;
+ }
+
+ dma_addr = dma_list[map_idx++];
+ prp_list[i++] = cpu_to_le64(dma_addr);
+
+ length -= NVME_CTRL_PAGE_SIZE;
+ if (length <= 0)
+ break;
+ }
+done:
+ iod->cmd.common.dptr.prp1 = cpu_to_le64(prp1_dma);
+ iod->cmd.common.dptr.prp2 = cpu_to_le64(prp2_dma);
+ return BLK_STS_OK;
+free_prps:
+ nvme_free_descriptors(req);
+ return BLK_STS_RESOURCE;
+}
+
+static inline bool nvme_rq_is_dmabuf_attached(struct request *req)
+{
+ if (!IS_ENABLED(CONFIG_DMABUF_TOKEN))
+ return false;
+ return req->bio && bio_flagged(req->bio, BIO_DMABUF_MAP);
+}
+
static void nvme_free_prps(struct request *req, unsigned int attrs)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -932,6 +1073,11 @@ static void nvme_unmap_data(struct request *req)
struct device *dma_dev = nvmeq->dev->dev;
unsigned int attrs = 0;
+ if (nvme_rq_is_dmabuf_attached(req)) {
+ nvme_rq_clean_dmabuf_map(nvmeq->dev, req);
+ return;
+ }
+
if (iod->flags & IOD_SINGLE_SEGMENT) {
static_assert(offsetof(union nvme_data_ptr, prp1) ==
offsetof(union nvme_data_ptr, sgl.addr));
@@ -1222,6 +1368,9 @@ static blk_status_t nvme_map_data(struct request *req)
struct blk_dma_iter iter;
blk_status_t ret;
+ if (nvme_rq_is_dmabuf_attached(req))
+ return nvme_rq_setup_dmabuf_map(req, nvmeq);
+
/*
* Try to skip the DMA iterator for single segment requests, as that
* significantly improves performances for small I/O sizes.
@@ -2238,6 +2387,134 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
return result;
}
+#ifdef CONFIG_DMABUF_TOKEN
+static void nvme_dmabuf_invalidate_mappings(struct dma_buf_attachment *attach)
+{
+ struct io_dmabuf_token *token = attach->importer_priv;
+
+ io_dmabuf_token_invalidate_mappings(token);
+}
+
+const struct dma_buf_attach_ops nvme_dmabuf_importer_ops = {
+ .invalidate_mappings = nvme_dmabuf_invalidate_mappings,
+ .allow_peer2peer = true,
+};
+
+static struct io_dmabuf_map *nvme_dmabuf_token_map(struct io_dmabuf_token *token)
+{
+ struct nvme_dmabuf_token *data = token->dev_priv;
+ struct dma_buf_attachment *attach = data->attach;
+ dma_addr_t *dma_list = NULL;
+ unsigned long tmp, i = 0;
+ struct nvme_dmabuf_map *map;
+ struct scatterlist *sg;
+ struct sg_table *sgt;
+ unsigned nr_entries;
+ int ret;
+
+ dma_resv_assert_held(token->dmabuf->resv);
+
+ map = kmalloc(sizeof(*map), GFP_KERNEL);
+ if (!map)
+ return ERR_PTR(-ENOMEM);
+
+ nr_entries = token->dmabuf->size / NVME_CTRL_PAGE_SIZE;
+ dma_list = kmalloc_array(nr_entries, sizeof(dma_list[0]), GFP_KERNEL);
+ if (!dma_list) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ sgt = dma_buf_map_attachment(attach, token->dir);
+ if (IS_ERR(sgt)) {
+ ret = PTR_ERR(sgt);
+ sgt = NULL;
+ goto err;
+ }
+
+ for_each_sgtable_dma_sg(sgt, sg, tmp) {
+ dma_addr_t dma_addr = sg_dma_address(sg);
+ unsigned long sg_len = sg_dma_len(sg);
+
+ if (sg_len % NVME_CTRL_PAGE_SIZE) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ while (sg_len) {
+ dma_list[i++] = dma_addr;
+ dma_addr += NVME_CTRL_PAGE_SIZE;
+ sg_len -= NVME_CTRL_PAGE_SIZE;
+ }
+ }
+
+ ret = io_dmabuf_init_map(token, &map->base);
+ if (ret)
+ goto err;
+ map->nr_entries = nr_entries;
+ map->dma_list = dma_list;
+ map->sgt = sgt;
+ return &map->base;
+err:
+ if (sgt)
+ dma_buf_unmap_attachment(attach, sgt, token->dir);
+ kfree(map);
+ kfree(dma_list);
+ return ERR_PTR(ret);
+}
+
+static void nvme_dmabuf_token_unmap(struct io_dmabuf_token *token,
+ struct io_dmabuf_map *map_base)
+{
+ struct nvme_dmabuf_token *data = token->dev_priv;
+ struct nvme_dmabuf_map *map = container_of(map_base,
+ struct nvme_dmabuf_map, base);
+
+ dma_resv_assert_held(token->dmabuf->resv);
+
+ dma_buf_unmap_attachment(data->attach, map->sgt, token->dir);
+ kfree(map->dma_list);
+}
+
+static void nvme_dmabuf_token_release(struct io_dmabuf_token *token)
+{
+ struct nvme_dmabuf_token *data = token->dev_priv;
+
+ dma_buf_detach(token->dmabuf, data->attach);
+ kfree(data);
+}
+
+const struct io_dmabuf_token_dev_ops nvme_dma_token_ops = {
+ .map = nvme_dmabuf_token_map,
+ .unmap = nvme_dmabuf_token_unmap,
+ .release = nvme_dmabuf_token_release,
+};
+
+static int nvme_create_dmabuf_token(struct request_queue *q,
+ struct io_dmabuf_token *token)
+{
+ struct nvme_dmabuf_token *data;
+ struct dma_buf_attachment *attach;
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
+ struct dma_buf *dmabuf = token->dmabuf;
+
+ data = kzalloc(sizeof(data), GFP_KERNEL);
+ if (!data)
+ return -ENOMEM;
+
+ token->dev_priv = data;
+ token->dev_ops = &nvme_dma_token_ops;
+
+ attach = dma_buf_dynamic_attach(dmabuf, dev->dev,
+ &nvme_dmabuf_importer_ops, token);
+ if (IS_ERR(attach))
+ return PTR_ERR(attach);
+ data->attach = attach;
+ return 0;
+}
+#endif
+
static const struct blk_mq_ops nvme_mq_admin_ops = {
.queue_rq = nvme_queue_rq,
.complete = nvme_pci_complete_rq,
@@ -2256,6 +2533,10 @@ static const struct blk_mq_ops nvme_mq_ops = {
.map_queues = nvme_pci_map_queues,
.timeout = nvme_timeout,
.poll = nvme_poll,
+
+#ifdef CONFIG_DMABUF_TOKEN
+ .create_dmabuf_token = nvme_create_dmabuf_token,
+#endif
};
static void nvme_dev_remove_admin(struct nvme_dev *dev)
@@ -4289,5 +4570,6 @@ MODULE_AUTHOR("Matthew Wilcox <willy@linux.intel.com>");
MODULE_LICENSE("GPL");
MODULE_VERSION("1.0");
MODULE_DESCRIPTION("NVMe host PCIe transport driver");
+MODULE_IMPORT_NS("DMA_BUF");
module_init(nvme_init);
module_exit(nvme_exit);
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 08/10] io_uring/rsrc: introduce buf registration structure
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
` (6 preceding siblings ...)
2026-04-29 15:25 ` [PATCH v3 07/10] nvme-pci: implement dma_token backed requests Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 09/10] io_uring/rsrc: extend buffer update Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 10/10] io_uring/rsrc: add dmabuf backed registered buffers Pavel Begunkov
9 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
In preparation to following changes, instead of passing an iovec for
buffer registration introduce a new structure. It'll be moved to uapi
later, but for now it's initialised early from a user provided iovec.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/rsrc.c | 50 +++++++++++++++++++++++++++++++++----------------
1 file changed, 34 insertions(+), 16 deletions(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index c4a7a77d1ee9..ba00238941ed 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -27,8 +27,14 @@ struct io_rsrc_update {
u32 offset;
};
+struct io_uring_regbuf_desc {
+ __u64 uaddr;
+ __u64 size;
+};
+
static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
- struct iovec *iov, struct page **last_hpage);
+ struct io_uring_regbuf_desc *desc,
+ struct page **last_hpage);
/* only define max */
#define IORING_MAX_FIXED_FILES (1U << 20)
@@ -36,6 +42,15 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
#define IO_CACHED_BVECS_SEGS 32
+static void io_iov_to_regbuf_desc(const struct iovec *iov,
+ struct io_uring_regbuf_desc *desc)
+{
+ *desc = (struct io_uring_regbuf_desc) {
+ .uaddr = (u64)iov->iov_base,
+ .size = iov->iov_len,
+ };
+}
+
int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
{
unsigned long page_limit, cur_pages, new_pages;
@@ -291,6 +306,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
return -EINVAL;
for (done = 0; done < nr_args; done++) {
+ struct io_uring_regbuf_desc desc;
struct io_rsrc_node *node;
u64 tag = 0;
@@ -304,7 +320,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
err = -EFAULT;
break;
}
- node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+
+ io_iov_to_regbuf_desc(iov, &desc);
+ node = io_sqe_buffer_register(ctx, &desc, &last_hpage);
if (IS_ERR(node)) {
err = PTR_ERR(node);
break;
@@ -760,27 +778,27 @@ bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
}
static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
- struct iovec *iov,
- struct page **last_hpage)
+ struct io_uring_regbuf_desc *desc,
+ struct page **last_hpage)
{
+ unsigned long uaddr = (unsigned long)desc->uaddr;
+ size_t size = desc->size;
struct io_mapped_ubuf *imu = NULL;
struct page **pages = NULL;
struct io_rsrc_node *node;
unsigned long off;
- size_t size;
int ret, nr_pages, i;
struct io_imu_folio_data data;
bool coalesced = false;
- if (!iov->iov_base) {
- if (iov->iov_len)
+ if (!uaddr) {
+ if (size)
return ERR_PTR(-EFAULT);
/* remove the buffer without installing a new one */
return NULL;
}
- ret = io_validate_user_buf_range((unsigned long)iov->iov_base,
- iov->iov_len);
+ ret = io_validate_user_buf_range(uaddr, size);
if (ret)
return ERR_PTR(ret);
@@ -789,8 +807,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
return ERR_PTR(-ENOMEM);
ret = -ENOMEM;
- pages = io_pin_pages((unsigned long) iov->iov_base, iov->iov_len,
- &nr_pages);
+ pages = io_pin_pages(uaddr, size, &nr_pages);
if (IS_ERR(pages)) {
ret = PTR_ERR(pages);
pages = NULL;
@@ -812,10 +829,9 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
if (ret)
goto done;
- size = iov->iov_len;
/* store original address for later verification */
- imu->ubuf = (unsigned long) iov->iov_base;
- imu->len = iov->iov_len;
+ imu->ubuf = uaddr;
+ imu->len = size;
imu->folio_shift = PAGE_SHIFT;
imu->release = io_release_ubuf;
imu->priv = imu;
@@ -825,7 +841,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
imu->folio_shift = data.folio_shift;
refcount_set(&imu->refs, 1);
- off = (unsigned long)iov->iov_base & ~PAGE_MASK;
+ off = uaddr & ~PAGE_MASK;
if (coalesced)
off += data.first_folio_page_idx << PAGE_SHIFT;
@@ -878,6 +894,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
memset(iov, 0, sizeof(*iov));
for (i = 0; i < nr_args; i++) {
+ struct io_uring_regbuf_desc desc;
struct io_rsrc_node *node;
u64 tag = 0;
@@ -901,7 +918,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
}
}
- node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+ io_iov_to_regbuf_desc(iov, &desc);
+ node = io_sqe_buffer_register(ctx, &desc, &last_hpage);
if (IS_ERR(node)) {
ret = PTR_ERR(node);
break;
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 09/10] io_uring/rsrc: extend buffer update
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
` (7 preceding siblings ...)
2026-04-29 15:25 ` [PATCH v3 08/10] io_uring/rsrc: introduce buf registration structure Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 10/10] io_uring/rsrc: add dmabuf backed registered buffers Pavel Begunkov
9 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
We need to pass more information to buffer registration than we can fit
into a single struct iovec. This patch allows users to optionally pass
struct io_uring_regbuf_desc. Apart from having more space for future use
cases, it also introduces registration types.
Currently, the type can be either of IO_REGBUF_TYPE_UADDR, which mirrors
the iovec path, or IO_REGBUF_TYPE_EMPTY for leaving a buffer table slot
empty. The next patch introduces a dmabuf backed type, and can be useful
for other extensions like splicing a list of user addresses (i.e.
iovec[]), interoperability with zcrx, kernel allocated memory like was
brough up by Cristoph. Note, the type only represents a registration
option, which is distinct from how io_uring internally stores it.
The flags field is not used yet but always useful to have, e.g. we can
encode read-only / write-only restrictions using it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/uapi/linux/io_uring.h | 27 +++++++++++++-
io_uring/rsrc.c | 69 ++++++++++++++++++++++-------------
2 files changed, 69 insertions(+), 27 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 17ac1b785440..05c3fd078767 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -790,13 +790,38 @@ struct io_uring_rsrc_update {
struct io_uring_rsrc_update2 {
__u32 offset;
- __u32 resv;
+ __u32 flags;
__aligned_u64 data;
__aligned_u64 tags;
__u32 nr;
__u32 resv2;
};
+/* struct io_uring_rsrc_update2::flags */
+enum io_uring_rsrc_reg_flags {
+ /*
+ * Use the extended descriptor format for buffer updates,
+ * see struct io_uring_regbuf_desc
+ */
+ IORING_RSRC_UPDATE_EXTENDED = 1U << 1,
+};
+
+/* Buffer registration type, passed in struct io_uring_regbuf_desc::type */
+enum io_uring_regbuf_type {
+ IO_REGBUF_TYPE_EMPTY,
+ IO_REGBUF_TYPE_UADDR,
+
+ __IO_REGBUF_TYPE_MAX,
+};
+
+struct io_uring_regbuf_desc {
+ __u32 type; /* enum io_uring_regbuf_type */
+ __u32 flags;
+ __u64 size;
+ __u64 uaddr;
+ __u64 __resv[7];
+};
+
/* Skip updating fd indexes set to this value in the fd table */
#define IORING_REGISTER_FILES_SKIP (-2)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index ba00238941ed..f8696b01cb54 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -27,11 +27,6 @@ struct io_rsrc_update {
u32 offset;
};
-struct io_uring_regbuf_desc {
- __u64 uaddr;
- __u64 size;
-};
-
static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
struct io_uring_regbuf_desc *desc,
struct page **last_hpage);
@@ -46,9 +41,12 @@ static void io_iov_to_regbuf_desc(const struct iovec *iov,
struct io_uring_regbuf_desc *desc)
{
*desc = (struct io_uring_regbuf_desc) {
+ .type = IO_REGBUF_TYPE_UADDR,
.uaddr = (u64)iov->iov_base,
.size = iov->iov_len,
};
+ if (!desc->uaddr)
+ desc->type = IO_REGBUF_TYPE_EMPTY;
}
int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
@@ -236,6 +234,8 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
return -ENXIO;
if (up->offset + nr_args > ctx->file_table.data.nr)
return -EINVAL;
+ if (up->flags)
+ return -EINVAL;
for (done = 0; done < nr_args; done++) {
u64 tag = 0;
@@ -292,10 +292,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
struct io_uring_rsrc_update2 *up,
unsigned int nr_args)
{
+ bool extended = up->flags & IORING_RSRC_UPDATE_EXTENDED;
u64 __user *tags = u64_to_user_ptr(up->tags);
- struct iovec fast_iov, *iov;
struct page *last_hpage = NULL;
- struct iovec __user *uvec;
u64 user_data = up->data;
__u32 done;
int i, err;
@@ -304,29 +303,49 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
return -ENXIO;
if (up->offset + nr_args > ctx->buf_table.nr)
return -EINVAL;
+ if (up->flags & ~IORING_RSRC_UPDATE_EXTENDED)
+ return -EINVAL;
for (done = 0; done < nr_args; done++) {
struct io_uring_regbuf_desc desc;
struct io_rsrc_node *node;
u64 tag = 0;
- uvec = u64_to_user_ptr(user_data);
- iov = iovec_from_user(uvec, 1, 1, &fast_iov, io_is_compat(ctx));
- if (IS_ERR(iov)) {
- err = PTR_ERR(iov);
- break;
- }
if (tags && copy_from_user(&tag, &tags[done], sizeof(tag))) {
err = -EFAULT;
break;
}
- io_iov_to_regbuf_desc(iov, &desc);
+ if (extended) {
+ if (copy_from_user(&desc, u64_to_user_ptr(user_data),
+ sizeof(desc))) {
+ err = -EFAULT;
+ break;
+ }
+ user_data += sizeof(desc);
+ } else {
+ struct iovec __user *uvec = u64_to_user_ptr(user_data);
+ struct iovec fast_iov, *iov;
+
+ if (io_is_compat(ctx))
+ user_data += sizeof(struct compat_iovec);
+ else
+ user_data += sizeof(struct iovec);
+
+ iov = iovec_from_user(uvec, 1, 1, &fast_iov, io_is_compat(ctx));
+ if (IS_ERR(iov)) {
+ err = PTR_ERR(iov);
+ break;
+ }
+ io_iov_to_regbuf_desc(iov, &desc);
+ }
+
node = io_sqe_buffer_register(ctx, &desc, &last_hpage);
if (IS_ERR(node)) {
err = PTR_ERR(node);
break;
}
+
if (tag) {
if (!node) {
err = -EINVAL;
@@ -337,10 +356,6 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
io_reset_rsrc_node(ctx, &ctx->buf_table, i);
ctx->buf_table.nodes[i] = node;
- if (io_is_compat(ctx))
- user_data += sizeof(struct compat_iovec);
- else
- user_data += sizeof(struct iovec);
}
return done ? done : err;
}
@@ -375,7 +390,7 @@ int io_register_files_update(struct io_ring_ctx *ctx, void __user *arg,
memset(&up, 0, sizeof(up));
if (copy_from_user(&up, arg, sizeof(struct io_uring_rsrc_update)))
return -EFAULT;
- if (up.resv || up.resv2)
+ if (up.resv2)
return -EINVAL;
return __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up, nr_args);
}
@@ -389,7 +404,7 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
return -EINVAL;
if (copy_from_user(&up, arg, sizeof(up)))
return -EFAULT;
- if (!up.nr || up.resv || up.resv2)
+ if (!up.nr || up.resv2)
return -EINVAL;
return __io_register_rsrc_update(ctx, type, &up, up.nr);
}
@@ -489,12 +504,9 @@ int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
struct io_uring_rsrc_update2 up2;
int ret;
+ memset(&up2, 0, sizeof(up2));
up2.offset = up->offset;
up2.data = up->arg;
- up2.nr = 0;
- up2.tags = 0;
- up2.resv = 0;
- up2.resv2 = 0;
if (up->offset == IORING_FILE_INDEX_ALLOC) {
ret = io_files_update_with_index_alloc(req, issue_flags);
@@ -791,8 +803,13 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
struct io_imu_folio_data data;
bool coalesced = false;
- if (!uaddr) {
- if (size)
+ if (desc->type >= __IO_REGBUF_TYPE_MAX)
+ return ERR_PTR(-EINVAL);
+ if (!mem_is_zero(&desc->__resv, sizeof(desc->__resv)))
+ return ERR_PTR(-EINVAL);
+
+ if (desc->type == IO_REGBUF_TYPE_EMPTY) {
+ if (uaddr || size)
return ERR_PTR(-EFAULT);
/* remove the buffer without installing a new one */
return NULL;
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH v3 10/10] io_uring/rsrc: add dmabuf backed registered buffers
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
` (8 preceding siblings ...)
2026-04-29 15:25 ` [PATCH v3 09/10] io_uring/rsrc: extend buffer update Pavel Begunkov
@ 2026-04-29 15:25 ` Pavel Begunkov
9 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:25 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: asml.silence, Nitesh Shetty, Kanchan Joshi, Anuj Gupta,
Tushar Gohad, William Power, Phil Cayton, Jason Gunthorpe
Implement dmabuf backed registered buffers. To register them, the user
should specify IO_REGBUF_TYPE_DMABUF for the regitration and pass the
desired dmabuf fd and a file for which it should be registered.
From there, it can be used with io_uring read/write requests
IORING_OP_{READ,WRITE}_FIXED) as normal. The requests should be issued
against the file specified during registration, and otherwise they'll be
failed. The user should also be prepared to handle spurious -EAGAIN by
reissuing the request.
Internally, dmabuf registered buffers is an optin feature for io_uring
request opcodes and they should pass a special flag on import to use it.
Suggested-by: David Wei <dw@davidwei.uk>
Suggested-by: Vishal Verma <vishal1.verma@intel.com>
Suggested-by: Tushar Gohad <tushar.gohad@intel.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/io_uring_types.h | 5 +
include/uapi/linux/io_uring.h | 6 +-
io_uring/io_uring.c | 3 +-
io_uring/rsrc.c | 163 +++++++++++++++++++++++++++++++--
io_uring/rsrc.h | 30 +++++-
io_uring/rw.c | 4 +-
6 files changed, 200 insertions(+), 11 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 7aee83e5ea0e..f9a33099421a 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -10,6 +10,7 @@
struct iou_loop_params;
struct io_uring_bpf_ops;
+struct io_dmabuf_map;
enum {
/*
@@ -567,6 +568,7 @@ enum {
REQ_F_IMPORT_BUFFER_BIT,
REQ_F_SQE_COPIED_BIT,
REQ_F_IOPOLL_BIT,
+ REQ_F_DROP_DMABUF_BIT,
/* not a real bit, just to check we're not overflowing the space */
__REQ_F_LAST_BIT,
@@ -662,6 +664,8 @@ enum {
REQ_F_SQE_COPIED = IO_REQ_FLAG(REQ_F_SQE_COPIED_BIT),
/* request must be iopolled to completion (set in ->issue()) */
REQ_F_IOPOLL = IO_REQ_FLAG(REQ_F_IOPOLL_BIT),
+ /* there is a dma map attached to request that needs to be dropped */
+ REQ_F_DROP_DMABUF = IO_REQ_FLAG(REQ_F_DROP_DMABUF_BIT),
};
struct io_tw_req {
@@ -786,6 +790,7 @@ struct io_kiocb {
/* custom credentials, valid IFF REQ_F_CREDS is set */
const struct cred *creds;
struct io_wq_work work;
+ struct io_dmabuf_map *dmabuf_map;
struct io_big_cqe {
u64 extra1;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 05c3fd078767..3cd6ce28f9f5 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -810,6 +810,7 @@ enum io_uring_rsrc_reg_flags {
enum io_uring_regbuf_type {
IO_REGBUF_TYPE_EMPTY,
IO_REGBUF_TYPE_UADDR,
+ IO_REGBUF_TYPE_DMABUF,
__IO_REGBUF_TYPE_MAX,
};
@@ -819,7 +820,10 @@ struct io_uring_regbuf_desc {
__u32 flags;
__u64 size;
__u64 uaddr;
- __u64 __resv[7];
+
+ __s32 dmabuf_fd;
+ __s32 target_fd;
+ __u64 __resv[6];
};
/* Skip updating fd indexes set to this value in the fd table */
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 6068448a5aaa..e8a8eef45c3f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -108,7 +108,7 @@
#define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | IO_REQ_LINK_FLAGS | \
REQ_F_REISSUE | REQ_F_POLLED | \
- IO_REQ_CLEAN_FLAGS)
+ IO_REQ_CLEAN_FLAGS | REQ_F_DROP_DMABUF)
#define IO_TCTX_REFS_CACHE_NR (1U << 10)
@@ -1115,6 +1115,7 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
io_queue_next(req);
if (unlikely(req->flags & IO_REQ_CLEAN_FLAGS))
io_clean_op(req);
+ io_req_drop_dmabuf(req);
}
io_put_file(req);
io_req_put_rsrc_nodes(req);
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index f8696b01cb54..bb61de308543 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -10,6 +10,7 @@
#include <linux/compat.h>
#include <linux/io_uring.h>
#include <linux/io_uring/cmd.h>
+#include <linux/io_dmabuf_token.h>
#include <uapi/linux/io_uring.h>
@@ -789,6 +790,93 @@ bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
return true;
}
+struct io_regbuf_dma {
+ struct io_dmabuf_token token;
+ struct file *target_file;
+};
+
+static void io_release_reg_dmabuf(void *priv)
+{
+ struct io_regbuf_dma *db = priv;
+
+ fput(db->target_file);
+ io_dmabuf_token_release(&db->token);
+}
+
+static struct io_rsrc_node *io_register_dmabuf(struct io_ring_ctx *ctx,
+ struct io_uring_regbuf_desc *desc)
+{
+ struct io_rsrc_node *node = NULL;
+ struct io_mapped_ubuf *imu = NULL;
+ struct io_regbuf_dma *regbuf = NULL;
+ struct file *target_file = NULL;
+ struct dma_buf *dmabuf = NULL;
+ int ret;
+
+ if (!IS_ENABLED(CONFIG_DMABUF_TOKEN))
+ return ERR_PTR(-EOPNOTSUPP);
+ if (desc->uaddr || desc->size)
+ return ERR_PTR(-EINVAL);
+
+ ret = -ENOMEM;
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
+ if (!node)
+ return ERR_PTR(-ENOMEM);
+ imu = io_alloc_imu(ctx, 0);
+ if (!imu)
+ goto err;
+ regbuf = kzalloc(sizeof(*regbuf), GFP_KERNEL);
+ if (!regbuf)
+ goto err;
+
+ ret = -EBADF;
+ target_file = fget(desc->target_fd);
+ if (!target_file)
+ goto err;
+
+ dmabuf = dma_buf_get(desc->dmabuf_fd);
+ if (IS_ERR(dmabuf)) {
+ ret = PTR_ERR(dmabuf);
+ dmabuf = NULL;
+ goto err;
+ }
+ if (dmabuf->size > SZ_1G) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ ret = io_dmabuf_token_create(target_file, ®buf->token, dmabuf,
+ DMA_BIDIRECTIONAL);
+ if (ret)
+ goto err;
+
+ regbuf->target_file = target_file;
+ imu->nr_bvecs = 1;
+ imu->ubuf = 0;
+ imu->len = dmabuf->size;
+ imu->folio_shift = 0;
+ imu->release = io_release_reg_dmabuf;
+ imu->priv = regbuf;
+ imu->flags = IO_REGBUF_F_DMABUF;
+ imu->dir = IO_BUF_DEST | IO_BUF_SOURCE;
+ refcount_set(&imu->refs, 1);
+ node->buf = imu;
+ dma_buf_put(dmabuf);
+ return node;
+err:
+ kfree(regbuf);
+ if (imu)
+ io_free_imu(ctx, imu);
+ if (node)
+ io_cache_free(&ctx->node_cache, node);
+ if (target_file)
+ fput(target_file);
+ if (dmabuf)
+ dma_buf_put(dmabuf);
+ return ERR_PTR(ret);
+}
+
+
static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
struct io_uring_regbuf_desc *desc,
struct page **last_hpage)
@@ -808,6 +896,12 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
if (!mem_is_zero(&desc->__resv, sizeof(desc->__resv)))
return ERR_PTR(-EINVAL);
+ if (desc->type == IO_REGBUF_TYPE_DMABUF)
+ return io_register_dmabuf(ctx, desc);
+
+ if (desc->dmabuf_fd || desc->target_fd)
+ return ERR_PTR(-EINVAL);
+
if (desc->type == IO_REGBUF_TYPE_EMPTY) {
if (uaddr || size)
return ERR_PTR(-EFAULT);
@@ -1134,9 +1228,57 @@ static int io_import_kbuf(int ddir, struct iov_iter *iter,
return 0;
}
-static int io_import_fixed(int ddir, struct iov_iter *iter,
+void io_drop_dmabuf_node(struct io_kiocb *req)
+{
+ struct io_mapped_ubuf *imu;
+
+ if (!IS_ENABLED(CONFIG_DMABUF_TOKEN))
+ return;
+ if (WARN_ON_ONCE(req->buf_node->type != IORING_RSRC_BUFFER))
+ return;
+ imu = req->buf_node->buf;
+ if (WARN_ON_ONCE(!(imu->flags & IO_REGBUF_F_DMABUF)))
+ return;
+ io_dmabuf_map_drop(req->dmabuf_map);
+}
+
+static int io_import_dmabuf(struct io_kiocb *req,
+ int ddir, struct iov_iter *iter,
struct io_mapped_ubuf *imu,
- u64 buf_addr, size_t len)
+ size_t len, size_t offset,
+ unsigned issue_flags)
+{
+ struct io_regbuf_dma *db = imu->priv;
+ struct io_dmabuf_map *map;
+
+ if (!IS_ENABLED(CONFIG_DMABUF_TOKEN))
+ return -EOPNOTSUPP;
+ if (!len)
+ return -EFAULT;
+ if (req->file != db->target_file)
+ return -EBADF;
+
+ map = io_dmabuf_get_map(&db->token);
+ if (unlikely(!map)) {
+ if (!(issue_flags & IO_URING_F_UNLOCKED))
+ return -EAGAIN;
+ map = io_dmabuf_create_map(&db->token);
+ if (IS_ERR(map))
+ return PTR_ERR(map);
+ }
+
+ req->dmabuf_map = map;
+ req->flags |= REQ_F_DROP_DMABUF;
+ iov_iter_dmabuf_map(iter, ddir, map, offset, len);
+ return 0;
+}
+
+static int io_import_fixed(struct io_kiocb *req,
+ int ddir, struct iov_iter *iter,
+ struct io_mapped_ubuf *imu,
+ u64 buf_addr, size_t len,
+ unsigned issue_flags,
+ unsigned import_flags)
{
const struct bio_vec *bvec;
size_t folio_mask;
@@ -1156,6 +1298,12 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
offset = buf_addr - imu->ubuf;
+ if (imu->flags & IO_REGBUF_F_DMABUF) {
+ if (!(import_flags & IO_REGBUF_IMPORT_ALLOW_DMABUF))
+ return -EFAULT;
+ return io_import_dmabuf(req, ddir, iter, imu, len, offset,
+ issue_flags);
+ }
if (imu->flags & IO_REGBUF_F_KBUF)
return io_import_kbuf(ddir, iter, imu, len, offset);
@@ -1209,16 +1357,17 @@ inline struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
return NULL;
}
-int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
+int __io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
u64 buf_addr, size_t len, int ddir,
- unsigned issue_flags)
+ unsigned issue_flags, unsigned import_flags)
{
struct io_rsrc_node *node;
node = io_find_buf_node(req, issue_flags);
if (!node)
return -EFAULT;
- return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
+ return io_import_fixed(req, ddir, iter, node->buf, buf_addr, len,
+ issue_flags, import_flags);
}
/* Lock two rings at once. The rings must be different! */
@@ -1577,7 +1726,9 @@ int io_import_reg_vec(int ddir, struct iov_iter *iter,
iovec_off = vec->nr - nr_iovs;
iov = vec->iovec + iovec_off;
- if (imu->flags & IO_REGBUF_F_KBUF) {
+ if (imu->flags & IO_REGBUF_F_DMABUF) {
+ return -EOPNOTSUPP;
+ } else if (imu->flags & IO_REGBUF_F_KBUF) {
int ret = io_kern_bvec_size(iov, nr_iovs, imu, &nr_segs);
if (unlikely(ret))
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 8d48195faf9d..005a273ba107 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -25,6 +25,11 @@ struct io_rsrc_node {
enum {
IO_REGBUF_F_KBUF = 1,
+ IO_REGBUF_F_DMABUF = 2,
+};
+
+enum {
+ IO_REGBUF_IMPORT_ALLOW_DMABUF = 1,
};
struct io_mapped_ubuf {
@@ -60,9 +65,19 @@ int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
unsigned issue_flags);
+int __io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
+ u64 buf_addr, size_t len, int ddir,
+ unsigned issue_flags, unsigned import_flags);
+
+static inline
int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
u64 buf_addr, size_t len, int ddir,
- unsigned issue_flags);
+ unsigned issue_flags)
+{
+ return __io_import_reg_buf(req, iter, buf_addr, len, ddir,
+ issue_flags, 0);
+}
+
int io_import_reg_vec(int ddir, struct iov_iter *iter,
struct io_kiocb *req, struct iou_vec *vec,
unsigned nr_iovs, unsigned issue_flags);
@@ -147,4 +162,17 @@ static inline void io_alloc_cache_vec_kasan(struct iou_vec *iv)
io_vec_free(iv);
}
+void io_drop_dmabuf_node(struct io_kiocb *req);
+
+static inline void io_req_drop_dmabuf(struct io_kiocb *req)
+{
+ if (!IS_ENABLED(CONFIG_DMABUF_TOKEN))
+ return;
+ if (!(req->flags & REQ_F_DROP_DMABUF))
+ return;
+ if (WARN_ON_ONCE(!(req->flags & REQ_F_BUF_NODE)))
+ return;
+ io_drop_dmabuf_node(req);
+}
+
#endif
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 20654deff84d..d50da5fa8bb9 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -380,8 +380,8 @@ static int io_init_rw_fixed(struct io_kiocb *req, unsigned int issue_flags,
if (io->bytes_done)
return 0;
- ret = io_import_reg_buf(req, &io->iter, rw->addr, rw->len, ddir,
- issue_flags);
+ ret = __io_import_reg_buf(req, &io->iter, rw->addr, rw->len, ddir,
+ issue_flags, IO_REGBUF_IMPORT_ALLOW_DMABUF);
iov_iter_save_state(&io->iter, &io->iter_state);
return ret;
}
--
2.53.0
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3 07/10] nvme-pci: implement dma_token backed requests
2026-04-29 15:25 ` [PATCH v3 07/10] nvme-pci: implement dma_token backed requests Pavel Begunkov
@ 2026-04-29 15:29 ` Pavel Begunkov
2026-04-29 16:07 ` Maurizio Lombardi
1 sibling, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-29 15:29 UTC (permalink / raw)
To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
Alexander Viro, Christian Brauner, Andrew Morton, Sumit Semwal,
Christian König, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On 4/29/26 16:25, Pavel Begunkov wrote:
> Enable BIO_DMABUF_MAP backed requests. It creates a prp list for the
> dmabuf when it's mapped, which is then used to initialise requests.
I left nvme request / map setup as it was in Keith's work for the most
part (apart from rebases, adapting it, etc.). It appears I have some
use for prp lists, and I know Kanchan, Nitesh and Anuj already have some
patches adding sgl support and some other optimisations. Hopefully, I
addressed most all feedback from v2.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 07/10] nvme-pci: implement dma_token backed requests
2026-04-29 15:25 ` [PATCH v3 07/10] nvme-pci: implement dma_token backed requests Pavel Begunkov
2026-04-29 15:29 ` Pavel Begunkov
@ 2026-04-29 16:07 ` Maurizio Lombardi
2026-04-30 18:18 ` Pavel Begunkov
1 sibling, 1 reply; 16+ messages in thread
From: Maurizio Lombardi @ 2026-04-29 16:07 UTC (permalink / raw)
To: Pavel Begunkov, Jens Axboe, Keith Busch, Christoph Hellwig,
Sagi Grimberg, Alexander Viro, Christian Brauner, Andrew Morton,
Sumit Semwal, Christian König, linux-block, linux-kernel,
linux-nvme, linux-fsdevel, io-uring, linux-media, dri-devel,
linaro-mm-sig
Cc: Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On Wed Apr 29, 2026 at 5:25 PM CEST, Pavel Begunkov wrote:
> Enable BIO_DMABUF_MAP backed requests. It creates a prp list for the
> dmabuf when it's mapped, which is then used to initialise requests.
>
> Suggested-by: Keith Busch <kbusch@kernel.org>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
> drivers/nvme/host/pci.c | 282 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 282 insertions(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index db5fc9bf6627..d2629853a972 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -27,6 +27,8 @@
> #include <linux/io-64-nonatomic-lo-hi.h>
> #include <linux/io-64-nonatomic-hi-lo.h>
> #include <linux/sed-opal.h>
> +#include <linux/io_dmabuf_token.h>
> +#include <linux/dma-resv.h>
>
> #include "trace.h"
> #include "nvme.h"
> @@ -393,6 +395,17 @@ struct nvme_queue {
> struct completion delete_done;
> };
>
> +struct nvme_dmabuf_token {
> + struct dma_buf_attachment *attach;
> +};
> +
> +struct nvme_dmabuf_map {
> + struct io_dmabuf_map base;
> + dma_addr_t *dma_list;
> + struct sg_table *sgt;
> + unsigned nr_entries;
> +};
> +
> /* bits for iod->flags */
> enum nvme_iod_flags {
> /* this command has been aborted by the timeout handler */
> @@ -854,6 +867,134 @@ static void nvme_free_descriptors(struct request *req)
> }
> }
>
> +static void nvme_dmabuf_map_sync(struct nvme_dev *nvme_dev, struct request *req,
> + bool for_cpu)
> +{
> + int length = blk_rq_payload_bytes(req);
> + struct device *dev = nvme_dev->dev;
> + enum dma_data_direction dma_dir;
> + struct bio *bio = req->bio;
> + struct nvme_dmabuf_map *map;
> + dma_addr_t *dma_list;
> + int offset, map_idx;
> +
> + dma_dir = rq_data_dir(req) == READ ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
> + map = container_of(bio->dmabuf_map, struct nvme_dmabuf_map, base);
> + dma_list = map->dma_list;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + length += offset & (NVME_CTRL_PAGE_SIZE - 1);
> +
> + while (length > 0) {
> + u64 dma_addr = dma_list[map_idx++];
> +
> + if (for_cpu)
> + __dma_sync_single_for_cpu(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE, dma_dir);
> + else
> + __dma_sync_single_for_device(dev, dma_addr,
> + NVME_CTRL_PAGE_SIZE,
> + dma_dir);
> + length -= NVME_CTRL_PAGE_SIZE;
> + }
> +}
> +
> +static void nvme_rq_clean_dmabuf_map(struct nvme_dev *dev,
> + struct request *req)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +
> + nvme_dmabuf_map_sync(dev, req, true);
> +
> + if (!(iod->flags & IOD_SINGLE_SEGMENT))
> + nvme_free_descriptors(req);
> +}
> +
> +static blk_status_t nvme_rq_setup_dmabuf_map(struct request *req,
> + struct nvme_queue *nvmeq)
> +{
> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> + int length = blk_rq_payload_bytes(req);
> + u64 dma_addr, prp1_dma, prp2_dma;
> + struct bio *bio = req->bio;
> + struct nvme_dmabuf_map *map;
> + dma_addr_t *dma_list;
> + dma_addr_t prp_dma;
> + __le64 *prp_list;
> + int i, map_idx;
> + int offset;
> +
> + nvme_dmabuf_map_sync(nvmeq->dev, req, false);
> +
> + map = container_of(bio->dmabuf_map, struct nvme_dmabuf_map, base);
> + dma_list = map->dma_list;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + map_idx = offset / NVME_CTRL_PAGE_SIZE;
> + offset &= (NVME_CTRL_PAGE_SIZE - 1);
> + prp1_dma = dma_list[map_idx++] + offset;
> +
> + length -= (NVME_CTRL_PAGE_SIZE - offset);
> + if (length <= 0) {
> + prp2_dma = 0;
> + goto done;
> + }
> +
> + if (length <= NVME_CTRL_PAGE_SIZE) {
> + prp2_dma = dma_list[map_idx];
> + goto done;
> + }
> +
> + if (DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE) <=
> + NVME_SMALL_POOL_SIZE / sizeof(__le64))
> + iod->flags |= IOD_SMALL_DESCRIPTOR;
> +
> + prp_list = dma_pool_alloc(nvme_dma_pool(nvmeq, iod), GFP_ATOMIC,
> + &prp_dma);
> + if (!prp_list)
> + return BLK_STS_RESOURCE;
> +
> + iod->descriptors[iod->nr_descriptors++] = prp_list;
> + prp2_dma = prp_dma;
> + i = 0;
> + for (;;) {
> + if (i == NVME_CTRL_PAGE_SIZE >> 3) {
> + __le64 *old_prp_list = prp_list;
> +
> + prp_list = dma_pool_alloc(nvmeq->descriptor_pools.large,
> + GFP_ATOMIC, &prp_dma);
> + if (!prp_list)
> + goto free_prps;
> + iod->descriptors[iod->nr_descriptors++] = prp_list;
> + prp_list[0] = old_prp_list[i - 1];
> + old_prp_list[i - 1] = cpu_to_le64(prp_dma);
> + i = 1;
> + }
> +
> + dma_addr = dma_list[map_idx++];
> + prp_list[i++] = cpu_to_le64(dma_addr);
> +
> + length -= NVME_CTRL_PAGE_SIZE;
> + if (length <= 0)
> + break;
> + }
> +done:
> + iod->cmd.common.dptr.prp1 = cpu_to_le64(prp1_dma);
> + iod->cmd.common.dptr.prp2 = cpu_to_le64(prp2_dma);
> + return BLK_STS_OK;
> +free_prps:
> + nvme_free_descriptors(req);
> + return BLK_STS_RESOURCE;
> +}
> +
> +static inline bool nvme_rq_is_dmabuf_attached(struct request *req)
> +{
> + if (!IS_ENABLED(CONFIG_DMABUF_TOKEN))
> + return false;
> + return req->bio && bio_flagged(req->bio, BIO_DMABUF_MAP);
> +}
> +
> static void nvme_free_prps(struct request *req, unsigned int attrs)
> {
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> @@ -932,6 +1073,11 @@ static void nvme_unmap_data(struct request *req)
> struct device *dma_dev = nvmeq->dev->dev;
> unsigned int attrs = 0;
>
> + if (nvme_rq_is_dmabuf_attached(req)) {
> + nvme_rq_clean_dmabuf_map(nvmeq->dev, req);
> + return;
> + }
> +
> if (iod->flags & IOD_SINGLE_SEGMENT) {
> static_assert(offsetof(union nvme_data_ptr, prp1) ==
> offsetof(union nvme_data_ptr, sgl.addr));
> @@ -1222,6 +1368,9 @@ static blk_status_t nvme_map_data(struct request *req)
> struct blk_dma_iter iter;
> blk_status_t ret;
>
> + if (nvme_rq_is_dmabuf_attached(req))
> + return nvme_rq_setup_dmabuf_map(req, nvmeq);
> +
> /*
> * Try to skip the DMA iterator for single segment requests, as that
> * significantly improves performances for small I/O sizes.
> @@ -2238,6 +2387,134 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
> return result;
> }
>
> +#ifdef CONFIG_DMABUF_TOKEN
> +static void nvme_dmabuf_invalidate_mappings(struct dma_buf_attachment *attach)
> +{
> + struct io_dmabuf_token *token = attach->importer_priv;
> +
> + io_dmabuf_token_invalidate_mappings(token);
> +}
> +
> +const struct dma_buf_attach_ops nvme_dmabuf_importer_ops = {
> + .invalidate_mappings = nvme_dmabuf_invalidate_mappings,
> + .allow_peer2peer = true,
> +};
> +
> +static struct io_dmabuf_map *nvme_dmabuf_token_map(struct io_dmabuf_token *token)
> +{
> + struct nvme_dmabuf_token *data = token->dev_priv;
> + struct dma_buf_attachment *attach = data->attach;
> + dma_addr_t *dma_list = NULL;
> + unsigned long tmp, i = 0;
> + struct nvme_dmabuf_map *map;
> + struct scatterlist *sg;
> + struct sg_table *sgt;
> + unsigned nr_entries;
> + int ret;
> +
> + dma_resv_assert_held(token->dmabuf->resv);
> +
> + map = kmalloc(sizeof(*map), GFP_KERNEL);
> + if (!map)
> + return ERR_PTR(-ENOMEM);
> +
> + nr_entries = token->dmabuf->size / NVME_CTRL_PAGE_SIZE;
> + dma_list = kmalloc_array(nr_entries, sizeof(dma_list[0]), GFP_KERNEL);
> + if (!dma_list) {
> + ret = -ENOMEM;
> + goto err;
> + }
> +
> + sgt = dma_buf_map_attachment(attach, token->dir);
> + if (IS_ERR(sgt)) {
> + ret = PTR_ERR(sgt);
> + sgt = NULL;
> + goto err;
> + }
> +
> + for_each_sgtable_dma_sg(sgt, sg, tmp) {
> + dma_addr_t dma_addr = sg_dma_address(sg);
> + unsigned long sg_len = sg_dma_len(sg);
> +
> + if (sg_len % NVME_CTRL_PAGE_SIZE) {
> + ret = -EINVAL;
> + goto err;
> + }
> +
> + while (sg_len) {
> + dma_list[i++] = dma_addr;
> + dma_addr += NVME_CTRL_PAGE_SIZE;
> + sg_len -= NVME_CTRL_PAGE_SIZE;
> + }
> + }
> +
> + ret = io_dmabuf_init_map(token, &map->base);
> + if (ret)
> + goto err;
> + map->nr_entries = nr_entries;
> + map->dma_list = dma_list;
> + map->sgt = sgt;
> + return &map->base;
> +err:
> + if (sgt)
> + dma_buf_unmap_attachment(attach, sgt, token->dir);
> + kfree(map);
> + kfree(dma_list);
> + return ERR_PTR(ret);
> +}
> +
> +static void nvme_dmabuf_token_unmap(struct io_dmabuf_token *token,
> + struct io_dmabuf_map *map_base)
> +{
> + struct nvme_dmabuf_token *data = token->dev_priv;
> + struct nvme_dmabuf_map *map = container_of(map_base,
> + struct nvme_dmabuf_map, base);
> +
> + dma_resv_assert_held(token->dmabuf->resv);
> +
> + dma_buf_unmap_attachment(data->attach, map->sgt, token->dir);
> + kfree(map->dma_list);
> +}
> +
> +static void nvme_dmabuf_token_release(struct io_dmabuf_token *token)
> +{
> + struct nvme_dmabuf_token *data = token->dev_priv;
> +
> + dma_buf_detach(token->dmabuf, data->attach);
> + kfree(data);
> +}
> +
> +const struct io_dmabuf_token_dev_ops nvme_dma_token_ops = {
> + .map = nvme_dmabuf_token_map,
> + .unmap = nvme_dmabuf_token_unmap,
> + .release = nvme_dmabuf_token_release,
> +};
> +
> +static int nvme_create_dmabuf_token(struct request_queue *q,
> + struct io_dmabuf_token *token)
> +{
> + struct nvme_dmabuf_token *data;
> + struct dma_buf_attachment *attach;
> + struct nvme_ns *ns = q->queuedata;
> + struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
> + struct dma_buf *dmabuf = token->dmabuf;
> +
> + data = kzalloc(sizeof(data), GFP_KERNEL);
> + if (!data)
> + return -ENOMEM;
Shouldn't this be kzalloc(sizeof(*data)...) ?
Also, checkpatch generates a warning because kzalloc_obj() should be
preferred for this kind of memory allocations over kzalloc().
> +
> + token->dev_priv = data;
> + token->dev_ops = &nvme_dma_token_ops;
> +
> + attach = dma_buf_dynamic_attach(dmabuf, dev->dev,
> + &nvme_dmabuf_importer_ops, token);
> + if (IS_ERR(attach))
> + return PTR_ERR(attach);
Supposing dma_buf_dynamic_attach() returns an error, won't the 'data'
pointer be leaked?
Maurizio
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps
2026-04-29 15:25 ` [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps Pavel Begunkov
@ 2026-04-30 6:03 ` Christian König
2026-04-30 18:33 ` Pavel Begunkov
0 siblings, 1 reply; 16+ messages in thread
From: Christian König @ 2026-04-30 6:03 UTC (permalink / raw)
To: Pavel Begunkov, Jens Axboe, Keith Busch, Christoph Hellwig,
Sagi Grimberg, Alexander Viro, Christian Brauner, Andrew Morton,
Sumit Semwal, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On 4/29/26 17:25, Pavel Begunkov wrote:
> Introduce a new file callback that allows creating long-term dma
> mapping. All necessary information together with a dmabuf will be passed
> in the second argument of type struct io_dmabuf_token, which will be
> defined in following patches.
Well first of all the naming is probably not the best. Maybe rather call that dma-buf attachment or context or mappping.
Then the patch should probably define the full interface and not just add the callback here and then the structure in a follow up patch.
Regards,
Christian.
>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
> include/linux/fs.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b5b01bb22d12..c5558aab4628 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1920,6 +1920,7 @@ struct dir_context {
>
> struct io_uring_cmd;
> struct offset_ctx;
> +struct io_dmabuf_token;
>
> typedef unsigned int __bitwise fop_flags_t;
>
> @@ -1967,6 +1968,7 @@ struct file_operations {
> int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> unsigned int poll_flags);
> int (*mmap_prepare)(struct vm_area_desc *);
> + int (*create_dmabuf_token)(struct file *, struct io_dmabuf_token *);
> } __randomize_layout;
>
> /* Supports async buffered reads */
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 07/10] nvme-pci: implement dma_token backed requests
2026-04-29 16:07 ` Maurizio Lombardi
@ 2026-04-30 18:18 ` Pavel Begunkov
0 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-30 18:18 UTC (permalink / raw)
To: Maurizio Lombardi, Jens Axboe, Keith Busch, Christoph Hellwig,
Sagi Grimberg, Alexander Viro, Christian Brauner, Andrew Morton,
Sumit Semwal, Christian König, linux-block, linux-kernel,
linux-nvme, linux-fsdevel, io-uring, linux-media, dri-devel,
linaro-mm-sig
Cc: Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On 4/29/26 17:07, Maurizio Lombardi wrote:
> On Wed Apr 29, 2026 at 5:25 PM CEST, Pavel Begunkov wrote:
>> Enable BIO_DMABUF_MAP backed requests. It creates a prp list for the
>> dmabuf when it's mapped, which is then used to initialise requests.
>>
>> Suggested-by: Keith Busch <kbusch@kernel.org>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
...>> +
>> +static int nvme_create_dmabuf_token(struct request_queue *q,
>> + struct io_dmabuf_token *token)
>> +{
>> + struct nvme_dmabuf_token *data;
>> + struct dma_buf_attachment *attach;
>> + struct nvme_ns *ns = q->queuedata;
>> + struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
>> + struct dma_buf *dmabuf = token->dmabuf;
>> +
>> + data = kzalloc(sizeof(data), GFP_KERNEL);
>> + if (!data)
>> + return -ENOMEM;
>
> Shouldn't this be kzalloc(sizeof(*data)...) ?
Good catch, I'll apply it all for next version
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps
2026-04-30 6:03 ` Christian König
@ 2026-04-30 18:33 ` Pavel Begunkov
0 siblings, 0 replies; 16+ messages in thread
From: Pavel Begunkov @ 2026-04-30 18:33 UTC (permalink / raw)
To: Christian König, Jens Axboe, Keith Busch, Christoph Hellwig,
Sagi Grimberg, Alexander Viro, Christian Brauner, Andrew Morton,
Sumit Semwal, linux-block, linux-kernel, linux-nvme,
linux-fsdevel, io-uring, linux-media, dri-devel, linaro-mm-sig
Cc: Nitesh Shetty, Kanchan Joshi, Anuj Gupta, Tushar Gohad,
William Power, Phil Cayton, Jason Gunthorpe
On 4/30/26 07:03, Christian König wrote:
> On 4/29/26 17:25, Pavel Begunkov wrote:
>> Introduce a new file callback that allows creating long-term dma
>> mapping. All necessary information together with a dmabuf will be passed
>> in the second argument of type struct io_dmabuf_token, which will be
>> defined in following patches.
>
> Well first of all the naming is probably not the best. Maybe rather call that dma-buf attachment or context or mappping.
"Mapping" or "attachment" would be confusing as maps are created lazily
together with struct io_dmabuf_map. I can name it create_dmabuf_ctx(),
but I decided to use "token" not to collide with dmabuf terminology.
e.g. I wouldn't be surprised to see some dmabuf ctx in the dmabuf
implementation code. Maybe "*io_ctx" would be better.
> Then the patch should probably define the full interface and not just add the callback here and then the structure in a follow up patch.
I strongly prefer splitting patches so that they touch one tree at
a time whenever possible. tbh, I don't see much of a problem it being
not defined as it's just forwarded in first patches, but I can shuffle
it around in the series so that definitions come first.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-04-30 18:33 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 15:25 [PATCH v3 00/10] Add dmabuf read/write via io_uring Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 01/10] file: add callback for creating long-term dmabuf maps Pavel Begunkov
2026-04-30 6:03 ` Christian König
2026-04-30 18:33 ` Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 02/10] iov_iter: add iterator type for " Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 03/10] block: move bvec init into __bio_clone Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 04/10] block: introduce dma map backed bio type Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 05/10] lib: add dmabuf token infrastructure Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 06/10] block: forward create_dmabuf_token to drivers Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 07/10] nvme-pci: implement dma_token backed requests Pavel Begunkov
2026-04-29 15:29 ` Pavel Begunkov
2026-04-29 16:07 ` Maurizio Lombardi
2026-04-30 18:18 ` Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 08/10] io_uring/rsrc: introduce buf registration structure Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 09/10] io_uring/rsrc: extend buffer update Pavel Begunkov
2026-04-29 15:25 ` [PATCH v3 10/10] io_uring/rsrc: add dmabuf backed registered buffers Pavel Begunkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox