* [RFC 00/12] io_uring dmabuf read/write support
@ 2025-06-27 15:10 Pavel Begunkov
2025-06-27 15:10 ` [RFC 01/12] file: add callback returning dev for dma operations Pavel Begunkov
` (12 more replies)
0 siblings, 13 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Disclaimer: haven't been tested well enough yet and needs more beating
For past couple of months David Wei, Vishal Verma and other folks
have been mentioning that it'd be great to have dmabuf support for
read/write and other operations in io_uring. The topic is not new,
it has been discussed many times in different contexts including
networking. The last relevant attempt was premapped dma tags by
Keith [1], and this patch set took a lot from it.
This series implements it for read/write io_uring requests. The uAPI
looks similar to normal registered buffers, the user will need to
register a dmabuf in io_uring first and then use it as any other
registered buffer. On registration the user also specifies a file
to map the dmabuf for.
// register
io_uring_update_buffers(ring, { dma_buf_fd, target_file_fd });
// use
reg_buf_idx = 0;
io_uring_prep_read_fixed(sqe, target_file_fd, buffer_offset,
buffer_size, file_offset, reg_buf_idx);
It's an RFC to discuss the overall direction. The series misses
parts like bio splitting and nvme sgl support, and otherwise
there are some rough edges and probably problems, which will
need more testing and attention.
[1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
simple liburing based example:
git: https://github.com/isilence/liburing.git dmabuf-rw
link: https://github.com/isilence/liburing/tree/dmabuf-rw
kernel branch:
git: https://github.com/isilence/linux.git dmabuf-rw-v1
Pavel Begunkov (12):
file: add callback returning dev for dma operations
iov_iter: introduce iter type for pre-registered dma
block: move around bio flagging helpers
block: introduce dmavec bio type
block: implement ->get_dma_device callback
nvme-pci: add support for user passed dma vectors
io_uring/rsrc: extended reg buffer registration
io_uring: add basic dmabuf helpers
io_uring/rsrc: add imu flags
io_uring/rsrc: add dmabuf-backed buffer registeration
io_uring/rsrc: implement dmabuf regbuf import
io_uring/rw: enable dma registered buffers
block/bdev.c | 11 ++
block/bio.c | 21 ++++
block/blk-merge.c | 32 +++++
block/blk.h | 2 +-
block/fops.c | 3 +
drivers/nvme/host/pci.c | 158 +++++++++++++++++++++++
include/linux/bio.h | 59 ++++++---
include/linux/blk-mq.h | 2 +
include/linux/blk_types.h | 6 +-
include/linux/blkdev.h | 2 +
include/linux/fs.h | 2 +
include/linux/uio.h | 14 +++
include/uapi/linux/io_uring.h | 13 +-
io_uring/Makefile | 1 +
io_uring/dmabuf.c | 60 +++++++++
io_uring/dmabuf.h | 34 +++++
io_uring/rsrc.c | 230 ++++++++++++++++++++++++++++++----
io_uring/rsrc.h | 23 +++-
io_uring/rw.c | 7 +-
lib/iov_iter.c | 70 ++++++++++-
20 files changed, 701 insertions(+), 49 deletions(-)
create mode 100644 io_uring/dmabuf.c
create mode 100644 io_uring/dmabuf.h
--
2.49.0
^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC 01/12] file: add callback returning dev for dma operations
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 02/12] iov_iter: introduce iter type for pre-registered dma Pavel Begunkov
` (11 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Add a file callback returning a device that can be used for
dma pre-mapping buffers. There should be only one device per file, and
a file should never return different deivecs during its lifetime.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/fs.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 96c7925a6551..9ab5aa413c62 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -82,6 +82,7 @@ struct fs_context;
struct fs_parameter_spec;
struct fileattr;
struct iomap_ops;
+struct device;
extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -2190,6 +2191,7 @@ struct file_operations {
int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
unsigned int poll_flags);
int (*mmap_prepare)(struct vm_area_desc *);
+ struct device *(*get_dma_device)(struct file *);
} __randomize_layout;
/* Supports async buffered reads */
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 02/12] iov_iter: introduce iter type for pre-registered dma
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
2025-06-27 15:10 ` [RFC 01/12] file: add callback returning dev for dma operations Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 03/12] block: move around bio flagging helpers Pavel Begunkov
` (10 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Introduce a new iterator type representing vectors with pre-registered
DMA addresses. It carries an array of struct dmavec, which is just a
{dma addr, dma len} pair. It'll be used to pass dmabuf buffers from
io_uring and other interfaces operating with iterators.
The vector is mapped for the device returned by the ->get_dma_device()
callback of the file, and the caller should only pass the iterator to
that file's methods. That should also prevent ITER_DMAVEC iterators
reaching unaware files.
Note, the drivers are responsible for cpu-device memory synchronisation
and should use dma_sync_single_for_{device,cpu} when appropriate.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/uio.h | 14 +++++++++
lib/iov_iter.c | 70 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 83 insertions(+), 1 deletion(-)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 2e86c653186c..d68148508ef7 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,11 +29,17 @@ enum iter_type {
ITER_FOLIOQ,
ITER_XARRAY,
ITER_DISCARD,
+ ITER_DMAVEC,
};
#define ITER_SOURCE 1 // == WRITE
#define ITER_DEST 0 // == READ
+struct dmavec {
+ dma_addr_t addr;
+ int len;
+};
+
struct iov_iter_state {
size_t iov_offset;
size_t count;
@@ -71,6 +77,7 @@ struct iov_iter {
const struct folio_queue *folioq;
struct xarray *xarray;
void __user *ubuf;
+ const struct dmavec *dmavec;
};
size_t count;
};
@@ -155,6 +162,11 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
return iov_iter_type(i) == ITER_XARRAY;
}
+static inline bool iov_iter_is_dma(const struct iov_iter *i)
+{
+ return iov_iter_type(i) == ITER_DMAVEC;
+}
+
static inline unsigned char iov_iter_rw(const struct iov_iter *i)
{
return i->data_source ? WRITE : READ;
@@ -302,6 +314,8 @@ void iov_iter_folio_queue(struct iov_iter *i, unsigned int direction,
unsigned int first_slot, unsigned int offset, size_t count);
void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
loff_t start, size_t count);
+void iov_iter_dma(struct iov_iter *i, unsigned int direction,
+ struct dmavec *dmavec, unsigned nr_segs, size_t count);
ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9193f952f49..b7740f9aa279 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -559,6 +559,26 @@ static void iov_iter_folioq_advance(struct iov_iter *i, size_t size)
i->folioq = folioq;
}
+static void iov_iter_dma_advance(struct iov_iter *i, size_t size)
+{
+ const struct dmavec *dmav, *end;
+
+ if (!i->count)
+ return;
+ i->count -= size;
+
+ size += i->iov_offset;
+
+ for (dmav = i->dmavec, end = dmav + i->nr_segs; dmav < end; dmav++) {
+ if (likely(size < dmav->len))
+ break;
+ size -= dmav->len;
+ }
+ i->iov_offset = size;
+ i->nr_segs -= dmav - i->dmavec;
+ i->dmavec = dmav;
+}
+
void iov_iter_advance(struct iov_iter *i, size_t size)
{
if (unlikely(i->count < size))
@@ -575,6 +595,8 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
iov_iter_folioq_advance(i, size);
} else if (iov_iter_is_discard(i)) {
i->count -= size;
+ } else if (iov_iter_is_dma(i)) {
+ iov_iter_dma_advance(i, size);
}
}
EXPORT_SYMBOL(iov_iter_advance);
@@ -763,6 +785,20 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
}
EXPORT_SYMBOL(iov_iter_xarray);
+void iov_iter_dma(struct iov_iter *i, unsigned int direction,
+ struct dmavec *dmavec, unsigned nr_segs, size_t count)
+{
+ WARN_ON(direction & ~(READ | WRITE));
+ *i = (struct iov_iter){
+ .iter_type = ITER_DMAVEC,
+ .data_source = direction,
+ .dmavec = dmavec,
+ .nr_segs = nr_segs,
+ .iov_offset = 0,
+ .count = count
+ };
+}
+
/**
* iov_iter_discard - Initialise an I/O iterator that discards data
* @i: The iterator to initialise.
@@ -834,6 +870,32 @@ static bool iov_iter_aligned_bvec(const struct iov_iter *i, unsigned addr_mask,
return true;
}
+static bool iov_iter_aligned_dma(const struct iov_iter *i, unsigned addr_mask,
+ unsigned len_mask)
+{
+ const struct dmavec *dmav = i->dmavec;
+ unsigned skip = i->iov_offset;
+ size_t size = i->count;
+
+ do {
+ size_t len = dmav->len - skip;
+
+ if (len > size)
+ len = size;
+ if (len & len_mask)
+ return false;
+ if ((unsigned long)(dmav->addr + skip) & addr_mask)
+ return false;
+
+ dmav++;
+ size -= len;
+ skip = 0;
+ } while (size);
+
+ return true;
+}
+
+
/**
* iov_iter_is_aligned() - Check if the addresses and lengths of each segments
* are aligned to the parameters.
@@ -875,6 +937,9 @@ bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
return false;
}
+ if (iov_iter_is_dma(i))
+ return iov_iter_aligned_dma(i, addr_mask, len_mask);
+
return true;
}
EXPORT_SYMBOL_GPL(iov_iter_is_aligned);
@@ -1552,7 +1617,8 @@ EXPORT_SYMBOL_GPL(import_ubuf);
void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
{
if (WARN_ON_ONCE(!iov_iter_is_bvec(i) && !iter_is_iovec(i) &&
- !iter_is_ubuf(i)) && !iov_iter_is_kvec(i))
+ !iter_is_ubuf(i)) && !iov_iter_is_kvec(i) &&
+ !iov_iter_is_dma(i))
return;
i->iov_offset = state->iov_offset;
i->count = state->count;
@@ -1570,6 +1636,8 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
BUILD_BUG_ON(sizeof(struct iovec) != sizeof(struct kvec));
if (iov_iter_is_bvec(i))
i->bvec -= state->nr_segs - i->nr_segs;
+ else if (iov_iter_is_dma(i))
+ i->dmavec -= state->nr_segs - i->nr_segs;
else
i->__iov -= state->nr_segs - i->nr_segs;
i->nr_segs = state->nr_segs;
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 03/12] block: move around bio flagging helpers
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
2025-06-27 15:10 ` [RFC 01/12] file: add callback returning dev for dma operations Pavel Begunkov
2025-06-27 15:10 ` [RFC 02/12] iov_iter: introduce iter type for pre-registered dma Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 04/12] block: introduce dmavec bio type Pavel Begunkov
` (9 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
We'll need bio_flagged() earlier in bio.h in the next patch, move it
together with all related helpers, and mark the bio_flagged()'s bio
argument as const.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/linux/bio.h | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 9c37c66ef9ca..8349569414ed 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -46,6 +46,21 @@ static inline unsigned int bio_max_segs(unsigned int nr_segs)
#define bio_data_dir(bio) \
(op_is_write(bio_op(bio)) ? WRITE : READ)
+static inline bool bio_flagged(const struct bio *bio, unsigned int bit)
+{
+ return bio->bi_flags & (1U << bit);
+}
+
+static inline void bio_set_flag(struct bio *bio, unsigned int bit)
+{
+ bio->bi_flags |= (1U << bit);
+}
+
+static inline void bio_clear_flag(struct bio *bio, unsigned int bit)
+{
+ bio->bi_flags &= ~(1U << bit);
+}
+
/*
* Check whether this bio carries any data or not. A NULL bio is allowed.
*/
@@ -225,21 +240,6 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
atomic_set(&bio->__bi_cnt, count);
}
-static inline bool bio_flagged(struct bio *bio, unsigned int bit)
-{
- return bio->bi_flags & (1U << bit);
-}
-
-static inline void bio_set_flag(struct bio *bio, unsigned int bit)
-{
- bio->bi_flags |= (1U << bit);
-}
-
-static inline void bio_clear_flag(struct bio *bio, unsigned int bit)
-{
- bio->bi_flags &= ~(1U << bit);
-}
-
static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
{
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 04/12] block: introduce dmavec bio type
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (2 preceding siblings ...)
2025-06-27 15:10 ` [RFC 03/12] block: move around bio flagging helpers Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 05/12] block: implement ->get_dma_device callback Pavel Begunkov
` (8 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Premapped buffers don't require a generic bio_vec since these have
already been dma mapped. Repurpose the bi_io_vec with the dma vector
as they are mutually exclusive, and provide the setup and to support
dma vectors.
In order to use this, a driver must implement the .get_dma_device()
blk-mq op. If the driver provides this callback, then it must be
aware that any given bio may be using a dma_tag instead of a
bio_vec.
Note, splitting is not implemented just yet.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
block/bio.c | 21 +++++++++++++++++++++
block/blk-merge.c | 32 ++++++++++++++++++++++++++++++++
block/blk.h | 2 +-
block/fops.c | 2 ++
include/linux/bio.h | 29 ++++++++++++++++++++++++++---
include/linux/blk_types.h | 6 +++++-
6 files changed, 87 insertions(+), 5 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 3c0a558c90f5..440f89b9e7de 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -838,6 +838,9 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
bio_clone_blkg_association(bio, bio_src);
}
+ if (bio_flagged(bio_src, BIO_DMAVEC))
+ bio_set_flag(bio, BIO_DMAVEC);
+
if (bio_crypt_clone(bio, bio_src, gfp) < 0)
return -ENOMEM;
if (bio_integrity(bio_src) &&
@@ -1156,6 +1159,18 @@ void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter)
bio_set_flag(bio, BIO_CLONED);
}
+void bio_iov_dmavec_set(struct bio *bio, struct iov_iter *iter)
+{
+ WARN_ON_ONCE(bio->bi_max_vecs);
+
+ bio->bi_vcnt = iter->nr_segs;
+ bio->bi_dmavec = (struct dmavec *)iter->dmavec;
+ bio->bi_iter.bi_bvec_done = iter->iov_offset;
+ bio->bi_iter.bi_size = iov_iter_count(iter);
+ bio->bi_opf |= REQ_NOMERGE;
+ bio_set_flag(bio, BIO_DMAVEC);
+}
+
static unsigned int get_contig_folio_len(unsigned int *num_pages,
struct page **pages, unsigned int i,
struct folio *folio, size_t left,
@@ -1322,6 +1337,10 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
bio_iov_bvec_set(bio, iter);
iov_iter_advance(iter, bio->bi_iter.bi_size);
return 0;
+ } else if (iov_iter_is_dma(iter)) {
+ bio_iov_dmavec_set(bio, iter);
+ iov_iter_advance(iter, bio->bi_iter.bi_size);
+ return 0;
}
if (iov_iter_extract_will_pin(iter))
@@ -1673,6 +1692,8 @@ struct bio *bio_split(struct bio *bio, int sectors,
/* Zone append commands cannot be split */
if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
return ERR_PTR(-EINVAL);
+ if (WARN_ON_ONCE(bio_flagged(bio, BIO_DMAVEC)))
+ return ERR_PTR(-EINVAL);
/* atomic writes cannot be split */
if (bio->bi_opf & REQ_ATOMIC)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3af1d284add5..f932ed61dcb5 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -278,6 +278,35 @@ static unsigned int bio_split_alignment(struct bio *bio,
return lim->logical_block_size;
}
+static int bio_split_rw_at_dmavec(struct bio *bio, const struct queue_limits *lim,
+ unsigned *segs, unsigned max_bytes)
+{
+ struct dmavec *dmav;
+ int offset, length;
+
+ /* Aggressively refuse any splitting, should be improved */
+
+ if (!lim->virt_boundary_mask)
+ return -EINVAL;
+ if (bio->bi_vcnt > lim->max_segments)
+ return -EINVAL;
+ if (bio->bi_iter.bi_size > max_bytes)
+ return -EINVAL;
+
+ dmav = &bio->bi_dmavec[bio->bi_iter.bi_idx];
+ offset = bio->bi_iter.bi_bvec_done;
+ length = bio->bi_iter.bi_size;
+ while (length > 0) {
+ if (dmav->len - offset > lim->max_segment_size)
+ return -EINVAL;
+ length -= dmav->len;
+ dmav++;
+ }
+ *segs = bio->bi_vcnt;
+ return 0;
+}
+
+
/**
* bio_split_rw_at - check if and where to split a read/write bio
* @bio: [in] bio to be split
@@ -297,6 +326,9 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
struct bvec_iter iter;
unsigned nsegs = 0, bytes = 0;
+ if (bio_flagged(bio, BIO_DMAVEC))
+ return bio_split_rw_at_dmavec(bio, lim, segs, max_bytes);
+
bio_for_each_bvec(bv, bio, iter) {
/*
* If the queue doesn't support SG gaps and adding this
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..85429f542bd2 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -386,7 +386,7 @@ static inline struct bio *__bio_split_to_limits(struct bio *bio,
switch (bio_op(bio)) {
case REQ_OP_READ:
case REQ_OP_WRITE:
- if (bio_may_need_split(bio, lim))
+ if (bio_may_need_split(bio, lim) || bio_flagged(bio, BIO_DMAVEC))
return bio_split_rw(bio, lim, nr_segs);
*nr_segs = 1;
return bio;
diff --git a/block/fops.c b/block/fops.c
index 1309861d4c2c..388acfe82b6c 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -347,6 +347,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
* bio_iov_iter_get_pages() and set the bvec directly.
*/
bio_iov_bvec_set(bio, iter);
+ } else if (iov_iter_is_dma(iter)) {
+ bio_iov_dmavec_set(bio, iter);
} else {
ret = bio_iov_iter_get_pages(bio, iter);
if (unlikely(ret))
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8349569414ed..49f6b20d77f6 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -108,16 +108,36 @@ static inline bool bio_next_segment(const struct bio *bio,
#define bio_for_each_segment_all(bvl, bio, iter) \
for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
+static inline void bio_advance_iter_dma(const struct bio *bio,
+ struct bvec_iter *iter, unsigned int bytes)
+{
+ unsigned int idx = iter->bi_idx;
+
+ iter->bi_size -= bytes;
+ bytes += iter->bi_bvec_done;
+
+ while (bytes && bytes >= bio->bi_dmavec[idx].len) {
+ bytes -= bio->bi_dmavec[idx].len;
+ idx++;
+ }
+
+ iter->bi_idx = idx;
+ iter->bi_bvec_done = bytes;
+}
+
static inline void bio_advance_iter(const struct bio *bio,
struct bvec_iter *iter, unsigned int bytes)
{
iter->bi_sector += bytes >> 9;
- if (bio_no_advance_iter(bio))
+ if (bio_no_advance_iter(bio)) {
iter->bi_size -= bytes;
- else
+ } else if (bio_flagged(bio, BIO_DMAVEC)) {
+ bio_advance_iter_dma(bio, iter, bytes);
+ } else {
bvec_iter_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
+ }
}
/* @bytes should be less or equal to bvec[i->bi_idx].bv_len */
@@ -129,6 +149,8 @@ static inline void bio_advance_iter_single(const struct bio *bio,
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
+ else if (bio_flagged(bio, BIO_DMAVEC))
+ bio_advance_iter_dma(bio, iter, bytes);
else
bvec_iter_advance_single(bio->bi_io_vec, iter, bytes);
}
@@ -396,7 +418,7 @@ static inline void bio_wouldblock_error(struct bio *bio)
*/
static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)
{
- if (iov_iter_is_bvec(iter))
+ if (iov_iter_is_bvec(iter) || iov_iter_is_dma(iter))
return 0;
return iov_iter_npages(iter, max_segs);
}
@@ -443,6 +465,7 @@ int bdev_rw_virt(struct block_device *bdev, sector_t sector, void *data,
int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
void bio_iov_bvec_set(struct bio *bio, const struct iov_iter *iter);
+void bio_iov_dmavec_set(struct bio *bio, struct iov_iter *iter);
void __bio_release_pages(struct bio *bio, bool mark_dirty);
extern void bio_set_pages_dirty(struct bio *bio);
extern void bio_check_pages_dirty(struct bio *bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3d1577f07c1c..dc2a5945604a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -266,7 +266,10 @@ struct bio {
atomic_t __bi_cnt; /* pin count */
- struct bio_vec *bi_io_vec; /* the actual vec list */
+ union {
+ struct bio_vec *bi_io_vec; /* the actual vec list */
+ struct dmavec *bi_dmavec;
+ };
struct bio_set *bi_pool;
@@ -308,6 +311,7 @@ enum {
BIO_REMAPPED,
BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
+ BIO_DMAVEC, /* Using premmaped dma buffers */
BIO_FLAG_LAST
};
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 05/12] block: implement ->get_dma_device callback
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (3 preceding siblings ...)
2025-06-27 15:10 ` [RFC 04/12] block: introduce dmavec bio type Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 06/12] nvme-pci: add support for user passed dma vectors Pavel Begunkov
` (7 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Implement the ->get_dma_device callback for block files by forwarding it
to drivers via a new blk-mq ops callback.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
block/bdev.c | 11 +++++++++++
block/fops.c | 1 +
include/linux/blk-mq.h | 2 ++
include/linux/blkdev.h | 2 ++
4 files changed, 16 insertions(+)
diff --git a/block/bdev.c b/block/bdev.c
index b77ddd12dc06..28850cc0125c 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -61,6 +61,17 @@ struct block_device *file_bdev(struct file *bdev_file)
}
EXPORT_SYMBOL(file_bdev);
+struct device *block_get_dma_device(struct file *file)
+{
+ struct request_queue *q = bdev_get_queue(file_bdev(file));
+
+ if (!(file->f_flags & O_DIRECT))
+ return ERR_PTR(-EINVAL);
+ if (q->mq_ops && q->mq_ops->get_dma_device)
+ return q->mq_ops->get_dma_device(q);
+ return ERR_PTR(-EINVAL);
+}
+
static void bdev_write_inode(struct block_device *bdev)
{
struct inode *inode = BD_INODE(bdev);
diff --git a/block/fops.c b/block/fops.c
index 388acfe82b6c..cb22ffdec7ef 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -941,6 +941,7 @@ const struct file_operations def_blk_fops = {
.fallocate = blkdev_fallocate,
.uring_cmd = blkdev_uring_cmd,
.fop_flags = FOP_BUFFER_RASYNC,
+ .get_dma_device = block_get_dma_device,
};
static __init int blkdev_init(void)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index de8c85a03bb7..1c878b9f5b4c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -656,6 +656,8 @@ struct blk_mq_ops {
*/
void (*map_queues)(struct blk_mq_tag_set *set);
+ struct device *(*get_dma_device)(struct request_queue *q);
+
#ifdef CONFIG_BLK_DEBUG_FS
/**
* @show_rq: Used by the debugfs implementation to show driver-specific
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a59880c809c7..54630e23a419 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1720,6 +1720,8 @@ struct block_device *file_bdev(struct file *bdev_file);
bool disk_live(struct gendisk *disk);
unsigned int block_size(struct block_device *bdev);
+struct device *block_get_dma_device(struct file *file);
+
#ifdef CONFIG_BLOCK
void invalidate_bdev(struct block_device *bdev);
int sync_blockdev(struct block_device *bdev);
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 06/12] nvme-pci: add support for user passed dma vectors
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (4 preceding siblings ...)
2025-06-27 15:10 ` [RFC 05/12] block: implement ->get_dma_device callback Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 07/12] io_uring/rsrc: extended reg buffer registration Pavel Begunkov
` (6 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Implement ->get_dma_device blk-mq callback and add BIO_DMAVEC handling.
If the drivers see BIO_DMAVEC, instead of mapping pages, it'll directly
populate the prp list with the provided dma addresses.
Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
drivers/nvme/host/pci.c | 158 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 158 insertions(+)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8ff12e415cb5..44a6366f2d9a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -637,11 +637,59 @@ static void nvme_free_descriptors(struct nvme_queue *nvmeq, struct request *req)
}
}
+static void nvme_sync_dma(struct nvme_dev *nvme_dev, struct request *req,
+ enum dma_data_direction dir)
+{
+ bool for_cpu = dir == DMA_FROM_DEVICE;
+ struct device *dev = nvme_dev->dev;
+ struct bio *bio = req->bio;
+ int offset, length;
+ struct dmavec *dmav;
+
+ if (!dma_dev_need_sync(dev))
+ return;
+
+ offset = bio->bi_iter.bi_bvec_done;
+ length = blk_rq_payload_bytes(req);
+ dmav = &bio->bi_dmavec[bio->bi_iter.bi_idx];
+
+ while (length) {
+ u64 dma_addr = dmav->addr + offset;
+ int dma_len = min(dmav->len - offset, length);
+
+ if (for_cpu)
+ __dma_sync_single_for_cpu(dev, dma_addr, dma_len, dir);
+ else
+ __dma_sync_single_for_device(dev, dma_addr,
+ dma_len, dir);
+
+ length -= dma_len;
+ }
+}
+
+static void nvme_unmap_premapped_data(struct nvme_dev *dev,
+ struct nvme_queue *nvmeq,
+ struct request *req)
+{
+ struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+
+ if (rq_data_dir(req) == READ)
+ nvme_sync_dma(dev, req, DMA_FROM_DEVICE);
+
+ if (!iod->dma_len)
+ nvme_free_descriptors(nvmeq, req);
+}
+
static void nvme_unmap_data(struct nvme_dev *dev, struct nvme_queue *nvmeq,
struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ if (req->bio && bio_flagged(req->bio, BIO_DMAVEC)) {
+ nvme_unmap_premapped_data(dev, nvmeq, req);
+ return;
+ }
+
if (iod->dma_len) {
dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
rq_dma_dir(req));
@@ -846,6 +894,104 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
return BLK_STS_OK;
}
+static blk_status_t nvme_dma_premapped(struct nvme_dev *dev, struct request *req,
+ struct nvme_queue *nvmeq,
+ struct nvme_rw_command *cmnd)
+{
+ struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ int length = blk_rq_payload_bytes(req);
+ u64 dma_addr, first_dma_addr;
+ struct bio *bio = req->bio;
+ int dma_len, offset;
+ struct dmavec *dmav;
+ dma_addr_t prp_dma;
+ __le64 *prp_list;
+ int i;
+
+ if (rq_data_dir(req) == WRITE)
+ nvme_sync_dma(dev, req, DMA_TO_DEVICE);
+
+ offset = bio->bi_iter.bi_bvec_done;
+ dmav = &bio->bi_dmavec[bio->bi_iter.bi_idx];
+ dma_addr = dmav->addr + offset;
+ dma_len = dmav->len - offset;
+ first_dma_addr = dma_addr;
+ offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
+
+ length -= (NVME_CTRL_PAGE_SIZE - offset);
+ if (length <= 0) {
+ iod->first_dma = 0;
+ goto done;
+ }
+
+ dma_len -= (NVME_CTRL_PAGE_SIZE - offset);
+ if (dma_len) {
+ dma_addr += (NVME_CTRL_PAGE_SIZE - offset);
+ } else {
+ dmav++;
+ dma_addr = dmav->addr;
+ dma_len = dmav->len;
+ }
+
+ if (length <= NVME_CTRL_PAGE_SIZE) {
+ iod->first_dma = dma_addr;
+ goto done;
+ }
+
+ if (DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE) <=
+ NVME_SMALL_POOL_SIZE / sizeof(__le64))
+ iod->flags |= IOD_SMALL_DESCRIPTOR;
+
+ prp_list = dma_pool_alloc(nvme_dma_pool(nvmeq, iod), GFP_ATOMIC,
+ &prp_dma);
+ if (!prp_list)
+ return BLK_STS_RESOURCE;
+
+ iod->descriptors[iod->nr_descriptors++] = prp_list;
+ iod->first_dma = prp_dma;
+ i = 0;
+ for (;;) {
+ if (i == NVME_CTRL_PAGE_SIZE >> 3) {
+ __le64 *old_prp_list = prp_list;
+
+ prp_list = dma_pool_alloc(nvmeq->descriptor_pools.large,
+ GFP_ATOMIC, &prp_dma);
+ if (!prp_list)
+ goto free_prps;
+ iod->descriptors[iod->nr_descriptors++] = prp_list;
+ prp_list[0] = old_prp_list[i - 1];
+ old_prp_list[i - 1] = cpu_to_le64(prp_dma);
+ i = 1;
+ }
+
+ prp_list[i++] = cpu_to_le64(dma_addr);
+ dma_len -= NVME_CTRL_PAGE_SIZE;
+ dma_addr += NVME_CTRL_PAGE_SIZE;
+ length -= NVME_CTRL_PAGE_SIZE;
+ if (length <= 0)
+ break;
+ if (dma_len > 0)
+ continue;
+ if (unlikely(dma_len < 0))
+ goto bad_sgl;
+ dmav++;
+ dma_addr = dmav->addr;
+ dma_len = dmav->len;
+ }
+done:
+ cmnd->dptr.prp1 = cpu_to_le64(first_dma_addr);
+ cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
+ return BLK_STS_OK;
+free_prps:
+ nvme_free_descriptors(nvmeq, req);
+ return BLK_STS_RESOURCE;
+bad_sgl:
+ WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
+ "Invalid SGL for payload:%d nents:%d\n",
+ blk_rq_payload_bytes(req), iod->sgt.nents);
+ return BLK_STS_IOERR;
+}
+
static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
struct nvme_command *cmnd)
{
@@ -854,6 +1000,9 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
blk_status_t ret = BLK_STS_RESOURCE;
int rc;
+ if (req->bio && bio_flagged(req->bio, BIO_DMAVEC))
+ return nvme_dma_premapped(dev, req, nvmeq, &cmnd->rw);
+
if (blk_rq_nr_phys_segments(req) == 1) {
struct bio_vec bv = req_bvec(req);
@@ -1874,6 +2023,14 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid, bool polled)
return result;
}
+static struct device *nvme_pci_get_dma_device(struct request_queue *q)
+{
+ struct nvme_ns *ns = q->queuedata;
+ struct nvme_dev *dev = to_nvme_dev(ns->ctrl);
+
+ return dev->dev;
+}
+
static const struct blk_mq_ops nvme_mq_admin_ops = {
.queue_rq = nvme_queue_rq,
.complete = nvme_pci_complete_rq,
@@ -1892,6 +2049,7 @@ static const struct blk_mq_ops nvme_mq_ops = {
.map_queues = nvme_pci_map_queues,
.timeout = nvme_timeout,
.poll = nvme_poll,
+ .get_dma_device = nvme_pci_get_dma_device,
};
static void nvme_dev_remove_admin(struct nvme_dev *dev)
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 07/12] io_uring/rsrc: extended reg buffer registration
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (5 preceding siblings ...)
2025-06-27 15:10 ` [RFC 06/12] nvme-pci: add support for user passed dma vectors Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 08/12] io_uring: add basic dmabuf helpers Pavel Begunkov
` (5 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
We'll need to pass extra information for buffer registration apart from
iovec, add a flag to struct io_uring_rsrc_update2 that tells that its
data fields points to an extended registration structure, i.e.
struct io_uring_reg_buffer. To do normal registration the user has to
set target_fd and dmabuf_fd fields to -1, and any other combination is
currently rejected.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
include/uapi/linux/io_uring.h | 13 ++++++++-
io_uring/rsrc.c | 53 +++++++++++++++++++++++++++--------
2 files changed, 54 insertions(+), 12 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index cfd17e382082..596cb71bd214 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -725,15 +725,26 @@ struct io_uring_rsrc_update {
__aligned_u64 data;
};
+/* struct io_uring_rsrc_update2::flags */
+enum io_uring_rsrc_reg_flags {
+ IORING_RSRC_F_EXTENDED_UPDATE = 1,
+};
+
struct io_uring_rsrc_update2 {
__u32 offset;
- __u32 resv;
+ __u32 flags;
__aligned_u64 data;
__aligned_u64 tags;
__u32 nr;
__u32 resv2;
};
+struct io_uring_reg_buffer {
+ __aligned_u64 iov_uaddr;
+ __s32 target_fd;
+ __s32 dmabuf_fd;
+};
+
/* Skip updating fd indexes set to this value in the fd table */
#define IORING_REGISTER_FILES_SKIP (-2)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index c592ceace97d..21f4932ecafa 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -26,7 +26,8 @@ struct io_rsrc_update {
u32 offset;
};
-static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
+static struct io_rsrc_node *
+io_sqe_buffer_register(struct io_ring_ctx *ctx, struct io_uring_reg_buffer *rb,
struct iovec *iov, struct page **last_hpage);
/* only define max */
@@ -226,6 +227,8 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
if (!ctx->file_table.data.nr)
return -ENXIO;
+ if (up->flags)
+ return -EINVAL;
if (up->offset + nr_args > ctx->file_table.data.nr)
return -EINVAL;
@@ -280,10 +283,18 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
return done ? done : err;
}
+static inline void io_default_reg_buf(struct io_uring_reg_buffer *rb)
+{
+ memset(rb, 0, sizeof(*rb));
+ rb->target_fd = -1;
+ rb->dmabuf_fd = -1;
+}
+
static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
struct io_uring_rsrc_update2 *up,
unsigned int nr_args)
{
+ bool extended_entry = up->flags & IORING_RSRC_F_EXTENDED_UPDATE;
u64 __user *tags = u64_to_user_ptr(up->tags);
struct iovec fast_iov, *iov;
struct page *last_hpage = NULL;
@@ -294,14 +305,32 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
if (!ctx->buf_table.nr)
return -ENXIO;
+ if (up->flags & ~IORING_RSRC_F_EXTENDED_UPDATE)
+ return -EINVAL;
if (up->offset + nr_args > ctx->buf_table.nr)
return -EINVAL;
for (done = 0; done < nr_args; done++) {
+ struct io_uring_reg_buffer rb;
struct io_rsrc_node *node;
u64 tag = 0;
- uvec = u64_to_user_ptr(user_data);
+ if (extended_entry) {
+ if (copy_from_user(&rb, u64_to_user_ptr(user_data),
+ sizeof(rb)))
+ return -EFAULT;
+ user_data += sizeof(rb);
+ } else {
+ io_default_reg_buf(&rb);
+ rb.iov_uaddr = user_data;
+
+ if (ctx->compat)
+ user_data += sizeof(struct compat_iovec);
+ else
+ user_data += sizeof(struct iovec);
+ }
+
+ uvec = u64_to_user_ptr(rb.iov_uaddr);
iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat);
if (IS_ERR(iov)) {
err = PTR_ERR(iov);
@@ -314,7 +343,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
err = io_buffer_validate(iov);
if (err)
break;
- node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+ node = io_sqe_buffer_register(ctx, &rb, iov, &last_hpage);
if (IS_ERR(node)) {
err = PTR_ERR(node);
break;
@@ -329,10 +358,6 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
io_reset_rsrc_node(ctx, &ctx->buf_table, i);
ctx->buf_table.nodes[i] = node;
- if (ctx->compat)
- user_data += sizeof(struct compat_iovec);
- else
- user_data += sizeof(struct iovec);
}
return done ? done : err;
}
@@ -367,7 +392,7 @@ int io_register_files_update(struct io_ring_ctx *ctx, void __user *arg,
memset(&up, 0, sizeof(up));
if (copy_from_user(&up, arg, sizeof(struct io_uring_rsrc_update)))
return -EFAULT;
- if (up.resv || up.resv2)
+ if (up.resv2)
return -EINVAL;
return __io_register_rsrc_update(ctx, IORING_RSRC_FILE, &up, nr_args);
}
@@ -381,7 +406,7 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
return -EINVAL;
if (copy_from_user(&up, arg, sizeof(up)))
return -EFAULT;
- if (!up.nr || up.resv || up.resv2)
+ if (!up.nr || up.resv2)
return -EINVAL;
return __io_register_rsrc_update(ctx, type, &up, up.nr);
}
@@ -485,7 +510,7 @@ int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
up2.data = up->arg;
up2.nr = 0;
up2.tags = 0;
- up2.resv = 0;
+ up2.flags = 0;
up2.resv2 = 0;
if (up->offset == IORING_FILE_INDEX_ALLOC) {
@@ -769,6 +794,7 @@ bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
}
static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
+ struct io_uring_reg_buffer *rb,
struct iovec *iov,
struct page **last_hpage)
{
@@ -781,6 +807,9 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
struct io_imu_folio_data data;
bool coalesced = false;
+ if (rb->dmabuf_fd != -1 || rb->target_fd != -1)
+ return NULL;
+
if (!iov->iov_base)
return NULL;
@@ -872,6 +901,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
memset(iov, 0, sizeof(*iov));
for (i = 0; i < nr_args; i++) {
+ struct io_uring_reg_buffer rb;
struct io_rsrc_node *node;
u64 tag = 0;
@@ -898,7 +928,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
}
}
- node = io_sqe_buffer_register(ctx, iov, &last_hpage);
+ io_default_reg_buf(&rb);
+ node = io_sqe_buffer_register(ctx, &rb, iov, &last_hpage);
if (IS_ERR(node)) {
ret = PTR_ERR(node);
break;
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 08/12] io_uring: add basic dmabuf helpers
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (6 preceding siblings ...)
2025-06-27 15:10 ` [RFC 07/12] io_uring/rsrc: extended reg buffer registration Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 09/12] io_uring/rsrc: add imu flags Pavel Begunkov
` (4 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Add basic dmabuf helpers and structures for io_uring, which will be used
for dmabuf buffer registration. That can also be reused in other places
in io_uring, which is ommited from the series.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/Makefile | 1 +
io_uring/dmabuf.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++
io_uring/dmabuf.h | 34 +++++++++++++++++++++++++++
3 files changed, 95 insertions(+)
create mode 100644 io_uring/dmabuf.c
create mode 100644 io_uring/dmabuf.h
diff --git a/io_uring/Makefile b/io_uring/Makefile
index d97c6b51d584..0f5a7ec38452 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -21,3 +21,4 @@ obj-$(CONFIG_EPOLL) += epoll.o
obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
obj-$(CONFIG_NET) += net.o cmd_net.o
obj-$(CONFIG_PROC_FS) += fdinfo.o
+obj-$(CONFIG_DMA_SHARED_BUFFER) += dmabuf.o
diff --git a/io_uring/dmabuf.c b/io_uring/dmabuf.c
new file mode 100644
index 000000000000..cb9d8bb5d5b3
--- /dev/null
+++ b/io_uring/dmabuf.c
@@ -0,0 +1,60 @@
+#include "dmabuf.h"
+
+void io_dmabuf_release(struct io_dmabuf *buf)
+{
+ if (buf->sgt)
+ dma_buf_unmap_attachment_unlocked(buf->attach, buf->sgt,
+ buf->dir);
+ if (buf->attach)
+ dma_buf_detach(buf->dmabuf, buf->attach);
+ if (buf->dmabuf)
+ dma_buf_put(buf->dmabuf);
+ if (buf->dev)
+ put_device(buf->dev);
+
+ memset(buf, 0, sizeof(*buf));
+}
+
+int io_dmabuf_import(struct io_dmabuf *buf, int dmabuf_fd,
+ struct device *dev, enum dma_data_direction dir)
+{
+ unsigned long total_size = 0;
+ struct scatterlist *sg;
+ int i, ret;
+
+ if (WARN_ON_ONCE(!dev))
+ return -EFAULT;
+
+ buf->dir = dir;
+ buf->dmabuf = dma_buf_get(dmabuf_fd);
+ if (IS_ERR(buf->dmabuf)) {
+ ret = PTR_ERR(buf->dmabuf);
+ buf->dmabuf = NULL;
+ goto err;
+ }
+
+ buf->attach = dma_buf_attach(buf->dmabuf, dev);
+ if (IS_ERR(buf->attach)) {
+ ret = PTR_ERR(buf->attach);
+ buf->attach = NULL;
+ goto err;
+ }
+
+ buf->sgt = dma_buf_map_attachment_unlocked(buf->attach, dir);
+ if (IS_ERR(buf->sgt)) {
+ ret = PTR_ERR(buf->sgt);
+ buf->sgt = NULL;
+ goto err;
+ }
+
+ for_each_sgtable_dma_sg(buf->sgt, sg, i)
+ total_size += sg_dma_len(sg);
+
+ buf->dir = dir;
+ buf->dev = get_device(dev);
+ buf->len = total_size;
+ return 0;
+err:
+ io_dmabuf_release(buf);
+ return ret;
+}
diff --git a/io_uring/dmabuf.h b/io_uring/dmabuf.h
new file mode 100644
index 000000000000..c785ccfe0b9e
--- /dev/null
+++ b/io_uring/dmabuf.h
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_DMABUF_H
+#define IOU_DMABUF_H
+
+#include <linux/io_uring_types.h>
+#include <linux/dma-buf.h>
+
+struct io_dmabuf {
+ size_t len;
+ struct dma_buf_attachment *attach;
+ struct dma_buf *dmabuf;
+ struct sg_table *sgt;
+ struct device *dev;
+ enum dma_data_direction dir;
+};
+
+#ifdef CONFIG_DMA_SHARED_BUFFER
+void io_dmabuf_release(struct io_dmabuf *buf);
+int io_dmabuf_import(struct io_dmabuf *buf, int dmabuf_fd,
+ struct device *dev, enum dma_data_direction dir);
+
+#else
+static inline void io_dmabuf_release(struct io_dmabuf *buf)
+{
+}
+
+static inline int io_dmabuf_import(struct io_dmabuf *buf, int dmabuf_fd,
+ struct device *dev, enum dma_data_direction dir)
+{
+ return -EOPNOTSUPP;
+}
+#endif
+
+#endif
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 09/12] io_uring/rsrc: add imu flags
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (7 preceding siblings ...)
2025-06-27 15:10 ` [RFC 08/12] io_uring: add basic dmabuf helpers Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 10/12] io_uring/rsrc: add dmabuf-backed buffer registeration Pavel Begunkov
` (3 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Replace is_kbuf with a flags field in io_mapped_ubuf. There will be new
flags shortly, and bit fields are often not as convenient to work with.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/rsrc.c | 12 ++++++------
io_uring/rsrc.h | 6 +++++-
io_uring/rw.c | 3 ++-
3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 21f4932ecafa..274274b80b96 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -850,7 +850,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
imu->folio_shift = PAGE_SHIFT;
imu->release = io_release_ubuf;
imu->priv = imu;
- imu->is_kbuf = false;
+ imu->flags = 0;
imu->dir = IO_IMU_DEST | IO_IMU_SOURCE;
if (coalesced)
imu->folio_shift = data.folio_shift;
@@ -999,7 +999,7 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
refcount_set(&imu->refs, 1);
imu->release = release;
imu->priv = rq;
- imu->is_kbuf = true;
+ imu->flags = IO_IMU_F_KBUF;
imu->dir = 1 << rq_data_dir(rq);
bvec = imu->bvec;
@@ -1034,7 +1034,7 @@ int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
ret = -EINVAL;
goto unlock;
}
- if (!node->buf->is_kbuf) {
+ if (!(node->buf->flags & IO_IMU_F_KBUF)) {
ret = -EBUSY;
goto unlock;
}
@@ -1100,7 +1100,7 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
offset = buf_addr - imu->ubuf;
- if (imu->is_kbuf)
+ if (imu->flags & IO_IMU_F_KBUF)
return io_import_kbuf(ddir, iter, imu, len, offset);
/*
@@ -1509,7 +1509,7 @@ int io_import_reg_vec(int ddir, struct iov_iter *iter,
iovec_off = vec->nr - nr_iovs;
iov = vec->iovec + iovec_off;
- if (imu->is_kbuf) {
+ if (imu->flags & IO_IMU_F_KBUF) {
int ret = io_kern_bvec_size(iov, nr_iovs, imu, &nr_segs);
if (unlikely(ret))
@@ -1543,7 +1543,7 @@ int io_import_reg_vec(int ddir, struct iov_iter *iter,
req->flags |= REQ_F_NEED_CLEANUP;
}
- if (imu->is_kbuf)
+ if (imu->flags & IO_IMU_F_KBUF)
return io_vec_fill_kern_bvec(ddir, iter, imu, iov, nr_iovs, vec);
return io_vec_fill_bvec(ddir, iter, imu, iov, nr_iovs, vec);
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 0d2138f16322..15ad4a885ae5 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -28,6 +28,10 @@ enum {
IO_IMU_SOURCE = 1 << ITER_SOURCE,
};
+enum {
+ IO_IMU_F_KBUF = 1,
+};
+
struct io_mapped_ubuf {
u64 ubuf;
unsigned int len;
@@ -37,7 +41,7 @@ struct io_mapped_ubuf {
unsigned long acct_pages;
void (*release)(void *);
void *priv;
- bool is_kbuf;
+ u8 flags;
u8 dir;
struct bio_vec bvec[] __counted_by(nr_bvecs);
};
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 710d8cd53ebb..cfcd7d26d8dc 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -696,7 +696,8 @@ static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter)
if ((kiocb->ki_flags & IOCB_NOWAIT) &&
!(kiocb->ki_filp->f_flags & O_NONBLOCK))
return -EAGAIN;
- if ((req->flags & REQ_F_BUF_NODE) && req->buf_node->buf->is_kbuf)
+ if ((req->flags & REQ_F_BUF_NODE) &&
+ (req->buf_node->buf->flags & IO_IMU_F_KBUF))
return -EFAULT;
ppos = io_kiocb_ppos(kiocb);
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 10/12] io_uring/rsrc: add dmabuf-backed buffer registeration
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (8 preceding siblings ...)
2025-06-27 15:10 ` [RFC 09/12] io_uring/rsrc: add imu flags Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 11/12] io_uring/rsrc: implement dmabuf regbuf import Pavel Begunkov
` (2 subsequent siblings)
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Add an ability to register a dmabuf backed io_uring buffer. It also
needs know which device to use for attachment, for that it takes
target_fd and extracts the device through the new file op. Unlike normal
buffers, it also retains the target file so that any imports from
ineligible requests can be rejected in next patches.
Suggested-by: Vishal Verma <vishal1.verma@intel.com>
Suggested-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/rsrc.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++-
io_uring/rsrc.h | 1 +
2 files changed, 118 insertions(+), 1 deletion(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 274274b80b96..f44aa2670bc5 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -10,6 +10,8 @@
#include <linux/compat.h>
#include <linux/io_uring.h>
#include <linux/io_uring/cmd.h>
+#include <linux/dma-map-ops.h>
+#include <linux/dma-buf.h>
#include <uapi/linux/io_uring.h>
@@ -18,6 +20,7 @@
#include "rsrc.h"
#include "memmap.h"
#include "register.h"
+#include "dmabuf.h"
struct io_rsrc_update {
struct file *file;
@@ -793,6 +796,117 @@ bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
return true;
}
+struct io_regbuf_dma {
+ struct io_dmabuf dmabuf;
+ struct dmavec *dmav;
+ struct file *target_file;
+};
+
+static void io_release_reg_dmabuf(struct io_regbuf_dma *db)
+{
+ if (db->dmav)
+ kfree(db->dmav);
+ io_dmabuf_release(&db->dmabuf);
+ if (db->target_file)
+ fput(db->target_file);
+
+ kfree(db);
+}
+
+static void io_release_reg_dmabuf_cb(void *priv)
+{
+ io_release_reg_dmabuf(priv);
+}
+
+static struct io_rsrc_node *io_register_dmabuf(struct io_ring_ctx *ctx,
+ struct io_uring_reg_buffer *rb,
+ struct iovec *iov)
+{
+ struct io_rsrc_node *node = NULL;
+ struct io_mapped_ubuf *imu = NULL;
+ struct io_regbuf_dma *regbuf;
+ struct file *target_file;
+ struct scatterlist *sg;
+ struct device *dev;
+ unsigned int segments;
+ int ret, i;
+
+ if (iov->iov_base || iov->iov_len)
+ return ERR_PTR(-EFAULT);
+
+ regbuf = kzalloc(sizeof(*regbuf), GFP_KERNEL);
+ if (!regbuf) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ target_file = fget(rb->target_fd);
+ if (!target_file) {
+ ret = -EBADF;
+ goto err;
+ }
+ regbuf->target_file = target_file;
+
+ if (!target_file->f_op->get_dma_device) {
+ ret = -EOPNOTSUPP;
+ goto err;
+ }
+ dev = target_file->f_op->get_dma_device(target_file);
+ if (IS_ERR(dev)) {
+ ret = PTR_ERR(dev);
+ goto err;
+ }
+
+ ret = io_dmabuf_import(®buf->dmabuf, rb->dmabuf_fd, dev,
+ DMA_BIDIRECTIONAL);
+ if (ret)
+ goto err;
+
+ segments = regbuf->dmabuf.sgt->nents;
+ regbuf->dmav = kmalloc_array(segments, sizeof(regbuf->dmav[0]),
+ GFP_KERNEL_ACCOUNT);
+ if (!regbuf->dmav) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ for_each_sgtable_dma_sg(regbuf->dmabuf.sgt, sg, i) {
+ regbuf->dmav[i].addr = sg_dma_address(sg);
+ regbuf->dmav[i].len = sg_dma_len(sg);
+ }
+
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
+ if (!node) {
+ ret = -ENOMEM;
+ goto err;
+ }
+ imu = io_alloc_imu(ctx, 0);
+ if (!imu) {
+ ret = -ENOMEM;
+ goto err;
+ }
+
+ imu->nr_bvecs = segments;
+ imu->ubuf = 0;
+ imu->len = regbuf->dmabuf.len;
+ imu->folio_shift = 0;
+ imu->release = io_release_reg_dmabuf_cb;
+ imu->priv = regbuf;
+ imu->flags = IO_IMU_F_DMA;
+ imu->dir = IO_IMU_DEST | IO_IMU_SOURCE;
+ refcount_set(&imu->refs, 1);
+ node->buf = imu;
+ return node;
+err:
+ if (regbuf)
+ io_release_reg_dmabuf(regbuf);
+ if (imu)
+ io_free_imu(ctx, imu);
+ if (node)
+ io_cache_free(&ctx->node_cache, node);
+ return ERR_PTR(ret);
+}
+
static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
struct io_uring_reg_buffer *rb,
struct iovec *iov,
@@ -808,7 +922,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
bool coalesced = false;
if (rb->dmabuf_fd != -1 || rb->target_fd != -1)
- return NULL;
+ return io_register_dmabuf(ctx, rb, iov);
if (!iov->iov_base)
return NULL;
@@ -1100,6 +1214,8 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
offset = buf_addr - imu->ubuf;
+ if (imu->flags & IO_IMU_F_DMA)
+ return -EOPNOTSUPP;
if (imu->flags & IO_IMU_F_KBUF)
return io_import_kbuf(ddir, iter, imu, len, offset);
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 15ad4a885ae5..f567ad82b76c 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -30,6 +30,7 @@ enum {
enum {
IO_IMU_F_KBUF = 1,
+ IO_IMU_F_DMA = 2,
};
struct io_mapped_ubuf {
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 11/12] io_uring/rsrc: implement dmabuf regbuf import
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (9 preceding siblings ...)
2025-06-27 15:10 ` [RFC 10/12] io_uring/rsrc: add dmabuf-backed buffer registeration Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-06-27 15:10 ` [RFC 12/12] io_uring/rw: enable dma registered buffers Pavel Begunkov
2025-07-03 14:23 ` [RFC 00/12] io_uring dmabuf read/write support Christoph Hellwig
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Allow importing dmabuf backed registered buffers. It's an opt-in feature
for requests and they need to pass a flag allowing it. Furthermore,
the import will fail if the request's file doesn't match the file for
which the buffer for registered. This way, it's also limited to files
that support the feature by implementing the corresponding file op.
Suggested-by: David Wei <dw@davidwei.uk>
Suggested-by: Vishal Verma <vishal1.verma@intel.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/rsrc.c | 53 ++++++++++++++++++++++++++++++++++++++++++-------
io_uring/rsrc.h | 16 ++++++++++++++-
2 files changed, 61 insertions(+), 8 deletions(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index f44aa2670bc5..11107491145c 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1196,9 +1196,44 @@ static int io_import_kbuf(int ddir, struct iov_iter *iter,
return 0;
}
-static int io_import_fixed(int ddir, struct iov_iter *iter,
+static int io_import_dmabuf(struct io_kiocb *req,
+ int ddir, struct iov_iter *iter,
struct io_mapped_ubuf *imu,
- u64 buf_addr, size_t len)
+ size_t len, size_t offset)
+{
+ struct io_regbuf_dma *db = imu->priv;
+ struct dmavec *dmavec = db->dmav;
+ int i = 0, start_idx, nr_segs;
+ ssize_t len_left;
+
+ if (req->file != db->target_file)
+ return -EBADF;
+ if (!len)
+ return -EFAULT;
+
+ while (offset >= dmavec[i].len) {
+ offset -= dmavec[i].len;
+ i++;
+ }
+ start_idx = i;
+
+ len_left = len;
+ while (len_left > 0) {
+ len_left -= dmavec[i].len;
+ i++;
+ }
+
+ nr_segs = i - start_idx;
+ iov_iter_dma(iter, ddir, dmavec + start_idx, nr_segs, len);
+ iter->iov_offset = offset;
+ return 0;
+}
+
+static int io_import_fixed(struct io_kiocb *req,
+ int ddir, struct iov_iter *iter,
+ struct io_mapped_ubuf *imu,
+ u64 buf_addr, size_t len,
+ unsigned import_flags)
{
const struct bio_vec *bvec;
size_t folio_mask;
@@ -1214,8 +1249,11 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
offset = buf_addr - imu->ubuf;
- if (imu->flags & IO_IMU_F_DMA)
- return -EOPNOTSUPP;
+ if (imu->flags & IO_IMU_F_DMA) {
+ if (!(import_flags & IO_REGBUF_IMPORT_ALLOW_DMA))
+ return -EFAULT;
+ return io_import_dmabuf(req, ddir, iter, imu, len, offset);
+ }
if (imu->flags & IO_IMU_F_KBUF)
return io_import_kbuf(ddir, iter, imu, len, offset);
@@ -1269,16 +1307,17 @@ inline struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
return NULL;
}
-int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
+int __io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
u64 buf_addr, size_t len, int ddir,
- unsigned issue_flags)
+ unsigned issue_flags, unsigned import_flags)
{
struct io_rsrc_node *node;
node = io_find_buf_node(req, issue_flags);
if (!node)
return -EFAULT;
- return io_import_fixed(ddir, iter, node->buf, buf_addr, len);
+ return io_import_fixed(req, ddir, iter, node->buf, buf_addr, len,
+ import_flags);
}
/* Lock two rings at once. The rings must be different! */
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index f567ad82b76c..64b7444b7899 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -33,6 +33,10 @@ enum {
IO_IMU_F_DMA = 2,
};
+enum {
+ IO_REGBUF_IMPORT_ALLOW_DMA = 1,
+};
+
struct io_mapped_ubuf {
u64 ubuf;
unsigned int len;
@@ -65,9 +69,19 @@ int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
unsigned issue_flags);
+int __io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
+ u64 buf_addr, size_t len, int ddir,
+ unsigned issue_flags, unsigned import_flags);
+
+static inline
int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
u64 buf_addr, size_t len, int ddir,
- unsigned issue_flags);
+ unsigned issue_flags)
+{
+ return __io_import_reg_buf(req, iter, buf_addr, len, ddir,
+ issue_flags, 0);
+}
+
int io_import_reg_vec(int ddir, struct iov_iter *iter,
struct io_kiocb *req, struct iou_vec *vec,
unsigned nr_iovs, unsigned issue_flags);
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC 12/12] io_uring/rw: enable dma registered buffers
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (10 preceding siblings ...)
2025-06-27 15:10 ` [RFC 11/12] io_uring/rsrc: implement dmabuf regbuf import Pavel Begunkov
@ 2025-06-27 15:10 ` Pavel Begunkov
2025-07-03 14:23 ` [RFC 00/12] io_uring dmabuf read/write support Christoph Hellwig
12 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2025-06-27 15:10 UTC (permalink / raw)
To: io-uring, linux-block, linux-nvme
Cc: linux-fsdevel, Keith Busch, David Wei, Vishal Verma, asml.silence
Enable dmabuf registered buffer from the read-write path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
io_uring/rw.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/io_uring/rw.c b/io_uring/rw.c
index cfcd7d26d8dc..78ac6a86521c 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -372,8 +372,8 @@ static int io_init_rw_fixed(struct io_kiocb *req, unsigned int issue_flags,
if (io->bytes_done)
return 0;
- ret = io_import_reg_buf(req, &io->iter, rw->addr, rw->len, ddir,
- issue_flags);
+ ret = __io_import_reg_buf(req, &io->iter, rw->addr, rw->len, ddir,
+ issue_flags, IO_REGBUF_IMPORT_ALLOW_DMA);
iov_iter_save_state(&io->iter, &io->iter_state);
return ret;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC 00/12] io_uring dmabuf read/write support
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
` (11 preceding siblings ...)
2025-06-27 15:10 ` [RFC 12/12] io_uring/rw: enable dma registered buffers Pavel Begunkov
@ 2025-07-03 14:23 ` Christoph Hellwig
2025-07-03 14:37 ` Christian König
2025-07-07 11:15 ` Pavel Begunkov
12 siblings, 2 replies; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-03 14:23 UTC (permalink / raw)
To: Pavel Begunkov
Cc: io-uring, linux-block, linux-nvme, linux-fsdevel, Keith Busch,
David Wei, Vishal Verma, Sumit Semwal, Christian König,
linux-media, dri-devel, linaro-mm-sig
[Note: it would be really useful to Cc all relevant maintainers]
On Fri, Jun 27, 2025 at 04:10:27PM +0100, Pavel Begunkov wrote:
> This series implements it for read/write io_uring requests. The uAPI
> looks similar to normal registered buffers, the user will need to
> register a dmabuf in io_uring first and then use it as any other
> registered buffer. On registration the user also specifies a file
> to map the dmabuf for.
Just commenting from the in-kernel POV here, where the interface
feels wrong.
You can't just expose 'the DMA device' up file operations, because
there can be and often is more than one. Similarly stuffing a
dma_addr_t into an iovec is rather dangerous.
The model that should work much better is to have file operations
to attach to / detach from a dma_buf, and then have an iter that
specifies a dmabuf and offsets into. That way the code behind the
file operations can forward the attachment to all the needed
devices (including more/less while it remains attached to the file)
and can pick the right dma address for each device.
I also remember some discussion that new dma-buf importers should
use the dynamic imported model for long-term imports, but as I'm
everything but an expert in that area I'll let the dma-buf folks
speak.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 00/12] io_uring dmabuf read/write support
2025-07-03 14:23 ` [RFC 00/12] io_uring dmabuf read/write support Christoph Hellwig
@ 2025-07-03 14:37 ` Christian König
2025-07-07 11:15 ` Pavel Begunkov
1 sibling, 0 replies; 19+ messages in thread
From: Christian König @ 2025-07-03 14:37 UTC (permalink / raw)
To: Christoph Hellwig, Pavel Begunkov
Cc: io-uring, linux-block, linux-nvme, linux-fsdevel, Keith Busch,
David Wei, Vishal Verma, Sumit Semwal, linux-media, dri-devel,
linaro-mm-sig
On 03.07.25 16:23, Christoph Hellwig wrote:
> [Note: it would be really useful to Cc all relevant maintainers]
>
> On Fri, Jun 27, 2025 at 04:10:27PM +0100, Pavel Begunkov wrote:
>> This series implements it for read/write io_uring requests. The uAPI
>> looks similar to normal registered buffers, the user will need to
>> register a dmabuf in io_uring first and then use it as any other
>> registered buffer. On registration the user also specifies a file
>> to map the dmabuf for.
>
> Just commenting from the in-kernel POV here, where the interface
> feels wrong.
>
> You can't just expose 'the DMA device' up file operations, because
> there can be and often is more than one. Similarly stuffing a
> dma_addr_t into an iovec is rather dangerous.
>
> The model that should work much better is to have file operations
> to attach to / detach from a dma_buf, and then have an iter that
> specifies a dmabuf and offsets into. That way the code behind the
> file operations can forward the attachment to all the needed
> devices (including more/less while it remains attached to the file)
> and can pick the right dma address for each device.
>
> I also remember some discussion that new dma-buf importers should
> use the dynamic imported model for long-term imports, but as I'm
> everything but an expert in that area I'll let the dma-buf folks
> speak.
Completely correct.
As long as you don't have a really good explanation and some mechanism to prevent abuse long term pinning of DMA-bufs should be avoided.
Regards,
Christian.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 00/12] io_uring dmabuf read/write support
2025-07-03 14:23 ` [RFC 00/12] io_uring dmabuf read/write support Christoph Hellwig
2025-07-03 14:37 ` Christian König
@ 2025-07-07 11:15 ` Pavel Begunkov
2025-07-07 14:48 ` Christoph Hellwig
1 sibling, 1 reply; 19+ messages in thread
From: Pavel Begunkov @ 2025-07-07 11:15 UTC (permalink / raw)
To: Christoph Hellwig
Cc: io-uring, linux-block, linux-nvme, linux-fsdevel, Keith Busch,
David Wei, Vishal Verma, Sumit Semwal, Christian König,
linux-media, dri-devel, linaro-mm-sig
On 7/3/25 15:23, Christoph Hellwig wrote:
> [Note: it would be really useful to Cc all relevant maintainers]
Will do next time
> On Fri, Jun 27, 2025 at 04:10:27PM +0100, Pavel Begunkov wrote:
>> This series implements it for read/write io_uring requests. The uAPI
>> looks similar to normal registered buffers, the user will need to
>> register a dmabuf in io_uring first and then use it as any other
>> registered buffer. On registration the user also specifies a file
>> to map the dmabuf for.
>
> Just commenting from the in-kernel POV here, where the interface
> feels wrong.
>
> You can't just expose 'the DMA device' up file operations, because
> there can be and often is more than one. Similarly stuffing a
> dma_addr_t into an iovec is rather dangerous.
>
> The model that should work much better is to have file operations
> to attach to / detach from a dma_buf, and then have an iter that
> specifies a dmabuf and offsets into. That way the code behind the
> file operations can forward the attachment to all the needed
> devices (including more/less while it remains attached to the file)
> and can pick the right dma address for each device.
By "iter that specifies a dmabuf" do you mean an opaque file-specific
structure allocated inside the new fop? Akin to what Keith proposed back
then. That sounds good and has more potential for various optimisations.
My concern would be growing struct iov_iter by an extra pointer:
struct dma_seg {
size_t off;
unsigned len;
};
struct iov_iter {
union {
struct iovec *iov;
struct dma_seg *dmav;
...
};
void *dma_token;
};
But maybe that's fine. It's 40B -> 48B, and it'll get back to
40 when / if xarray_start / ITER_XARRAY is removed.
> I also remember some discussion that new dma-buf importers should
> use the dynamic imported model for long-term imports, but as I'm
> everything but an expert in that area I'll let the dma-buf folks
> speak.
I'll take a look
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 00/12] io_uring dmabuf read/write support
2025-07-07 11:15 ` Pavel Begunkov
@ 2025-07-07 14:48 ` Christoph Hellwig
2025-07-07 15:41 ` Pavel Begunkov
0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-07 14:48 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christoph Hellwig, io-uring, linux-block, linux-nvme,
linux-fsdevel, Keith Busch, David Wei, Vishal Verma, Sumit Semwal,
Christian König, linux-media, dri-devel, linaro-mm-sig
On Mon, Jul 07, 2025 at 12:15:54PM +0100, Pavel Begunkov wrote:
> > to attach to / detach from a dma_buf, and then have an iter that
> > specifies a dmabuf and offsets into. That way the code behind the
> > file operations can forward the attachment to all the needed
> > devices (including more/less while it remains attached to the file)
> > and can pick the right dma address for each device.
>
> By "iter that specifies a dmabuf" do you mean an opaque file-specific
> structure allocated inside the new fop?
I mean a reference the actual dma_buf (probably indirect through the file
* for it, but listen to the dma_buf experts for that and not me).
> Akin to what Keith proposed back
> then. That sounds good and has more potential for various optimisations.
> My concern would be growing struct iov_iter by an extra pointer:
> struct iov_iter {
> union {
> struct iovec *iov;
> struct dma_seg *dmav;
> ...
> };
> void *dma_token;
> };
>
> But maybe that's fine. It's 40B -> 48B,
Alternatively we could the union point to a struct that has the dma buf
pointer and a variable length array of dma_segs. Not sure if that would
create a mess in the callers, though.
> and it'll get back to
> 40 when / if xarray_start / ITER_XARRAY is removed.
Would it? At least for 64-bit architectures nr_segs is the same size.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 00/12] io_uring dmabuf read/write support
2025-07-07 14:48 ` Christoph Hellwig
@ 2025-07-07 15:41 ` Pavel Begunkov
2025-07-08 9:45 ` Christoph Hellwig
0 siblings, 1 reply; 19+ messages in thread
From: Pavel Begunkov @ 2025-07-07 15:41 UTC (permalink / raw)
To: Christoph Hellwig
Cc: io-uring, linux-block, linux-nvme, linux-fsdevel, Keith Busch,
David Wei, Vishal Verma, Sumit Semwal, Christian König,
linux-media, dri-devel, linaro-mm-sig
On 7/7/25 15:48, Christoph Hellwig wrote:
> On Mon, Jul 07, 2025 at 12:15:54PM +0100, Pavel Begunkov wrote:
>>> to attach to / detach from a dma_buf, and then have an iter that
>>> specifies a dmabuf and offsets into. That way the code behind the
>>> file operations can forward the attachment to all the needed
>>> devices (including more/less while it remains attached to the file)
>>> and can pick the right dma address for each device.
>>
>> By "iter that specifies a dmabuf" do you mean an opaque file-specific
>> structure allocated inside the new fop?
>
> I mean a reference the actual dma_buf (probably indirect through the file
> * for it, but listen to the dma_buf experts for that and not me).
My expectation is that io_uring would pass struct dma_buf to the
file during registration, so that it can do a bunch of work upfront,
but iterators will carry sth already pre-attached and pre dma mapped,
probably in a file specific format hiding details for multi-device
support, and possibly bundled with the dma-buf pointer if necessary.
(All modulo move notify which I need to look into first).
>> Akin to what Keith proposed back
>> then. That sounds good and has more potential for various optimisations.
>> My concern would be growing struct iov_iter by an extra pointer:
>
>> struct iov_iter {
>> union {
>> struct iovec *iov;
>> struct dma_seg *dmav;
>> ...
>> };
>> void *dma_token;
>> };
>>
>> But maybe that's fine. It's 40B -> 48B,
>
> Alternatively we could the union point to a struct that has the dma buf
> pointer and a variable length array of dma_segs. Not sure if that would
> create a mess in the callers, though.
Iteration helpers adjust the pointer, so either it needs to store
the pointer directly in iter or keep the current index. It could rely
solely on offsets, but that'll be a mess with nested loops (where the
inner one would walk some kind of sg table).
>> and it'll get back to
>> 40 when / if xarray_start / ITER_XARRAY is removed.
>
> Would it? At least for 64-bit architectures nr_segs is the same size.
Ah yes
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC 00/12] io_uring dmabuf read/write support
2025-07-07 15:41 ` Pavel Begunkov
@ 2025-07-08 9:45 ` Christoph Hellwig
0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2025-07-08 9:45 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Christoph Hellwig, io-uring, linux-block, linux-nvme,
linux-fsdevel, Keith Busch, David Wei, Vishal Verma, Sumit Semwal,
Christian König, linux-media, dri-devel, linaro-mm-sig
On Mon, Jul 07, 2025 at 04:41:23PM +0100, Pavel Begunkov wrote:
> > I mean a reference the actual dma_buf (probably indirect through the file
> > * for it, but listen to the dma_buf experts for that and not me).
>
> My expectation is that io_uring would pass struct dma_buf to the
io_uring isn't the only user. We've already had one other use case
coming up for pre-load of media files in mobile very recently. It's
also a really good interface for P2P transfers of any kind.
> file during registration, so that it can do a bunch of work upfront,
> but iterators will carry sth already pre-attached and pre dma mapped,
> probably in a file specific format hiding details for multi-device
> support, and possibly bundled with the dma-buf pointer if necessary.
> (All modulo move notify which I need to look into first).
I'd expect that the exported passed around the dma_buf, and something
that has access to it then imports it to the file. This could be
directly forwarded to the device for the initial scrope in your series
where you only support it for block device files.
Now we have two variants:
1) the file instance returns a cookie for the registration that the
caller has to pass into every read/write
2) the file instance tracks said cookie itself and matches it on
every read/write
1) sounds faster, 2) has more sanity checking and could prevent things
from going wrong.
(all this is based on my limited dma_buf understanding, corrections
always welcome).
> > > But maybe that's fine. It's 40B -> 48B,
> >
> > Alternatively we could the union point to a struct that has the dma buf
> > pointer and a variable length array of dma_segs. Not sure if that would
> > create a mess in the callers, though.
>
> Iteration helpers adjust the pointer, so either it needs to store
> the pointer directly in iter or keep the current index. It could rely
> solely on offsets, but that'll be a mess with nested loops (where the
> inner one would walk some kind of sg table).
Yeah. Maybe just keep is as a separate pointer growing the structure
and see if anyone screams.
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-07-08 9:45 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 15:10 [RFC 00/12] io_uring dmabuf read/write support Pavel Begunkov
2025-06-27 15:10 ` [RFC 01/12] file: add callback returning dev for dma operations Pavel Begunkov
2025-06-27 15:10 ` [RFC 02/12] iov_iter: introduce iter type for pre-registered dma Pavel Begunkov
2025-06-27 15:10 ` [RFC 03/12] block: move around bio flagging helpers Pavel Begunkov
2025-06-27 15:10 ` [RFC 04/12] block: introduce dmavec bio type Pavel Begunkov
2025-06-27 15:10 ` [RFC 05/12] block: implement ->get_dma_device callback Pavel Begunkov
2025-06-27 15:10 ` [RFC 06/12] nvme-pci: add support for user passed dma vectors Pavel Begunkov
2025-06-27 15:10 ` [RFC 07/12] io_uring/rsrc: extended reg buffer registration Pavel Begunkov
2025-06-27 15:10 ` [RFC 08/12] io_uring: add basic dmabuf helpers Pavel Begunkov
2025-06-27 15:10 ` [RFC 09/12] io_uring/rsrc: add imu flags Pavel Begunkov
2025-06-27 15:10 ` [RFC 10/12] io_uring/rsrc: add dmabuf-backed buffer registeration Pavel Begunkov
2025-06-27 15:10 ` [RFC 11/12] io_uring/rsrc: implement dmabuf regbuf import Pavel Begunkov
2025-06-27 15:10 ` [RFC 12/12] io_uring/rw: enable dma registered buffers Pavel Begunkov
2025-07-03 14:23 ` [RFC 00/12] io_uring dmabuf read/write support Christoph Hellwig
2025-07-03 14:37 ` Christian König
2025-07-07 11:15 ` Pavel Begunkov
2025-07-07 14:48 ` Christoph Hellwig
2025-07-07 15:41 ` Pavel Begunkov
2025-07-08 9:45 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox