public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHv2 0/6] ublk zero-copy support
@ 2025-02-11  0:56 Keith Busch
  2025-02-11  0:56 ` [PATCHv2 1/6] io_uring: use node for import Keith Busch
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Keith Busch @ 2025-02-11  0:56 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

From: Keith Busch <[email protected]>

Previous version was discussed here:

  https://lore.kernel.org/linux-block/[email protected]/

The same ublksrv reference code in that link was used to test the kernel
side changes.

Before listing what has changed, I want to mention what is the same: the
reliance on the ring ctx lock to serialize the register ahead of any
use. I'm not ignoring the feedback; I just don't have a solid answer
right now, and want to progress on the other fronts in the meantime.

Here's what's different from the previous:

 - Introduced an optional 'release' callback when the resource node is
   no longer referenced. The callback addresses any buggy applications
   that may complete their request and unregister their index while IO
   is in flight. This obviates any need to take extra page references
   since it prevents the request from completing.

 - Removed peeking into the io_cache element size and instead use a
   more intuitive bvec segment count limit to decide if we're caching
   the imu (suggested by Pavel).

 - Dropped the const request changes; it's not needed.

 - Merged up to latest block uring branch

Jens Axboe (1):
  io_uring: use node for import

Keith Busch (5):
  io_uring: create resource release callback
  io_uring: add support for kernel registered bvecs
  ublk: zc register/unregister bvec
  io_uring: add abstraction for buf_table rsrc data
  io_uring: cache nodes and mapped buffers

 drivers/block/ublk_drv.c       | 145 ++++++++++++++-----
 include/linux/io_uring.h       |   1 +
 include/linux/io_uring_types.h |  28 ++--
 include/uapi/linux/ublk_cmd.h  |   4 +
 io_uring/fdinfo.c              |   8 +-
 io_uring/filetable.c           |   2 +-
 io_uring/net.c                 |   5 +-
 io_uring/nop.c                 |   2 +-
 io_uring/register.c            |   2 +-
 io_uring/rsrc.c                | 252 ++++++++++++++++++++++++++-------
 io_uring/rsrc.h                |  11 +-
 io_uring/rw.c                  |   4 +-
 io_uring/uring_cmd.c           |   4 +-
 13 files changed, 355 insertions(+), 113 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCHv2 1/6] io_uring: use node for import
  2025-02-11  0:56 [PATCHv2 0/6] ublk zero-copy support Keith Busch
@ 2025-02-11  0:56 ` Keith Busch
  2025-02-11  0:56 ` [PATCHv2 2/6] io_uring: create resource release callback Keith Busch
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2025-02-11  0:56 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

From: Jens Axboe <[email protected]>

Replace the mapped buffer to the parent node. This is preparing for a
future for different types with specific handling considerations.

Signed-off-by: Keith Busch <[email protected]>
---
 io_uring/net.c       | 3 +--
 io_uring/rsrc.c      | 6 +++---
 io_uring/rsrc.h      | 5 ++---
 io_uring/rw.c        | 2 +-
 io_uring/uring_cmd.c | 2 +-
 5 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 10344b3a6d89c..280d576e89249 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1377,8 +1377,7 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
 			return ret;
 
 		ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter,
-					node->buf, (u64)(uintptr_t)sr->buf,
-					sr->len);
+					node, (u64)(uintptr_t)sr->buf, sr->len);
 		if (unlikely(ret))
 			return ret;
 		kmsg->msg.sg_from_iter = io_sg_from_iter;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index af39b69eb4fde..4d0e1c06c8bc6 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -860,10 +860,10 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
-int io_import_fixed(int ddir, struct iov_iter *iter,
-			   struct io_mapped_ubuf *imu,
-			   u64 buf_addr, size_t len)
+int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
+		    u64 buf_addr, size_t len)
 {
+	struct io_mapped_ubuf *imu = node->buf;
 	u64 buf_end;
 	size_t offset;
 
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 190f7ee45de93..abd0d5d42c3e1 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -50,9 +50,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
 void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data);
 int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
 
-int io_import_fixed(int ddir, struct iov_iter *iter,
-			   struct io_mapped_ubuf *imu,
-			   u64 buf_addr, size_t len);
+int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
+		    u64 buf_addr, size_t len);
 
 int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg);
 int io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 7aa1e4c9f64a3..c25e0ab5c996b 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -369,7 +369,7 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe
 	io_req_assign_buf_node(req, node);
 
 	io = req->async_data;
-	ret = io_import_fixed(ddir, &io->iter, node->buf, rw->addr, rw->len);
+	ret = io_import_fixed(ddir, &io->iter, node, rw->addr, rw->len);
 	iov_iter_save_state(&io->iter, &io->iter_state);
 	return ret;
 }
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 1f6a82128b475..aebbe2a4c7183 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -274,7 +274,7 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 
 	/* Must have had rsrc_node assigned at prep time */
 	if (node)
-		return io_import_fixed(rw, iter, node->buf, ubuf, len);
+		return io_import_fixed(rw, iter, node, ubuf, len);
 
 	return -EFAULT;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCHv2 2/6] io_uring: create resource release callback
  2025-02-11  0:56 [PATCHv2 0/6] ublk zero-copy support Keith Busch
  2025-02-11  0:56 ` [PATCHv2 1/6] io_uring: use node for import Keith Busch
@ 2025-02-11  0:56 ` Keith Busch
  2025-02-13  1:31   ` Pavel Begunkov
  2025-02-11  0:56 ` [PATCHv2 3/6] io_uring: add support for kernel registered bvecs Keith Busch
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Keith Busch @ 2025-02-11  0:56 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

From: Keith Busch <[email protected]>

When a registered resource is about to be freed, check if it has
registered a release function, and call it if so. This is preparing for
resources that are related to an external component.

Signed-off-by: Keith Busch <[email protected]>
---
 io_uring/rsrc.c | 2 ++
 io_uring/rsrc.h | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 4d0e1c06c8bc6..30f08cf13ef60 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -455,6 +455,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 	case IORING_RSRC_BUFFER:
 		if (node->buf)
 			io_buffer_unmap(ctx, node);
+		if (node->release)
+			node->release(node->priv);
 		break;
 	default:
 		WARN_ON_ONCE(1);
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index abd0d5d42c3e1..a3826ab84e666 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -24,6 +24,9 @@ struct io_rsrc_node {
 		unsigned long file_ptr;
 		struct io_mapped_ubuf *buf;
 	};
+
+	void (*release)(void *);
+	void *priv;
 };
 
 struct io_mapped_ubuf {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCHv2 3/6] io_uring: add support for kernel registered bvecs
  2025-02-11  0:56 [PATCHv2 0/6] ublk zero-copy support Keith Busch
  2025-02-11  0:56 ` [PATCHv2 1/6] io_uring: use node for import Keith Busch
  2025-02-11  0:56 ` [PATCHv2 2/6] io_uring: create resource release callback Keith Busch
@ 2025-02-11  0:56 ` Keith Busch
  2025-02-13  1:33   ` Pavel Begunkov
  2025-02-14  3:30   ` Ming Lei
  2025-02-11  0:56 ` [PATCHv2 4/6] ublk: zc register/unregister bvec Keith Busch
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Keith Busch @ 2025-02-11  0:56 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

From: Keith Busch <[email protected]>

Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.

User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.

Signed-off-by: Keith Busch <[email protected]>
---
 include/linux/io_uring.h       |   1 +
 include/linux/io_uring_types.h |   4 ++
 io_uring/rsrc.c                | 100 +++++++++++++++++++++++++++++++--
 io_uring/rsrc.h                |   1 +
 4 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 85fe4e6b275c7..b5637a2aae340 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -5,6 +5,7 @@
 #include <linux/sched.h>
 #include <linux/xarray.h>
 #include <uapi/linux/io_uring.h>
+#include <linux/blk-mq.h>
 
 #if defined(CONFIG_IO_URING)
 void __io_uring_cancel(bool cancel_all);
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e2fef264ff8b8..99aac2d52fbae 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -693,4 +693,8 @@ static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
 	return ctx->flags & IORING_SETUP_CQE32;
 }
 
+int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
+			    void (*release)(void *), unsigned int index);
+void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int tag);
+
 #endif
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 30f08cf13ef60..14efec8587888 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -110,8 +110,9 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 
 		if (!refcount_dec_and_test(&imu->refs))
 			return;
-		for (i = 0; i < imu->nr_bvecs; i++)
-			unpin_user_page(imu->bvec[i].bv_page);
+		if (node->type == IORING_RSRC_BUFFER)
+			for (i = 0; i < imu->nr_bvecs; i++)
+				unpin_user_page(imu->bvec[i].bv_page);
 		if (imu->acct_pages)
 			io_unaccount_mem(ctx, imu->acct_pages);
 		kvfree(imu);
@@ -240,6 +241,13 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 		struct io_rsrc_node *node;
 		u64 tag = 0;
 
+		i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
+		node = io_rsrc_node_lookup(&ctx->buf_table, i);
+		if (node && node->type != IORING_RSRC_BUFFER) {
+			err = -EBUSY;
+			break;
+		}
+
 		uvec = u64_to_user_ptr(user_data);
 		iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat);
 		if (IS_ERR(iov)) {
@@ -265,7 +273,6 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 			}
 			node->tag = tag;
 		}
-		i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
 		io_reset_rsrc_node(ctx, &ctx->buf_table, i);
 		ctx->buf_table.nodes[i] = node;
 		if (ctx->compat)
@@ -452,6 +459,7 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 		if (io_slot_file(node))
 			fput(io_slot_file(node));
 		break;
+	case IORING_RSRC_KBUFFER:
 	case IORING_RSRC_BUFFER:
 		if (node->buf)
 			io_buffer_unmap(ctx, node);
@@ -862,6 +870,79 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
+int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
+			    void (*release)(void *), unsigned int index)
+{
+	struct io_rsrc_data *data = &ctx->buf_table;
+	struct req_iterator rq_iter;
+	struct io_mapped_ubuf *imu;
+	struct io_rsrc_node *node;
+	struct bio_vec bv;
+	u16 nr_bvecs;
+	int i = 0;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	if (!data->nr)
+		return -EINVAL;
+	if (index >= data->nr)
+		return -EINVAL;
+
+	node = data->nodes[index];
+	if (node)
+		return -EBUSY;
+
+	node = io_rsrc_node_alloc(IORING_RSRC_KBUFFER);
+	if (!node)
+		return -ENOMEM;
+
+	node->release = release;
+	node->priv = rq;
+
+	nr_bvecs = blk_rq_nr_phys_segments(rq);
+	imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+	if (!imu) {
+		kfree(node);
+		return -ENOMEM;
+	}
+
+	imu->ubuf = 0;
+	imu->len = blk_rq_bytes(rq);
+	imu->acct_pages = 0;
+	imu->nr_bvecs = nr_bvecs;
+	refcount_set(&imu->refs, 1);
+	node->buf = imu;
+
+	rq_for_each_bvec(bv, rq, rq_iter)
+		bvec_set_page(&node->buf->bvec[i++], bv.bv_page, bv.bv_len,
+			      bv.bv_offset);
+	data->nodes[index] = node;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
+
+void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index)
+{
+	struct io_rsrc_data *data = &ctx->buf_table;
+	struct io_rsrc_node *node;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	if (!data->nr)
+		return;
+	if (index >= data->nr)
+		return;
+
+	node = data->nodes[index];
+	if (!node || !node->buf)
+		return;
+	if (node->type != IORING_RSRC_KBUFFER)
+		return;
+	io_reset_rsrc_node(ctx, data, index);
+}
+EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
+
 int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
 		    u64 buf_addr, size_t len)
 {
@@ -888,8 +969,8 @@ int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
 		/*
 		 * Don't use iov_iter_advance() here, as it's really slow for
 		 * using the latter parts of a big fixed buffer - it iterates
-		 * over each segment manually. We can cheat a bit here, because
-		 * we know that:
+		 * over each segment manually. We can cheat a bit here for user
+		 * registered nodes, because we know that:
 		 *
 		 * 1) it's a BVEC iter, we set it up
 		 * 2) all bvecs are the same in size, except potentially the
@@ -903,8 +984,15 @@ int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
 		 */
 		const struct bio_vec *bvec = imu->bvec;
 
+		/*
+		 * Kernel buffer bvecs, on the other hand, don't necessarily
+		 * have the size property of user registered ones, so we have
+		 * to use the slow iter advance.
+		 */
 		if (offset < bvec->bv_len) {
 			iter->iov_offset = offset;
+		} else if (node->type == IORING_RSRC_KBUFFER) {
+			iov_iter_advance(iter, offset);
 		} else {
 			unsigned long seg_skip;
 
@@ -1004,7 +1092,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 		if (!src_node) {
 			dst_node = NULL;
 		} else {
-			dst_node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+			dst_node = io_rsrc_node_alloc(src_node->type);
 			if (!dst_node) {
 				ret = -ENOMEM;
 				goto out_free;
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index a3826ab84e666..8147dfc26f737 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -13,6 +13,7 @@
 enum {
 	IORING_RSRC_FILE		= 0,
 	IORING_RSRC_BUFFER		= 1,
+	IORING_RSRC_KBUFFER		= 2,
 };
 
 struct io_rsrc_node {
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCHv2 4/6] ublk: zc register/unregister bvec
  2025-02-11  0:56 [PATCHv2 0/6] ublk zero-copy support Keith Busch
                   ` (2 preceding siblings ...)
  2025-02-11  0:56 ` [PATCHv2 3/6] io_uring: add support for kernel registered bvecs Keith Busch
@ 2025-02-11  0:56 ` Keith Busch
  2025-02-12  2:49   ` Ming Lei
  2025-02-13  2:12   ` Pavel Begunkov
  2025-02-11  0:56 ` [PATCHv2 5/6] io_uring: add abstraction for buf_table rsrc data Keith Busch
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 30+ messages in thread
From: Keith Busch @ 2025-02-11  0:56 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

From: Keith Busch <[email protected]>

Provide new operations for the user to request mapping an active request
to an io uring instance's buf_table. The user has to provide the index
it wants to install the buffer.

A reference count is taken on the request to ensure it can't be
completed while it is active in a ring's buf_table.

Signed-off-by: Keith Busch <[email protected]>
---
 drivers/block/ublk_drv.c      | 145 +++++++++++++++++++++++++---------
 include/uapi/linux/ublk_cmd.h |   4 +
 2 files changed, 113 insertions(+), 36 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 529085181f355..ccfda7b2c24da 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -51,6 +51,9 @@
 /* private ioctl command mirror */
 #define UBLK_CMD_DEL_DEV_ASYNC	_IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
 
+#define UBLK_IO_REGISTER_IO_BUF		_IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
+#define UBLK_IO_UNREGISTER_IO_BUF	_IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
+
 /* All UBLK_F_* have to be included into UBLK_F_ALL */
 #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
 		| UBLK_F_URING_CMD_COMP_IN_TASK \
@@ -76,6 +79,9 @@ struct ublk_rq_data {
 	struct llist_node node;
 
 	struct kref ref;
+
+#define UBLK_ZC_REGISTERED 0
+	unsigned long flags;
 };
 
 struct ublk_uring_cmd_pdu {
@@ -201,7 +207,7 @@ static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq,
 						   int tag);
 static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub)
 {
-	return ub->dev_info.flags & UBLK_F_USER_COPY;
+	return ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
 }
 
 static inline bool ublk_dev_is_zoned(const struct ublk_device *ub)
@@ -581,7 +587,7 @@ static void ublk_apply_params(struct ublk_device *ub)
 
 static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
 {
-	return ubq->flags & UBLK_F_USER_COPY;
+	return ubq->flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
 }
 
 static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
@@ -1747,6 +1753,102 @@ static inline void ublk_prep_cancel(struct io_uring_cmd *cmd,
 	io_uring_cmd_mark_cancelable(cmd, issue_flags);
 }
 
+static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
+		struct ublk_queue *ubq, int tag, size_t offset)
+{
+	struct request *req;
+
+	if (!ublk_need_req_ref(ubq))
+		return NULL;
+
+	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+	if (!req)
+		return NULL;
+
+	if (!ublk_get_req_ref(ubq, req))
+		return NULL;
+
+	if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
+		goto fail_put;
+
+	if (!ublk_rq_has_data(req))
+		goto fail_put;
+
+	if (offset > blk_rq_bytes(req))
+		goto fail_put;
+
+	return req;
+fail_put:
+	ublk_put_req_ref(ubq, req);
+	return NULL;
+}
+
+static void ublk_io_release(void *priv)
+{
+	struct request *rq = priv;
+	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
+
+	ublk_put_req_ref(ubq, rq);
+}
+
+static int ublk_register_io_buf(struct io_uring_cmd *cmd,
+				struct ublk_queue *ubq, int tag,
+				const struct ublksrv_io_cmd *ub_cmd)
+{
+	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+	struct ublk_device *ub = cmd->file->private_data;
+	int index = (int)ub_cmd->addr, ret;
+	struct ublk_rq_data *data;
+	struct request *req;
+
+	if (!ub)
+		return -EPERM;
+
+	req = __ublk_check_and_get_req(ub, ubq, tag, 0);
+	if (!req)
+		return -EINVAL;
+
+	data = blk_mq_rq_to_pdu(req);
+	if (test_and_set_bit(UBLK_ZC_REGISTERED, &data->flags)) {
+		ublk_put_req_ref(ubq, req);
+		return -EBUSY;
+	}
+
+	ret = io_buffer_register_bvec(ctx, req, ublk_io_release, index);
+	if (ret) {
+		clear_bit(UBLK_ZC_REGISTERED, &data->flags);
+		ublk_put_req_ref(ubq, req);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
+				  struct ublk_queue *ubq, int tag,
+				  const struct ublksrv_io_cmd *ub_cmd)
+{
+	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+	struct ublk_device *ub = cmd->file->private_data;
+	int index = (int)ub_cmd->addr;
+	struct ublk_rq_data *data;
+	struct request *req;
+
+	if (!ub)
+		return -EPERM;
+
+	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+	if (!req)
+		return -EINVAL;
+
+	data = blk_mq_rq_to_pdu(req);
+	if (!test_and_clear_bit(UBLK_ZC_REGISTERED, &data->flags))
+		return -EINVAL;
+
+	io_buffer_unregister_bvec(ctx, index);
+	return 0;
+}
+
 static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 			       unsigned int issue_flags,
 			       const struct ublksrv_io_cmd *ub_cmd)
@@ -1798,6 +1900,10 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 
 	ret = -EINVAL;
 	switch (_IOC_NR(cmd_op)) {
+	case UBLK_IO_REGISTER_IO_BUF:
+		return ublk_register_io_buf(cmd, ubq, tag, ub_cmd);
+	case UBLK_IO_UNREGISTER_IO_BUF:
+		return ublk_unregister_io_buf(cmd, ubq, tag, ub_cmd);
 	case UBLK_IO_FETCH_REQ:
 		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
 		if (ublk_queue_ready(ubq)) {
@@ -1872,36 +1978,6 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 	return -EIOCBQUEUED;
 }
 
-static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
-		struct ublk_queue *ubq, int tag, size_t offset)
-{
-	struct request *req;
-
-	if (!ublk_need_req_ref(ubq))
-		return NULL;
-
-	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
-	if (!req)
-		return NULL;
-
-	if (!ublk_get_req_ref(ubq, req))
-		return NULL;
-
-	if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
-		goto fail_put;
-
-	if (!ublk_rq_has_data(req))
-		goto fail_put;
-
-	if (offset > blk_rq_bytes(req))
-		goto fail_put;
-
-	return req;
-fail_put:
-	ublk_put_req_ref(ubq, req);
-	return NULL;
-}
-
 static inline int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
 {
@@ -2527,9 +2603,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 		goto out_free_dev_number;
 	}
 
-	/* We are not ready to support zero copy */
-	ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
-
 	ub->dev_info.nr_hw_queues = min_t(unsigned int,
 			ub->dev_info.nr_hw_queues, nr_cpu_ids);
 	ublk_align_max_io_size(ub);
@@ -2860,7 +2933,7 @@ static int ublk_ctrl_get_features(struct io_uring_cmd *cmd)
 {
 	const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe);
 	void __user *argp = (void __user *)(unsigned long)header->addr;
-	u64 features = UBLK_F_ALL & ~UBLK_F_SUPPORT_ZERO_COPY;
+	u64 features = UBLK_F_ALL;
 
 	if (header->len != UBLK_FEATURES_LEN || !header->addr)
 		return -EINVAL;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index a8bc98bb69fce..74246c926b55f 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -94,6 +94,10 @@
 	_IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd)
 #define	UBLK_U_IO_NEED_GET_DATA		\
 	_IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd)
+#define	UBLK_U_IO_REGISTER_IO_BUF	\
+	_IOWR('u', 0x23, struct ublksrv_io_cmd)
+#define	UBLK_U_IO_UNREGISTER_IO_BUF	\
+	_IOWR('u', 0x24, struct ublksrv_io_cmd)
 
 /* only ABORT means that no re-fetch */
 #define UBLK_IO_RES_OK			0
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCHv2 5/6] io_uring: add abstraction for buf_table rsrc data
  2025-02-11  0:56 [PATCHv2 0/6] ublk zero-copy support Keith Busch
                   ` (3 preceding siblings ...)
  2025-02-11  0:56 ` [PATCHv2 4/6] ublk: zc register/unregister bvec Keith Busch
@ 2025-02-11  0:56 ` Keith Busch
  2025-02-11  0:56 ` [PATCHv2 6/6] io_uring: cache nodes and mapped buffers Keith Busch
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2025-02-11  0:56 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

From: Keith Busch <[email protected]>

We'll need to add more fields specific to the registered buffers, so
make a layer for it now. No functional change in this patch.

Signed-off-by: Keith Busch <[email protected]>
---
 include/linux/io_uring_types.h |  6 +++-
 io_uring/fdinfo.c              |  8 +++---
 io_uring/net.c                 |  2 +-
 io_uring/nop.c                 |  2 +-
 io_uring/register.c            |  2 +-
 io_uring/rsrc.c                | 51 +++++++++++++++++-----------------
 io_uring/rw.c                  |  2 +-
 io_uring/uring_cmd.c           |  2 +-
 8 files changed, 39 insertions(+), 36 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 99aac2d52fbae..4f4b7ad21500d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -67,6 +67,10 @@ struct io_file_table {
 	unsigned int alloc_hint;
 };
 
+struct io_buf_table {
+	struct io_rsrc_data	data;
+};
+
 struct io_hash_bucket {
 	struct hlist_head	list;
 } ____cacheline_aligned_in_smp;
@@ -291,7 +295,7 @@ struct io_ring_ctx {
 		struct io_wq_work_list	iopoll_list;
 
 		struct io_file_table	file_table;
-		struct io_rsrc_data	buf_table;
+		struct io_buf_table	buf_table;
 
 		struct io_submit_state	submit_state;
 
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index f60d0a9d505e2..d389c06cbce10 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -217,12 +217,12 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
 			seq_puts(m, "\n");
 		}
 	}
-	seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.nr);
-	for (i = 0; has_lock && i < ctx->buf_table.nr; i++) {
+	seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.data.nr);
+	for (i = 0; has_lock && i < ctx->buf_table.data.nr; i++) {
 		struct io_mapped_ubuf *buf = NULL;
 
-		if (ctx->buf_table.nodes[i])
-			buf = ctx->buf_table.nodes[i]->buf;
+		if (ctx->buf_table.data.nodes[i])
+			buf = ctx->buf_table.data.nodes[i]->buf;
 		if (buf)
 			seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf, buf->len);
 		else
diff --git a/io_uring/net.c b/io_uring/net.c
index 280d576e89249..c1020c857333d 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1366,7 +1366,7 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
 
 		ret = -EFAULT;
 		io_ring_submit_lock(ctx, issue_flags);
-		node = io_rsrc_node_lookup(&ctx->buf_table, sr->buf_index);
+		node = io_rsrc_node_lookup(&ctx->buf_table.data, sr->buf_index);
 		if (node) {
 			io_req_assign_buf_node(sr->notif, node);
 			ret = 0;
diff --git a/io_uring/nop.c b/io_uring/nop.c
index 5e5196df650a1..e3ebe5f019076 100644
--- a/io_uring/nop.c
+++ b/io_uring/nop.c
@@ -69,7 +69,7 @@ int io_nop(struct io_kiocb *req, unsigned int issue_flags)
 
 		ret = -EFAULT;
 		io_ring_submit_lock(ctx, issue_flags);
-		node = io_rsrc_node_lookup(&ctx->buf_table, nop->buffer);
+		node = io_rsrc_node_lookup(&ctx->buf_table.data, nop->buffer);
 		if (node) {
 			io_req_assign_buf_node(req, node);
 			ret = 0;
diff --git a/io_uring/register.c b/io_uring/register.c
index 9a4d2fbce4aec..fa922b1b26583 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -919,7 +919,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
 	ret = __io_uring_register(ctx, opcode, arg, nr_args);
 
 	trace_io_uring_register(ctx, opcode, ctx->file_table.data.nr,
-				ctx->buf_table.nr, ret);
+				ctx->buf_table.data.nr, ret);
 	mutex_unlock(&ctx->uring_lock);
 
 	fput(file);
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 14efec8587888..b3f36f1b2a668 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -232,17 +232,17 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 	__u32 done;
 	int i, err;
 
-	if (!ctx->buf_table.nr)
+	if (!ctx->buf_table.data.nr)
 		return -ENXIO;
-	if (up->offset + nr_args > ctx->buf_table.nr)
+	if (up->offset + nr_args > ctx->buf_table.data.nr)
 		return -EINVAL;
 
 	for (done = 0; done < nr_args; done++) {
 		struct io_rsrc_node *node;
 		u64 tag = 0;
 
-		i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
-		node = io_rsrc_node_lookup(&ctx->buf_table, i);
+		i = array_index_nospec(up->offset + done, ctx->buf_table.data.nr);
+		node = io_rsrc_node_lookup(&ctx->buf_table.data, i);
 		if (node && node->type != IORING_RSRC_BUFFER) {
 			err = -EBUSY;
 			break;
@@ -273,8 +273,8 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 			}
 			node->tag = tag;
 		}
-		io_reset_rsrc_node(ctx, &ctx->buf_table, i);
-		ctx->buf_table.nodes[i] = node;
+		io_reset_rsrc_node(ctx, &ctx->buf_table.data, i);
+		ctx->buf_table.data.nodes[i] = node;
 		if (ctx->compat)
 			user_data += sizeof(struct compat_iovec);
 		else
@@ -555,9 +555,9 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 
 int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
-	if (!ctx->buf_table.nr)
+	if (!ctx->buf_table.data.nr)
 		return -ENXIO;
-	io_rsrc_data_free(ctx, &ctx->buf_table);
+	io_rsrc_data_free(ctx, &ctx->buf_table.data);
 	return 0;
 }
 
@@ -584,8 +584,8 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
 	}
 
 	/* check previously registered pages */
-	for (i = 0; i < ctx->buf_table.nr; i++) {
-		struct io_rsrc_node *node = ctx->buf_table.nodes[i];
+	for (i = 0; i < ctx->buf_table.data.nr; i++) {
+		struct io_rsrc_node *node = ctx->buf_table.data.nodes[i];
 		struct io_mapped_ubuf *imu;
 
 		if (!node)
@@ -811,7 +811,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 
 	BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));
 
-	if (ctx->buf_table.nr)
+	if (ctx->buf_table.data.nr)
 		return -EBUSY;
 	if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
 		return -EINVAL;
@@ -864,7 +864,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 		data.nodes[i] = node;
 	}
 
-	ctx->buf_table = data;
+	ctx->buf_table.data = data;
 	if (ret)
 		io_sqe_buffers_unregister(ctx);
 	return ret;
@@ -873,7 +873,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
 			    void (*release)(void *), unsigned int index)
 {
-	struct io_rsrc_data *data = &ctx->buf_table;
+	struct io_rsrc_data *data = &ctx->buf_table.data;
 	struct req_iterator rq_iter;
 	struct io_mapped_ubuf *imu;
 	struct io_rsrc_node *node;
@@ -924,7 +924,7 @@ EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
 
 void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index)
 {
-	struct io_rsrc_data *data = &ctx->buf_table;
+	struct io_rsrc_data *data = &ctx->buf_table.data;
 	struct io_rsrc_node *node;
 
 	lockdep_assert_held(&ctx->uring_lock);
@@ -1040,10 +1040,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	if (!arg->nr && (arg->dst_off || arg->src_off))
 		return -EINVAL;
 	/* not allowed unless REPLACE is set */
-	if (ctx->buf_table.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE))
+	if (ctx->buf_table.data.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE))
 		return -EBUSY;
 
-	nbufs = src_ctx->buf_table.nr;
+	nbufs = src_ctx->buf_table.data.nr;
 	if (!arg->nr)
 		arg->nr = nbufs;
 	else if (arg->nr > nbufs)
@@ -1053,13 +1053,13 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	if (check_add_overflow(arg->nr, arg->dst_off, &nbufs))
 		return -EOVERFLOW;
 
-	ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.nr));
+	ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr));
 	if (ret)
 		return ret;
 
 	/* Fill entries in data from dst that won't overlap with src */
-	for (i = 0; i < min(arg->dst_off, ctx->buf_table.nr); i++) {
-		struct io_rsrc_node *src_node = ctx->buf_table.nodes[i];
+	for (i = 0; i < min(arg->dst_off, ctx->buf_table.data.nr); i++) {
+		struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i];
 
 		if (src_node) {
 			data.nodes[i] = src_node;
@@ -1068,7 +1068,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	}
 
 	ret = -ENXIO;
-	nbufs = src_ctx->buf_table.nr;
+	nbufs = src_ctx->buf_table.data.nr;
 	if (!nbufs)
 		goto out_free;
 	ret = -EINVAL;
@@ -1088,7 +1088,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	while (nr--) {
 		struct io_rsrc_node *dst_node, *src_node;
 
-		src_node = io_rsrc_node_lookup(&src_ctx->buf_table, i);
+		src_node = io_rsrc_node_lookup(&src_ctx->buf_table.data, i);
 		if (!src_node) {
 			dst_node = NULL;
 		} else {
@@ -1110,7 +1110,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	 * old and new nodes at this point.
 	 */
 	if (arg->flags & IORING_REGISTER_DST_REPLACE)
-		io_rsrc_data_free(ctx, &ctx->buf_table);
+		io_sqe_buffers_unregister(ctx);
 
 	/*
 	 * ctx->buf_table must be empty now - either the contents are being
@@ -1118,10 +1118,9 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	 * copied to a ring that does not have buffers yet (checked at function
 	 * entry).
 	 */
-	WARN_ON_ONCE(ctx->buf_table.nr);
-	ctx->buf_table = data;
+	WARN_ON_ONCE(ctx->buf_table.data.nr);
+	ctx->buf_table.data = data;
 	return 0;
-
 out_free:
 	io_rsrc_data_free(ctx, &data);
 	return ret;
@@ -1146,7 +1145,7 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg)
 		return -EFAULT;
 	if (buf.flags & ~(IORING_REGISTER_SRC_REGISTERED|IORING_REGISTER_DST_REPLACE))
 		return -EINVAL;
-	if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.nr)
+	if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.data.nr)
 		return -EBUSY;
 	if (memchr_inv(buf.pad, 0, sizeof(buf.pad)))
 		return -EINVAL;
diff --git a/io_uring/rw.c b/io_uring/rw.c
index c25e0ab5c996b..38ec32401a558 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -363,7 +363,7 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe
 	if (unlikely(ret))
 		return ret;
 
-	node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
+	node = io_rsrc_node_lookup(&ctx->buf_table.data, req->buf_index);
 	if (!node)
 		return -EFAULT;
 	io_req_assign_buf_node(req, node);
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index aebbe2a4c7183..5d9719402b49b 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -206,7 +206,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		struct io_rsrc_node *node;
 		u16 index = READ_ONCE(sqe->buf_index);
 
-		node = io_rsrc_node_lookup(&ctx->buf_table, index);
+		node = io_rsrc_node_lookup(&ctx->buf_table.data, index);
 		if (unlikely(!node))
 			return -EFAULT;
 		/*
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCHv2 6/6] io_uring: cache nodes and mapped buffers
  2025-02-11  0:56 [PATCHv2 0/6] ublk zero-copy support Keith Busch
                   ` (4 preceding siblings ...)
  2025-02-11  0:56 ` [PATCHv2 5/6] io_uring: add abstraction for buf_table rsrc data Keith Busch
@ 2025-02-11  0:56 ` Keith Busch
  2025-02-11 16:47   ` Keith Busch
  2025-02-12  2:29 ` [PATCHv2 0/6] ublk zero-copy support Ming Lei
  2025-02-13 15:12 ` lizetao
  7 siblings, 1 reply; 30+ messages in thread
From: Keith Busch @ 2025-02-11  0:56 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

From: Keith Busch <[email protected]>

Frequent alloc/free cycles on these is pretty costly. Use an io cache to
more efficiently reuse these buffers.

Signed-off-by: Keith Busch <[email protected]>
---
 include/linux/io_uring_types.h |  18 +++---
 io_uring/filetable.c           |   2 +-
 io_uring/rsrc.c                | 115 +++++++++++++++++++++++++--------
 io_uring/rsrc.h                |   2 +-
 4 files changed, 101 insertions(+), 36 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 4f4b7ad21500d..a6e525b756d10 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -67,8 +67,18 @@ struct io_file_table {
 	unsigned int alloc_hint;
 };
 
+struct io_alloc_cache {
+	void			**entries;
+	unsigned int		nr_cached;
+	unsigned int		max_cached;
+	size_t			elem_size;
+	unsigned int		init_clear;
+};
+
 struct io_buf_table {
 	struct io_rsrc_data	data;
+	struct io_alloc_cache	node_cache;
+	struct io_alloc_cache	imu_cache;
 };
 
 struct io_hash_bucket {
@@ -222,14 +232,6 @@ struct io_submit_state {
 	struct blk_plug		plug;
 };
 
-struct io_alloc_cache {
-	void			**entries;
-	unsigned int		nr_cached;
-	unsigned int		max_cached;
-	unsigned int		elem_size;
-	unsigned int		init_clear;
-};
-
 struct io_ring_ctx {
 	/* const or read-mostly hot data */
 	struct {
diff --git a/io_uring/filetable.c b/io_uring/filetable.c
index dd8eeec97acf6..a21660e3145ab 100644
--- a/io_uring/filetable.c
+++ b/io_uring/filetable.c
@@ -68,7 +68,7 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
 	if (slot_index >= ctx->file_table.data.nr)
 		return -EINVAL;
 
-	node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+	node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
 	if (!node)
 		return -ENOMEM;
 
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index b3f36f1b2a668..88a67590c67d4 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -32,6 +32,8 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 #define IORING_MAX_FIXED_FILES	(1U << 20)
 #define IORING_MAX_REG_BUFFERS	(1U << 14)
 
+#define IO_CACHED_BVECS_SEGS	30
+
 int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
 {
 	unsigned long page_limit, cur_pages, new_pages;
@@ -119,19 +121,35 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 	}
 }
 
-struct io_rsrc_node *io_rsrc_node_alloc(int type)
+
+struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type)
 {
 	struct io_rsrc_node *node;
 
-	node = kzalloc(sizeof(*node), GFP_KERNEL);
+	if (type == IORING_RSRC_FILE)
+		node = kmalloc(sizeof(*node), GFP_KERNEL);
+	else
+		node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL);
 	if (node) {
 		node->type = type;
 		node->refs = 1;
+		node->tag = 0;
+		node->file_ptr = 0;
+		node->release = NULL;
+		node->priv = NULL;
 	}
 	return node;
 }
 
-__cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data)
+static __cold void __io_rsrc_data_free(struct io_rsrc_data *data)
+{
+	kvfree(data->nodes);
+	data->nodes = NULL;
+	data->nr = 0;
+}
+
+__cold void io_rsrc_data_free(struct io_ring_ctx *ctx,
+			      struct io_rsrc_data *data)
 {
 	if (!data->nr)
 		return;
@@ -139,9 +157,7 @@ __cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data
 		if (data->nodes[data->nr])
 			io_put_rsrc_node(ctx, data->nodes[data->nr]);
 	}
-	kvfree(data->nodes);
-	data->nodes = NULL;
-	data->nr = 0;
+	__io_rsrc_data_free(data);
 }
 
 __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
@@ -155,6 +171,34 @@ __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
 	return -ENOMEM;
 }
 
+static __cold int io_rsrc_buffer_alloc(struct io_buf_table *table, unsigned nr)
+{
+	const int imu_cache_size = struct_size_t(struct io_mapped_ubuf, bvec,
+						 IO_CACHED_BVECS_SEGS);
+	int ret;
+
+	BUILD_BUG_ON(imu_cache_size != 512);
+	ret = io_rsrc_data_alloc(&table->data, nr);
+	if (ret)
+		return ret;
+
+	ret = io_alloc_cache_init(&table->node_cache, nr,
+				  sizeof(struct io_rsrc_node), 0);
+	if (ret)
+		goto out_1;
+
+	ret = io_alloc_cache_init(&table->imu_cache, nr, imu_cache_size, 0);
+	if (ret)
+		goto out_2;
+
+	return 0;
+out_2:
+	io_alloc_cache_free(&table->node_cache, kfree);
+out_1:
+	__io_rsrc_data_free(&table->data);
+	return ret;
+}
+
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				 struct io_uring_rsrc_update2 *up,
 				 unsigned nr_args)
@@ -204,7 +248,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				err = -EBADF;
 				break;
 			}
-			node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+			node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
 			if (!node) {
 				err = -ENOMEM;
 				fput(file);
@@ -465,6 +509,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 			io_buffer_unmap(ctx, node);
 		if (node->release)
 			node->release(node->priv);
+		if (io_alloc_cache_put(&ctx->buf_table.node_cache, node))
+			return;
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -533,7 +579,7 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 			goto fail;
 		}
 		ret = -ENOMEM;
-		node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+		node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
 		if (!node) {
 			fput(file);
 			goto fail;
@@ -553,11 +599,19 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
+static void io_rsrc_buffer_free(struct io_ring_ctx *ctx,
+				struct io_buf_table *table)
+{
+	io_rsrc_data_free(ctx, &table->data);
+	io_alloc_cache_free(&table->node_cache, kfree);
+	io_alloc_cache_free(&table->imu_cache, kfree);
+}
+
 int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
 	if (!ctx->buf_table.data.nr)
 		return -ENXIO;
-	io_rsrc_data_free(ctx, &ctx->buf_table.data);
+	io_rsrc_buffer_free(ctx, &ctx->buf_table);
 	return 0;
 }
 
@@ -722,6 +776,15 @@ bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
 	return true;
 }
 
+static struct io_mapped_ubuf *io_alloc_imu(struct io_ring_ctx *ctx,
+					   int nr_bvecs)
+{
+	if (nr_bvecs <= IO_CACHED_BVECS_SEGS)
+		return io_cache_alloc(&ctx->buf_table.imu_cache, GFP_KERNEL);
+	return kvmalloc(struct_size_t(struct io_mapped_ubuf, bvec, nr_bvecs),
+			GFP_KERNEL);
+}
+
 static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 						   struct iovec *iov,
 						   struct page **last_hpage)
@@ -738,7 +801,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 	if (!iov->iov_base)
 		return NULL;
 
-	node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+	node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
 	if (!node)
 		return ERR_PTR(-ENOMEM);
 	node->buf = NULL;
@@ -758,7 +821,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 			coalesced = io_coalesce_buffer(&pages, &nr_pages, &data);
 	}
 
-	imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
+	imu = io_alloc_imu(ctx, nr_pages);
 	if (!imu)
 		goto done;
 
@@ -804,9 +867,9 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			    unsigned int nr_args, u64 __user *tags)
 {
 	struct page *last_hpage = NULL;
-	struct io_rsrc_data data;
 	struct iovec fast_iov, *iov = &fast_iov;
 	const struct iovec __user *uvec;
+	struct io_buf_table table;
 	int i, ret;
 
 	BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));
@@ -815,13 +878,14 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 		return -EBUSY;
 	if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
 		return -EINVAL;
-	ret = io_rsrc_data_alloc(&data, nr_args);
+	ret = io_rsrc_buffer_alloc(&table, nr_args);
 	if (ret)
 		return ret;
 
 	if (!arg)
 		memset(iov, 0, sizeof(*iov));
 
+	ctx->buf_table = table;
 	for (i = 0; i < nr_args; i++) {
 		struct io_rsrc_node *node;
 		u64 tag = 0;
@@ -861,10 +925,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			}
 			node->tag = tag;
 		}
-		data.nodes[i] = node;
+		table.data.nodes[i] = node;
 	}
-
-	ctx->buf_table.data = data;
 	if (ret)
 		io_sqe_buffers_unregister(ctx);
 	return ret;
@@ -892,7 +954,7 @@ int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
 	if (node)
 		return -EBUSY;
 
-	node = io_rsrc_node_alloc(IORING_RSRC_KBUFFER);
+	node = io_rsrc_node_alloc(ctx, IORING_RSRC_KBUFFER);
 	if (!node)
 		return -ENOMEM;
 
@@ -900,7 +962,8 @@ int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
 	node->priv = rq;
 
 	nr_bvecs = blk_rq_nr_phys_segments(rq);
-	imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+
+	imu = io_alloc_imu(ctx, nr_bvecs);
 	if (!imu) {
 		kfree(node);
 		return -ENOMEM;
@@ -1022,7 +1085,7 @@ static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
 static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx,
 			    struct io_uring_clone_buffers *arg)
 {
-	struct io_rsrc_data data;
+	struct io_buf_table table;
 	int i, ret, off, nr;
 	unsigned int nbufs;
 
@@ -1053,7 +1116,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	if (check_add_overflow(arg->nr, arg->dst_off, &nbufs))
 		return -EOVERFLOW;
 
-	ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr));
+	ret = io_rsrc_buffer_alloc(&table, max(nbufs, ctx->buf_table.data.nr));
 	if (ret)
 		return ret;
 
@@ -1062,7 +1125,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 		struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i];
 
 		if (src_node) {
-			data.nodes[i] = src_node;
+			table.data.nodes[i] = src_node;
 			src_node->refs++;
 		}
 	}
@@ -1092,7 +1155,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 		if (!src_node) {
 			dst_node = NULL;
 		} else {
-			dst_node = io_rsrc_node_alloc(src_node->type);
+			dst_node = io_rsrc_node_alloc(ctx, src_node->type);
 			if (!dst_node) {
 				ret = -ENOMEM;
 				goto out_free;
@@ -1101,12 +1164,12 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 			refcount_inc(&src_node->buf->refs);
 			dst_node->buf = src_node->buf;
 		}
-		data.nodes[off++] = dst_node;
+		table.data.nodes[off++] = dst_node;
 		i++;
 	}
 
 	/*
-	 * If asked for replace, put the old table. data->nodes[] holds both
+	 * If asked for replace, put the old table. table.data->nodes[] holds both
 	 * old and new nodes at this point.
 	 */
 	if (arg->flags & IORING_REGISTER_DST_REPLACE)
@@ -1119,10 +1182,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	 * entry).
 	 */
 	WARN_ON_ONCE(ctx->buf_table.data.nr);
-	ctx->buf_table.data = data;
+	ctx->buf_table = table;
 	return 0;
 out_free:
-	io_rsrc_data_free(ctx, &data);
+	io_rsrc_buffer_free(ctx, &table);
 	return ret;
 }
 
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 8147dfc26f737..751db2ce9affb 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -49,7 +49,7 @@ struct io_imu_folio_data {
 	unsigned int	nr_folios;
 };
 
-struct io_rsrc_node *io_rsrc_node_alloc(int type);
+struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type);
 void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
 void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data);
 int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 6/6] io_uring: cache nodes and mapped buffers
  2025-02-11  0:56 ` [PATCHv2 6/6] io_uring: cache nodes and mapped buffers Keith Busch
@ 2025-02-11 16:47   ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2025-02-11 16:47 UTC (permalink / raw)
  To: Keith Busch; +Cc: ming.lei, asml.silence, axboe, linux-block, io-uring, bernd

On Mon, Feb 10, 2025 at 04:56:46PM -0800, Keith Busch wrote:
> +static __cold int io_rsrc_buffer_alloc(struct io_buf_table *table, unsigned nr)
> +{
> +	const int imu_cache_size = struct_size_t(struct io_mapped_ubuf, bvec,
> +						 IO_CACHED_BVECS_SEGS);
> +	int ret;
> +
> +	BUILD_BUG_ON(imu_cache_size != 512);

This isn't accurate for 32-bit machines.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 0/6] ublk zero-copy support
  2025-02-11  0:56 [PATCHv2 0/6] ublk zero-copy support Keith Busch
                   ` (5 preceding siblings ...)
  2025-02-11  0:56 ` [PATCHv2 6/6] io_uring: cache nodes and mapped buffers Keith Busch
@ 2025-02-12  2:29 ` Ming Lei
  2025-02-12 15:28   ` Keith Busch
  2025-02-13 15:12 ` lizetao
  7 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-02-12  2:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: asml.silence, axboe, linux-block, io-uring, bernd, Keith Busch

On Mon, Feb 10, 2025 at 04:56:40PM -0800, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Previous version was discussed here:
> 
>   https://lore.kernel.org/linux-block/[email protected]/
> 
> The same ublksrv reference code in that link was used to test the kernel
> side changes.
> 
> Before listing what has changed, I want to mention what is the same: the
> reliance on the ring ctx lock to serialize the register ahead of any
> use. I'm not ignoring the feedback; I just don't have a solid answer
> right now, and want to progress on the other fronts in the meantime.

It is explained in the following links:

https://lore.kernel.org/linux-block/[email protected]/

- node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
  in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
  ->issue() when the two are submitted in same io_uring_enter(), so you
  need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
  to ->issue() first.

- secondly, ->issue() order is only respected by IO_LINK, and io_uring
  can't provide such guarantee without using IO_LINK:

  Pavel explained it in the following link:

  https://lore.kernel.org/linux-block/[email protected]/

  There are also other examples, such as, register buffer stays in one
  link chain, and the consumer OP isn't in this chain, the consumer OP
  can still be issued before issuing register_buffer.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 4/6] ublk: zc register/unregister bvec
  2025-02-11  0:56 ` [PATCHv2 4/6] ublk: zc register/unregister bvec Keith Busch
@ 2025-02-12  2:49   ` Ming Lei
  2025-02-12  4:11     ` Keith Busch
  2025-02-13  2:12   ` Pavel Begunkov
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-02-12  2:49 UTC (permalink / raw)
  To: Keith Busch
  Cc: asml.silence, axboe, linux-block, io-uring, bernd, Keith Busch

On Mon, Feb 10, 2025 at 04:56:44PM -0800, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Provide new operations for the user to request mapping an active request
> to an io uring instance's buf_table. The user has to provide the index
> it wants to install the buffer.
> 
> A reference count is taken on the request to ensure it can't be
> completed while it is active in a ring's buf_table.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---
>  drivers/block/ublk_drv.c      | 145 +++++++++++++++++++++++++---------
>  include/uapi/linux/ublk_cmd.h |   4 +
>  2 files changed, 113 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 529085181f355..ccfda7b2c24da 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -51,6 +51,9 @@
>  /* private ioctl command mirror */
>  #define UBLK_CMD_DEL_DEV_ASYNC	_IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
>  
> +#define UBLK_IO_REGISTER_IO_BUF		_IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
> +#define UBLK_IO_UNREGISTER_IO_BUF	_IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)

UBLK_IO_REGISTER_IO_BUF command may be completed, and buffer isn't used
by RW_FIXED yet in the following cases:

- application doesn't submit any RW_FIXED consumer OP

- io_uring_enter() only issued UBLK_IO_REGISTER_IO_BUF, and the other
  OPs can't be issued because of out of resource 

...

Then io_uring_enter() returns, and the application is panic or killed,
how to avoid buffer leak?

It need to deal with in io_uring cancel code for calling ->release() if
the kbuffer node isn't released.

UBLK_IO_UNREGISTER_IO_BUF still need to call ->release() if the node
buffer isn't used.

> +
>  /* All UBLK_F_* have to be included into UBLK_F_ALL */
>  #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
>  		| UBLK_F_URING_CMD_COMP_IN_TASK \
> @@ -76,6 +79,9 @@ struct ublk_rq_data {
>  	struct llist_node node;
>  
>  	struct kref ref;
> +
> +#define UBLK_ZC_REGISTERED 0
> +	unsigned long flags;
>  };
>  
>  struct ublk_uring_cmd_pdu {
> @@ -201,7 +207,7 @@ static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq,
>  						   int tag);
>  static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub)
>  {
> -	return ub->dev_info.flags & UBLK_F_USER_COPY;
> +	return ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
>  }
>  
>  static inline bool ublk_dev_is_zoned(const struct ublk_device *ub)
> @@ -581,7 +587,7 @@ static void ublk_apply_params(struct ublk_device *ub)
>  
>  static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
>  {
> -	return ubq->flags & UBLK_F_USER_COPY;
> +	return ubq->flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
>  }
>  
>  static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
> @@ -1747,6 +1753,102 @@ static inline void ublk_prep_cancel(struct io_uring_cmd *cmd,
>  	io_uring_cmd_mark_cancelable(cmd, issue_flags);
>  }
>  
> +static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> +		struct ublk_queue *ubq, int tag, size_t offset)
> +{
> +	struct request *req;
> +
> +	if (!ublk_need_req_ref(ubq))
> +		return NULL;
> +
> +	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> +	if (!req)
> +		return NULL;
> +
> +	if (!ublk_get_req_ref(ubq, req))
> +		return NULL;
> +
> +	if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
> +		goto fail_put;
> +
> +	if (!ublk_rq_has_data(req))
> +		goto fail_put;
> +
> +	if (offset > blk_rq_bytes(req))
> +		goto fail_put;
> +
> +	return req;
> +fail_put:
> +	ublk_put_req_ref(ubq, req);
> +	return NULL;
> +}
> +
> +static void ublk_io_release(void *priv)
> +{
> +	struct request *rq = priv;
> +	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> +
> +	ublk_put_req_ref(ubq, rq);
> +}

It isn't enough to just get & put request reference here between registering
buffer and freeing the registered node buf, because the same reference can be
dropped from ublk_commit_completion() which is from queueing
UBLK_IO_COMMIT_AND_FETCH_REQ, and buggy app may queue this command multiple
times for freeing the request.

One solution is to not allow request completion until the ->release() is
returned.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 4/6] ublk: zc register/unregister bvec
  2025-02-12  2:49   ` Ming Lei
@ 2025-02-12  4:11     ` Keith Busch
  2025-02-12  9:24       ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Keith Busch @ 2025-02-12  4:11 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, asml.silence, axboe, linux-block, io-uring, bernd

On Wed, Feb 12, 2025 at 10:49:15AM +0800, Ming Lei wrote:
> On Mon, Feb 10, 2025 at 04:56:44PM -0800, Keith Busch wrote:
> > From: Keith Busch <[email protected]>
> > 
> > Provide new operations for the user to request mapping an active request
> > to an io uring instance's buf_table. The user has to provide the index
> > it wants to install the buffer.
> > 
> > A reference count is taken on the request to ensure it can't be
> > completed while it is active in a ring's buf_table.
> > 
> > Signed-off-by: Keith Busch <[email protected]>
> > ---
> >  drivers/block/ublk_drv.c      | 145 +++++++++++++++++++++++++---------
> >  include/uapi/linux/ublk_cmd.h |   4 +
> >  2 files changed, 113 insertions(+), 36 deletions(-)
> > 
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 529085181f355..ccfda7b2c24da 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -51,6 +51,9 @@
> >  /* private ioctl command mirror */
> >  #define UBLK_CMD_DEL_DEV_ASYNC	_IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
> >  
> > +#define UBLK_IO_REGISTER_IO_BUF		_IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
> > +#define UBLK_IO_UNREGISTER_IO_BUF	_IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
> 
> UBLK_IO_REGISTER_IO_BUF command may be completed, and buffer isn't used
> by RW_FIXED yet in the following cases:
> 
> - application doesn't submit any RW_FIXED consumer OP
> 
> - io_uring_enter() only issued UBLK_IO_REGISTER_IO_BUF, and the other
>   OPs can't be issued because of out of resource 
> 
> ...
> 
> Then io_uring_enter() returns, and the application is panic or killed,
> how to avoid buffer leak?

The death of the uring that registered the node tears down the table
that it's registered with, which releases its reference. All good.
 
> It need to deal with in io_uring cancel code for calling ->release() if
> the kbuffer node isn't released.

There should be no situation here where it isn't released after its use
is completed. Either the resource was gracefully unregistered or the
ring close while it was still active, but either one drops its
reference.

> UBLK_IO_UNREGISTER_IO_BUF still need to call ->release() if the node
> buffer isn't used.

Only once the last reference is dropped. Which should happen no matter
which way the node is freed.

> > +static void ublk_io_release(void *priv)
> > +{
> > +	struct request *rq = priv;
> > +	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > +
> > +	ublk_put_req_ref(ubq, rq);
> > +}
> 
> It isn't enough to just get & put request reference here between registering
> buffer and freeing the registered node buf, because the same reference can be
> dropped from ublk_commit_completion() which is from queueing
> UBLK_IO_COMMIT_AND_FETCH_REQ, and buggy app may queue this command multiple
> times for freeing the request.
> 
> One solution is to not allow request completion until the ->release() is
> returned.

Double completions are tricky because the same request id can be reused
pretty quickly and there's no immediate way to tell if the 2nd
completion is a double or a genuine completion of the reused request.

We have rotating sequence numbers in the nvme driver to try to detect a
similar situation. So far it hasn't revealed any real bugs as far as I
know. This feels like the other side screwed up and that's their fault.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 4/6] ublk: zc register/unregister bvec
  2025-02-12  4:11     ` Keith Busch
@ 2025-02-12  9:24       ` Ming Lei
  2025-02-12 14:59         ` Keith Busch
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-02-12  9:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, asml.silence, axboe, linux-block, io-uring, bernd

On Tue, Feb 11, 2025 at 09:11:10PM -0700, Keith Busch wrote:
> On Wed, Feb 12, 2025 at 10:49:15AM +0800, Ming Lei wrote:
> > On Mon, Feb 10, 2025 at 04:56:44PM -0800, Keith Busch wrote:
> > > From: Keith Busch <[email protected]>
> > > 
> > > Provide new operations for the user to request mapping an active request
> > > to an io uring instance's buf_table. The user has to provide the index
> > > it wants to install the buffer.
> > > 
> > > A reference count is taken on the request to ensure it can't be
> > > completed while it is active in a ring's buf_table.
> > > 
> > > Signed-off-by: Keith Busch <[email protected]>
> > > ---
> > >  drivers/block/ublk_drv.c      | 145 +++++++++++++++++++++++++---------
> > >  include/uapi/linux/ublk_cmd.h |   4 +
> > >  2 files changed, 113 insertions(+), 36 deletions(-)
> > > 
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index 529085181f355..ccfda7b2c24da 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -51,6 +51,9 @@
> > >  /* private ioctl command mirror */
> > >  #define UBLK_CMD_DEL_DEV_ASYNC	_IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
> > >  
> > > +#define UBLK_IO_REGISTER_IO_BUF		_IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
> > > +#define UBLK_IO_UNREGISTER_IO_BUF	_IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
> > 
> > UBLK_IO_REGISTER_IO_BUF command may be completed, and buffer isn't used
> > by RW_FIXED yet in the following cases:
> > 
> > - application doesn't submit any RW_FIXED consumer OP
> > 
> > - io_uring_enter() only issued UBLK_IO_REGISTER_IO_BUF, and the other
> >   OPs can't be issued because of out of resource 
> > 
> > ...
> > 
> > Then io_uring_enter() returns, and the application is panic or killed,
> > how to avoid buffer leak?
> 
> The death of the uring that registered the node tears down the table
> that it's registered with, which releases its reference. All good.

OK, looks I miss the point.

io_sqe_buffers_unregister() is called from io_ring_ctx_free(), when the
registered buffer can be released.

However, it still may cause use-after-free on this request which has
been failed from io_uring_try_cancel_uring_cmd(), and please see the
following code path:

io_uring_try_cancel_requests
	io_uring_try_cancel_uring_cmd
		ublk_uring_cmd_cancel_fn
			ublk_abort_requests
				ublk_abort_queue
					__ublk_fail_req
						ublk_put_req_ref

The above race needs to be covered.

>  
> > It need to deal with in io_uring cancel code for calling ->release() if
> > the kbuffer node isn't released.
> 
> There should be no situation here where it isn't released after its use
> is completed. Either the resource was gracefully unregistered or the
> ring close while it was still active, but either one drops its
> reference.
> 
> > UBLK_IO_UNREGISTER_IO_BUF still need to call ->release() if the node
> > buffer isn't used.
> 
> Only once the last reference is dropped. Which should happen no matter
> which way the node is freed.
> 
> > > +static void ublk_io_release(void *priv)
> > > +{
> > > +	struct request *rq = priv;
> > > +	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > > +
> > > +	ublk_put_req_ref(ubq, rq);
> > > +}
> > 
> > It isn't enough to just get & put request reference here between registering
> > buffer and freeing the registered node buf, because the same reference can be
> > dropped from ublk_commit_completion() which is from queueing
> > UBLK_IO_COMMIT_AND_FETCH_REQ, and buggy app may queue this command multiple
> > times for freeing the request.
> > 
> > One solution is to not allow request completion until the ->release() is
> > returned.
> 
> Double completions are tricky because the same request id can be reused
> pretty quickly and there's no immediate way to tell if the 2nd
> completion is a double or a genuine completion of the reused request.
> 
> We have rotating sequence numbers in the nvme driver to try to detect a
> similar situation. So far it hasn't revealed any real bugs as far as I
> know. This feels like the other side screwed up and that's their fault.

Not same with nvme, in which nvme controller won't run DMA on this buffer
after the 1st completion.

The ublk request buffer has been leased to io_uring for running read_fixed/write_fixed,
meantime it is freed and reused by kernel for other purpose.

As I mentioned, it can be solved by not allowing to complete the IO
command if the buffer is leased to io_uring.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 4/6] ublk: zc register/unregister bvec
  2025-02-12  9:24       ` Ming Lei
@ 2025-02-12 14:59         ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2025-02-12 14:59 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, asml.silence, axboe, linux-block, io-uring, bernd

On Wed, Feb 12, 2025 at 05:24:34PM +0800, Ming Lei wrote:
> On Tue, Feb 11, 2025 at 09:11:10PM -0700, Keith Busch wrote:
> 
> However, it still may cause use-after-free on this request which has
> been failed from io_uring_try_cancel_uring_cmd(), and please see the
> following code path:
> 
> io_uring_try_cancel_requests
> 	io_uring_try_cancel_uring_cmd
> 		ublk_uring_cmd_cancel_fn
> 			ublk_abort_requests
> 				ublk_abort_queue
> 					__ublk_fail_req
> 						ublk_put_req_ref
> 
> The above race needs to be covered.

This race is covered. The ublk request has one reference taken when it
notifies userspace, then a second reference taken when user registers
the bvec.

The first reference is dropped from the abort, but the request won't be
completed because the second reference prevents that. That second
request reference can't be dropped until the ->release callback happens,
and that can't happen until two conditions are met:

  The bvec is unregistered
  All IO using the index completes

I think all the bases are covered here.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 0/6] ublk zero-copy support
  2025-02-12  2:29 ` [PATCHv2 0/6] ublk zero-copy support Ming Lei
@ 2025-02-12 15:28   ` Keith Busch
  2025-02-12 16:06     ` Pavel Begunkov
  0 siblings, 1 reply; 30+ messages in thread
From: Keith Busch @ 2025-02-12 15:28 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, asml.silence, axboe, linux-block, io-uring, bernd

On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote:
> It is explained in the following links:
> 
> https://lore.kernel.org/linux-block/[email protected]/
> 
> - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
>   in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
>   ->issue() when the two are submitted in same io_uring_enter(), so you
>   need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
>   to ->issue() first.

I don't think that's accurate, at least in practice. In a normal flow,
we'll have this sequence:

 io_submit_sqes
   io_submit_sqe (uring_cmd ublk register)
     io_init_req
       ->prep()
     io_queue_sqe
       ->issue()
   io_submit_sqe (read/write_fixed)
     io_init_req
       ->prep()
     io_queue_sqe
      ->issue()

The first SQE is handled in its entirety before even looking at the
subsequent SQE. Since the register is first, then the read/write_fixed's
prep will have a valid index. Testing this patch series appears to show
this reliably works.
 
> - secondly, ->issue() order is only respected by IO_LINK, and io_uring
>   can't provide such guarantee without using IO_LINK:
> 
>   Pavel explained it in the following link:
> 
>   https://lore.kernel.org/linux-block/[email protected]/
> 
>   There are also other examples, such as, register buffer stays in one
>   link chain, and the consumer OP isn't in this chain, the consumer OP
>   can still be issued before issuing register_buffer.

Yep, I got that. Linking is just something I was hoping to avoid. I
understand there are conditions that can break the normal flow I'm
relying on regarding  the ordering. This hasn't appeared to be a problem
in practice, but I agree this needs to be handled.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 0/6] ublk zero-copy support
  2025-02-12 15:28   ` Keith Busch
@ 2025-02-12 16:06     ` Pavel Begunkov
  2025-02-13  1:52       ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2025-02-12 16:06 UTC (permalink / raw)
  To: Keith Busch, Ming Lei; +Cc: Keith Busch, axboe, linux-block, io-uring, bernd

On 2/12/25 15:28, Keith Busch wrote:
> On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote:
>> It is explained in the following links:
>>
>> https://lore.kernel.org/linux-block/[email protected]/
>>
>> - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
>>    in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
>>    ->issue() when the two are submitted in same io_uring_enter(), so you
>>    need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
>>    to ->issue() first.
> 
> I don't think that's accurate, at least in practice. In a normal flow,
> we'll have this sequence:
> 
>   io_submit_sqes
>     io_submit_sqe (uring_cmd ublk register)
>       io_init_req
>         ->prep()
>       io_queue_sqe
>         ->issue()
>     io_submit_sqe (read/write_fixed)
>       io_init_req
>         ->prep()
>       io_queue_sqe
>        ->issue()
> 
> The first SQE is handled in its entirety before even looking at the
> subsequent SQE. Since the register is first, then the read/write_fixed's
> prep will have a valid index. Testing this patch series appears to show
> this reliably works.

Ming describes how it works for links. This one is indeed how
non links are normally executed. Though I'd repeat it's an
implementation detail and not a part of the uapi. Interestingly,
Keith, you sent some patches changing the ordering here quite a
while ago, just as an example of how it can change.


>> - secondly, ->issue() order is only respected by IO_LINK, and io_uring
>>    can't provide such guarantee without using IO_LINK:
>>
>>    Pavel explained it in the following link:
>>
>>    https://lore.kernel.org/linux-block/[email protected]/
>>
>>    There are also other examples, such as, register buffer stays in one
>>    link chain, and the consumer OP isn't in this chain, the consumer OP
>>    can still be issued before issuing register_buffer.
> 
> Yep, I got that. Linking is just something I was hoping to avoid. I
> understand there are conditions that can break the normal flow I'm
> relying on regarding  the ordering. This hasn't appeared to be a problem
> in practice, but I agree this needs to be handled.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 2/6] io_uring: create resource release callback
  2025-02-11  0:56 ` [PATCHv2 2/6] io_uring: create resource release callback Keith Busch
@ 2025-02-13  1:31   ` Pavel Begunkov
  2025-02-13  1:58     ` Keith Busch
  0 siblings, 1 reply; 30+ messages in thread
From: Pavel Begunkov @ 2025-02-13  1:31 UTC (permalink / raw)
  To: Keith Busch, ming.lei, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

On 2/11/25 00:56, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> When a registered resource is about to be freed, check if it has
> registered a release function, and call it if so. This is preparing for
> resources that are related to an external component.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---
>   io_uring/rsrc.c | 2 ++
>   io_uring/rsrc.h | 3 +++
>   2 files changed, 5 insertions(+)
> 
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 4d0e1c06c8bc6..30f08cf13ef60 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -455,6 +455,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
>   	case IORING_RSRC_BUFFER:
>   		if (node->buf)
>   			io_buffer_unmap(ctx, node);
> +		if (node->release)
> +			node->release(node->priv);
>   		break;
>   	default:
>   		WARN_ON_ONCE(1);
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index abd0d5d42c3e1..a3826ab84e666 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -24,6 +24,9 @@ struct io_rsrc_node {
>   		unsigned long file_ptr;
>   		struct io_mapped_ubuf *buf;
>   	};
> +
> +	void (*release)(void *);
> +	void *priv;

Nodes are more generic than buffers. Unless we want to reuse it
for files, and I very much doubt that, it should move into
io_mapped_ubuf.

>   };
>   
>   struct io_mapped_ubuf {

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 3/6] io_uring: add support for kernel registered bvecs
  2025-02-11  0:56 ` [PATCHv2 3/6] io_uring: add support for kernel registered bvecs Keith Busch
@ 2025-02-13  1:33   ` Pavel Begunkov
  2025-02-14  3:30   ` Ming Lei
  1 sibling, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2025-02-13  1:33 UTC (permalink / raw)
  To: Keith Busch, ming.lei, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

On 2/11/25 00:56, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Provide an interface for the kernel to leverage the existing
> pre-registered buffers that io_uring provides. User space can reference
> these later to achieve zero-copy IO.
> 
> User space must register an empty fixed buffer table with io_uring in
> order for the kernel to make use of it.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---
>   include/linux/io_uring.h       |   1 +
>   include/linux/io_uring_types.h |   4 ++
>   io_uring/rsrc.c                | 100 +++++++++++++++++++++++++++++++--
>   io_uring/rsrc.h                |   1 +
>   4 files changed, 100 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 85fe4e6b275c7..b5637a2aae340 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -5,6 +5,7 @@
>   #include <linux/sched.h>
>   #include <linux/xarray.h>
>   #include <uapi/linux/io_uring.h>
> +#include <linux/blk-mq.h>
>   
>   #if defined(CONFIG_IO_URING)
>   void __io_uring_cancel(bool cancel_all);
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index e2fef264ff8b8..99aac2d52fbae 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -693,4 +693,8 @@ static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
>   	return ctx->flags & IORING_SETUP_CQE32;
>   }
>   
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
> +			    void (*release)(void *), unsigned int index);
> +void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int tag);
> +
>   #endif
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 30f08cf13ef60..14efec8587888 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -110,8 +110,9 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
>   
>   		if (!refcount_dec_and_test(&imu->refs))
>   			return;
> -		for (i = 0; i < imu->nr_bvecs; i++)
> -			unpin_user_page(imu->bvec[i].bv_page);
> +		if (node->type == IORING_RSRC_BUFFER)
> +			for (i = 0; i < imu->nr_bvecs; i++)
> +				unpin_user_page(imu->bvec[i].bv_page);
>   		if (imu->acct_pages)
>   			io_unaccount_mem(ctx, imu->acct_pages);
>   		kvfree(imu);
> @@ -240,6 +241,13 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
>   		struct io_rsrc_node *node;
>   		u64 tag = 0;
>   
> +		i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
> +		node = io_rsrc_node_lookup(&ctx->buf_table, i);
> +		if (node && node->type != IORING_RSRC_BUFFER) {

Please drop the special rule. It's handled, so if the user wants
to do it, it can be allowed in a generic way.

> +			err = -EBUSY;
> +			break;
> +		}
> +
...
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
> +			    void (*release)(void *), unsigned int index)
> +{
> +	struct io_rsrc_data *data = &ctx->buf_table;
> +	struct req_iterator rq_iter;
> +	struct io_mapped_ubuf *imu;
> +	struct io_rsrc_node *node;
> +	struct bio_vec bv;
> +	u16 nr_bvecs;
> +	int i = 0;
> +
> +	lockdep_assert_held(&ctx->uring_lock);
> +
> +	if (!data->nr)
> +		return -EINVAL;
> +	if (index >= data->nr)
> +		return -EINVAL;

array_index_nospec

> +
> +	node = data->nodes[index];
> +	if (node)
> +		return -EBUSY;
...> +void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index)
> +{
> +	struct io_rsrc_data *data = &ctx->buf_table;
> +	struct io_rsrc_node *node;
> +
> +	lockdep_assert_held(&ctx->uring_lock);
> +
> +	if (!data->nr)
> +		return;
> +	if (index >= data->nr)
> +		return;

array_index_nospec

> +
> +	node = data->nodes[index];
> +	if (!node || !node->buf)

How can node->buf be NULL?

> +		return;
> +	if (node->type != IORING_RSRC_KBUFFER)
> +		return;
> +	io_reset_rsrc_node(ctx, data, index);
> +}
> +EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
...> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index a3826ab84e666..8147dfc26f737 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -13,6 +13,7 @@
>   enum {
>   	IORING_RSRC_FILE		= 0,
>   	IORING_RSRC_BUFFER		= 1,
> +	IORING_RSRC_KBUFFER		= 2,

nit: you don't even need it, just check for presence of the
callback in io_buffer_unmap()

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 0/6] ublk zero-copy support
  2025-02-12 16:06     ` Pavel Begunkov
@ 2025-02-13  1:52       ` Ming Lei
  0 siblings, 0 replies; 30+ messages in thread
From: Ming Lei @ 2025-02-13  1:52 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Keith Busch, Keith Busch, axboe, linux-block, io-uring, bernd

On Wed, Feb 12, 2025 at 04:06:58PM +0000, Pavel Begunkov wrote:
> On 2/12/25 15:28, Keith Busch wrote:
> > On Wed, Feb 12, 2025 at 10:29:32AM +0800, Ming Lei wrote:
> > > It is explained in the following links:
> > > 
> > > https://lore.kernel.org/linux-block/[email protected]/
> > > 
> > > - node kbuffer is registered in ublk uring_cmd's ->issue(), but lookup
> > >    in RW_FIXED OP's ->prep(), and ->prep() is always called before calling
> > >    ->issue() when the two are submitted in same io_uring_enter(), so you
> > >    need to move io_rsrc_node_lookup() & buffer importing from RW_FIXED's ->prep()
> > >    to ->issue() first.
> > 
> > I don't think that's accurate, at least in practice. In a normal flow,
> > we'll have this sequence:
> > 
> >   io_submit_sqes
> >     io_submit_sqe (uring_cmd ublk register)
> >       io_init_req
> >         ->prep()
> >       io_queue_sqe
> >         ->issue()
> >     io_submit_sqe (read/write_fixed)
> >       io_init_req
> >         ->prep()
> >       io_queue_sqe
> >        ->issue()
> > 
> > The first SQE is handled in its entirety before even looking at the
> > subsequent SQE. Since the register is first, then the read/write_fixed's
> > prep will have a valid index. Testing this patch series appears to show
> > this reliably works.
> 
> Ming describes how it works for links. This one is indeed how
> non links are normally executed. Though I'd repeat it's an
> implementation detail and not a part of the uapi. Interestingly,
> Keith, you sent some patches changing the ordering here quite a
> while ago, just as an example of how it can change.

My fault, I should have provided the link or async background.

> 
> 
> > > - secondly, ->issue() order is only respected by IO_LINK, and io_uring
> > >    can't provide such guarantee without using IO_LINK:
> > > 
> > >    Pavel explained it in the following link:
> > > 
> > >    https://lore.kernel.org/linux-block/[email protected]/
> > > 
> > >    There are also other examples, such as, register buffer stays in one
> > >    link chain, and the consumer OP isn't in this chain, the consumer OP
> > >    can still be issued before issuing register_buffer.
> > 
> > Yep, I got that. Linking is just something I was hoping to avoid. I
> > understand there are conditions that can break the normal flow I'm
> > relying on regarding  the ordering. This hasn't appeared to be a problem
> > in practice, but I agree this needs to be handled.

LINK/ASYNC needs to be supported, and sometimes they are useful.

- IO_LINK is the only way for respecting IO order

  io_uring only supports non-link or link all in one batch

- ASYNC sometimes can avoid to call two ->issue() unnecessarily if you
  know that the OP can't be dealt with async way in advance, maybe not
  one problem for ublk uring_cmd, but it is helpful for some FS write
  (un-allocated write)


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 2/6] io_uring: create resource release callback
  2025-02-13  1:31   ` Pavel Begunkov
@ 2025-02-13  1:58     ` Keith Busch
  2025-02-13 13:06       ` Pavel Begunkov
  0 siblings, 1 reply; 30+ messages in thread
From: Keith Busch @ 2025-02-13  1:58 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd

On Thu, Feb 13, 2025 at 01:31:07AM +0000, Pavel Begunkov wrote:
> On 2/11/25 00:56, Keith Busch wrote:
> > @@ -24,6 +24,9 @@ struct io_rsrc_node {
> >   		unsigned long file_ptr;
> >   		struct io_mapped_ubuf *buf;
> >   	};
> > +
> > +	void (*release)(void *);
> > +	void *priv;
> 
> Nodes are more generic than buffers. Unless we want to reuse it
> for files, and I very much doubt that, it should move into
> io_mapped_ubuf.

Good point. Cloning is another reason it should be in the imu.

If we want to save space, the "KBUFFER" type doesn't need ubuf or
folio_shift, so we could even union the new fields to prevent imu from
getting any bigger. The downside would be we can't use "release" to
distinguish kernel vs user buffers, which you suggested in that patch's
thread. Which option do you prefer?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 4/6] ublk: zc register/unregister bvec
  2025-02-11  0:56 ` [PATCHv2 4/6] ublk: zc register/unregister bvec Keith Busch
  2025-02-12  2:49   ` Ming Lei
@ 2025-02-13  2:12   ` Pavel Begunkov
  1 sibling, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2025-02-13  2:12 UTC (permalink / raw)
  To: Keith Busch, ming.lei, axboe, linux-block, io-uring; +Cc: bernd, Keith Busch

On 2/11/25 00:56, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Provide new operations for the user to request mapping an active request
> to an io uring instance's buf_table. The user has to provide the index
> it wants to install the buffer.
> 
> A reference count is taken on the request to ensure it can't be
> completed while it is active in a ring's buf_table.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---
>   drivers/block/ublk_drv.c      | 145 +++++++++++++++++++++++++---------
>   include/uapi/linux/ublk_cmd.h |   4 +
>   2 files changed, 113 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 529085181f355..ccfda7b2c24da 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -51,6 +51,9 @@
...
> +static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
> +				  struct ublk_queue *ubq, int tag,
> +				  const struct ublksrv_io_cmd *ub_cmd)
> +{
> +	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +	struct ublk_device *ub = cmd->file->private_data;
> +	int index = (int)ub_cmd->addr;
> +	struct ublk_rq_data *data;
> +	struct request *req;
> +
> +	if (!ub)
> +		return -EPERM;
> +
> +	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> +	if (!req)
> +		return -EINVAL;
> +
> +	data = blk_mq_rq_to_pdu(req);
> +	if (!test_and_clear_bit(UBLK_ZC_REGISTERED, &data->flags))
> +		return -EINVAL;

Why is it cleared here but not when it's auto removed?
Do you even need it? For example, take the option of not having
UBLK_U_IO_UNREGISTER_IO_BUF and doing all unregistrations the
normal io_uring way. Then you install it, and you know it'll
be released once and no protection needed.

For ublk_unregister_io_buf() it prevents multiple unregistrations,
even though io_uring would tolerate that, and I don't see
anything else meaningful going on here on the ublk side.

Btw, if you do it via ublk like this, then I agree, it's an
additional callback, though it can be made fancier in the
future. E.g. peeking at the {release,priv} and avoid flagging
above, and so on (though maybe flagging helps with ref
overflows).

All that aside, it looks fine in general. The only concern
is ublk going away before all buffers are released, but
maybe it does waits for all its requests to compelte?

> +
> +	io_buffer_unregister_bvec(ctx, index);

Please pass issue_flags into the helper and do conditional
locking inside. I understand that you rely on IO_URING_F_UNLOCKED
checks in ublk, but those are an abuse of io_uring internal
details by ublk. ublk should've never been allowed to interpret
issue_flags.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 2/6] io_uring: create resource release callback
  2025-02-13  1:58     ` Keith Busch
@ 2025-02-13 13:06       ` Pavel Begunkov
  0 siblings, 0 replies; 30+ messages in thread
From: Pavel Begunkov @ 2025-02-13 13:06 UTC (permalink / raw)
  To: Keith Busch; +Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd

On 2/13/25 01:58, Keith Busch wrote:
> On Thu, Feb 13, 2025 at 01:31:07AM +0000, Pavel Begunkov wrote:
>> On 2/11/25 00:56, Keith Busch wrote:
>>> @@ -24,6 +24,9 @@ struct io_rsrc_node {
>>>    		unsigned long file_ptr;
>>>    		struct io_mapped_ubuf *buf;
>>>    	};
>>> +
>>> +	void (*release)(void *);
>>> +	void *priv;
>>
>> Nodes are more generic than buffers. Unless we want to reuse it
>> for files, and I very much doubt that, it should move into
>> io_mapped_ubuf.
> 
> Good point. Cloning is another reason it should be in the imu.
> 
> If we want to save space, the "KBUFFER" type doesn't need ubuf or
> folio_shift, so we could even union the new fields to prevent imu from
> getting any bigger. The downside would be we can't use "release" to
> distinguish kernel vs user buffers, which you suggested in that patch's
> thread. Which option do you prefer?

I think we can just grow the struct for now, it's not a big
problem, and follow up on that separately. Should make it
easier for the series as well.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCHv2 0/6] ublk zero-copy support
  2025-02-11  0:56 [PATCHv2 0/6] ublk zero-copy support Keith Busch
                   ` (6 preceding siblings ...)
  2025-02-12  2:29 ` [PATCHv2 0/6] ublk zero-copy support Ming Lei
@ 2025-02-13 15:12 ` lizetao
  2025-02-13 16:06   ` Keith Busch
  2025-02-14  2:41   ` Ming Lei
  7 siblings, 2 replies; 30+ messages in thread
From: lizetao @ 2025-02-13 15:12 UTC (permalink / raw)
  To: Keith Busch; +Cc: [email protected], [email protected]

Hi,

> -----Original Message-----
> From: Keith Busch <[email protected]>
> Sent: Tuesday, February 11, 2025 8:57 AM
> To: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Cc: [email protected]; Keith Busch <[email protected]>
> Subject: [PATCHv2 0/6] ublk zero-copy support
> 
> From: Keith Busch <[email protected]>
> 
> Previous version was discussed here:
> 
>   https://lore.kernel.org/linux-block/20250203154517.937623-1-
> [email protected]/
> 
> The same ublksrv reference code in that link was used to test the kernel side
> changes.
> 
> Before listing what has changed, I want to mention what is the same: the
> reliance on the ring ctx lock to serialize the register ahead of any use. I'm not
> ignoring the feedback; I just don't have a solid answer right now, and want to
> progress on the other fronts in the meantime.
> 
> Here's what's different from the previous:
> 
>  - Introduced an optional 'release' callback when the resource node is
>    no longer referenced. The callback addresses any buggy applications
>    that may complete their request and unregister their index while IO
>    is in flight. This obviates any need to take extra page references
>    since it prevents the request from completing.
> 
>  - Removed peeking into the io_cache element size and instead use a
>    more intuitive bvec segment count limit to decide if we're caching
>    the imu (suggested by Pavel).
> 
>  - Dropped the const request changes; it's not needed.

I tested this patch set. When I use null as the device, the test results are like your v1.
When the bs is 4k, there is a slight improvement; when the bs is 64k, there is a significant improvement.
However, when I used loop as the device, I found that there was no improvement, whether using 4k or 64k. As follow:

  ublk add -t loop -f ./ublk-loop.img 
  ublk add -t loop -f ./ublk-loop-zerocopy.img

  fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
    read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)

  fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
    read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)


So, this patch set is optimized for null type devices? Or if I've missed any key information, please let me know.


---
Li Zetao


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 0/6] ublk zero-copy support
  2025-02-13 15:12 ` lizetao
@ 2025-02-13 16:06   ` Keith Busch
  2025-02-14  3:39     ` lizetao
  2025-02-14  2:41   ` Ming Lei
  1 sibling, 1 reply; 30+ messages in thread
From: Keith Busch @ 2025-02-13 16:06 UTC (permalink / raw)
  To: lizetao; +Cc: Keith Busch, [email protected], [email protected]

On Thu, Feb 13, 2025 at 03:12:43PM +0000, lizetao wrote:
> I tested this patch set. When I use null as the device, the test results are like your v1.
> When the bs is 4k, there is a slight improvement; when the bs is 64k, there is a significant improvement.
> However, when I used loop as the device, I found that there was no improvement, whether using 4k or 64k. As follow:
> 
>   ublk add -t loop -f ./ublk-loop.img 
>   ublk add -t loop -f ./ublk-loop-zerocopy.img
> 
>   fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
>     read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)
> 
>   fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
>     read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)
> 
> 
> So, this patch set is optimized for null type devices? Or if I've missed any key information, please let me know.

What do you get if if you run your fio job directly on your
ublk-loop.img file?

Throughput should improve until you've saturated the backend device.
Once you hit that point, the primary benefit of zero-copy come from
decreased memory and CPU utilizations.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 0/6] ublk zero-copy support
  2025-02-13 15:12 ` lizetao
  2025-02-13 16:06   ` Keith Busch
@ 2025-02-14  2:41   ` Ming Lei
  2025-02-14  4:21     ` lizetao
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-02-14  2:41 UTC (permalink / raw)
  To: lizetao; +Cc: Keith Busch, [email protected], [email protected], Ming Lei

On Thu, Feb 13, 2025 at 11:24 PM lizetao <[email protected]> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Keith Busch <[email protected]>
> > Sent: Tuesday, February 11, 2025 8:57 AM
> > To: [email protected]; [email protected]; [email protected]; linux-
> > [email protected]; [email protected]
> > Cc: [email protected]; Keith Busch <[email protected]>
> > Subject: [PATCHv2 0/6] ublk zero-copy support
> >
> > From: Keith Busch <[email protected]>
> >
> > Previous version was discussed here:
> >
> >   https://lore.kernel.org/linux-block/20250203154517.937623-1-
> > [email protected]/
> >
> > The same ublksrv reference code in that link was used to test the kernel side
> > changes.
> >
> > Before listing what has changed, I want to mention what is the same: the
> > reliance on the ring ctx lock to serialize the register ahead of any use. I'm not
> > ignoring the feedback; I just don't have a solid answer right now, and want to
> > progress on the other fronts in the meantime.
> >
> > Here's what's different from the previous:
> >
> >  - Introduced an optional 'release' callback when the resource node is
> >    no longer referenced. The callback addresses any buggy applications
> >    that may complete their request and unregister their index while IO
> >    is in flight. This obviates any need to take extra page references
> >    since it prevents the request from completing.
> >
> >  - Removed peeking into the io_cache element size and instead use a
> >    more intuitive bvec segment count limit to decide if we're caching
> >    the imu (suggested by Pavel).
> >
> >  - Dropped the const request changes; it's not needed.
>
> I tested this patch set. When I use null as the device, the test results are like your v1.
> When the bs is 4k, there is a slight improvement; when the bs is 64k, there is a significant improvement.

Yes,  the improvement is usually more obvious with a big IO size(>= 64K).

> However, when I used loop as the device, I found that there was no improvement, whether using 4k or 64k. As follow:
>
>   ublk add -t loop -f ./ublk-loop.img
>   ublk add -t loop -f ./ublk-loop-zerocopy.img
>
>   fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
>     read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)
>
>   fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
>     read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)
>
>
> So, this patch set is optimized for null type devices? Or if I've missed any key information, please let me know.

Latency may have decreased a bit.

System sources can't be saturated in single queue depth, please run
the same test with
high queue depth per Keith's suggestion:

        --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16

Also if you set up the backing file as ramfs image, the improvement
should be pretty
obvious, I observed IOPS doubled in this way.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 3/6] io_uring: add support for kernel registered bvecs
  2025-02-11  0:56 ` [PATCHv2 3/6] io_uring: add support for kernel registered bvecs Keith Busch
  2025-02-13  1:33   ` Pavel Begunkov
@ 2025-02-14  3:30   ` Ming Lei
  2025-02-14 15:26     ` Keith Busch
  1 sibling, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-02-14  3:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: asml.silence, axboe, linux-block, io-uring, bernd, Keith Busch

On Mon, Feb 10, 2025 at 04:56:43PM -0800, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Provide an interface for the kernel to leverage the existing
> pre-registered buffers that io_uring provides. User space can reference
> these later to achieve zero-copy IO.
> 
> User space must register an empty fixed buffer table with io_uring in
> order for the kernel to make use of it.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---

...

>  
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
> +			    void (*release)(void *), unsigned int index)
> +{
> +	struct io_rsrc_data *data = &ctx->buf_table;
> +	struct req_iterator rq_iter;
> +	struct io_mapped_ubuf *imu;
> +	struct io_rsrc_node *node;
> +	struct bio_vec bv;
> +	u16 nr_bvecs;
> +	int i = 0;
> +
> +	lockdep_assert_held(&ctx->uring_lock);
> +
> +	if (!data->nr)
> +		return -EINVAL;
> +	if (index >= data->nr)
> +		return -EINVAL;
> +
> +	node = data->nodes[index];
> +	if (node)
> +		return -EBUSY;
> +
> +	node = io_rsrc_node_alloc(IORING_RSRC_KBUFFER);
> +	if (!node)
> +		return -ENOMEM;
> +
> +	node->release = release;
> +	node->priv = rq;
> +
> +	nr_bvecs = blk_rq_nr_phys_segments(rq);
> +	imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
> +	if (!imu) {
> +		kfree(node);
> +		return -ENOMEM;
> +	}
> +
> +	imu->ubuf = 0;
> +	imu->len = blk_rq_bytes(rq);
> +	imu->acct_pages = 0;
> +	imu->nr_bvecs = nr_bvecs;
> +	refcount_set(&imu->refs, 1);
> +	node->buf = imu;

request buffer direction needs to be stored in `imu`, for READ,
the buffer is write-only, and for WRITE, the buffer is read-only,
which isn't different with user mapped buffer.

Meantime in read_fixed/write_fixed side or buffer lookup abstraction
helper, the buffer direction needs to be validated.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCHv2 0/6] ublk zero-copy support
  2025-02-13 16:06   ` Keith Busch
@ 2025-02-14  3:39     ` lizetao
  0 siblings, 0 replies; 30+ messages in thread
From: lizetao @ 2025-02-14  3:39 UTC (permalink / raw)
  To: Keith Busch; +Cc: Keith Busch, [email protected], [email protected]

Hi,

> -----Original Message-----
> From: Keith Busch <[email protected]>
> Sent: Friday, February 14, 2025 12:07 AM
> To: lizetao <[email protected]>
> Cc: Keith Busch <[email protected]>; [email protected];
> [email protected]
> Subject: Re: [PATCHv2 0/6] ublk zero-copy support
> 
> On Thu, Feb 13, 2025 at 03:12:43PM +0000, lizetao wrote:
> > I tested this patch set. When I use null as the device, the test results are like
> your v1.
> > When the bs is 4k, there is a slight improvement; when the bs is 64k, there is
> a significant improvement.
> > However, when I used loop as the device, I found that there was no
> improvement, whether using 4k or 64k. As follow:
> >
> >   ublk add -t loop -f ./ublk-loop.img
> >   ublk add -t loop -f ./ublk-loop-zerocopy.img
> >
> >   fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring
> -bs=128k -size=5G
> >     read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)
> >
> >   fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring
> -bs=128k -size=5G
> >     read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)
> >
> >
> > So, this patch set is optimized for null type devices? Or if I've missed any key
> information, please let me know.
> 
> What do you get if if you run your fio job directly on your ublk-loop.img file?

I test it directly on ublk-loop.img, and the result is as follow:
  
  fio -filename=./ublk-loop.img -direct=1 -rw=read -iodepth=1 -ioengine=io_uring -bs=128k -size=5G
  read: IOPS=1005, BW=126MiB/s (132MB/s)(1258MiB/10009msec)

From the test results, this seems to be limited by the performance limitations of the file system
where ./ublk-loop.img is located.

> Throughput should improve until you've saturated the backend device.
> Once you hit that point, the primary benefit of zero-copy come from decreased
> memory and CPU utilizations.

---
Li Zetao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCHv2 0/6] ublk zero-copy support
  2025-02-14  2:41   ` Ming Lei
@ 2025-02-14  4:21     ` lizetao
  0 siblings, 0 replies; 30+ messages in thread
From: lizetao @ 2025-02-14  4:21 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, [email protected], [email protected]

Hi,

> -----Original Message-----
> From: Ming Lei <[email protected]>
> Sent: Friday, February 14, 2025 10:41 AM
> To: lizetao <[email protected]>
> Cc: Keith Busch <[email protected]>; [email protected];
> [email protected]; Ming Lei <[email protected]>
> Subject: Re: [PATCHv2 0/6] ublk zero-copy support
> 
> On Thu, Feb 13, 2025 at 11:24 PM lizetao <[email protected]> wrote:
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: Keith Busch <[email protected]>
> > > Sent: Tuesday, February 11, 2025 8:57 AM
> > > To: [email protected]; [email protected]; [email protected];
> > > linux- [email protected]; [email protected]
> > > Cc: [email protected]; Keith Busch <[email protected]>
> > > Subject: [PATCHv2 0/6] ublk zero-copy support
> > >
> > > From: Keith Busch <[email protected]>
> > >
> > > Previous version was discussed here:
> > >
> > >   https://lore.kernel.org/linux-block/20250203154517.937623-1-
> > > [email protected]/
> > >
> > > The same ublksrv reference code in that link was used to test the
> > > kernel side changes.
> > >
> > > Before listing what has changed, I want to mention what is the same:
> > > the reliance on the ring ctx lock to serialize the register ahead of
> > > any use. I'm not ignoring the feedback; I just don't have a solid
> > > answer right now, and want to progress on the other fronts in the
> meantime.
> > >
> > > Here's what's different from the previous:
> > >
> > >  - Introduced an optional 'release' callback when the resource node is
> > >    no longer referenced. The callback addresses any buggy applications
> > >    that may complete their request and unregister their index while IO
> > >    is in flight. This obviates any need to take extra page references
> > >    since it prevents the request from completing.
> > >
> > >  - Removed peeking into the io_cache element size and instead use a
> > >    more intuitive bvec segment count limit to decide if we're caching
> > >    the imu (suggested by Pavel).
> > >
> > >  - Dropped the const request changes; it's not needed.
> >
> > I tested this patch set. When I use null as the device, the test results are like
> your v1.
> > When the bs is 4k, there is a slight improvement; when the bs is 64k, there is
> a significant improvement.
> 
> Yes,  the improvement is usually more obvious with a big IO size(>= 64K).
> 
> > However, when I used loop as the device, I found that there was no
> improvement, whether using 4k or 64k. As follow:
> >
> >   ublk add -t loop -f ./ublk-loop.img
> >   ublk add -t loop -f ./ublk-loop-zerocopy.img
> >
> >   fio -filename=/dev/ublkb0 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring
> -bs=128k -size=5G
> >     read: IOPS=2015, BW=126MiB/s (132MB/s)(1260MiB/10005msec)
> >
> >   fio -filename=/dev/ublkb1 -direct=1 -rw=read -iodepth=1 -ioengine=io_uring
> -bs=128k -size=5G
> >     read: IOPS=1998, BW=125MiB/s (131MB/s)(1250MiB/10005msec)
> >
> >
> > So, this patch set is optimized for null type devices? Or if I've missed any key
> information, please let me know.
> 
> Latency may have decreased a bit.
> 
> System sources can't be saturated in single queue depth, please run the same
> test with high queue depth per Keith's suggestion:
> 
>         --iodepth=128 --iodepth_batch_submit=16 --
> iodepth_batch_complete_min=16

I tested it with these settings, but the result is similar to iodepth=1:

  fio -filename=/dev/ublkb0 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10
    read: IOPS=2182, BW=136MiB/s (143MB/s)(1440MiB/10558msec)
  
  fio -filename=/dev/ublkb1 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10
    read: IOPS=2174, BW=136MiB/s (143MB/s)(1438MiB/10580msec)

So I believe this is limited by the performance limitations of the file system where ./ublk-loop.img is located.

> 
> Also if you set up the backing file as ramfs image, the improvement should be
> pretty obvious, I observed IOPS doubled in this way.

This is true, I tested it in /tmp/ and got a large optimizations. The results as follow:

  fio -filename=/dev/ublkb0 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10
    read: IOPS=95.8k, BW=5985MiB/s (6276MB/s)(58.5GiB/10014msec)

  fio -filename=/dev/ublkb1 -direct=1 -rw=read --iodepth=128 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 -ioengine=io_uring -bs=64k -size=8G -numjobs=10
    read: IOPS=170k, BW=10.4GiB/s (11.1GB/s)(80.0GiB/7721msec)

So this test result is in line with expectations.

---
Li Zetao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 3/6] io_uring: add support for kernel registered bvecs
  2025-02-14  3:30   ` Ming Lei
@ 2025-02-14 15:26     ` Keith Busch
  2025-02-15  1:34       ` Ming Lei
  0 siblings, 1 reply; 30+ messages in thread
From: Keith Busch @ 2025-02-14 15:26 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, asml.silence, axboe, linux-block, io-uring, bernd

On Fri, Feb 14, 2025 at 11:30:11AM +0800, Ming Lei wrote:
> On Mon, Feb 10, 2025 at 04:56:43PM -0800, Keith Busch wrote:
> > +
> > +	node->release = release;
> > +	node->priv = rq;
> > +
> > +	nr_bvecs = blk_rq_nr_phys_segments(rq);
> > +	imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
> > +	if (!imu) {
> > +		kfree(node);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	imu->ubuf = 0;
> > +	imu->len = blk_rq_bytes(rq);
> > +	imu->acct_pages = 0;
> > +	imu->nr_bvecs = nr_bvecs;
> > +	refcount_set(&imu->refs, 1);
> > +	node->buf = imu;
> 
> request buffer direction needs to be stored in `imu`, for READ,
> the buffer is write-only, and for WRITE, the buffer is read-only,
> which isn't different with user mapped buffer.
> 
> Meantime in read_fixed/write_fixed side or buffer lookup abstraction
> helper, the buffer direction needs to be validated.

I suppose we could add that check, but the primary use case doesn't even
use those operations. They're using uring_cmd with the FIXED flag, and
io_uring can't readily validate the data direction from that interface.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 3/6] io_uring: add support for kernel registered bvecs
  2025-02-14 15:26     ` Keith Busch
@ 2025-02-15  1:34       ` Ming Lei
  2025-02-18 20:34         ` Keith Busch
  0 siblings, 1 reply; 30+ messages in thread
From: Ming Lei @ 2025-02-15  1:34 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, asml.silence, axboe, linux-block, io-uring, bernd

On Fri, Feb 14, 2025 at 08:26:17AM -0700, Keith Busch wrote:
> On Fri, Feb 14, 2025 at 11:30:11AM +0800, Ming Lei wrote:
> > On Mon, Feb 10, 2025 at 04:56:43PM -0800, Keith Busch wrote:
> > > +
> > > +	node->release = release;
> > > +	node->priv = rq;
> > > +
> > > +	nr_bvecs = blk_rq_nr_phys_segments(rq);
> > > +	imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
> > > +	if (!imu) {
> > > +		kfree(node);
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	imu->ubuf = 0;
> > > +	imu->len = blk_rq_bytes(rq);
> > > +	imu->acct_pages = 0;
> > > +	imu->nr_bvecs = nr_bvecs;
> > > +	refcount_set(&imu->refs, 1);
> > > +	node->buf = imu;
> > 
> > request buffer direction needs to be stored in `imu`, for READ,
> > the buffer is write-only, and for WRITE, the buffer is read-only,
> > which isn't different with user mapped buffer.
> > 
> > Meantime in read_fixed/write_fixed side or buffer lookup abstraction
> > helper, the buffer direction needs to be validated.
> 
> I suppose we could add that check, but the primary use case doesn't even
> use those operations. They're using uring_cmd with the FIXED flag, and
> io_uring can't readily validate the data direction from that interface.

The check can be added to io_import_fixed().

It is a security trouble. Without the validation:

- kernel data can be redirected to user file via write_fixed,

- kernel page data is over-written unexpectedly via read_fixed, cause fs corruption or
even kernel panic.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHv2 3/6] io_uring: add support for kernel registered bvecs
  2025-02-15  1:34       ` Ming Lei
@ 2025-02-18 20:34         ` Keith Busch
  0 siblings, 0 replies; 30+ messages in thread
From: Keith Busch @ 2025-02-18 20:34 UTC (permalink / raw)
  To: Ming Lei; +Cc: Keith Busch, asml.silence, axboe, linux-block, io-uring, bernd

On Sat, Feb 15, 2025 at 09:34:42AM +0800, Ming Lei wrote:
> > I suppose we could add that check, but the primary use case doesn't even
> > use those operations. They're using uring_cmd with the FIXED flag, and
> > io_uring can't readily validate the data direction from that interface.
> 
> The check can be added to io_import_fixed().

Good point, will incorporate this change.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-02-18 20:34 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-11  0:56 [PATCHv2 0/6] ublk zero-copy support Keith Busch
2025-02-11  0:56 ` [PATCHv2 1/6] io_uring: use node for import Keith Busch
2025-02-11  0:56 ` [PATCHv2 2/6] io_uring: create resource release callback Keith Busch
2025-02-13  1:31   ` Pavel Begunkov
2025-02-13  1:58     ` Keith Busch
2025-02-13 13:06       ` Pavel Begunkov
2025-02-11  0:56 ` [PATCHv2 3/6] io_uring: add support for kernel registered bvecs Keith Busch
2025-02-13  1:33   ` Pavel Begunkov
2025-02-14  3:30   ` Ming Lei
2025-02-14 15:26     ` Keith Busch
2025-02-15  1:34       ` Ming Lei
2025-02-18 20:34         ` Keith Busch
2025-02-11  0:56 ` [PATCHv2 4/6] ublk: zc register/unregister bvec Keith Busch
2025-02-12  2:49   ` Ming Lei
2025-02-12  4:11     ` Keith Busch
2025-02-12  9:24       ` Ming Lei
2025-02-12 14:59         ` Keith Busch
2025-02-13  2:12   ` Pavel Begunkov
2025-02-11  0:56 ` [PATCHv2 5/6] io_uring: add abstraction for buf_table rsrc data Keith Busch
2025-02-11  0:56 ` [PATCHv2 6/6] io_uring: cache nodes and mapped buffers Keith Busch
2025-02-11 16:47   ` Keith Busch
2025-02-12  2:29 ` [PATCHv2 0/6] ublk zero-copy support Ming Lei
2025-02-12 15:28   ` Keith Busch
2025-02-12 16:06     ` Pavel Begunkov
2025-02-13  1:52       ` Ming Lei
2025-02-13 15:12 ` lizetao
2025-02-13 16:06   ` Keith Busch
2025-02-14  3:39     ` lizetao
2025-02-14  2:41   ` Ming Lei
2025-02-14  4:21     ` lizetao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox