public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] ublk: support to register bvec buffer automatically
@ 2025-04-28  9:44 Ming Lei
  2025-04-28  9:44 ` [RFC PATCH 1/7] io_uring: add 'struct io_buf_data' for register/unregister bvec buffer Ming Lei
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Ming Lei @ 2025-04-28  9:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov, linux-block
  Cc: Uday Shankar, Caleb Sander Mateos, Keith Busch, Ming Lei

Hello Jens,

This patch tries to address limitation from in-tree ublk zero copy:

- one IO needs two extra uring_cmd for register/unregister bvec buffer, not
  efficient

- introduced dependency on the two uring_cmd, so the buffer consumer has to
linked with the two uring_cmd, hard to use & less efficient

This patchset adds feature UBLK_F_AUTO_BUF_REG:

- register request buffer automatically before delivering io command to ublk server

- unregister request buffer automatically when completing the request

- both io_uring context and buffer index can be specified from ublk
  uring_cmd header

With this way, 'fio/t/io_uring -p0 /dev/ublkb0' shows that IOPS is improved
by 50% compared with F_SUPPORT_ZERO_COPY in my test VM.

kernel selftests are added for covering both function & stress test.

Please review & comment!

Thanks,
Ming

Ming Lei (7):
  io_uring: add 'struct io_buf_data' for register/unregister bvec buffer
  io_uring: add helper __io_buffer_[un]register_bvec
  io_uring: support to register bvec buffer to specified io_uring
  ublk: convert to refcount_t
  ublk: prepare for supporting to register request buffer automatically
  ublk: register buffer to specified io_uring & buf index via
    UBLK_F_AUTO_BUF_REG
  selftests: ublk: support UBLK_F_AUTO_BUF_REG

 drivers/block/ublk_drv.c                      | 165 ++++++++++++++----
 include/linux/io_uring/cmd.h                  |  15 +-
 include/uapi/linux/ublk_cmd.h                 |  38 ++++
 io_uring/rsrc.c                               | 141 ++++++++++-----
 tools/testing/selftests/ublk/Makefile         |   2 +
 tools/testing/selftests/ublk/file_backed.c    |  12 +-
 tools/testing/selftests/ublk/kublk.c          |  24 ++-
 tools/testing/selftests/ublk/kublk.h          |   6 +
 tools/testing/selftests/ublk/null.c           |  43 +++--
 tools/testing/selftests/ublk/stripe.c         |  21 +--
 .../testing/selftests/ublk/test_generic_08.sh |  32 ++++
 .../testing/selftests/ublk/test_stress_03.sh  |   6 +
 .../testing/selftests/ublk/test_stress_04.sh  |   6 +
 .../testing/selftests/ublk/test_stress_05.sh  |   8 +
 14 files changed, 409 insertions(+), 110 deletions(-)
 create mode 100755 tools/testing/selftests/ublk/test_generic_08.sh

-- 
2.47.0


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH 1/7] io_uring: add 'struct io_buf_data' for register/unregister bvec buffer
  2025-04-28  9:44 [RFC PATCH 0/7] ublk: support to register bvec buffer automatically Ming Lei
@ 2025-04-28  9:44 ` Ming Lei
  2025-04-29  0:35   ` Caleb Sander Mateos
  2025-04-28  9:44 ` [RFC PATCH 2/7] io_uring: add helper __io_buffer_[un]register_bvec Ming Lei
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-04-28  9:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov, linux-block
  Cc: Uday Shankar, Caleb Sander Mateos, Keith Busch, Ming Lei

Add 'struct io_buf_data' for register/unregister bvec buffer, and
prepare for supporting to register buffer into one specified io_uring
context by its FD.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c     | 13 ++++++++++---
 include/linux/io_uring/cmd.h | 11 ++++++++---
 io_uring/rsrc.c              | 12 ++++++++----
 3 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 0d82014679f8..ac56482b55f5 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1925,6 +1925,10 @@ static int ublk_register_io_buf(struct io_uring_cmd *cmd,
 {
 	struct ublk_device *ub = cmd->file->private_data;
 	const struct ublk_io *io = &ubq->ios[tag];
+	struct io_buf_data data = {
+		.index = index,
+		.release = ublk_io_release,
+	};
 	struct request *req;
 	int ret;
 
@@ -1938,8 +1942,8 @@ static int ublk_register_io_buf(struct io_uring_cmd *cmd,
 	if (!req)
 		return -EINVAL;
 
-	ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index,
-				      issue_flags);
+	data.rq = req;
+	ret = io_buffer_register_bvec(cmd, &data, issue_flags);
 	if (ret) {
 		ublk_put_req_ref(ubq, req);
 		return ret;
@@ -1953,6 +1957,9 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
 				  unsigned int index, unsigned int issue_flags)
 {
 	const struct ublk_io *io = &ubq->ios[tag];
+	struct io_buf_data data = {
+		.index = index,
+	};
 
 	if (!ublk_support_zero_copy(ubq))
 		return -EINVAL;
@@ -1960,7 +1967,7 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
 	if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
 		return -EINVAL;
 
-	return io_buffer_unregister_bvec(cmd, index, issue_flags);
+	return io_buffer_unregister_bvec(cmd, &data, issue_flags);
 }
 
 static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 0634a3de1782..78fa336a284b 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -23,6 +23,12 @@ struct io_uring_cmd_data {
 	void			*op_data;
 };
 
+struct io_buf_data {
+	unsigned short index;
+	struct request *rq;
+	void (*release)(void *);
+};
+
 static inline const void *io_uring_sqe_cmd(const struct io_uring_sqe *sqe)
 {
 	return sqe->cmd;
@@ -140,10 +146,9 @@ static inline struct io_uring_cmd_data *io_uring_cmd_get_async_data(struct io_ur
 	return cmd_to_io_kiocb(cmd)->async_data;
 }
 
-int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
-			    void (*release)(void *), unsigned int index,
+int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct io_buf_data *data,
 			    unsigned int issue_flags);
-int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
+int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, struct io_buf_data *data,
 			      unsigned int issue_flags);
 
 #endif /* _LINUX_IO_URING_CMD_H */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index b4c5f3ee8855..66d2c11e2f46 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -918,12 +918,14 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
-int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
-			    void (*release)(void *), unsigned int index,
+int io_buffer_register_bvec(struct io_uring_cmd *cmd,
+			    struct io_buf_data *buf,
 			    unsigned int issue_flags)
 {
 	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
 	struct io_rsrc_data *data = &ctx->buf_table;
+	unsigned int index = buf->index;
+	struct request *rq = buf->rq;
 	struct req_iterator rq_iter;
 	struct io_mapped_ubuf *imu;
 	struct io_rsrc_node *node;
@@ -963,7 +965,7 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
 	imu->folio_shift = PAGE_SHIFT;
 	imu->nr_bvecs = nr_bvecs;
 	refcount_set(&imu->refs, 1);
-	imu->release = release;
+	imu->release = buf->release;
 	imu->priv = rq;
 	imu->is_kbuf = true;
 	imu->dir = 1 << rq_data_dir(rq);
@@ -980,11 +982,13 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
 }
 EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
 
-int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
+int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
+			      struct io_buf_data *buf,
 			      unsigned int issue_flags)
 {
 	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
 	struct io_rsrc_data *data = &ctx->buf_table;
+	unsigned index = buf->index;
 	struct io_rsrc_node *node;
 	int ret = 0;
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 2/7] io_uring: add helper __io_buffer_[un]register_bvec
  2025-04-28  9:44 [RFC PATCH 0/7] ublk: support to register bvec buffer automatically Ming Lei
  2025-04-28  9:44 ` [RFC PATCH 1/7] io_uring: add 'struct io_buf_data' for register/unregister bvec buffer Ming Lei
@ 2025-04-28  9:44 ` Ming Lei
  2025-04-29  0:36   ` Caleb Sander Mateos
  2025-04-28  9:44 ` [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring Ming Lei
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-04-28  9:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov, linux-block
  Cc: Uday Shankar, Caleb Sander Mateos, Keith Busch, Ming Lei

Add helper __io_buffer_[un]register_bvec and prepare for supporting to
register bvec buffer into specified io_uring and buffer index.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 io_uring/rsrc.c | 88 ++++++++++++++++++++++++++-----------------------
 1 file changed, 46 insertions(+), 42 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 66d2c11e2f46..5f8ab130a573 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -918,11 +918,9 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
-int io_buffer_register_bvec(struct io_uring_cmd *cmd,
-			    struct io_buf_data *buf,
-			    unsigned int issue_flags)
+static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
+				     struct io_buf_data *buf)
 {
-	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
 	struct io_rsrc_data *data = &ctx->buf_table;
 	unsigned int index = buf->index;
 	struct request *rq = buf->rq;
@@ -931,32 +929,23 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd,
 	struct io_rsrc_node *node;
 	struct bio_vec bv, *bvec;
 	u16 nr_bvecs;
-	int ret = 0;
 
-	io_ring_submit_lock(ctx, issue_flags);
-	if (index >= data->nr) {
-		ret = -EINVAL;
-		goto unlock;
-	}
-	index = array_index_nospec(index, data->nr);
+	if (index >= data->nr)
+		return -EINVAL;
 
-	if (data->nodes[index]) {
-		ret = -EBUSY;
-		goto unlock;
-	}
+	index = array_index_nospec(index, data->nr);
+	if (data->nodes[index])
+		return -EBUSY;
 
 	node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
-	if (!node) {
-		ret = -ENOMEM;
-		goto unlock;
-	}
+	if (!node)
+		return -ENOMEM;
 
 	nr_bvecs = blk_rq_nr_phys_segments(rq);
 	imu = io_alloc_imu(ctx, nr_bvecs);
 	if (!imu) {
 		kfree(node);
-		ret = -ENOMEM;
-		goto unlock;
+		return -ENOMEM;
 	}
 
 	imu->ubuf = 0;
@@ -976,43 +965,58 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd,
 
 	node->buf = imu;
 	data->nodes[index] = node;
-unlock:
+
+	return 0;
+}
+
+int io_buffer_register_bvec(struct io_uring_cmd *cmd,
+			    struct io_buf_data *buf,
+			    unsigned int issue_flags)
+{
+	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+	int ret;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	ret = __io_buffer_register_bvec(ctx, buf);
 	io_ring_submit_unlock(ctx, issue_flags);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
 
-int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
-			      struct io_buf_data *buf,
-			      unsigned int issue_flags)
+static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
+				       struct io_buf_data *buf)
 {
-	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
 	struct io_rsrc_data *data = &ctx->buf_table;
 	unsigned index = buf->index;
 	struct io_rsrc_node *node;
-	int ret = 0;
 
-	io_ring_submit_lock(ctx, issue_flags);
-	if (index >= data->nr) {
-		ret = -EINVAL;
-		goto unlock;
-	}
-	index = array_index_nospec(index, data->nr);
+	if (index >= data->nr)
+		return -EINVAL;
 
+	index = array_index_nospec(index, data->nr);
 	node = data->nodes[index];
-	if (!node) {
-		ret = -EINVAL;
-		goto unlock;
-	}
-	if (!node->buf->is_kbuf) {
-		ret = -EBUSY;
-		goto unlock;
-	}
+	if (!node)
+		return -EINVAL;
+	if (!node->buf->is_kbuf)
+		return -EBUSY;
 
 	io_put_rsrc_node(ctx, node);
 	data->nodes[index] = NULL;
-unlock:
+	return 0;
+}
+
+int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
+			      struct io_buf_data *buf,
+			      unsigned int issue_flags)
+{
+	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+	int ret;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	ret = __io_buffer_unregister_bvec(ctx, buf);
 	io_ring_submit_unlock(ctx, issue_flags);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-04-28  9:44 [RFC PATCH 0/7] ublk: support to register bvec buffer automatically Ming Lei
  2025-04-28  9:44 ` [RFC PATCH 1/7] io_uring: add 'struct io_buf_data' for register/unregister bvec buffer Ming Lei
  2025-04-28  9:44 ` [RFC PATCH 2/7] io_uring: add helper __io_buffer_[un]register_bvec Ming Lei
@ 2025-04-28  9:44 ` Ming Lei
  2025-04-28 10:28   ` Pavel Begunkov
  2025-04-29  0:43   ` Caleb Sander Mateos
  2025-04-28  9:44 ` [RFC PATCH 4/7] ublk: convert to refcount_t Ming Lei
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 29+ messages in thread
From: Ming Lei @ 2025-04-28  9:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov, linux-block
  Cc: Uday Shankar, Caleb Sander Mateos, Keith Busch, Ming Lei

Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
supporting to register/unregister bvec buffer to specified io_uring,
which FD is usually passed from userspace.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/io_uring/cmd.h |  4 ++
 io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
 2 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 78fa336a284b..7516fe5cd606 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -25,6 +25,10 @@ struct io_uring_cmd_data {
 
 struct io_buf_data {
 	unsigned short index;
+	bool has_fd;
+	bool registered_fd;
+
+	int ring_fd;
 	struct request *rq;
 	void (*release)(void *);
 };
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 5f8ab130a573..701dd33fecf7 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -969,21 +969,6 @@ static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
 	return 0;
 }
 
-int io_buffer_register_bvec(struct io_uring_cmd *cmd,
-			    struct io_buf_data *buf,
-			    unsigned int issue_flags)
-{
-	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
-	int ret;
-
-	io_ring_submit_lock(ctx, issue_flags);
-	ret = __io_buffer_register_bvec(ctx, buf);
-	io_ring_submit_unlock(ctx, issue_flags);
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
-
 static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
 				       struct io_buf_data *buf)
 {
@@ -1006,19 +991,77 @@ static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
 	return 0;
 }
 
-int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
-			      struct io_buf_data *buf,
-			      unsigned int issue_flags)
+static inline int do_reg_unreg_bvec(struct io_ring_ctx *ctx,
+				    struct io_buf_data *buf,
+				    unsigned int issue_flags,
+				    bool reg)
 {
-	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
 	int ret;
 
 	io_ring_submit_lock(ctx, issue_flags);
-	ret = __io_buffer_unregister_bvec(ctx, buf);
+	if (reg)
+		ret = __io_buffer_register_bvec(ctx, buf);
+	else
+		ret = __io_buffer_unregister_bvec(ctx, buf);
 	io_ring_submit_unlock(ctx, issue_flags);
 
 	return ret;
 }
+
+static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
+				    struct io_buf_data *buf,
+				    unsigned int issue_flags,
+				    bool reg)
+{
+	struct io_ring_ctx *remote_ctx = ctx;
+	struct file *file = NULL;
+	int ret;
+
+	if (buf->has_fd) {
+		file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
+		if (IS_ERR(file))
+			return PTR_ERR(file);
+		remote_ctx = file->private_data;
+		if (!remote_ctx)
+			return -EINVAL;
+	}
+
+	if (remote_ctx == ctx) {
+		do_reg_unreg_bvec(ctx, buf, issue_flags, reg);
+	} else {
+		if (!(issue_flags & IO_URING_F_UNLOCKED))
+			mutex_unlock(&ctx->uring_lock);
+
+		do_reg_unreg_bvec(remote_ctx, buf, IO_URING_F_UNLOCKED, reg);
+
+		if (!(issue_flags & IO_URING_F_UNLOCKED))
+			mutex_lock(&ctx->uring_lock);
+	}
+
+	if (file)
+		fput(file);
+
+	return ret;
+}
+
+int io_buffer_register_bvec(struct io_uring_cmd *cmd,
+			    struct io_buf_data *buf,
+			    unsigned int issue_flags)
+{
+	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+
+	return io_buffer_reg_unreg_bvec(ctx, buf, issue_flags, true);
+}
+EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
+
+int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
+			      struct io_buf_data *buf,
+			      unsigned int issue_flags)
+{
+	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+
+	return io_buffer_reg_unreg_bvec(ctx, buf, issue_flags, false);
+}
 EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
 
 static int validate_fixed_range(u64 buf_addr, size_t len,
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 4/7] ublk: convert to refcount_t
  2025-04-28  9:44 [RFC PATCH 0/7] ublk: support to register bvec buffer automatically Ming Lei
                   ` (2 preceding siblings ...)
  2025-04-28  9:44 ` [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring Ming Lei
@ 2025-04-28  9:44 ` Ming Lei
  2025-04-28 17:13   ` Caleb Sander Mateos
  2025-04-28  9:44 ` [RFC PATCH 5/7] ublk: prepare for supporting to register request buffer automatically Ming Lei
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-04-28  9:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov, linux-block
  Cc: Uday Shankar, Caleb Sander Mateos, Keith Busch, Ming Lei

Convert to refcount_t and prepare for supporting to register bvec buffer
automatically, which needs to initialize reference counter as 2, and
kref doesn't provide this interface, so convert to refcount_t.

Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index ac56482b55f5..9cd331d12fa6 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -79,7 +79,7 @@
 	 UBLK_PARAM_TYPE_DMA_ALIGN | UBLK_PARAM_TYPE_SEGMENT)
 
 struct ublk_rq_data {
-	struct kref ref;
+	refcount_t ref;
 };
 
 struct ublk_uring_cmd_pdu {
@@ -484,7 +484,6 @@ static blk_status_t ublk_setup_iod_zoned(struct ublk_queue *ubq,
 #endif
 
 static inline void __ublk_complete_rq(struct request *req);
-static void ublk_complete_rq(struct kref *ref);
 
 static dev_t ublk_chr_devt;
 static const struct class ublk_chr_class = {
@@ -644,7 +643,7 @@ static inline void ublk_init_req_ref(const struct ublk_queue *ubq,
 	if (ublk_need_req_ref(ubq)) {
 		struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
 
-		kref_init(&data->ref);
+		refcount_set(&data->ref, 1);
 	}
 }
 
@@ -654,7 +653,7 @@ static inline bool ublk_get_req_ref(const struct ublk_queue *ubq,
 	if (ublk_need_req_ref(ubq)) {
 		struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
 
-		return kref_get_unless_zero(&data->ref);
+		return refcount_inc_not_zero(&data->ref);
 	}
 
 	return true;
@@ -666,7 +665,8 @@ static inline void ublk_put_req_ref(const struct ublk_queue *ubq,
 	if (ublk_need_req_ref(ubq)) {
 		struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
 
-		kref_put(&data->ref, ublk_complete_rq);
+		if(refcount_dec_and_test(&data->ref))
+			__ublk_complete_rq(req);
 	} else {
 		__ublk_complete_rq(req);
 	}
@@ -1124,15 +1124,6 @@ static inline void __ublk_complete_rq(struct request *req)
 	blk_mq_end_request(req, res);
 }
 
-static void ublk_complete_rq(struct kref *ref)
-{
-	struct ublk_rq_data *data = container_of(ref, struct ublk_rq_data,
-			ref);
-	struct request *req = blk_mq_rq_from_pdu(data);
-
-	__ublk_complete_rq(req);
-}
-
 static void ublk_complete_io_cmd(struct ublk_io *io, struct request *req,
 				 int res, unsigned issue_flags)
 {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 5/7] ublk: prepare for supporting to register request buffer automatically
  2025-04-28  9:44 [RFC PATCH 0/7] ublk: support to register bvec buffer automatically Ming Lei
                   ` (3 preceding siblings ...)
  2025-04-28  9:44 ` [RFC PATCH 4/7] ublk: convert to refcount_t Ming Lei
@ 2025-04-28  9:44 ` Ming Lei
  2025-04-29  0:50   ` Caleb Sander Mateos
  2025-04-28  9:44 ` [RFC PATCH 6/7] ublk: register buffer to specified io_uring & buf index via UBLK_F_AUTO_BUF_REG Ming Lei
  2025-04-28  9:44 ` [RFC PATCH 7/7] selftests: ublk: support UBLK_F_AUTO_BUF_REG Ming Lei
  6 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-04-28  9:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov, linux-block
  Cc: Uday Shankar, Caleb Sander Mateos, Keith Busch, Ming Lei

UBLK_F_SUPPORT_ZERO_COPY requires ublk server to issue explicit buffer
register/unregister uring_cmd for each IO, this way is not only inefficient,
but also introduce dependency between buffer consumer and buffer register/
unregister uring_cmd, please see tools/testing/selftests/ublk/stripe.c
in which backing file IO has to be issued one by one by IOSQE_IO_LINK.

Prepare for adding feature UBLK_F_AUTO_BUF_REG for addressing the existing
zero copy limitation:

- register request buffer automatically to ublk uring_cmd's io_uring
  context before delivering io command to ublk server

- unregister request buffer automatically from the ublk uring_cmd's
  io_uring context when completing the request

- io_uring will unregister the buffer automatically when uring is
  exiting, so we needn't worry about accident exit

For using this feature, ublk server has to create one sparse buffer table

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 85 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 76 insertions(+), 9 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 9cd331d12fa6..1fd20e481a60 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -133,6 +133,14 @@ struct ublk_uring_cmd_pdu {
  */
 #define UBLK_IO_FLAG_NEED_GET_DATA 0x08
 
+/*
+ * request buffer is registered automatically, so we have to unregister it
+ * before completing this request.
+ *
+ * io_uring will unregister buffer automatically for us during exiting.
+ */
+#define UBLK_IO_FLAG_AUTO_BUF_REG 	0x10
+
 /* atomic RW with ubq->cancel_lock */
 #define UBLK_IO_FLAG_CANCELED	0x80000000
 
@@ -205,6 +213,7 @@ struct ublk_params_header {
 	__u32	types;
 };
 
+static void ublk_io_release(void *priv);
 static void ublk_stop_dev_unlocked(struct ublk_device *ub);
 static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
@@ -615,6 +624,11 @@ static inline bool ublk_support_zero_copy(const struct ublk_queue *ubq)
 	return ubq->flags & UBLK_F_SUPPORT_ZERO_COPY;
 }
 
+static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
+{
+	return false;
+}
+
 static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
 {
 	return ubq->flags & UBLK_F_USER_COPY;
@@ -622,7 +636,8 @@ static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
 
 static inline bool ublk_need_map_io(const struct ublk_queue *ubq)
 {
-	return !ublk_support_user_copy(ubq) && !ublk_support_zero_copy(ubq);
+	return !ublk_support_user_copy(ubq) && !ublk_support_zero_copy(ubq) &&
+		!ublk_support_auto_buf_reg(ubq);
 }
 
 static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
@@ -633,17 +648,22 @@ static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
 	 *
 	 * for zero copy, request buffer need to be registered to io_uring
 	 * buffer table, so reference is needed
+	 *
+	 * For auto buffer register, ublk server still may issue
+	 * UBLK_IO_COMMIT_AND_FETCH_REQ before one registered buffer is used up,
+	 * so reference is required too.
 	 */
-	return ublk_support_user_copy(ubq) || ublk_support_zero_copy(ubq);
+	return ublk_support_user_copy(ubq) || ublk_support_zero_copy(ubq) ||
+		ublk_support_auto_buf_reg(ubq);
 }
 
 static inline void ublk_init_req_ref(const struct ublk_queue *ubq,
-		struct request *req)
+		struct request *req, int init_ref)
 {
 	if (ublk_need_req_ref(ubq)) {
 		struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
 
-		refcount_set(&data->ref, 1);
+		refcount_set(&data->ref, init_ref);
 	}
 }
 
@@ -1157,6 +1177,37 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
 		blk_mq_end_request(rq, BLK_STS_IOERR);
 }
 
+static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
+			      struct ublk_io *io, unsigned int issue_flags)
+{
+	struct io_buf_data data = {
+		.rq    = req,
+		.index = req->tag,
+		.release = ublk_io_release,
+	};
+	int ret;
+
+	/* one extra reference is dropped by ublk_io_release */
+	ublk_init_req_ref(ubq, req, 2);
+	ret = io_buffer_register_bvec(io->cmd, &data, issue_flags);
+	if (ret) {
+		blk_mq_end_request(req, BLK_STS_IOERR);
+		return false;
+	}
+	io->flags |= UBLK_IO_FLAG_AUTO_BUF_REG;
+	return true;
+}
+
+static bool ublk_prep_buf_reg(struct ublk_queue *ubq, struct request *req,
+			      struct ublk_io *io, unsigned int issue_flags)
+{
+	if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req))
+		return ublk_auto_buf_reg(ubq, req, io, issue_flags);
+
+	ublk_init_req_ref(ubq, req, 1);
+	return true;
+}
+
 static void ublk_start_io(struct ublk_queue *ubq, struct request *req,
 			  struct ublk_io *io)
 {
@@ -1181,8 +1232,6 @@ static void ublk_start_io(struct ublk_queue *ubq, struct request *req,
 		ublk_get_iod(ubq, req->tag)->nr_sectors =
 			mapped_bytes >> 9;
 	}
-
-	ublk_init_req_ref(ubq, req);
 }
 
 static void ublk_dispatch_req(struct ublk_queue *ubq,
@@ -1225,7 +1274,9 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
 	}
 
 	ublk_start_io(ubq, req, io);
-	ublk_complete_io_cmd(io, req, UBLK_IO_RES_OK, issue_flags);
+
+	if (ublk_prep_buf_reg(ubq, req, io, issue_flags))
+		ublk_complete_io_cmd(io, req, UBLK_IO_RES_OK, issue_flags);
 }
 
 static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
@@ -2007,9 +2058,21 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
 	return ret;
 }
 
+static void ublk_auto_buf_unreg(struct ublk_io *io, struct io_uring_cmd *cmd,
+				struct request *req, unsigned int issue_flags)
+{
+	struct io_buf_data data = {
+		.index = req->tag,
+	};
+
+	WARN_ON_ONCE(io_buffer_unregister_bvec(cmd, &data, issue_flags));
+	io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
+}
+
 static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
 				 struct ublk_io *io, struct io_uring_cmd *cmd,
-				 const struct ublksrv_io_cmd *ub_cmd)
+				 const struct ublksrv_io_cmd *ub_cmd,
+				 unsigned int issue_flags)
 {
 	struct request *req;
 
@@ -2033,6 +2096,9 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
 		return -EINVAL;
 	}
 
+	if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
+		ublk_auto_buf_unreg(io, cmd, req, issue_flags);
+
 	ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
 
 	/* now this cmd slot is owned by ublk driver */
@@ -2065,6 +2131,7 @@ static void ublk_get_data(struct ublk_queue *ubq, struct ublk_io *io)
 			ublk_get_iod(ubq, req->tag)->addr);
 
 	ublk_start_io(ubq, req, io);
+	ublk_init_req_ref(ubq, req, 1);
 }
 
 static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
@@ -2124,7 +2191,7 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 			goto out;
 		break;
 	case UBLK_IO_COMMIT_AND_FETCH_REQ:
-		ret = ublk_commit_and_fetch(ubq, io, cmd, ub_cmd);
+		ret = ublk_commit_and_fetch(ubq, io, cmd, ub_cmd, issue_flags);
 		if (ret)
 			goto out;
 		break;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 6/7] ublk: register buffer to specified io_uring & buf index via UBLK_F_AUTO_BUF_REG
  2025-04-28  9:44 [RFC PATCH 0/7] ublk: support to register bvec buffer automatically Ming Lei
                   ` (4 preceding siblings ...)
  2025-04-28  9:44 ` [RFC PATCH 5/7] ublk: prepare for supporting to register request buffer automatically Ming Lei
@ 2025-04-28  9:44 ` Ming Lei
  2025-04-29  0:52   ` Caleb Sander Mateos
  2025-04-28  9:44 ` [RFC PATCH 7/7] selftests: ublk: support UBLK_F_AUTO_BUF_REG Ming Lei
  6 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-04-28  9:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov, linux-block
  Cc: Uday Shankar, Caleb Sander Mateos, Keith Busch, Ming Lei

Add UBLK_F_AUTO_BUF_REG for supporting to register buffer automatically
to specified io_uring context and buffer index.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c      | 56 ++++++++++++++++++++++++++++-------
 include/uapi/linux/ublk_cmd.h | 38 ++++++++++++++++++++++++
 2 files changed, 84 insertions(+), 10 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 1fd20e481a60..e82618442749 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -66,7 +66,8 @@
 		| UBLK_F_USER_COPY \
 		| UBLK_F_ZONED \
 		| UBLK_F_USER_RECOVERY_FAIL_IO \
-		| UBLK_F_UPDATE_SIZE)
+		| UBLK_F_UPDATE_SIZE \
+		| UBLK_F_AUTO_BUF_REG)
 
 #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
 		| UBLK_F_USER_RECOVERY_REISSUE \
@@ -146,7 +147,10 @@ struct ublk_uring_cmd_pdu {
 
 struct ublk_io {
 	/* userspace buffer address from io cmd */
-	__u64	addr;
+	union {
+		__u64	addr;
+		struct ublk_auto_buf_reg buf;
+	};
 	unsigned int flags;
 	int res;
 
@@ -626,7 +630,7 @@ static inline bool ublk_support_zero_copy(const struct ublk_queue *ubq)
 
 static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
 {
-	return false;
+	return ubq->flags & UBLK_F_AUTO_BUF_REG;
 }
 
 static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
@@ -1177,6 +1181,16 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
 		blk_mq_end_request(rq, BLK_STS_IOERR);
 }
 
+
+static inline void ublk_init_auto_buf_reg(const struct ublk_io *io,
+					  struct io_buf_data *data)
+{
+	data->index = io->buf.index;
+	data->ring_fd = io->buf.ring_fd;
+	data->has_fd = true;
+	data->registered_fd = io->buf.flags & UBLK_AUTO_BUF_REGISTERED_RING;
+}
+
 static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
 			      struct ublk_io *io, unsigned int issue_flags)
 {
@@ -1187,6 +1201,9 @@ static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
 	};
 	int ret;
 
+	if (ublk_support_auto_buf_reg(ubq))
+		ublk_init_auto_buf_reg(io, &data);
+
 	/* one extra reference is dropped by ublk_io_release */
 	ublk_init_req_ref(ubq, req, 2);
 	ret = io_buffer_register_bvec(io->cmd, &data, issue_flags);
@@ -2045,7 +2062,7 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
 		 */
 		if (!buf_addr && !ublk_need_get_data(ubq))
 			goto out;
-	} else if (buf_addr) {
+	} else if (buf_addr && !ublk_support_auto_buf_reg(ubq)) {
 		/* User copy requires addr to be unset */
 		ret = -EINVAL;
 		goto out;
@@ -2058,13 +2075,17 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
 	return ret;
 }
 
-static void ublk_auto_buf_unreg(struct ublk_io *io, struct io_uring_cmd *cmd,
+static void ublk_auto_buf_unreg(const struct ublk_queue *ubq,
+				struct ublk_io *io, struct io_uring_cmd *cmd,
 				struct request *req, unsigned int issue_flags)
 {
 	struct io_buf_data data = {
 		.index = req->tag,
 	};
 
+	if (ublk_support_auto_buf_reg(ubq))
+		ublk_init_auto_buf_reg(io, &data);
+
 	WARN_ON_ONCE(io_buffer_unregister_bvec(cmd, &data, issue_flags));
 	io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
 }
@@ -2088,7 +2109,8 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
 		if (!ub_cmd->addr && (!ublk_need_get_data(ubq) ||
 					req_op(req) == REQ_OP_READ))
 			return -EINVAL;
-	} else if (req_op(req) != REQ_OP_ZONE_APPEND && ub_cmd->addr) {
+	} else if ((req_op(req) != REQ_OP_ZONE_APPEND &&
+				!ublk_support_auto_buf_reg(ubq)) && ub_cmd->addr) {
 		/*
 		 * User copy requires addr to be unset when command is
 		 * not zone append
@@ -2097,7 +2119,7 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
 	}
 
 	if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
-		ublk_auto_buf_unreg(io, cmd, req, issue_flags);
+		ublk_auto_buf_unreg(ubq, io, cmd, req, issue_flags);
 
 	ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
 
@@ -2788,6 +2810,11 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
 	else if (!(info.flags & UBLK_F_UNPRIVILEGED_DEV))
 		return -EPERM;
 
+	/* F_AUTO_BUF_REG and F_SUPPORT_ZERO_COPY can't co-exist */
+	if ((info.flags & UBLK_F_AUTO_BUF_REG) &&
+			(info.flags & UBLK_F_SUPPORT_ZERO_COPY))
+		return -EINVAL;
+
 	/* forbid nonsense combinations of recovery flags */
 	switch (info.flags & UBLK_F_ALL_RECOVERY_FLAGS) {
 	case 0:
@@ -2817,8 +2844,11 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
 		 * For USER_COPY, we depends on userspace to fill request
 		 * buffer by pwrite() to ublk char device, which can't be
 		 * used for unprivileged device
+		 *
+		 * Same with zero copy or auto buffer register.
 		 */
-		if (info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY))
+		if (info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
+					UBLK_F_AUTO_BUF_REG))
 			return -EINVAL;
 	}
 
@@ -2876,17 +2906,22 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
 		UBLK_F_URING_CMD_COMP_IN_TASK;
 
 	/* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */
-	if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY))
+	if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
+				UBLK_F_AUTO_BUF_REG))
 		ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
 
 	/*
 	 * Zoned storage support requires reuse `ublksrv_io_cmd->addr` for
 	 * returning write_append_lba, which is only allowed in case of
 	 * user copy or zero copy
+	 *
+	 * UBLK_F_AUTO_BUF_REG can't be enabled for zoned because it need
+	 * the space for getting ring_fd and buffer index.
 	 */
 	if (ublk_dev_is_zoned(ub) &&
 	    (!IS_ENABLED(CONFIG_BLK_DEV_ZONED) || !(ub->dev_info.flags &
-	     (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)))) {
+	     (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)) ||
+	     (ub->dev_info.flags & UBLK_F_AUTO_BUF_REG))) {
 		ret = -EINVAL;
 		goto out_free_dev_number;
 	}
@@ -3403,6 +3438,7 @@ static int __init ublk_init(void)
 
 	BUILD_BUG_ON((u64)UBLKSRV_IO_BUF_OFFSET +
 			UBLKSRV_IO_BUF_TOTAL_SIZE < UBLKSRV_IO_BUF_OFFSET);
+	BUILD_BUG_ON(sizeof(struct ublk_auto_buf_reg) != sizeof(__u64));
 
 	init_waitqueue_head(&ublk_idr_wq);
 
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index be5c6c6b16e0..3d7c8c69cf06 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -219,6 +219,30 @@
  */
 #define UBLK_F_UPDATE_SIZE		 (1ULL << 10)
 
+/*
+ * request buffer is registered automatically to ublk server specified
+ * io_uring context before delivering this io command to ublk server,
+ * meantime it is un-registered automatically when completing this io
+ * command.
+ *
+ * For using this feature:
+ *
+ * - ublk server has to create sparse buffer table
+ *
+ * - pass io_ring context FD from `ublksrv_io_cmd.buf.ring_fd`, and the FD
+ *   can be registered io_ring FD if `UBLK_AUTO_BUF_REGISTERED_RING` is set
+ *   in `ublksrv_io_cmd.flags`, or plain FD
+ *
+ * - pass buffer index from `ublksrv_io_cmd.buf.index`
+ *
+ * This way avoids extra cost from two uring_cmd, but also simplifies backend
+ * implementation, such as, the dependency on IO_REGISTER_IO_BUF and
+ * IO_UNREGISTER_IO_BUF becomes not necessary.
+ *
+ * This feature isn't available for UBLK_F_ZONED
+ */
+#define UBLK_F_AUTO_BUF_REG 	(1ULL << 11)
+
 /* device state */
 #define UBLK_S_DEV_DEAD	0
 #define UBLK_S_DEV_LIVE	1
@@ -339,6 +363,14 @@ static inline __u32 ublksrv_get_flags(const struct ublksrv_io_desc *iod)
 	return iod->op_flags >> 8;
 }
 
+struct ublk_auto_buf_reg {
+	__s32  ring_fd;
+	__u16  index;
+#define UBLK_AUTO_BUF_REGISTERED_RING            (1 << 0)
+	__u8   flags;
+	__u8   _pad;
+};
+
 /* issued to ublk driver via /dev/ublkcN */
 struct ublksrv_io_cmd {
 	__u16	q_id;
@@ -363,6 +395,12 @@ struct ublksrv_io_cmd {
 		 */
 		__u64	addr;
 		__u64	zone_append_lba;
+
+		/*
+		 * for AUTO_BUF_REG feature, F_ZONED can't be supported,
+		 * and ->addr isn't used for zero copy
+		 */
+		struct ublk_auto_buf_reg auto_buf;
 	};
 };
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 7/7] selftests: ublk: support UBLK_F_AUTO_BUF_REG
  2025-04-28  9:44 [RFC PATCH 0/7] ublk: support to register bvec buffer automatically Ming Lei
                   ` (5 preceding siblings ...)
  2025-04-28  9:44 ` [RFC PATCH 6/7] ublk: register buffer to specified io_uring & buf index via UBLK_F_AUTO_BUF_REG Ming Lei
@ 2025-04-28  9:44 ` Ming Lei
  6 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2025-04-28  9:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov, linux-block
  Cc: Uday Shankar, Caleb Sander Mateos, Keith Busch, Ming Lei

Enable UBLK_F_AUTO_BUF_REG support for ublk utility by argument `--auto_zc`,
meantime support this feature in null, loop and stripe target code.

Add function test generic_08 for covering basic UBLK_F_AUTO_BUF_REG feature.

Also cover UBLK_F_AUTO_BUF_REG in stress_03, stress_04 and stress_05 test too.

'fio/t/io_uring -p0 /dev/ublkb0' shows that F_AUTO_BUF_REG can improve
IOPS by 50% compared with F_SUPPORT_ZERO_COPY in my test VM.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile         |  2 +
 tools/testing/selftests/ublk/file_backed.c    | 12 ++++--
 tools/testing/selftests/ublk/kublk.c          | 24 ++++++++---
 tools/testing/selftests/ublk/kublk.h          |  6 +++
 tools/testing/selftests/ublk/null.c           | 43 ++++++++++++++-----
 tools/testing/selftests/ublk/stripe.c         | 21 ++++-----
 .../testing/selftests/ublk/test_generic_08.sh | 32 ++++++++++++++
 .../testing/selftests/ublk/test_stress_03.sh  |  6 +++
 .../testing/selftests/ublk/test_stress_04.sh  |  6 +++
 .../testing/selftests/ublk/test_stress_05.sh  |  8 ++++
 10 files changed, 130 insertions(+), 30 deletions(-)
 create mode 100755 tools/testing/selftests/ublk/test_generic_08.sh

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index f34ac0bac696..8475716f407b 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -11,6 +11,8 @@ TEST_PROGS += test_generic_05.sh
 TEST_PROGS += test_generic_06.sh
 TEST_PROGS += test_generic_07.sh
 
+TEST_PROGS += test_generic_08.sh
+
 TEST_PROGS += test_null_01.sh
 TEST_PROGS += test_null_02.sh
 TEST_PROGS += test_loop_01.sh
diff --git a/tools/testing/selftests/ublk/file_backed.c b/tools/testing/selftests/ublk/file_backed.c
index 6f34eabfae97..9dc00b217a66 100644
--- a/tools/testing/selftests/ublk/file_backed.c
+++ b/tools/testing/selftests/ublk/file_backed.c
@@ -29,19 +29,23 @@ static int loop_queue_flush_io(struct ublk_queue *q, const struct ublksrv_io_des
 static int loop_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_desc *iod, int tag)
 {
 	unsigned ublk_op = ublksrv_get_op(iod);
-	int zc = ublk_queue_use_zc(q);
-	enum io_uring_op op = ublk_to_uring_op(iod, zc);
+	unsigned zc = ublk_queue_use_zc(q);
+	unsigned auto_zc = ublk_queue_use_auto_zc(q);
+	enum io_uring_op op = ublk_to_uring_op(iod, zc | auto_zc);
 	struct io_uring_sqe *sqe[3];
+	void *addr = (zc | auto_zc) ? NULL : (void *)iod->addr;
 
-	if (!zc) {
+	if (!zc || auto_zc) {
 		ublk_queue_alloc_sqes(q, sqe, 1);
 		if (!sqe[0])
 			return -ENOMEM;
 
 		io_uring_prep_rw(op, sqe[0], 1 /*fds[1]*/,
-				(void *)iod->addr,
+				addr,
 				iod->nr_sectors << 9,
 				iod->start_sector << 9);
+		if (auto_zc)
+			sqe[0]->buf_index = tag;
 		io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE);
 		/* bit63 marks us as tgt io */
 		sqe[0]->user_data = build_user_data(tag, ublk_op, 0, 1);
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 3afd45d7f989..0b8d84db4fb7 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -420,9 +420,12 @@ static int ublk_queue_init(struct ublk_queue *q)
 	q->cmd_inflight = 0;
 	q->tid = gettid();
 
-	if (dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) {
+	if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) {
 		q->state |= UBLKSRV_NO_BUF;
-		q->state |= UBLKSRV_ZC;
+		if (dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY)
+			q->state |= UBLKSRV_ZC;
+		if (dev->dev_info.flags & UBLK_F_AUTO_BUF_REG)
+			q->state |= UBLKSRV_AUTO_BUF_REG;
 	}
 
 	cmd_buf_size = ublk_queue_cmd_buf_sz(q);
@@ -461,7 +464,7 @@ static int ublk_queue_init(struct ublk_queue *q)
 		goto fail;
 	}
 
-	if (dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) {
+	if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) {
 		ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth);
 		if (ret) {
 			ublk_err("ublk dev %d queue %d register spare buffers failed %d",
@@ -579,6 +582,12 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag)
 	else
 		cmd->addr	= 0;
 
+	if (q->state & UBLKSRV_AUTO_BUF_REG) {
+		cmd->auto_buf.index = tag;
+		cmd->auto_buf.ring_fd = q->ring.enter_ring_fd;
+		cmd->auto_buf.flags = UBLK_AUTO_BUF_REGISTERED_RING;
+	}
+
 	user_data = build_user_data(tag, _IOC_NR(cmd_op), 0, 0);
 	io_uring_sqe_set_data64(sqe[0], user_data);
 
@@ -586,8 +595,9 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag)
 
 	q->cmd_inflight += 1;
 
-	ublk_dbg(UBLK_DBG_IO_CMD, "%s: (qid %d tag %u cmd_op %u) iof %x stopping %d\n",
+	ublk_dbg(UBLK_DBG_IO_CMD, "%s: (qid %d tag %u cmd_op %u) ring_fd %d iof %x stopping %d\n",
 			__func__, q->q_id, tag, cmd_op,
+			cmd->auto_buf.ring_fd,
 			io->flags, !!(q->state & UBLKSRV_QUEUE_STOPPING));
 	return 1;
 }
@@ -1206,6 +1216,7 @@ static int cmd_dev_get_features(void)
 		[const_ilog2(UBLK_F_USER_COPY)] = "USER_COPY",
 		[const_ilog2(UBLK_F_ZONED)] = "ZONED",
 		[const_ilog2(UBLK_F_USER_RECOVERY_FAIL_IO)] = "RECOVERY_FAIL_IO",
+		[const_ilog2(UBLK_F_AUTO_BUF_REG)] = "AUTO_BUF_REG",
 	};
 	struct ublk_dev *dev;
 	__u64 features = 0;
@@ -1245,7 +1256,7 @@ static void __cmd_create_help(char *exe, bool recovery)
 
 	printf("%s %s -t [null|loop|stripe|fault_inject] [-q nr_queues] [-d depth] [-n dev_id]\n",
 			exe, recovery ? "recover" : "add");
-	printf("\t[--foreground] [--quiet] [-z] [--debug_mask mask] [-r 0|1 ] [-g 0|1]\n");
+	printf("\t[--foreground] [--quiet] [-z] [--auto_zc] [--debug_mask mask] [-r 0|1 ] [-g 0|1]\n");
 	printf("\t[-e 0|1 ] [-i 0|1]\n");
 	printf("\t[target options] [backfile1] [backfile2] ...\n");
 	printf("\tdefault: nr_queues=2(max 32), depth=128(max 1024), dev_id=-1(auto allocation)\n");
@@ -1300,6 +1311,7 @@ int main(int argc, char *argv[])
 		{ "recovery_fail_io",	1,	NULL, 'e'},
 		{ "recovery_reissue",	1,	NULL, 'i'},
 		{ "get_data",		1,	NULL, 'g'},
+		{ "auto_zc",		0,	NULL,  0},
 		{ 0, 0, 0, 0 }
 	};
 	const struct ublk_tgt_ops *ops = NULL;
@@ -1368,6 +1380,8 @@ int main(int argc, char *argv[])
 				ublk_dbg_mask = 0;
 			if (!strcmp(longopts[option_idx].name, "foreground"))
 				ctx.fg = 1;
+			if (!strcmp(longopts[option_idx].name, "auto_zc"))
+				ctx.flags |= UBLK_F_AUTO_BUF_REG;
 			break;
 		case '?':
 			/*
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 44ee1e4ac55b..f07677555526 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -169,6 +169,7 @@ struct ublk_queue {
 #define UBLKSRV_QUEUE_IDLE	(1U << 1)
 #define UBLKSRV_NO_BUF		(1U << 2)
 #define UBLKSRV_ZC		(1U << 3)
+#define UBLKSRV_AUTO_BUF_REG		(1U << 4)
 	unsigned state;
 	pid_t tid;
 	pthread_t thread;
@@ -388,6 +389,11 @@ static inline int ublk_queue_use_zc(const struct ublk_queue *q)
 	return q->state & UBLKSRV_ZC;
 }
 
+static inline int ublk_queue_use_auto_zc(const struct ublk_queue *q)
+{
+	return q->state & UBLKSRV_AUTO_BUF_REG;
+}
+
 extern const struct ublk_tgt_ops null_tgt_ops;
 extern const struct ublk_tgt_ops loop_tgt_ops;
 extern const struct ublk_tgt_ops stripe_tgt_ops;
diff --git a/tools/testing/selftests/ublk/null.c b/tools/testing/selftests/ublk/null.c
index 91fec3690d4b..1362dd422c6e 100644
--- a/tools/testing/selftests/ublk/null.c
+++ b/tools/testing/selftests/ublk/null.c
@@ -42,10 +42,22 @@ static int ublk_null_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev)
 	return 0;
 }
 
+static void __setup_nop_io(int tag, const struct ublksrv_io_desc *iod,
+		struct io_uring_sqe *sqe)
+{
+	unsigned ublk_op = ublksrv_get_op(iod);
+
+	io_uring_prep_nop(sqe);
+	sqe->buf_index = tag;
+	sqe->flags |= IOSQE_FIXED_FILE;
+	sqe->rw_flags = IORING_NOP_FIXED_BUFFER | IORING_NOP_INJECT_RESULT;
+	sqe->len = iod->nr_sectors << 9; 	/* injected result */
+	sqe->user_data = build_user_data(tag, ublk_op, 0, 1);
+}
+
 static int null_queue_zc_io(struct ublk_queue *q, int tag)
 {
 	const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
-	unsigned ublk_op = ublksrv_get_op(iod);
 	struct io_uring_sqe *sqe[3];
 
 	ublk_queue_alloc_sqes(q, sqe, 3);
@@ -55,12 +67,8 @@ static int null_queue_zc_io(struct ublk_queue *q, int tag)
 			ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1);
 	sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
 
-	io_uring_prep_nop(sqe[1]);
-	sqe[1]->buf_index = tag;
-	sqe[1]->flags |= IOSQE_FIXED_FILE | IOSQE_IO_HARDLINK;
-	sqe[1]->rw_flags = IORING_NOP_FIXED_BUFFER | IORING_NOP_INJECT_RESULT;
-	sqe[1]->len = iod->nr_sectors << 9; 	/* injected result */
-	sqe[1]->user_data = build_user_data(tag, ublk_op, 0, 1);
+	__setup_nop_io(tag, iod, sqe[1]);
+	sqe[1]->flags |= IOSQE_IO_HARDLINK;
 
 	io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, tag);
 	sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, 1);
@@ -69,6 +77,16 @@ static int null_queue_zc_io(struct ublk_queue *q, int tag)
 	return 2;
 }
 
+static int null_queue_auto_zc_io(struct ublk_queue *q, int tag)
+{
+	const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
+	struct io_uring_sqe *sqe[1];
+
+	ublk_queue_alloc_sqes(q, sqe, 1);
+	__setup_nop_io(tag, iod, sqe[0]);
+	return 1;
+}
+
 static void ublk_null_io_done(struct ublk_queue *q, int tag,
 		const struct io_uring_cqe *cqe)
 {
@@ -94,15 +112,18 @@ static void ublk_null_io_done(struct ublk_queue *q, int tag,
 static int ublk_null_queue_io(struct ublk_queue *q, int tag)
 {
 	const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
-	int zc = ublk_queue_use_zc(q);
+	unsigned auto_zc = ublk_queue_use_auto_zc(q);
+	unsigned zc = ublk_queue_use_zc(q);
 	int queued;
 
-	if (!zc) {
+	if (auto_zc)
+		queued = null_queue_auto_zc_io(q, tag);
+	else if (zc)
+		queued = null_queue_zc_io(q, tag);
+	else {
 		ublk_complete_io(q, tag, iod->nr_sectors << 9);
 		return 0;
 	}
-
-	queued = null_queue_zc_io(q, tag);
 	ublk_queued_tgt_io(q, tag, queued);
 	return 0;
 }
diff --git a/tools/testing/selftests/ublk/stripe.c b/tools/testing/selftests/ublk/stripe.c
index 5dbd6392d83d..8fd8faeb5e76 100644
--- a/tools/testing/selftests/ublk/stripe.c
+++ b/tools/testing/selftests/ublk/stripe.c
@@ -70,7 +70,7 @@ static void free_stripe_array(struct stripe_array *s)
 }
 
 static void calculate_stripe_array(const struct stripe_conf *conf,
-		const struct ublksrv_io_desc *iod, struct stripe_array *s)
+		const struct ublksrv_io_desc *iod, struct stripe_array *s, void *base)
 {
 	const unsigned shift = conf->shift - 9;
 	const unsigned chunk_sects = 1 << shift;
@@ -102,7 +102,7 @@ static void calculate_stripe_array(const struct stripe_conf *conf,
 		}
 
 		assert(this->nr_vec < this->cap);
-		this->vec[this->nr_vec].iov_base = (void *)(iod->addr + done);
+		this->vec[this->nr_vec].iov_base = (void *)(base + done);
 		this->vec[this->nr_vec++].iov_len = nr_sects << 9;
 
 		start += nr_sects;
@@ -126,15 +126,17 @@ static inline enum io_uring_op stripe_to_uring_op(
 static int stripe_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_desc *iod, int tag)
 {
 	const struct stripe_conf *conf = get_chunk_shift(q);
-	int zc = !!(ublk_queue_use_zc(q) != 0);
-	enum io_uring_op op = stripe_to_uring_op(iod, zc);
+	unsigned auto_zc = (ublk_queue_use_auto_zc(q) != 0);
+	unsigned zc = (ublk_queue_use_zc(q) != 0);
+	enum io_uring_op op = stripe_to_uring_op(iod, zc | auto_zc);
 	struct io_uring_sqe *sqe[NR_STRIPE];
 	struct stripe_array *s = alloc_stripe_array(conf, iod);
 	struct ublk_io *io = ublk_get_io(q, tag);
 	int i, extra = zc ? 2 : 0;
+	void *base = (zc | auto_zc) ? NULL : (void *)iod->addr;
 
 	io->private_data = s;
-	calculate_stripe_array(conf, iod, s);
+	calculate_stripe_array(conf, iod, s, base);
 
 	ublk_queue_alloc_sqes(q, sqe, s->nr + extra);
 
@@ -153,12 +155,11 @@ static int stripe_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_
 				(void *)t->vec,
 				t->nr_vec,
 				t->start << 9);
-		if (zc) {
+		io_uring_sqe_set_flags(sqe[i], IOSQE_FIXED_FILE);
+		if (auto_zc || zc) {
 			sqe[i]->buf_index = tag;
-			io_uring_sqe_set_flags(sqe[i],
-					IOSQE_FIXED_FILE | IOSQE_IO_HARDLINK);
-		} else {
-			io_uring_sqe_set_flags(sqe[i], IOSQE_FIXED_FILE);
+			if (zc)
+				sqe[i]->flags |= IOSQE_IO_HARDLINK;
 		}
 		/* bit63 marks us as tgt io */
 		sqe[i]->user_data = build_user_data(tag, ublksrv_get_op(iod), i - zc, 1);
diff --git a/tools/testing/selftests/ublk/test_generic_08.sh b/tools/testing/selftests/ublk/test_generic_08.sh
new file mode 100755
index 000000000000..b222f3a77e12
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_08.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID="generic_08"
+ERR_CODE=0
+
+if ! _have_feature "AUTO_BUF_REG"; then
+	exit "$UBLK_SKIP_CODE"
+fi
+
+_prep_test "generic" "test UBLK_F_AUTO_BUF_REG"
+
+_create_backfile 0 256M
+_create_backfile 1 256M
+
+dev_id=$(_add_ublk_dev -t loop -q 2 --auto_zc "${UBLK_BACKFILES[0]}")
+_check_add_dev $TID $?
+
+if ! _mkfs_mount_test /dev/ublkb"${dev_id}"; then
+	_cleanup_test "generic"
+	_show_result $TID 255
+fi
+
+dev_id=$(_add_ublk_dev -t stripe --auto_zc "${UBLK_BACKFILES[0]}" "${UBLK_BACKFILES[1]}")
+_check_add_dev $TID $?
+_mkfs_mount_test /dev/ublkb"${dev_id}"
+ERR_CODE=$?
+
+_cleanup_test "generic"
+_show_result $TID $ERR_CODE
diff --git a/tools/testing/selftests/ublk/test_stress_03.sh b/tools/testing/selftests/ublk/test_stress_03.sh
index e0854f71d35b..b5a5520dcae6 100755
--- a/tools/testing/selftests/ublk/test_stress_03.sh
+++ b/tools/testing/selftests/ublk/test_stress_03.sh
@@ -32,6 +32,12 @@ _create_backfile 2 128M
 ublk_io_and_remove 8G -t null -q 4 -z &
 ublk_io_and_remove 256M -t loop -q 4 -z "${UBLK_BACKFILES[0]}" &
 ublk_io_and_remove 256M -t stripe -q 4 -z "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
+
+if _have_feature "AUTO_BUF_REG"; then
+	ublk_io_and_remove 8G -t null -q 4 --auto_zc &
+	ublk_io_and_remove 256M -t loop -q 4 --auto_zc "${UBLK_BACKFILES[0]}" &
+	ublk_io_and_remove 256M -t stripe -q 4 --auto_zc "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
+fi
 wait
 
 _cleanup_test "stress"
diff --git a/tools/testing/selftests/ublk/test_stress_04.sh b/tools/testing/selftests/ublk/test_stress_04.sh
index 1798a98387e8..5b49a8025002 100755
--- a/tools/testing/selftests/ublk/test_stress_04.sh
+++ b/tools/testing/selftests/ublk/test_stress_04.sh
@@ -31,6 +31,12 @@ _create_backfile 2 128M
 ublk_io_and_kill_daemon 8G -t null -q 4 -z &
 ublk_io_and_kill_daemon 256M -t loop -q 4 -z "${UBLK_BACKFILES[0]}" &
 ublk_io_and_kill_daemon 256M -t stripe -q 4 -z "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
+
+if _have_feature "AUTO_BUF_REG"; then
+	ublk_io_and_kill_daemon 8G -t null -q 4 --auto_zc &
+	ublk_io_and_kill_daemon 256M -t loop -q 4 --auto_zc "${UBLK_BACKFILES[0]}" &
+	ublk_io_and_kill_daemon 256M -t stripe -q 4 --auto_zc "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
+fi
 wait
 
 _cleanup_test "stress"
diff --git a/tools/testing/selftests/ublk/test_stress_05.sh b/tools/testing/selftests/ublk/test_stress_05.sh
index 88601b48f1cd..6f758f6070a5 100755
--- a/tools/testing/selftests/ublk/test_stress_05.sh
+++ b/tools/testing/selftests/ublk/test_stress_05.sh
@@ -60,5 +60,13 @@ if _have_feature "ZERO_COPY"; then
 	done
 fi
 
+if _have_feature "AUTO_BUF_REG"; then
+	for reissue in $(seq 0 1); do
+		ublk_io_and_remove 8G -t null -q 4 -g --auto_zc -r 1 -i "$reissue" &
+		ublk_io_and_remove 256M -t loop -q 4 -g --auto_zc -r 1 -i "$reissue" "${UBLK_BACKFILES[1]}" &
+		wait
+	done
+fi
+
 _cleanup_test "stress"
 _show_result $TID $ERR_CODE
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-04-28  9:44 ` [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring Ming Lei
@ 2025-04-28 10:28   ` Pavel Begunkov
  2025-04-29  0:46     ` Caleb Sander Mateos
  2025-04-29  0:47     ` Ming Lei
  2025-04-29  0:43   ` Caleb Sander Mateos
  1 sibling, 2 replies; 29+ messages in thread
From: Pavel Begunkov @ 2025-04-28 10:28 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, io-uring, linux-block
  Cc: Uday Shankar, Caleb Sander Mateos, Keith Busch

On 4/28/25 10:44, Ming Lei wrote:
> Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> supporting to register/unregister bvec buffer to specified io_uring,
> which FD is usually passed from userspace.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   include/linux/io_uring/cmd.h |  4 ++
>   io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
>   2 files changed, 67 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> index 78fa336a284b..7516fe5cd606 100644
> --- a/include/linux/io_uring/cmd.h
> +++ b/include/linux/io_uring/cmd.h
> @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
...
>   	io_ring_submit_lock(ctx, issue_flags);
> -	ret = __io_buffer_unregister_bvec(ctx, buf);
> +	if (reg)
> +		ret = __io_buffer_register_bvec(ctx, buf);
> +	else
> +		ret = __io_buffer_unregister_bvec(ctx, buf);
>   	io_ring_submit_unlock(ctx, issue_flags);
>   
>   	return ret;
>   }
> +
> +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> +				    struct io_buf_data *buf,
> +				    unsigned int issue_flags,
> +				    bool reg)
> +{
> +	struct io_ring_ctx *remote_ctx = ctx;
> +	struct file *file = NULL;
> +	int ret;
> +
> +	if (buf->has_fd) {
> +		file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);

io_uring_register_get_file() accesses task private data and the request
doesn't control from which task it's executed. IOW, you can't use the
helper here. It can be iowq or sqpoll, but either way nothing is
promised.

> +		if (IS_ERR(file))
> +			return PTR_ERR(file);
> +		remote_ctx = file->private_data;
> +		if (!remote_ctx)
> +			return -EINVAL;

nit: this check is not needed.

> +	}
> +
> +	if (remote_ctx == ctx) {
> +		do_reg_unreg_bvec(ctx, buf, issue_flags, reg);
> +	} else {
> +		if (!(issue_flags & IO_URING_F_UNLOCKED))
> +			mutex_unlock(&ctx->uring_lock);

We shouldn't be dropping the lock in random helpers, for example
it'd be pretty nasty suspending a submission loop with a submission
from another task.

You can try lock first, if fails it'll need a fresh context via
iowq to be task-work'ed into the ring. see msg_ring.c for how
it's done for files.

> +
> +		do_reg_unreg_bvec(remote_ctx, buf, IO_URING_F_UNLOCKED, reg);
> +
> +		if (!(issue_flags & IO_URING_F_UNLOCKED))
> +			mutex_lock(&ctx->uring_lock);
> +	}
> +
> +	if (file)
> +		fput(file);
> +
> +	return ret;
> +}
-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 4/7] ublk: convert to refcount_t
  2025-04-28  9:44 ` [RFC PATCH 4/7] ublk: convert to refcount_t Ming Lei
@ 2025-04-28 17:13   ` Caleb Sander Mateos
  0 siblings, 0 replies; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-04-28 17:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Mon, Apr 28, 2025 at 2:45 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Convert to refcount_t and prepare for supporting to register bvec buffer
> automatically, which needs to initialize reference counter as 2, and
> kref doesn't provide this interface, so convert to refcount_t.
>
> Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 19 +++++--------------
>  1 file changed, 5 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index ac56482b55f5..9cd331d12fa6 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -79,7 +79,7 @@
>          UBLK_PARAM_TYPE_DMA_ALIGN | UBLK_PARAM_TYPE_SEGMENT)
>
>  struct ublk_rq_data {
> -       struct kref ref;
> +       refcount_t ref;
>  };
>
>  struct ublk_uring_cmd_pdu {
> @@ -484,7 +484,6 @@ static blk_status_t ublk_setup_iod_zoned(struct ublk_queue *ubq,
>  #endif
>
>  static inline void __ublk_complete_rq(struct request *req);
> -static void ublk_complete_rq(struct kref *ref);
>
>  static dev_t ublk_chr_devt;
>  static const struct class ublk_chr_class = {
> @@ -644,7 +643,7 @@ static inline void ublk_init_req_ref(const struct ublk_queue *ubq,
>         if (ublk_need_req_ref(ubq)) {
>                 struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
>
> -               kref_init(&data->ref);
> +               refcount_set(&data->ref, 1);
>         }
>  }
>
> @@ -654,7 +653,7 @@ static inline bool ublk_get_req_ref(const struct ublk_queue *ubq,
>         if (ublk_need_req_ref(ubq)) {
>                 struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
>
> -               return kref_get_unless_zero(&data->ref);
> +               return refcount_inc_not_zero(&data->ref);
>         }
>
>         return true;
> @@ -666,7 +665,8 @@ static inline void ublk_put_req_ref(const struct ublk_queue *ubq,
>         if (ublk_need_req_ref(ubq)) {
>                 struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
>
> -               kref_put(&data->ref, ublk_complete_rq);
> +               if(refcount_dec_and_test(&data->ref))

nit: missing space after if

Other than that,

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

> +                       __ublk_complete_rq(req);
>         } else {
>                 __ublk_complete_rq(req);
>         }
> @@ -1124,15 +1124,6 @@ static inline void __ublk_complete_rq(struct request *req)
>         blk_mq_end_request(req, res);
>  }
>
> -static void ublk_complete_rq(struct kref *ref)
> -{
> -       struct ublk_rq_data *data = container_of(ref, struct ublk_rq_data,
> -                       ref);
> -       struct request *req = blk_mq_rq_from_pdu(data);
> -
> -       __ublk_complete_rq(req);
> -}
> -
>  static void ublk_complete_io_cmd(struct ublk_io *io, struct request *req,
>                                  int res, unsigned issue_flags)
>  {
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/7] io_uring: add 'struct io_buf_data' for register/unregister bvec buffer
  2025-04-28  9:44 ` [RFC PATCH 1/7] io_uring: add 'struct io_buf_data' for register/unregister bvec buffer Ming Lei
@ 2025-04-29  0:35   ` Caleb Sander Mateos
  0 siblings, 0 replies; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-04-29  0:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add 'struct io_buf_data' for register/unregister bvec buffer, and
> prepare for supporting to register buffer into one specified io_uring
> context by its FD.
>
> No functional change.

I'm not a big fan of this change. Passing the arguments by reference
instead of in registers makes io_buffer_register_bvec() more expensive
to call. And it's a strange API for io_buffer_unregister_bvec() to
take a full struct io_buf_data but only use the index field.

Best,
Caleb


>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c     | 13 ++++++++++---
>  include/linux/io_uring/cmd.h | 11 ++++++++---
>  io_uring/rsrc.c              | 12 ++++++++----
>  3 files changed, 26 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 0d82014679f8..ac56482b55f5 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1925,6 +1925,10 @@ static int ublk_register_io_buf(struct io_uring_cmd *cmd,
>  {
>         struct ublk_device *ub = cmd->file->private_data;
>         const struct ublk_io *io = &ubq->ios[tag];
> +       struct io_buf_data data = {
> +               .index = index,
> +               .release = ublk_io_release,
> +       };
>         struct request *req;
>         int ret;
>
> @@ -1938,8 +1942,8 @@ static int ublk_register_io_buf(struct io_uring_cmd *cmd,
>         if (!req)
>                 return -EINVAL;
>
> -       ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index,
> -                                     issue_flags);
> +       data.rq = req;
> +       ret = io_buffer_register_bvec(cmd, &data, issue_flags);
>         if (ret) {
>                 ublk_put_req_ref(ubq, req);
>                 return ret;
> @@ -1953,6 +1957,9 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
>                                   unsigned int index, unsigned int issue_flags)
>  {
>         const struct ublk_io *io = &ubq->ios[tag];
> +       struct io_buf_data data = {
> +               .index = index,
> +       };
>
>         if (!ublk_support_zero_copy(ubq))
>                 return -EINVAL;
> @@ -1960,7 +1967,7 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
>         if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
>                 return -EINVAL;
>
> -       return io_buffer_unregister_bvec(cmd, index, issue_flags);
> +       return io_buffer_unregister_bvec(cmd, &data, issue_flags);
>  }
>
>  static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
> diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> index 0634a3de1782..78fa336a284b 100644
> --- a/include/linux/io_uring/cmd.h
> +++ b/include/linux/io_uring/cmd.h
> @@ -23,6 +23,12 @@ struct io_uring_cmd_data {
>         void                    *op_data;
>  };
>
> +struct io_buf_data {
> +       unsigned short index;
> +       struct request *rq;
> +       void (*release)(void *);
> +};
> +
>  static inline const void *io_uring_sqe_cmd(const struct io_uring_sqe *sqe)
>  {
>         return sqe->cmd;
> @@ -140,10 +146,9 @@ static inline struct io_uring_cmd_data *io_uring_cmd_get_async_data(struct io_ur
>         return cmd_to_io_kiocb(cmd)->async_data;
>  }
>
> -int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
> -                           void (*release)(void *), unsigned int index,
> +int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct io_buf_data *data,
>                             unsigned int issue_flags);
> -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
> +int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, struct io_buf_data *data,
>                               unsigned int issue_flags);
>
>  #endif /* _LINUX_IO_URING_CMD_H */
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index b4c5f3ee8855..66d2c11e2f46 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -918,12 +918,14 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>         return ret;
>  }
>
> -int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
> -                           void (*release)(void *), unsigned int index,
> +int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> +                           struct io_buf_data *buf,
>                             unsigned int issue_flags)
>  {
>         struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
>         struct io_rsrc_data *data = &ctx->buf_table;
> +       unsigned int index = buf->index;
> +       struct request *rq = buf->rq;
>         struct req_iterator rq_iter;
>         struct io_mapped_ubuf *imu;
>         struct io_rsrc_node *node;
> @@ -963,7 +965,7 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
>         imu->folio_shift = PAGE_SHIFT;
>         imu->nr_bvecs = nr_bvecs;
>         refcount_set(&imu->refs, 1);
> -       imu->release = release;
> +       imu->release = buf->release;
>         imu->priv = rq;
>         imu->is_kbuf = true;
>         imu->dir = 1 << rq_data_dir(rq);
> @@ -980,11 +982,13 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
>  }
>  EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
>
> -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
> +int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> +                             struct io_buf_data *buf,
>                               unsigned int issue_flags)
>  {
>         struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
>         struct io_rsrc_data *data = &ctx->buf_table;
> +       unsigned index = buf->index;
>         struct io_rsrc_node *node;
>         int ret = 0;
>
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/7] io_uring: add helper __io_buffer_[un]register_bvec
  2025-04-28  9:44 ` [RFC PATCH 2/7] io_uring: add helper __io_buffer_[un]register_bvec Ming Lei
@ 2025-04-29  0:36   ` Caleb Sander Mateos
  0 siblings, 0 replies; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-04-29  0:36 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add helper __io_buffer_[un]register_bvec and prepare for supporting to
> register bvec buffer into specified io_uring and buffer index.
>
> No functional change.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  io_uring/rsrc.c | 88 ++++++++++++++++++++++++++-----------------------
>  1 file changed, 46 insertions(+), 42 deletions(-)
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 66d2c11e2f46..5f8ab130a573 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -918,11 +918,9 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>         return ret;
>  }
>
> -int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> -                           struct io_buf_data *buf,
> -                           unsigned int issue_flags)
> +static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
> +                                    struct io_buf_data *buf)

__must_hold(&ctx->uring_lock) ? Same comment about
__io_buffer_unregister_bvec().

Best,
Caleb

>  {
> -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
>         struct io_rsrc_data *data = &ctx->buf_table;
>         unsigned int index = buf->index;
>         struct request *rq = buf->rq;
> @@ -931,32 +929,23 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd,
>         struct io_rsrc_node *node;
>         struct bio_vec bv, *bvec;
>         u16 nr_bvecs;
> -       int ret = 0;
>
> -       io_ring_submit_lock(ctx, issue_flags);
> -       if (index >= data->nr) {
> -               ret = -EINVAL;
> -               goto unlock;
> -       }
> -       index = array_index_nospec(index, data->nr);
> +       if (index >= data->nr)
> +               return -EINVAL;
>
> -       if (data->nodes[index]) {
> -               ret = -EBUSY;
> -               goto unlock;
> -       }
> +       index = array_index_nospec(index, data->nr);
> +       if (data->nodes[index])
> +               return -EBUSY;
>
>         node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
> -       if (!node) {
> -               ret = -ENOMEM;
> -               goto unlock;
> -       }
> +       if (!node)
> +               return -ENOMEM;
>
>         nr_bvecs = blk_rq_nr_phys_segments(rq);
>         imu = io_alloc_imu(ctx, nr_bvecs);
>         if (!imu) {
>                 kfree(node);
> -               ret = -ENOMEM;
> -               goto unlock;
> +               return -ENOMEM;
>         }
>
>         imu->ubuf = 0;
> @@ -976,43 +965,58 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd,
>
>         node->buf = imu;
>         data->nodes[index] = node;
> -unlock:
> +
> +       return 0;
> +}
> +
> +int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> +                           struct io_buf_data *buf,
> +                           unsigned int issue_flags)
> +{
> +       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +       int ret;
> +
> +       io_ring_submit_lock(ctx, issue_flags);
> +       ret = __io_buffer_register_bvec(ctx, buf);
>         io_ring_submit_unlock(ctx, issue_flags);
> +
>         return ret;
>  }
>  EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
>
> -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> -                             struct io_buf_data *buf,
> -                             unsigned int issue_flags)
> +static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> +                                      struct io_buf_data *buf)
>  {
> -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
>         struct io_rsrc_data *data = &ctx->buf_table;
>         unsigned index = buf->index;
>         struct io_rsrc_node *node;
> -       int ret = 0;
>
> -       io_ring_submit_lock(ctx, issue_flags);
> -       if (index >= data->nr) {
> -               ret = -EINVAL;
> -               goto unlock;
> -       }
> -       index = array_index_nospec(index, data->nr);
> +       if (index >= data->nr)
> +               return -EINVAL;
>
> +       index = array_index_nospec(index, data->nr);
>         node = data->nodes[index];
> -       if (!node) {
> -               ret = -EINVAL;
> -               goto unlock;
> -       }
> -       if (!node->buf->is_kbuf) {
> -               ret = -EBUSY;
> -               goto unlock;
> -       }
> +       if (!node)
> +               return -EINVAL;
> +       if (!node->buf->is_kbuf)
> +               return -EBUSY;
>
>         io_put_rsrc_node(ctx, node);
>         data->nodes[index] = NULL;
> -unlock:
> +       return 0;
> +}
> +
> +int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> +                             struct io_buf_data *buf,
> +                             unsigned int issue_flags)
> +{
> +       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +       int ret;
> +
> +       io_ring_submit_lock(ctx, issue_flags);
> +       ret = __io_buffer_unregister_bvec(ctx, buf);
>         io_ring_submit_unlock(ctx, issue_flags);
> +
>         return ret;
>  }
>  EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-04-28  9:44 ` [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring Ming Lei
  2025-04-28 10:28   ` Pavel Begunkov
@ 2025-04-29  0:43   ` Caleb Sander Mateos
  2025-04-30 15:34     ` Ming Lei
  1 sibling, 1 reply; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-04-29  0:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> supporting to register/unregister bvec buffer to specified io_uring,
> which FD is usually passed from userspace.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  include/linux/io_uring/cmd.h |  4 ++
>  io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
>  2 files changed, 67 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> index 78fa336a284b..7516fe5cd606 100644
> --- a/include/linux/io_uring/cmd.h
> +++ b/include/linux/io_uring/cmd.h
> @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
>
>  struct io_buf_data {
>         unsigned short index;
> +       bool has_fd;
> +       bool registered_fd;
> +
> +       int ring_fd;
>         struct request *rq;
>         void (*release)(void *);
>  };
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 5f8ab130a573..701dd33fecf7 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -969,21 +969,6 @@ static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
>         return 0;
>  }
>
> -int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> -                           struct io_buf_data *buf,
> -                           unsigned int issue_flags)
> -{
> -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> -       int ret;
> -
> -       io_ring_submit_lock(ctx, issue_flags);
> -       ret = __io_buffer_register_bvec(ctx, buf);
> -       io_ring_submit_unlock(ctx, issue_flags);
> -
> -       return ret;
> -}
> -EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> -
>  static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
>                                        struct io_buf_data *buf)
>  {
> @@ -1006,19 +991,77 @@ static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
>         return 0;
>  }
>
> -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> -                             struct io_buf_data *buf,
> -                             unsigned int issue_flags)
> +static inline int do_reg_unreg_bvec(struct io_ring_ctx *ctx,
> +                                   struct io_buf_data *buf,
> +                                   unsigned int issue_flags,
> +                                   bool reg)
>  {
> -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
>         int ret;
>
>         io_ring_submit_lock(ctx, issue_flags);
> -       ret = __io_buffer_unregister_bvec(ctx, buf);
> +       if (reg)
> +               ret = __io_buffer_register_bvec(ctx, buf);
> +       else
> +               ret = __io_buffer_unregister_bvec(ctx, buf);

It feels like unifying __io_buffer_register_bvec() and
__io_buffer_unregister_bvec() would belong better in the prior patch
that changes their signatures.

>         io_ring_submit_unlock(ctx, issue_flags);
>
>         return ret;
>  }
> +
> +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> +                                   struct io_buf_data *buf,
> +                                   unsigned int issue_flags,
> +                                   bool reg)
> +{
> +       struct io_ring_ctx *remote_ctx = ctx;
> +       struct file *file = NULL;
> +       int ret;
> +
> +       if (buf->has_fd) {
> +               file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> +               if (IS_ERR(file))
> +                       return PTR_ERR(file);

It would be good to avoid the overhead of this lookup and
reference-counting in the I/O path. Would it be possible to move this
lookup to when UBLK_IO_FETCH_REQ (and UBLK_IO_COMMIT_AND_FETCH_REQ, if
it specifies a different ring_fd) is submitted? I guess that might
require storing an extra io_ring_ctx pointer in struct ublk_io.

> +               remote_ctx = file->private_data;
> +               if (!remote_ctx)
> +                       return -EINVAL;
> +       }
> +
> +       if (remote_ctx == ctx) {
> +               do_reg_unreg_bvec(ctx, buf, issue_flags, reg);
> +       } else {
> +               if (!(issue_flags & IO_URING_F_UNLOCKED))
> +                       mutex_unlock(&ctx->uring_lock);
> +
> +               do_reg_unreg_bvec(remote_ctx, buf, IO_URING_F_UNLOCKED, reg);
> +
> +               if (!(issue_flags & IO_URING_F_UNLOCKED))
> +                       mutex_lock(&ctx->uring_lock);
> +       }
> +
> +       if (file)
> +               fput(file);
> +
> +       return ret;
> +}
> +
> +int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> +                           struct io_buf_data *buf,
> +                           unsigned int issue_flags)

If buf->has_fd is set, this struct io_uring_cmd *cmd is unused. Could
you define separate functions that take a struct io_uring_cmd * vs. a
ring_fd?

Best,
Caleb


> +{
> +       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +
> +       return io_buffer_reg_unreg_bvec(ctx, buf, issue_flags, true);
> +}
> +EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> +
> +int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> +                             struct io_buf_data *buf,
> +                             unsigned int issue_flags)
> +{
> +       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +
> +       return io_buffer_reg_unreg_bvec(ctx, buf, issue_flags, false);
> +}
>  EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
>
>  static int validate_fixed_range(u64 buf_addr, size_t len,
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-04-28 10:28   ` Pavel Begunkov
@ 2025-04-29  0:46     ` Caleb Sander Mateos
  2025-04-29  0:47     ` Ming Lei
  1 sibling, 0 replies; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-04-29  0:46 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Ming Lei, Jens Axboe, io-uring, linux-block, Uday Shankar,
	Keith Busch

On Mon, Apr 28, 2025 at 3:27 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 4/28/25 10:44, Ming Lei wrote:
> > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > supporting to register/unregister bvec buffer to specified io_uring,
> > which FD is usually passed from userspace.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   include/linux/io_uring/cmd.h |  4 ++
> >   io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> >   2 files changed, 67 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > index 78fa336a284b..7516fe5cd606 100644
> > --- a/include/linux/io_uring/cmd.h
> > +++ b/include/linux/io_uring/cmd.h
> > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> ...
> >       io_ring_submit_lock(ctx, issue_flags);
> > -     ret = __io_buffer_unregister_bvec(ctx, buf);
> > +     if (reg)
> > +             ret = __io_buffer_register_bvec(ctx, buf);
> > +     else
> > +             ret = __io_buffer_unregister_bvec(ctx, buf);
> >       io_ring_submit_unlock(ctx, issue_flags);
> >
> >       return ret;
> >   }
> > +
> > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > +                                 struct io_buf_data *buf,
> > +                                 unsigned int issue_flags,
> > +                                 bool reg)
> > +{
> > +     struct io_ring_ctx *remote_ctx = ctx;
> > +     struct file *file = NULL;
> > +     int ret;
> > +
> > +     if (buf->has_fd) {
> > +             file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
>
> io_uring_register_get_file() accesses task private data and the request
> doesn't control from which task it's executed. IOW, you can't use the
> helper here. It can be iowq or sqpoll, but either way nothing is
> promised.

The later patches only call it from task_work on the task that
submitted the ublk fetch request, so it should work. But I agree it's
a very subtle requirement that may be difficult to ensure for other
callers added in the future.

Best,
Caleb

>
> > +             if (IS_ERR(file))
> > +                     return PTR_ERR(file);
> > +             remote_ctx = file->private_data;
> > +             if (!remote_ctx)
> > +                     return -EINVAL;
>
> nit: this check is not needed.
>
> > +     }
> > +
> > +     if (remote_ctx == ctx) {
> > +             do_reg_unreg_bvec(ctx, buf, issue_flags, reg);
> > +     } else {
> > +             if (!(issue_flags & IO_URING_F_UNLOCKED))
> > +                     mutex_unlock(&ctx->uring_lock);
>
> We shouldn't be dropping the lock in random helpers, for example
> it'd be pretty nasty suspending a submission loop with a submission
> from another task.
>
> You can try lock first, if fails it'll need a fresh context via
> iowq to be task-work'ed into the ring. see msg_ring.c for how
> it's done for files.
>
> > +
> > +             do_reg_unreg_bvec(remote_ctx, buf, IO_URING_F_UNLOCKED, reg);
> > +
> > +             if (!(issue_flags & IO_URING_F_UNLOCKED))
> > +                     mutex_lock(&ctx->uring_lock);
> > +     }
> > +
> > +     if (file)
> > +             fput(file);
> > +
> > +     return ret;
> > +}
> --
> Pavel Begunkov
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-04-28 10:28   ` Pavel Begunkov
  2025-04-29  0:46     ` Caleb Sander Mateos
@ 2025-04-29  0:47     ` Ming Lei
  2025-04-30  8:25       ` Pavel Begunkov
  1 sibling, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-04-29  0:47 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, io-uring, linux-block, Uday Shankar,
	Caleb Sander Mateos, Keith Busch

On Mon, Apr 28, 2025 at 11:28:30AM +0100, Pavel Begunkov wrote:
> On 4/28/25 10:44, Ming Lei wrote:
> > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > supporting to register/unregister bvec buffer to specified io_uring,
> > which FD is usually passed from userspace.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   include/linux/io_uring/cmd.h |  4 ++
> >   io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> >   2 files changed, 67 insertions(+), 20 deletions(-)
> > 
> > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > index 78fa336a284b..7516fe5cd606 100644
> > --- a/include/linux/io_uring/cmd.h
> > +++ b/include/linux/io_uring/cmd.h
> > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> ...
> >   	io_ring_submit_lock(ctx, issue_flags);
> > -	ret = __io_buffer_unregister_bvec(ctx, buf);
> > +	if (reg)
> > +		ret = __io_buffer_register_bvec(ctx, buf);
> > +	else
> > +		ret = __io_buffer_unregister_bvec(ctx, buf);
> >   	io_ring_submit_unlock(ctx, issue_flags);
> >   	return ret;
> >   }
> > +
> > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > +				    struct io_buf_data *buf,
> > +				    unsigned int issue_flags,
> > +				    bool reg)
> > +{
> > +	struct io_ring_ctx *remote_ctx = ctx;
> > +	struct file *file = NULL;
> > +	int ret;
> > +
> > +	if (buf->has_fd) {
> > +		file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> 
> io_uring_register_get_file() accesses task private data and the request
> doesn't control from which task it's executed. IOW, you can't use the
> helper here. It can be iowq or sqpoll, but either way nothing is
> promised.

Good catch!

Actually ublk uring_cmd is guaranteed to be issued from user context.

We can enhance it by failing buffer register:

	if ((current->flags & PF_KTHREAD) || (issue_flags & IO_URING_F_IOWQ))
		return -EACCESS;

> 
> > +		if (IS_ERR(file))
> > +			return PTR_ERR(file);
> > +		remote_ctx = file->private_data;
> > +		if (!remote_ctx)
> > +			return -EINVAL;
> 
> nit: this check is not needed.

OK.

> 
> > +	}
> > +
> > +	if (remote_ctx == ctx) {
> > +		do_reg_unreg_bvec(ctx, buf, issue_flags, reg);
> > +	} else {
> > +		if (!(issue_flags & IO_URING_F_UNLOCKED))
> > +			mutex_unlock(&ctx->uring_lock);
> 
> We shouldn't be dropping the lock in random helpers, for example
> it'd be pretty nasty suspending a submission loop with a submission
> from another task.
> 
> You can try lock first, if fails it'll need a fresh context via
> iowq to be task-work'ed into the ring. see msg_ring.c for how
> it's done for files.

Looks trylock is better, will take this approach by returning -EAGAIN,
and let ublk driver retry.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 5/7] ublk: prepare for supporting to register request buffer automatically
  2025-04-28  9:44 ` [RFC PATCH 5/7] ublk: prepare for supporting to register request buffer automatically Ming Lei
@ 2025-04-29  0:50   ` Caleb Sander Mateos
  0 siblings, 0 replies; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-04-29  0:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Mon, Apr 28, 2025 at 2:45 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> UBLK_F_SUPPORT_ZERO_COPY requires ublk server to issue explicit buffer
> register/unregister uring_cmd for each IO, this way is not only inefficient,
> but also introduce dependency between buffer consumer and buffer register/
> unregister uring_cmd, please see tools/testing/selftests/ublk/stripe.c
> in which backing file IO has to be issued one by one by IOSQE_IO_LINK.
>
> Prepare for adding feature UBLK_F_AUTO_BUF_REG for addressing the existing
> zero copy limitation:
>
> - register request buffer automatically to ublk uring_cmd's io_uring
>   context before delivering io command to ublk server
>
> - unregister request buffer automatically from the ublk uring_cmd's
>   io_uring context when completing the request
>
> - io_uring will unregister the buffer automatically when uring is
>   exiting, so we needn't worry about accident exit
>
> For using this feature, ublk server has to create one sparse buffer table
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 85 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 76 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 9cd331d12fa6..1fd20e481a60 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -133,6 +133,14 @@ struct ublk_uring_cmd_pdu {
>   */
>  #define UBLK_IO_FLAG_NEED_GET_DATA 0x08
>
> +/*
> + * request buffer is registered automatically, so we have to unregister it
> + * before completing this request.
> + *
> + * io_uring will unregister buffer automatically for us during exiting.
> + */
> +#define UBLK_IO_FLAG_AUTO_BUF_REG      0x10
> +
>  /* atomic RW with ubq->cancel_lock */
>  #define UBLK_IO_FLAG_CANCELED  0x80000000
>
> @@ -205,6 +213,7 @@ struct ublk_params_header {
>         __u32   types;
>  };
>
> +static void ublk_io_release(void *priv);
>  static void ublk_stop_dev_unlocked(struct ublk_device *ub);
>  static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
>  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> @@ -615,6 +624,11 @@ static inline bool ublk_support_zero_copy(const struct ublk_queue *ubq)
>         return ubq->flags & UBLK_F_SUPPORT_ZERO_COPY;
>  }
>
> +static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
> +{
> +       return false;
> +}
> +
>  static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
>  {
>         return ubq->flags & UBLK_F_USER_COPY;
> @@ -622,7 +636,8 @@ static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
>
>  static inline bool ublk_need_map_io(const struct ublk_queue *ubq)
>  {
> -       return !ublk_support_user_copy(ubq) && !ublk_support_zero_copy(ubq);
> +       return !ublk_support_user_copy(ubq) && !ublk_support_zero_copy(ubq) &&
> +               !ublk_support_auto_buf_reg(ubq);
>  }
>
>  static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
> @@ -633,17 +648,22 @@ static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
>          *
>          * for zero copy, request buffer need to be registered to io_uring
>          * buffer table, so reference is needed
> +        *
> +        * For auto buffer register, ublk server still may issue
> +        * UBLK_IO_COMMIT_AND_FETCH_REQ before one registered buffer is used up,
> +        * so reference is required too.
>          */
> -       return ublk_support_user_copy(ubq) || ublk_support_zero_copy(ubq);
> +       return ublk_support_user_copy(ubq) || ublk_support_zero_copy(ubq) ||
> +               ublk_support_auto_buf_reg(ubq);
>  }
>
>  static inline void ublk_init_req_ref(const struct ublk_queue *ubq,
> -               struct request *req)
> +               struct request *req, int init_ref)
>  {
>         if (ublk_need_req_ref(ubq)) {
>                 struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
>
> -               refcount_set(&data->ref, 1);
> +               refcount_set(&data->ref, init_ref);
>         }
>  }
>
> @@ -1157,6 +1177,37 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
>                 blk_mq_end_request(rq, BLK_STS_IOERR);
>  }
>
> +static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
> +                             struct ublk_io *io, unsigned int issue_flags)
> +{
> +       struct io_buf_data data = {
> +               .rq    = req,
> +               .index = req->tag,

It feels a bit misleading to specify a value here that is always
overwritten by ublk_init_auto_buf_reg() in the next patch. Can you
just omit the field from the initialization here? Same comment for
ublk_auto_buf_unreg().

> +               .release = ublk_io_release,
> +       };
> +       int ret;
> +
> +       /* one extra reference is dropped by ublk_io_release */
> +       ublk_init_req_ref(ubq, req, 2);

I think the ublk_need_req_ref(ubq) check in ublk_init_req_ref() is not
needed here, since this is only called when
ublk_support_auto_buf_reg(ubq) is true. Maybe just inline the
refcount_set() here? Then you could drop the ubq argument to
ublk_auto_buf_reg(), and ublk_init_req_ref() wouldn't need to be
modified.

Best,
Caleb

> +       ret = io_buffer_register_bvec(io->cmd, &data, issue_flags);
> +       if (ret) {
> +               blk_mq_end_request(req, BLK_STS_IOERR);
> +               return false;
> +       }
> +       io->flags |= UBLK_IO_FLAG_AUTO_BUF_REG;
> +       return true;
> +}
> +
> +static bool ublk_prep_buf_reg(struct ublk_queue *ubq, struct request *req,
> +                             struct ublk_io *io, unsigned int issue_flags)
> +{
> +       if (ublk_support_auto_buf_reg(ubq) && ublk_rq_has_data(req))
> +               return ublk_auto_buf_reg(ubq, req, io, issue_flags);
> +
> +       ublk_init_req_ref(ubq, req, 1);
> +       return true;
> +}
> +
>  static void ublk_start_io(struct ublk_queue *ubq, struct request *req,
>                           struct ublk_io *io)
>  {
> @@ -1181,8 +1232,6 @@ static void ublk_start_io(struct ublk_queue *ubq, struct request *req,
>                 ublk_get_iod(ubq, req->tag)->nr_sectors =
>                         mapped_bytes >> 9;
>         }
> -
> -       ublk_init_req_ref(ubq, req);
>  }
>
>  static void ublk_dispatch_req(struct ublk_queue *ubq,
> @@ -1225,7 +1274,9 @@ static void ublk_dispatch_req(struct ublk_queue *ubq,
>         }
>
>         ublk_start_io(ubq, req, io);
> -       ublk_complete_io_cmd(io, req, UBLK_IO_RES_OK, issue_flags);
> +
> +       if (ublk_prep_buf_reg(ubq, req, io, issue_flags))
> +               ublk_complete_io_cmd(io, req, UBLK_IO_RES_OK, issue_flags);
>  }
>
>  static void ublk_cmd_tw_cb(struct io_uring_cmd *cmd,
> @@ -2007,9 +2058,21 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
>         return ret;
>  }
>
> +static void ublk_auto_buf_unreg(struct ublk_io *io, struct io_uring_cmd *cmd,
> +                               struct request *req, unsigned int issue_flags)
> +{
> +       struct io_buf_data data = {
> +               .index = req->tag,
> +       };
> +
> +       WARN_ON_ONCE(io_buffer_unregister_bvec(cmd, &data, issue_flags));
> +       io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
> +}
> +
>  static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
>                                  struct ublk_io *io, struct io_uring_cmd *cmd,
> -                                const struct ublksrv_io_cmd *ub_cmd)
> +                                const struct ublksrv_io_cmd *ub_cmd,
> +                                unsigned int issue_flags)
>  {
>         struct request *req;
>
> @@ -2033,6 +2096,9 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
>                 return -EINVAL;
>         }
>
> +       if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
> +               ublk_auto_buf_unreg(io, cmd, req, issue_flags);
> +
>         ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
>
>         /* now this cmd slot is owned by ublk driver */
> @@ -2065,6 +2131,7 @@ static void ublk_get_data(struct ublk_queue *ubq, struct ublk_io *io)
>                         ublk_get_iod(ubq, req->tag)->addr);
>
>         ublk_start_io(ubq, req, io);
> +       ublk_init_req_ref(ubq, req, 1);
>  }
>
>  static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
> @@ -2124,7 +2191,7 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
>                         goto out;
>                 break;
>         case UBLK_IO_COMMIT_AND_FETCH_REQ:
> -               ret = ublk_commit_and_fetch(ubq, io, cmd, ub_cmd);
> +               ret = ublk_commit_and_fetch(ubq, io, cmd, ub_cmd, issue_flags);
>                 if (ret)
>                         goto out;
>                 break;
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 6/7] ublk: register buffer to specified io_uring & buf index via UBLK_F_AUTO_BUF_REG
  2025-04-28  9:44 ` [RFC PATCH 6/7] ublk: register buffer to specified io_uring & buf index via UBLK_F_AUTO_BUF_REG Ming Lei
@ 2025-04-29  0:52   ` Caleb Sander Mateos
  2025-04-30 15:45     ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-04-29  0:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Mon, Apr 28, 2025 at 2:45 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add UBLK_F_AUTO_BUF_REG for supporting to register buffer automatically
> to specified io_uring context and buffer index.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c      | 56 ++++++++++++++++++++++++++++-------
>  include/uapi/linux/ublk_cmd.h | 38 ++++++++++++++++++++++++
>  2 files changed, 84 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 1fd20e481a60..e82618442749 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -66,7 +66,8 @@
>                 | UBLK_F_USER_COPY \
>                 | UBLK_F_ZONED \
>                 | UBLK_F_USER_RECOVERY_FAIL_IO \
> -               | UBLK_F_UPDATE_SIZE)
> +               | UBLK_F_UPDATE_SIZE \
> +               | UBLK_F_AUTO_BUF_REG)
>
>  #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
>                 | UBLK_F_USER_RECOVERY_REISSUE \
> @@ -146,7 +147,10 @@ struct ublk_uring_cmd_pdu {
>
>  struct ublk_io {
>         /* userspace buffer address from io cmd */
> -       __u64   addr;
> +       union {
> +               __u64   addr;
> +               struct ublk_auto_buf_reg buf;

Maybe add a comment justifying why these fields can overlap? From my
understanding, buf is valid iff UBLK_F_AUTO_BUF_REG is set on the
ublk_queue and addr is valid iff neither UBLK_F_USER_COPY,
UBLK_F_SUPPORT_ZERO_COPY, nor UBLK_F_AUTO_BUF_REG is set.

> +       };
>         unsigned int flags;
>         int res;
>
> @@ -626,7 +630,7 @@ static inline bool ublk_support_zero_copy(const struct ublk_queue *ubq)
>
>  static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
>  {
> -       return false;
> +       return ubq->flags & UBLK_F_AUTO_BUF_REG;
>  }
>
>  static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
> @@ -1177,6 +1181,16 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
>                 blk_mq_end_request(rq, BLK_STS_IOERR);
>  }
>
> +
> +static inline void ublk_init_auto_buf_reg(const struct ublk_io *io,
> +                                         struct io_buf_data *data)
> +{
> +       data->index = io->buf.index;
> +       data->ring_fd = io->buf.ring_fd;
> +       data->has_fd = true;
> +       data->registered_fd = io->buf.flags & UBLK_AUTO_BUF_REGISTERED_RING;
> +}
> +
>  static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
>                               struct ublk_io *io, unsigned int issue_flags)
>  {
> @@ -1187,6 +1201,9 @@ static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
>         };
>         int ret;
>
> +       if (ublk_support_auto_buf_reg(ubq))

This check seems redundant with the check in the caller? Same comment
about ublk_auto_buf_unreg(). That would allow you to avoid adding the
ubq argument to ublk_auto_buf_unreg().

> +               ublk_init_auto_buf_reg(io, &data);
> +
>         /* one extra reference is dropped by ublk_io_release */
>         ublk_init_req_ref(ubq, req, 2);
>         ret = io_buffer_register_bvec(io->cmd, &data, issue_flags);
> @@ -2045,7 +2062,7 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
>                  */
>                 if (!buf_addr && !ublk_need_get_data(ubq))
>                         goto out;
> -       } else if (buf_addr) {
> +       } else if (buf_addr && !ublk_support_auto_buf_reg(ubq)) {
>                 /* User copy requires addr to be unset */
>                 ret = -EINVAL;
>                 goto out;
> @@ -2058,13 +2075,17 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
>         return ret;
>  }
>
> -static void ublk_auto_buf_unreg(struct ublk_io *io, struct io_uring_cmd *cmd,
> +static void ublk_auto_buf_unreg(const struct ublk_queue *ubq,
> +                               struct ublk_io *io, struct io_uring_cmd *cmd,
>                                 struct request *req, unsigned int issue_flags)
>  {
>         struct io_buf_data data = {
>                 .index = req->tag,
>         };
>
> +       if (ublk_support_auto_buf_reg(ubq))
> +               ublk_init_auto_buf_reg(io, &data);
> +
>         WARN_ON_ONCE(io_buffer_unregister_bvec(cmd, &data, issue_flags));
>         io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
>  }
> @@ -2088,7 +2109,8 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
>                 if (!ub_cmd->addr && (!ublk_need_get_data(ubq) ||
>                                         req_op(req) == REQ_OP_READ))
>                         return -EINVAL;
> -       } else if (req_op(req) != REQ_OP_ZONE_APPEND && ub_cmd->addr) {
> +       } else if ((req_op(req) != REQ_OP_ZONE_APPEND &&
> +                               !ublk_support_auto_buf_reg(ubq)) && ub_cmd->addr) {
>                 /*
>                  * User copy requires addr to be unset when command is
>                  * not zone append
> @@ -2097,7 +2119,7 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
>         }
>
>         if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
> -               ublk_auto_buf_unreg(io, cmd, req, issue_flags);
> +               ublk_auto_buf_unreg(ubq, io, cmd, req, issue_flags);
>
>         ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
>
> @@ -2788,6 +2810,11 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
>         else if (!(info.flags & UBLK_F_UNPRIVILEGED_DEV))
>                 return -EPERM;
>
> +       /* F_AUTO_BUF_REG and F_SUPPORT_ZERO_COPY can't co-exist */
> +       if ((info.flags & UBLK_F_AUTO_BUF_REG) &&
> +                       (info.flags & UBLK_F_SUPPORT_ZERO_COPY))
> +               return -EINVAL;
> +
>         /* forbid nonsense combinations of recovery flags */
>         switch (info.flags & UBLK_F_ALL_RECOVERY_FLAGS) {
>         case 0:
> @@ -2817,8 +2844,11 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
>                  * For USER_COPY, we depends on userspace to fill request
>                  * buffer by pwrite() to ublk char device, which can't be
>                  * used for unprivileged device
> +                *
> +                * Same with zero copy or auto buffer register.
>                  */
> -               if (info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY))
> +               if (info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
> +                                       UBLK_F_AUTO_BUF_REG))
>                         return -EINVAL;
>         }
>
> @@ -2876,17 +2906,22 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
>                 UBLK_F_URING_CMD_COMP_IN_TASK;
>
>         /* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */
> -       if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY))
> +       if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
> +                               UBLK_F_AUTO_BUF_REG))
>                 ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
>
>         /*
>          * Zoned storage support requires reuse `ublksrv_io_cmd->addr` for
>          * returning write_append_lba, which is only allowed in case of
>          * user copy or zero copy
> +        *
> +        * UBLK_F_AUTO_BUF_REG can't be enabled for zoned because it need
> +        * the space for getting ring_fd and buffer index.
>          */
>         if (ublk_dev_is_zoned(ub) &&
>             (!IS_ENABLED(CONFIG_BLK_DEV_ZONED) || !(ub->dev_info.flags &
> -            (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)))) {
> +            (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)) ||
> +            (ub->dev_info.flags & UBLK_F_AUTO_BUF_REG))) {
>                 ret = -EINVAL;
>                 goto out_free_dev_number;
>         }
> @@ -3403,6 +3438,7 @@ static int __init ublk_init(void)
>
>         BUILD_BUG_ON((u64)UBLKSRV_IO_BUF_OFFSET +
>                         UBLKSRV_IO_BUF_TOTAL_SIZE < UBLKSRV_IO_BUF_OFFSET);
> +       BUILD_BUG_ON(sizeof(struct ublk_auto_buf_reg) != sizeof(__u64));
>
>         init_waitqueue_head(&ublk_idr_wq);
>
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index be5c6c6b16e0..3d7c8c69cf06 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -219,6 +219,30 @@
>   */
>  #define UBLK_F_UPDATE_SIZE              (1ULL << 10)
>
> +/*
> + * request buffer is registered automatically to ublk server specified
> + * io_uring context before delivering this io command to ublk server,
> + * meantime it is un-registered automatically when completing this io
> + * command.
> + *
> + * For using this feature:
> + *
> + * - ublk server has to create sparse buffer table
> + *
> + * - pass io_ring context FD from `ublksrv_io_cmd.buf.ring_fd`, and the FD
> + *   can be registered io_ring FD if `UBLK_AUTO_BUF_REGISTERED_RING` is set
> + *   in `ublksrv_io_cmd.flags`, or plain FD
> + *
> + * - pass buffer index from `ublksrv_io_cmd.buf.index`
> + *
> + * This way avoids extra cost from two uring_cmd, but also simplifies backend
> + * implementation, such as, the dependency on IO_REGISTER_IO_BUF and
> + * IO_UNREGISTER_IO_BUF becomes not necessary.
> + *
> + * This feature isn't available for UBLK_F_ZONED
> + */
> +#define UBLK_F_AUTO_BUF_REG    (1ULL << 11)
> +
>  /* device state */
>  #define UBLK_S_DEV_DEAD        0
>  #define UBLK_S_DEV_LIVE        1
> @@ -339,6 +363,14 @@ static inline __u32 ublksrv_get_flags(const struct ublksrv_io_desc *iod)
>         return iod->op_flags >> 8;
>  }
>
> +struct ublk_auto_buf_reg {
> +       __s32  ring_fd;
> +       __u16  index;
> +#define UBLK_AUTO_BUF_REGISTERED_RING            (1 << 0)
> +       __u8   flags;

The flag could potentially be stored in ublk_io's flags field instead
to avoid taking up this byte.

Best,
Caleb


> +       __u8   _pad;
> +};
> +
>  /* issued to ublk driver via /dev/ublkcN */
>  struct ublksrv_io_cmd {
>         __u16   q_id;
> @@ -363,6 +395,12 @@ struct ublksrv_io_cmd {
>                  */
>                 __u64   addr;
>                 __u64   zone_append_lba;
> +
> +               /*
> +                * for AUTO_BUF_REG feature, F_ZONED can't be supported,
> +                * and ->addr isn't used for zero copy
> +                */
> +               struct ublk_auto_buf_reg auto_buf;
>         };
>  };
>
> --
> 2.47.0
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-04-29  0:47     ` Ming Lei
@ 2025-04-30  8:25       ` Pavel Begunkov
  2025-04-30 14:44         ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Pavel Begunkov @ 2025-04-30  8:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, linux-block, Uday Shankar,
	Caleb Sander Mateos, Keith Busch

On 4/29/25 01:47, Ming Lei wrote:
> On Mon, Apr 28, 2025 at 11:28:30AM +0100, Pavel Begunkov wrote:
>> On 4/28/25 10:44, Ming Lei wrote:
>>> Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
>>> supporting to register/unregister bvec buffer to specified io_uring,
>>> which FD is usually passed from userspace.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>    include/linux/io_uring/cmd.h |  4 ++
>>>    io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
>>>    2 files changed, 67 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
>>> index 78fa336a284b..7516fe5cd606 100644
>>> --- a/include/linux/io_uring/cmd.h
>>> +++ b/include/linux/io_uring/cmd.h
>>> @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
>> ...
>>>    	io_ring_submit_lock(ctx, issue_flags);
>>> -	ret = __io_buffer_unregister_bvec(ctx, buf);
>>> +	if (reg)
>>> +		ret = __io_buffer_register_bvec(ctx, buf);
>>> +	else
>>> +		ret = __io_buffer_unregister_bvec(ctx, buf);
>>>    	io_ring_submit_unlock(ctx, issue_flags);
>>>    	return ret;
>>>    }
>>> +
>>> +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
>>> +				    struct io_buf_data *buf,
>>> +				    unsigned int issue_flags,
>>> +				    bool reg)
>>> +{
>>> +	struct io_ring_ctx *remote_ctx = ctx;
>>> +	struct file *file = NULL;
>>> +	int ret;
>>> +
>>> +	if (buf->has_fd) {
>>> +		file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
>>
>> io_uring_register_get_file() accesses task private data and the request
>> doesn't control from which task it's executed. IOW, you can't use the
>> helper here. It can be iowq or sqpoll, but either way nothing is
>> promised.
> 
> Good catch!
> 
> Actually ublk uring_cmd is guaranteed to be issued from user context.
> 
> We can enhance it by failing buffer register:
> 
> 	if ((current->flags & PF_KTHREAD) || (issue_flags & IO_URING_F_IOWQ))
> 		return -EACCESS;

Can it somehow check that current matches the desired task? That's exactly
the condition where it can go wrong, and that's much better than listing all
corner cases that might change.

Just to avoid confusion, it's not guaranteed by io_uring it'll be run from
the "right" task. If that changes in the future, either the ublk uapi should
be mandating the user to fall back to something else like regular fds, or
ublk will need to handle it somehow.

>>> +		if (IS_ERR(file))
>>> +			return PTR_ERR(file);
>>> +		remote_ctx = file->private_data;
>>> +		if (!remote_ctx)
>>> +			return -EINVAL;
>>
>> nit: this check is not needed.
> 
> OK.
> 
>>
>>> +	}
>>> +
>>> +	if (remote_ctx == ctx) {
>>> +		do_reg_unreg_bvec(ctx, buf, issue_flags, reg);
>>> +	} else {
>>> +		if (!(issue_flags & IO_URING_F_UNLOCKED))
>>> +			mutex_unlock(&ctx->uring_lock);
>>
>> We shouldn't be dropping the lock in random helpers, for example
>> it'd be pretty nasty suspending a submission loop with a submission
>> from another task.
>>
>> You can try lock first, if fails it'll need a fresh context via
>> iowq to be task-work'ed into the ring. see msg_ring.c for how
>> it's done for files.
> 
> Looks trylock is better, will take this approach by returning -EAGAIN,
> and let ublk driver retry.

Is there a reliable fall back path for the userspace? Hammering the
same thing until it succeeds in never a good option.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-04-30  8:25       ` Pavel Begunkov
@ 2025-04-30 14:44         ` Ming Lei
  0 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2025-04-30 14:44 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, io-uring, linux-block, Uday Shankar,
	Caleb Sander Mateos, Keith Busch

On Wed, Apr 30, 2025 at 09:25:33AM +0100, Pavel Begunkov wrote:
> On 4/29/25 01:47, Ming Lei wrote:
> > On Mon, Apr 28, 2025 at 11:28:30AM +0100, Pavel Begunkov wrote:
> > > On 4/28/25 10:44, Ming Lei wrote:
> > > > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > > > supporting to register/unregister bvec buffer to specified io_uring,
> > > > which FD is usually passed from userspace.
> > > > 
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >    include/linux/io_uring/cmd.h |  4 ++
> > > >    io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> > > >    2 files changed, 67 insertions(+), 20 deletions(-)
> > > > 
> > > > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > > > index 78fa336a284b..7516fe5cd606 100644
> > > > --- a/include/linux/io_uring/cmd.h
> > > > +++ b/include/linux/io_uring/cmd.h
> > > > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> > > ...
> > > >    	io_ring_submit_lock(ctx, issue_flags);
> > > > -	ret = __io_buffer_unregister_bvec(ctx, buf);
> > > > +	if (reg)
> > > > +		ret = __io_buffer_register_bvec(ctx, buf);
> > > > +	else
> > > > +		ret = __io_buffer_unregister_bvec(ctx, buf);
> > > >    	io_ring_submit_unlock(ctx, issue_flags);
> > > >    	return ret;
> > > >    }
> > > > +
> > > > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > +				    struct io_buf_data *buf,
> > > > +				    unsigned int issue_flags,
> > > > +				    bool reg)
> > > > +{
> > > > +	struct io_ring_ctx *remote_ctx = ctx;
> > > > +	struct file *file = NULL;
> > > > +	int ret;
> > > > +
> > > > +	if (buf->has_fd) {
> > > > +		file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> > > 
> > > io_uring_register_get_file() accesses task private data and the request
> > > doesn't control from which task it's executed. IOW, you can't use the
> > > helper here. It can be iowq or sqpoll, but either way nothing is
> > > promised.
> > 
> > Good catch!
> > 
> > Actually ublk uring_cmd is guaranteed to be issued from user context.
> > 
> > We can enhance it by failing buffer register:
> > 
> > 	if ((current->flags & PF_KTHREAD) || (issue_flags & IO_URING_F_IOWQ))
> > 		return -EACCESS;
> 
> Can it somehow check that current matches the desired task? That's exactly
> the condition where it can go wrong, and that's much better than listing all
> corner cases that might change.
> 
> Just to avoid confusion, it's not guaranteed by io_uring it'll be run from
> the "right" task. If that changes in the future, either the ublk uapi should
> be mandating the user to fall back to something else like regular fds, or
> ublk will need to handle it somehow.

ublk does track the task context, and I will enhance the check for
registered ring fd in ublk driver side, and make sure that it won't be
used if the task context isn't ubq_daemon context.

If task context doesn't match, the uring command can be completed and ublk
server gets notified for handling the failure, such as, by sending register io buffer
command.

> 
> > > > +		if (IS_ERR(file))
> > > > +			return PTR_ERR(file);
> > > > +		remote_ctx = file->private_data;
> > > > +		if (!remote_ctx)
> > > > +			return -EINVAL;
> > > 
> > > nit: this check is not needed.
> > 
> > OK.
> > 
> > > 
> > > > +	}
> > > > +
> > > > +	if (remote_ctx == ctx) {
> > > > +		do_reg_unreg_bvec(ctx, buf, issue_flags, reg);
> > > > +	} else {
> > > > +		if (!(issue_flags & IO_URING_F_UNLOCKED))
> > > > +			mutex_unlock(&ctx->uring_lock);
> > > 
> > > We shouldn't be dropping the lock in random helpers, for example
> > > it'd be pretty nasty suspending a submission loop with a submission
> > > from another task.
> > > 
> > > You can try lock first, if fails it'll need a fresh context via
> > > iowq to be task-work'ed into the ring. see msg_ring.c for how
> > > it's done for files.
> > 
> > Looks trylock is better, will take this approach by returning -EAGAIN,
> > and let ublk driver retry.
> 
> Is there a reliable fall back path for the userspace? Hammering the
> same thing until it succeeds in never a good option.

It is the simplest way to retry until it succeeds.

But we can improve it by retrying several times, if it still can't succeed,
complete the uring command and let ublk server send register buffer
command for registering the buffer manually.


thanks,
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-04-29  0:43   ` Caleb Sander Mateos
@ 2025-04-30 15:34     ` Ming Lei
  2025-05-02  1:31       ` Caleb Sander Mateos
  0 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-04-30 15:34 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Mon, Apr 28, 2025 at 05:43:12PM -0700, Caleb Sander Mateos wrote:
> On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > supporting to register/unregister bvec buffer to specified io_uring,
> > which FD is usually passed from userspace.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  include/linux/io_uring/cmd.h |  4 ++
> >  io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> >  2 files changed, 67 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > index 78fa336a284b..7516fe5cd606 100644
> > --- a/include/linux/io_uring/cmd.h
> > +++ b/include/linux/io_uring/cmd.h
> > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> >
> >  struct io_buf_data {
> >         unsigned short index;
> > +       bool has_fd;
> > +       bool registered_fd;
> > +
> > +       int ring_fd;
> >         struct request *rq;
> >         void (*release)(void *);
> >  };
> > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > index 5f8ab130a573..701dd33fecf7 100644
> > --- a/io_uring/rsrc.c
> > +++ b/io_uring/rsrc.c
> > @@ -969,21 +969,6 @@ static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
> >         return 0;
> >  }
> >
> > -int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> > -                           struct io_buf_data *buf,
> > -                           unsigned int issue_flags)
> > -{
> > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > -       int ret;
> > -
> > -       io_ring_submit_lock(ctx, issue_flags);
> > -       ret = __io_buffer_register_bvec(ctx, buf);
> > -       io_ring_submit_unlock(ctx, issue_flags);
> > -
> > -       return ret;
> > -}
> > -EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> > -
> >  static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> >                                        struct io_buf_data *buf)
> >  {
> > @@ -1006,19 +991,77 @@ static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> >         return 0;
> >  }
> >
> > -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> > -                             struct io_buf_data *buf,
> > -                             unsigned int issue_flags)
> > +static inline int do_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > +                                   struct io_buf_data *buf,
> > +                                   unsigned int issue_flags,
> > +                                   bool reg)
> >  {
> > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> >         int ret;
> >
> >         io_ring_submit_lock(ctx, issue_flags);
> > -       ret = __io_buffer_unregister_bvec(ctx, buf);
> > +       if (reg)
> > +               ret = __io_buffer_register_bvec(ctx, buf);
> > +       else
> > +               ret = __io_buffer_unregister_bvec(ctx, buf);
> 
> It feels like unifying __io_buffer_register_bvec() and
> __io_buffer_unregister_bvec() would belong better in the prior patch
> that changes their signatures.

Can you share how to do above in previous patch?

> 
> >         io_ring_submit_unlock(ctx, issue_flags);
> >
> >         return ret;
> >  }
> > +
> > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > +                                   struct io_buf_data *buf,
> > +                                   unsigned int issue_flags,
> > +                                   bool reg)
> > +{
> > +       struct io_ring_ctx *remote_ctx = ctx;
> > +       struct file *file = NULL;
> > +       int ret;
> > +
> > +       if (buf->has_fd) {
> > +               file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> > +               if (IS_ERR(file))
> > +                       return PTR_ERR(file);
> 
> It would be good to avoid the overhead of this lookup and
> reference-counting in the I/O path. Would it be possible to move this
> lookup to when UBLK_IO_FETCH_REQ (and UBLK_IO_COMMIT_AND_FETCH_REQ, if
> it specifies a different ring_fd) is submitted? I guess that might
> require storing an extra io_ring_ctx pointer in struct ublk_io.

Let's start from the flexible way & simple implementation.

Any optimization & improvement can be done as follow-up.

Each command may register buffer to different io_uring context,
it can't be done in UBLK_IO_FETCH_REQ stage, because new IO with same
tag may register buffer to new io_uring context.

But it can be optimized in future for one specific use case with feature
flag.

> 
> > +               remote_ctx = file->private_data;
> > +               if (!remote_ctx)
> > +                       return -EINVAL;
> > +       }
> > +
> > +       if (remote_ctx == ctx) {
> > +               do_reg_unreg_bvec(ctx, buf, issue_flags, reg);
> > +       } else {
> > +               if (!(issue_flags & IO_URING_F_UNLOCKED))
> > +                       mutex_unlock(&ctx->uring_lock);
> > +
> > +               do_reg_unreg_bvec(remote_ctx, buf, IO_URING_F_UNLOCKED, reg);
> > +
> > +               if (!(issue_flags & IO_URING_F_UNLOCKED))
> > +                       mutex_lock(&ctx->uring_lock);
> > +       }
> > +
> > +       if (file)
> > +               fput(file);
> > +
> > +       return ret;
> > +}
> > +
> > +int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> > +                           struct io_buf_data *buf,
> > +                           unsigned int issue_flags)
> 
> If buf->has_fd is set, this struct io_uring_cmd *cmd is unused. Could
> you define separate functions that take a struct io_uring_cmd * vs. a
> ring_fd?

The ring_fd may point to the same io_uring context with 'io_uring_cmd',
we need this information for dealing with io_uring context lock.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 6/7] ublk: register buffer to specified io_uring & buf index via UBLK_F_AUTO_BUF_REG
  2025-04-29  0:52   ` Caleb Sander Mateos
@ 2025-04-30 15:45     ` Ming Lei
  2025-04-30 16:30       ` Caleb Sander Mateos
  0 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-04-30 15:45 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Mon, Apr 28, 2025 at 05:52:28PM -0700, Caleb Sander Mateos wrote:
> On Mon, Apr 28, 2025 at 2:45 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add UBLK_F_AUTO_BUF_REG for supporting to register buffer automatically
> > to specified io_uring context and buffer index.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c      | 56 ++++++++++++++++++++++++++++-------
> >  include/uapi/linux/ublk_cmd.h | 38 ++++++++++++++++++++++++
> >  2 files changed, 84 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 1fd20e481a60..e82618442749 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -66,7 +66,8 @@
> >                 | UBLK_F_USER_COPY \
> >                 | UBLK_F_ZONED \
> >                 | UBLK_F_USER_RECOVERY_FAIL_IO \
> > -               | UBLK_F_UPDATE_SIZE)
> > +               | UBLK_F_UPDATE_SIZE \
> > +               | UBLK_F_AUTO_BUF_REG)
> >
> >  #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
> >                 | UBLK_F_USER_RECOVERY_REISSUE \
> > @@ -146,7 +147,10 @@ struct ublk_uring_cmd_pdu {
> >
> >  struct ublk_io {
> >         /* userspace buffer address from io cmd */
> > -       __u64   addr;
> > +       union {
> > +               __u64   addr;
> > +               struct ublk_auto_buf_reg buf;
> 
> Maybe add a comment justifying why these fields can overlap? From my
> understanding, buf is valid iff UBLK_F_AUTO_BUF_REG is set on the
> ublk_queue and addr is valid iff neither UBLK_F_USER_COPY,
> UBLK_F_SUPPORT_ZERO_COPY, nor UBLK_F_AUTO_BUF_REG is set.

->addr is for storing the userspace buffer, which is only used in
non-zc cases(zc, auto_buf_reg) or user copy case.

> 
> > +       };
> >         unsigned int flags;
> >         int res;
> >
> > @@ -626,7 +630,7 @@ static inline bool ublk_support_zero_copy(const struct ublk_queue *ubq)
> >
> >  static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
> >  {
> > -       return false;
> > +       return ubq->flags & UBLK_F_AUTO_BUF_REG;
> >  }
> >
> >  static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
> > @@ -1177,6 +1181,16 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
> >                 blk_mq_end_request(rq, BLK_STS_IOERR);
> >  }
> >
> > +
> > +static inline void ublk_init_auto_buf_reg(const struct ublk_io *io,
> > +                                         struct io_buf_data *data)
> > +{
> > +       data->index = io->buf.index;
> > +       data->ring_fd = io->buf.ring_fd;
> > +       data->has_fd = true;
> > +       data->registered_fd = io->buf.flags & UBLK_AUTO_BUF_REGISTERED_RING;
> > +}
> > +
> >  static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
> >                               struct ublk_io *io, unsigned int issue_flags)
> >  {
> > @@ -1187,6 +1201,9 @@ static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
> >         };
> >         int ret;
> >
> > +       if (ublk_support_auto_buf_reg(ubq))
> 
> This check seems redundant with the check in the caller? Same comment
> about ublk_auto_buf_unreg(). That would allow you to avoid adding the
> ubq argument to ublk_auto_buf_unreg().

Yeah, actually I removed one feature which just registers buffer to
the uring command context, then forget to update the check.

> 
> > +               ublk_init_auto_buf_reg(io, &data);
> > +
> >         /* one extra reference is dropped by ublk_io_release */
> >         ublk_init_req_ref(ubq, req, 2);
> >         ret = io_buffer_register_bvec(io->cmd, &data, issue_flags);
> > @@ -2045,7 +2062,7 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
> >                  */
> >                 if (!buf_addr && !ublk_need_get_data(ubq))
> >                         goto out;
> > -       } else if (buf_addr) {
> > +       } else if (buf_addr && !ublk_support_auto_buf_reg(ubq)) {
> >                 /* User copy requires addr to be unset */
> >                 ret = -EINVAL;
> >                 goto out;
> > @@ -2058,13 +2075,17 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
> >         return ret;
> >  }
> >
> > -static void ublk_auto_buf_unreg(struct ublk_io *io, struct io_uring_cmd *cmd,
> > +static void ublk_auto_buf_unreg(const struct ublk_queue *ubq,
> > +                               struct ublk_io *io, struct io_uring_cmd *cmd,
> >                                 struct request *req, unsigned int issue_flags)
> >  {
> >         struct io_buf_data data = {
> >                 .index = req->tag,
> >         };
> >
> > +       if (ublk_support_auto_buf_reg(ubq))
> > +               ublk_init_auto_buf_reg(io, &data);
> > +
> >         WARN_ON_ONCE(io_buffer_unregister_bvec(cmd, &data, issue_flags));
> >         io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
> >  }
> > @@ -2088,7 +2109,8 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
> >                 if (!ub_cmd->addr && (!ublk_need_get_data(ubq) ||
> >                                         req_op(req) == REQ_OP_READ))
> >                         return -EINVAL;
> > -       } else if (req_op(req) != REQ_OP_ZONE_APPEND && ub_cmd->addr) {
> > +       } else if ((req_op(req) != REQ_OP_ZONE_APPEND &&
> > +                               !ublk_support_auto_buf_reg(ubq)) && ub_cmd->addr) {
> >                 /*
> >                  * User copy requires addr to be unset when command is
> >                  * not zone append
> > @@ -2097,7 +2119,7 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
> >         }
> >
> >         if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
> > -               ublk_auto_buf_unreg(io, cmd, req, issue_flags);
> > +               ublk_auto_buf_unreg(ubq, io, cmd, req, issue_flags);
> >
> >         ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
> >
> > @@ -2788,6 +2810,11 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
> >         else if (!(info.flags & UBLK_F_UNPRIVILEGED_DEV))
> >                 return -EPERM;
> >
> > +       /* F_AUTO_BUF_REG and F_SUPPORT_ZERO_COPY can't co-exist */
> > +       if ((info.flags & UBLK_F_AUTO_BUF_REG) &&
> > +                       (info.flags & UBLK_F_SUPPORT_ZERO_COPY))
> > +               return -EINVAL;
> > +
> >         /* forbid nonsense combinations of recovery flags */
> >         switch (info.flags & UBLK_F_ALL_RECOVERY_FLAGS) {
> >         case 0:
> > @@ -2817,8 +2844,11 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
> >                  * For USER_COPY, we depends on userspace to fill request
> >                  * buffer by pwrite() to ublk char device, which can't be
> >                  * used for unprivileged device
> > +                *
> > +                * Same with zero copy or auto buffer register.
> >                  */
> > -               if (info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY))
> > +               if (info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
> > +                                       UBLK_F_AUTO_BUF_REG))
> >                         return -EINVAL;
> >         }
> >
> > @@ -2876,17 +2906,22 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
> >                 UBLK_F_URING_CMD_COMP_IN_TASK;
> >
> >         /* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */
> > -       if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY))
> > +       if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
> > +                               UBLK_F_AUTO_BUF_REG))
> >                 ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
> >
> >         /*
> >          * Zoned storage support requires reuse `ublksrv_io_cmd->addr` for
> >          * returning write_append_lba, which is only allowed in case of
> >          * user copy or zero copy
> > +        *
> > +        * UBLK_F_AUTO_BUF_REG can't be enabled for zoned because it need
> > +        * the space for getting ring_fd and buffer index.
> >          */
> >         if (ublk_dev_is_zoned(ub) &&
> >             (!IS_ENABLED(CONFIG_BLK_DEV_ZONED) || !(ub->dev_info.flags &
> > -            (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)))) {
> > +            (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)) ||
> > +            (ub->dev_info.flags & UBLK_F_AUTO_BUF_REG))) {
> >                 ret = -EINVAL;
> >                 goto out_free_dev_number;
> >         }
> > @@ -3403,6 +3438,7 @@ static int __init ublk_init(void)
> >
> >         BUILD_BUG_ON((u64)UBLKSRV_IO_BUF_OFFSET +
> >                         UBLKSRV_IO_BUF_TOTAL_SIZE < UBLKSRV_IO_BUF_OFFSET);
> > +       BUILD_BUG_ON(sizeof(struct ublk_auto_buf_reg) != sizeof(__u64));
> >
> >         init_waitqueue_head(&ublk_idr_wq);
> >
> > diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> > index be5c6c6b16e0..3d7c8c69cf06 100644
> > --- a/include/uapi/linux/ublk_cmd.h
> > +++ b/include/uapi/linux/ublk_cmd.h
> > @@ -219,6 +219,30 @@
> >   */
> >  #define UBLK_F_UPDATE_SIZE              (1ULL << 10)
> >
> > +/*
> > + * request buffer is registered automatically to ublk server specified
> > + * io_uring context before delivering this io command to ublk server,
> > + * meantime it is un-registered automatically when completing this io
> > + * command.
> > + *
> > + * For using this feature:
> > + *
> > + * - ublk server has to create sparse buffer table
> > + *
> > + * - pass io_ring context FD from `ublksrv_io_cmd.buf.ring_fd`, and the FD
> > + *   can be registered io_ring FD if `UBLK_AUTO_BUF_REGISTERED_RING` is set
> > + *   in `ublksrv_io_cmd.flags`, or plain FD
> > + *
> > + * - pass buffer index from `ublksrv_io_cmd.buf.index`
> > + *
> > + * This way avoids extra cost from two uring_cmd, but also simplifies backend
> > + * implementation, such as, the dependency on IO_REGISTER_IO_BUF and
> > + * IO_UNREGISTER_IO_BUF becomes not necessary.
> > + *
> > + * This feature isn't available for UBLK_F_ZONED
> > + */
> > +#define UBLK_F_AUTO_BUF_REG    (1ULL << 11)
> > +
> >  /* device state */
> >  #define UBLK_S_DEV_DEAD        0
> >  #define UBLK_S_DEV_LIVE        1
> > @@ -339,6 +363,14 @@ static inline __u32 ublksrv_get_flags(const struct ublksrv_io_desc *iod)
> >         return iod->op_flags >> 8;
> >  }
> >
> > +struct ublk_auto_buf_reg {
> > +       __s32  ring_fd;
> > +       __u16  index;
> > +#define UBLK_AUTO_BUF_REGISTERED_RING            (1 << 0)
> > +       __u8   flags;
> 
> The flag could potentially be stored in ublk_io's flags field instead
> to avoid taking up this byte.

`ublk_auto_buf_reg` takes the exact ->addr space in both 'struct ublk_io'
and 'struct ublksrv_io_cmd', this way won't take extra byte, but keep code simple
and reuse for dealing with auto_buf_reg from both 'struct ublk_io' and
'struct ublksrv_io_cmd'.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 6/7] ublk: register buffer to specified io_uring & buf index via UBLK_F_AUTO_BUF_REG
  2025-04-30 15:45     ` Ming Lei
@ 2025-04-30 16:30       ` Caleb Sander Mateos
  2025-05-02 14:09         ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-04-30 16:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Wed, Apr 30, 2025 at 8:45 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Apr 28, 2025 at 05:52:28PM -0700, Caleb Sander Mateos wrote:
> > On Mon, Apr 28, 2025 at 2:45 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > Add UBLK_F_AUTO_BUF_REG for supporting to register buffer automatically
> > > to specified io_uring context and buffer index.
> > >
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/block/ublk_drv.c      | 56 ++++++++++++++++++++++++++++-------
> > >  include/uapi/linux/ublk_cmd.h | 38 ++++++++++++++++++++++++
> > >  2 files changed, 84 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index 1fd20e481a60..e82618442749 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -66,7 +66,8 @@
> > >                 | UBLK_F_USER_COPY \
> > >                 | UBLK_F_ZONED \
> > >                 | UBLK_F_USER_RECOVERY_FAIL_IO \
> > > -               | UBLK_F_UPDATE_SIZE)
> > > +               | UBLK_F_UPDATE_SIZE \
> > > +               | UBLK_F_AUTO_BUF_REG)
> > >
> > >  #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
> > >                 | UBLK_F_USER_RECOVERY_REISSUE \
> > > @@ -146,7 +147,10 @@ struct ublk_uring_cmd_pdu {
> > >
> > >  struct ublk_io {
> > >         /* userspace buffer address from io cmd */
> > > -       __u64   addr;
> > > +       union {
> > > +               __u64   addr;
> > > +               struct ublk_auto_buf_reg buf;
> >
> > Maybe add a comment justifying why these fields can overlap? From my
> > understanding, buf is valid iff UBLK_F_AUTO_BUF_REG is set on the
> > ublk_queue and addr is valid iff neither UBLK_F_USER_COPY,
> > UBLK_F_SUPPORT_ZERO_COPY, nor UBLK_F_AUTO_BUF_REG is set.
>
> ->addr is for storing the userspace buffer, which is only used in
> non-zc cases(zc, auto_buf_reg) or user copy case.

Right, could you add a comment to that effect? I think using
overlapping fields is subtle and has the potential to break in the
future if the usage of the fields changes. Documenting the assumptions
clearly would go a long way.

>
> >
> > > +       };
> > >         unsigned int flags;
> > >         int res;
> > >
> > > @@ -626,7 +630,7 @@ static inline bool ublk_support_zero_copy(const struct ublk_queue *ubq)
> > >
> > >  static inline bool ublk_support_auto_buf_reg(const struct ublk_queue *ubq)
> > >  {
> > > -       return false;
> > > +       return ubq->flags & UBLK_F_AUTO_BUF_REG;
> > >  }
> > >
> > >  static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
> > > @@ -1177,6 +1181,16 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
> > >                 blk_mq_end_request(rq, BLK_STS_IOERR);
> > >  }
> > >
> > > +
> > > +static inline void ublk_init_auto_buf_reg(const struct ublk_io *io,
> > > +                                         struct io_buf_data *data)
> > > +{
> > > +       data->index = io->buf.index;
> > > +       data->ring_fd = io->buf.ring_fd;
> > > +       data->has_fd = true;
> > > +       data->registered_fd = io->buf.flags & UBLK_AUTO_BUF_REGISTERED_RING;
> > > +}
> > > +
> > >  static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
> > >                               struct ublk_io *io, unsigned int issue_flags)
> > >  {
> > > @@ -1187,6 +1201,9 @@ static bool ublk_auto_buf_reg(struct ublk_queue *ubq, struct request *req,
> > >         };
> > >         int ret;
> > >
> > > +       if (ublk_support_auto_buf_reg(ubq))
> >
> > This check seems redundant with the check in the caller? Same comment
> > about ublk_auto_buf_unreg(). That would allow you to avoid adding the
> > ubq argument to ublk_auto_buf_unreg().
>
> Yeah, actually I removed one feature which just registers buffer to
> the uring command context, then forget to update the check.
>
> >
> > > +               ublk_init_auto_buf_reg(io, &data);
> > > +
> > >         /* one extra reference is dropped by ublk_io_release */
> > >         ublk_init_req_ref(ubq, req, 2);
> > >         ret = io_buffer_register_bvec(io->cmd, &data, issue_flags);
> > > @@ -2045,7 +2062,7 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
> > >                  */
> > >                 if (!buf_addr && !ublk_need_get_data(ubq))
> > >                         goto out;
> > > -       } else if (buf_addr) {
> > > +       } else if (buf_addr && !ublk_support_auto_buf_reg(ubq)) {
> > >                 /* User copy requires addr to be unset */
> > >                 ret = -EINVAL;
> > >                 goto out;
> > > @@ -2058,13 +2075,17 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq,
> > >         return ret;
> > >  }
> > >
> > > -static void ublk_auto_buf_unreg(struct ublk_io *io, struct io_uring_cmd *cmd,
> > > +static void ublk_auto_buf_unreg(const struct ublk_queue *ubq,
> > > +                               struct ublk_io *io, struct io_uring_cmd *cmd,
> > >                                 struct request *req, unsigned int issue_flags)
> > >  {
> > >         struct io_buf_data data = {
> > >                 .index = req->tag,
> > >         };
> > >
> > > +       if (ublk_support_auto_buf_reg(ubq))
> > > +               ublk_init_auto_buf_reg(io, &data);
> > > +
> > >         WARN_ON_ONCE(io_buffer_unregister_bvec(cmd, &data, issue_flags));
> > >         io->flags &= ~UBLK_IO_FLAG_AUTO_BUF_REG;
> > >  }
> > > @@ -2088,7 +2109,8 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
> > >                 if (!ub_cmd->addr && (!ublk_need_get_data(ubq) ||
> > >                                         req_op(req) == REQ_OP_READ))
> > >                         return -EINVAL;
> > > -       } else if (req_op(req) != REQ_OP_ZONE_APPEND && ub_cmd->addr) {
> > > +       } else if ((req_op(req) != REQ_OP_ZONE_APPEND &&
> > > +                               !ublk_support_auto_buf_reg(ubq)) && ub_cmd->addr) {
> > >                 /*
> > >                  * User copy requires addr to be unset when command is
> > >                  * not zone append
> > > @@ -2097,7 +2119,7 @@ static int ublk_commit_and_fetch(const struct ublk_queue *ubq,
> > >         }
> > >
> > >         if (io->flags & UBLK_IO_FLAG_AUTO_BUF_REG)
> > > -               ublk_auto_buf_unreg(io, cmd, req, issue_flags);
> > > +               ublk_auto_buf_unreg(ubq, io, cmd, req, issue_flags);
> > >
> > >         ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
> > >
> > > @@ -2788,6 +2810,11 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
> > >         else if (!(info.flags & UBLK_F_UNPRIVILEGED_DEV))
> > >                 return -EPERM;
> > >
> > > +       /* F_AUTO_BUF_REG and F_SUPPORT_ZERO_COPY can't co-exist */
> > > +       if ((info.flags & UBLK_F_AUTO_BUF_REG) &&
> > > +                       (info.flags & UBLK_F_SUPPORT_ZERO_COPY))
> > > +               return -EINVAL;
> > > +
> > >         /* forbid nonsense combinations of recovery flags */
> > >         switch (info.flags & UBLK_F_ALL_RECOVERY_FLAGS) {
> > >         case 0:
> > > @@ -2817,8 +2844,11 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
> > >                  * For USER_COPY, we depends on userspace to fill request
> > >                  * buffer by pwrite() to ublk char device, which can't be
> > >                  * used for unprivileged device
> > > +                *
> > > +                * Same with zero copy or auto buffer register.
> > >                  */
> > > -               if (info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY))
> > > +               if (info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
> > > +                                       UBLK_F_AUTO_BUF_REG))
> > >                         return -EINVAL;
> > >         }
> > >
> > > @@ -2876,17 +2906,22 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
> > >                 UBLK_F_URING_CMD_COMP_IN_TASK;
> > >
> > >         /* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */
> > > -       if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY))
> > > +       if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY |
> > > +                               UBLK_F_AUTO_BUF_REG))
> > >                 ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
> > >
> > >         /*
> > >          * Zoned storage support requires reuse `ublksrv_io_cmd->addr` for
> > >          * returning write_append_lba, which is only allowed in case of
> > >          * user copy or zero copy
> > > +        *
> > > +        * UBLK_F_AUTO_BUF_REG can't be enabled for zoned because it need
> > > +        * the space for getting ring_fd and buffer index.
> > >          */
> > >         if (ublk_dev_is_zoned(ub) &&
> > >             (!IS_ENABLED(CONFIG_BLK_DEV_ZONED) || !(ub->dev_info.flags &
> > > -            (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)))) {
> > > +            (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)) ||
> > > +            (ub->dev_info.flags & UBLK_F_AUTO_BUF_REG))) {
> > >                 ret = -EINVAL;
> > >                 goto out_free_dev_number;
> > >         }
> > > @@ -3403,6 +3438,7 @@ static int __init ublk_init(void)
> > >
> > >         BUILD_BUG_ON((u64)UBLKSRV_IO_BUF_OFFSET +
> > >                         UBLKSRV_IO_BUF_TOTAL_SIZE < UBLKSRV_IO_BUF_OFFSET);
> > > +       BUILD_BUG_ON(sizeof(struct ublk_auto_buf_reg) != sizeof(__u64));
> > >
> > >         init_waitqueue_head(&ublk_idr_wq);
> > >
> > > diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> > > index be5c6c6b16e0..3d7c8c69cf06 100644
> > > --- a/include/uapi/linux/ublk_cmd.h
> > > +++ b/include/uapi/linux/ublk_cmd.h
> > > @@ -219,6 +219,30 @@
> > >   */
> > >  #define UBLK_F_UPDATE_SIZE              (1ULL << 10)
> > >
> > > +/*
> > > + * request buffer is registered automatically to ublk server specified
> > > + * io_uring context before delivering this io command to ublk server,
> > > + * meantime it is un-registered automatically when completing this io
> > > + * command.
> > > + *
> > > + * For using this feature:
> > > + *
> > > + * - ublk server has to create sparse buffer table
> > > + *
> > > + * - pass io_ring context FD from `ublksrv_io_cmd.buf.ring_fd`, and the FD
> > > + *   can be registered io_ring FD if `UBLK_AUTO_BUF_REGISTERED_RING` is set
> > > + *   in `ublksrv_io_cmd.flags`, or plain FD
> > > + *
> > > + * - pass buffer index from `ublksrv_io_cmd.buf.index`
> > > + *
> > > + * This way avoids extra cost from two uring_cmd, but also simplifies backend
> > > + * implementation, such as, the dependency on IO_REGISTER_IO_BUF and
> > > + * IO_UNREGISTER_IO_BUF becomes not necessary.
> > > + *
> > > + * This feature isn't available for UBLK_F_ZONED
> > > + */
> > > +#define UBLK_F_AUTO_BUF_REG    (1ULL << 11)
> > > +
> > >  /* device state */
> > >  #define UBLK_S_DEV_DEAD        0
> > >  #define UBLK_S_DEV_LIVE        1
> > > @@ -339,6 +363,14 @@ static inline __u32 ublksrv_get_flags(const struct ublksrv_io_desc *iod)
> > >         return iod->op_flags >> 8;
> > >  }
> > >
> > > +struct ublk_auto_buf_reg {
> > > +       __s32  ring_fd;
> > > +       __u16  index;
> > > +#define UBLK_AUTO_BUF_REGISTERED_RING            (1 << 0)
> > > +       __u8   flags;
> >
> > The flag could potentially be stored in ublk_io's flags field instead
> > to avoid taking up this byte.
>
> `ublk_auto_buf_reg` takes the exact ->addr space in both 'struct ublk_io'
> and 'struct ublksrv_io_cmd', this way won't take extra byte, but keep code simple
> and reuse for dealing with auto_buf_reg from both 'struct ublk_io' and
> 'struct ublksrv_io_cmd'.

Sure, either way is fine by me.

Best,
Caleb

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-04-30 15:34     ` Ming Lei
@ 2025-05-02  1:31       ` Caleb Sander Mateos
  2025-05-02 15:59         ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-05-02  1:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Wed, Apr 30, 2025 at 8:34 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Apr 28, 2025 at 05:43:12PM -0700, Caleb Sander Mateos wrote:
> > On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > > supporting to register/unregister bvec buffer to specified io_uring,
> > > which FD is usually passed from userspace.
> > >
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  include/linux/io_uring/cmd.h |  4 ++
> > >  io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> > >  2 files changed, 67 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > > index 78fa336a284b..7516fe5cd606 100644
> > > --- a/include/linux/io_uring/cmd.h
> > > +++ b/include/linux/io_uring/cmd.h
> > > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> > >
> > >  struct io_buf_data {
> > >         unsigned short index;
> > > +       bool has_fd;
> > > +       bool registered_fd;
> > > +
> > > +       int ring_fd;
> > >         struct request *rq;
> > >         void (*release)(void *);
> > >  };
> > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > index 5f8ab130a573..701dd33fecf7 100644
> > > --- a/io_uring/rsrc.c
> > > +++ b/io_uring/rsrc.c
> > > @@ -969,21 +969,6 @@ static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
> > >         return 0;
> > >  }
> > >
> > > -int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> > > -                           struct io_buf_data *buf,
> > > -                           unsigned int issue_flags)
> > > -{
> > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > -       int ret;
> > > -
> > > -       io_ring_submit_lock(ctx, issue_flags);
> > > -       ret = __io_buffer_register_bvec(ctx, buf);
> > > -       io_ring_submit_unlock(ctx, issue_flags);
> > > -
> > > -       return ret;
> > > -}
> > > -EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> > > -
> > >  static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > >                                        struct io_buf_data *buf)
> > >  {
> > > @@ -1006,19 +991,77 @@ static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > >         return 0;
> > >  }
> > >
> > > -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> > > -                             struct io_buf_data *buf,
> > > -                             unsigned int issue_flags)
> > > +static inline int do_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > +                                   struct io_buf_data *buf,
> > > +                                   unsigned int issue_flags,
> > > +                                   bool reg)
> > >  {
> > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > >         int ret;
> > >
> > >         io_ring_submit_lock(ctx, issue_flags);
> > > -       ret = __io_buffer_unregister_bvec(ctx, buf);
> > > +       if (reg)
> > > +               ret = __io_buffer_register_bvec(ctx, buf);
> > > +       else
> > > +               ret = __io_buffer_unregister_bvec(ctx, buf);
> >
> > It feels like unifying __io_buffer_register_bvec() and
> > __io_buffer_unregister_bvec() would belong better in the prior patch
> > that changes their signatures.
>
> Can you share how to do above in previous patch?

I was thinking you could define do_reg_unreg_bvec() in the previous
patch. It's a logical step once you've extracted out all the
differences between io_buffer_register_bvec() and
io_buffer_unregister_bvec() into the helpers
__io_buffer_register_bvec() and __io_buffer_unregister_bvec(). But
either way is fine.

>
> >
> > >         io_ring_submit_unlock(ctx, issue_flags);
> > >
> > >         return ret;
> > >  }
> > > +
> > > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > +                                   struct io_buf_data *buf,
> > > +                                   unsigned int issue_flags,
> > > +                                   bool reg)
> > > +{
> > > +       struct io_ring_ctx *remote_ctx = ctx;
> > > +       struct file *file = NULL;
> > > +       int ret;
> > > +
> > > +       if (buf->has_fd) {
> > > +               file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> > > +               if (IS_ERR(file))
> > > +                       return PTR_ERR(file);
> >
> > It would be good to avoid the overhead of this lookup and
> > reference-counting in the I/O path. Would it be possible to move this
> > lookup to when UBLK_IO_FETCH_REQ (and UBLK_IO_COMMIT_AND_FETCH_REQ, if
> > it specifies a different ring_fd) is submitted? I guess that might
> > require storing an extra io_ring_ctx pointer in struct ublk_io.
>
> Let's start from the flexible way & simple implementation.
>
> Any optimization & improvement can be done as follow-up.

Sure, we can start with this as-is. But I suspect the extra
reference-counting here will significantly decrease the benefit of the
auto-register register feature.

>
> Each command may register buffer to different io_uring context,
> it can't be done in UBLK_IO_FETCH_REQ stage, because new IO with same
> tag may register buffer to new io_uring context.

Right, if UBLK_IO_COMMIT_AND_FETCH_REQ specifies a different io_uring
fd, then we'd have to look it up anew. But it seems likely that all
UBLK_IO_COMMIT_AND_FETCH_REQs for a single tag will specify the same
io_uring (certainly that's how our ublk server works). And in that
case, the I/O could just reuse the io_uring context that was looked up
for the prior UBLK_IO_(COMMIT_AND_)FETCH_REQ.

>
> But it can be optimized in future for one specific use case with feature
> flag.
>
> >
> > > +               remote_ctx = file->private_data;
> > > +               if (!remote_ctx)
> > > +                       return -EINVAL;
> > > +       }
> > > +
> > > +       if (remote_ctx == ctx) {
> > > +               do_reg_unreg_bvec(ctx, buf, issue_flags, reg);
> > > +       } else {
> > > +               if (!(issue_flags & IO_URING_F_UNLOCKED))
> > > +                       mutex_unlock(&ctx->uring_lock);
> > > +
> > > +               do_reg_unreg_bvec(remote_ctx, buf, IO_URING_F_UNLOCKED, reg);
> > > +
> > > +               if (!(issue_flags & IO_URING_F_UNLOCKED))
> > > +                       mutex_lock(&ctx->uring_lock);
> > > +       }
> > > +
> > > +       if (file)
> > > +               fput(file);
> > > +
> > > +       return ret;
> > > +}
> > > +
> > > +int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> > > +                           struct io_buf_data *buf,
> > > +                           unsigned int issue_flags)
> >
> > If buf->has_fd is set, this struct io_uring_cmd *cmd is unused. Could
> > you define separate functions that take a struct io_uring_cmd * vs. a
> > ring_fd?
>
> The ring_fd may point to the same io_uring context with 'io_uring_cmd',
> we need this information for dealing with io_uring context lock.

Good point.

Best,
Caleb

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 6/7] ublk: register buffer to specified io_uring & buf index via UBLK_F_AUTO_BUF_REG
  2025-04-30 16:30       ` Caleb Sander Mateos
@ 2025-05-02 14:09         ` Ming Lei
  0 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2025-05-02 14:09 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Wed, Apr 30, 2025 at 09:30:16AM -0700, Caleb Sander Mateos wrote:
> On Wed, Apr 30, 2025 at 8:45 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Apr 28, 2025 at 05:52:28PM -0700, Caleb Sander Mateos wrote:
> > > On Mon, Apr 28, 2025 at 2:45 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > Add UBLK_F_AUTO_BUF_REG for supporting to register buffer automatically
> > > > to specified io_uring context and buffer index.
> > > >
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  drivers/block/ublk_drv.c      | 56 ++++++++++++++++++++++++++++-------
> > > >  include/uapi/linux/ublk_cmd.h | 38 ++++++++++++++++++++++++
> > > >  2 files changed, 84 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index 1fd20e481a60..e82618442749 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -66,7 +66,8 @@
> > > >                 | UBLK_F_USER_COPY \
> > > >                 | UBLK_F_ZONED \
> > > >                 | UBLK_F_USER_RECOVERY_FAIL_IO \
> > > > -               | UBLK_F_UPDATE_SIZE)
> > > > +               | UBLK_F_UPDATE_SIZE \
> > > > +               | UBLK_F_AUTO_BUF_REG)
> > > >
> > > >  #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \
> > > >                 | UBLK_F_USER_RECOVERY_REISSUE \
> > > > @@ -146,7 +147,10 @@ struct ublk_uring_cmd_pdu {
> > > >
> > > >  struct ublk_io {
> > > >         /* userspace buffer address from io cmd */
> > > > -       __u64   addr;
> > > > +       union {
> > > > +               __u64   addr;
> > > > +               struct ublk_auto_buf_reg buf;
> > >
> > > Maybe add a comment justifying why these fields can overlap? From my
> > > understanding, buf is valid iff UBLK_F_AUTO_BUF_REG is set on the
> > > ublk_queue and addr is valid iff neither UBLK_F_USER_COPY,
> > > UBLK_F_SUPPORT_ZERO_COPY, nor UBLK_F_AUTO_BUF_REG is set.
> >
> > ->addr is for storing the userspace buffer, which is only used in
> > non-zc cases(zc, auto_buf_reg) or user copy case.
> 
> Right, could you add a comment to that effect? I think using

Sure.

> overlapping fields is subtle and has the potential to break in the
> future if the usage of the fields changes. Documenting the assumptions
> clearly would go a long way.

The usage is actually reliable and can be well documented 

- ->addr is used if data copy is required 

- ->zone_append_lba is for zoned, which requires UBLK_F_USER_COPY or 
UBLK_F_ZERO_COPY

- 'ublk_auto_buf_reg' is for UBLK_F_AUTO_BUF_REG which is actually one
special(automatic) zero copy, meantime zoned can't be supported for this
feature



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-05-02  1:31       ` Caleb Sander Mateos
@ 2025-05-02 15:59         ` Ming Lei
  2025-05-02 21:21           ` Caleb Sander Mateos
  0 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-05-02 15:59 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Thu, May 01, 2025 at 06:31:03PM -0700, Caleb Sander Mateos wrote:
> On Wed, Apr 30, 2025 at 8:34 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Apr 28, 2025 at 05:43:12PM -0700, Caleb Sander Mateos wrote:
> > > On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > > > supporting to register/unregister bvec buffer to specified io_uring,
> > > > which FD is usually passed from userspace.
> > > >
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  include/linux/io_uring/cmd.h |  4 ++
> > > >  io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> > > >  2 files changed, 67 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > > > index 78fa336a284b..7516fe5cd606 100644
> > > > --- a/include/linux/io_uring/cmd.h
> > > > +++ b/include/linux/io_uring/cmd.h
> > > > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> > > >
> > > >  struct io_buf_data {
> > > >         unsigned short index;
> > > > +       bool has_fd;
> > > > +       bool registered_fd;
> > > > +
> > > > +       int ring_fd;
> > > >         struct request *rq;
> > > >         void (*release)(void *);
> > > >  };
> > > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > > index 5f8ab130a573..701dd33fecf7 100644
> > > > --- a/io_uring/rsrc.c
> > > > +++ b/io_uring/rsrc.c
> > > > @@ -969,21 +969,6 @@ static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
> > > >         return 0;
> > > >  }
> > > >
> > > > -int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> > > > -                           struct io_buf_data *buf,
> > > > -                           unsigned int issue_flags)
> > > > -{
> > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > > -       int ret;
> > > > -
> > > > -       io_ring_submit_lock(ctx, issue_flags);
> > > > -       ret = __io_buffer_register_bvec(ctx, buf);
> > > > -       io_ring_submit_unlock(ctx, issue_flags);
> > > > -
> > > > -       return ret;
> > > > -}
> > > > -EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> > > > -
> > > >  static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > >                                        struct io_buf_data *buf)
> > > >  {
> > > > @@ -1006,19 +991,77 @@ static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > >         return 0;
> > > >  }
> > > >
> > > > -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> > > > -                             struct io_buf_data *buf,
> > > > -                             unsigned int issue_flags)
> > > > +static inline int do_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > +                                   struct io_buf_data *buf,
> > > > +                                   unsigned int issue_flags,
> > > > +                                   bool reg)
> > > >  {
> > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > >         int ret;
> > > >
> > > >         io_ring_submit_lock(ctx, issue_flags);
> > > > -       ret = __io_buffer_unregister_bvec(ctx, buf);
> > > > +       if (reg)
> > > > +               ret = __io_buffer_register_bvec(ctx, buf);
> > > > +       else
> > > > +               ret = __io_buffer_unregister_bvec(ctx, buf);
> > >
> > > It feels like unifying __io_buffer_register_bvec() and
> > > __io_buffer_unregister_bvec() would belong better in the prior patch
> > > that changes their signatures.
> >
> > Can you share how to do above in previous patch?
> 
> I was thinking you could define do_reg_unreg_bvec() in the previous
> patch. It's a logical step once you've extracted out all the
> differences between io_buffer_register_bvec() and
> io_buffer_unregister_bvec() into the helpers
> __io_buffer_register_bvec() and __io_buffer_unregister_bvec(). But
> either way is fine.

'has_fd' and 'ring_fd' fields isn't added yet, the defined do_reg_unreg_bvec()
could be quite simple, looks no big difference, I can do that...

> 
> >
> > >
> > > >         io_ring_submit_unlock(ctx, issue_flags);
> > > >
> > > >         return ret;
> > > >  }
> > > > +
> > > > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > +                                   struct io_buf_data *buf,
> > > > +                                   unsigned int issue_flags,
> > > > +                                   bool reg)
> > > > +{
> > > > +       struct io_ring_ctx *remote_ctx = ctx;
> > > > +       struct file *file = NULL;
> > > > +       int ret;
> > > > +
> > > > +       if (buf->has_fd) {
> > > > +               file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> > > > +               if (IS_ERR(file))
> > > > +                       return PTR_ERR(file);
> > >
> > > It would be good to avoid the overhead of this lookup and
> > > reference-counting in the I/O path. Would it be possible to move this
> > > lookup to when UBLK_IO_FETCH_REQ (and UBLK_IO_COMMIT_AND_FETCH_REQ, if
> > > it specifies a different ring_fd) is submitted? I guess that might
> > > require storing an extra io_ring_ctx pointer in struct ublk_io.
> >
> > Let's start from the flexible way & simple implementation.
> >
> > Any optimization & improvement can be done as follow-up.
> 
> Sure, we can start with this as-is. But I suspect the extra
> reference-counting here will significantly decrease the benefit of the
> auto-register register feature.

The reference-counting should only be needed for registering buffer to
external ring, which may have been slow because of the cross-ring thing...

Maybe we can start automatic buffer register for ubq_daemon context only,
meantime allow to register buffer from external io_uring by adding per-io
spin_lock, which may help the per-io task Uday is working on too.

And the interface still allow to support automatic buffer register to
external io_uring since `ublk_auto_buf_reg` includes 'flags' field, we can
enable it in future when efficient implementation is figured out.

What do you think of this approach?

> 
> >
> > Each command may register buffer to different io_uring context,
> > it can't be done in UBLK_IO_FETCH_REQ stage, because new IO with same
> > tag may register buffer to new io_uring context.
> 
> Right, if UBLK_IO_COMMIT_AND_FETCH_REQ specifies a different io_uring
> fd, then we'd have to look it up anew. But it seems likely that all
> UBLK_IO_COMMIT_AND_FETCH_REQs for a single tag will specify the same
> io_uring (certainly that's how our ublk server works). And in that
> case, the I/O could just reuse the io_uring context that was looked up
> for the prior UBLK_IO_(COMMIT_AND_)FETCH_REQ.

It is a special case, and one follow-up feature flag can help to
optimize this case only.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-05-02 15:59         ` Ming Lei
@ 2025-05-02 21:21           ` Caleb Sander Mateos
  2025-05-03  1:00             ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-05-02 21:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Fri, May 2, 2025 at 8:59 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Thu, May 01, 2025 at 06:31:03PM -0700, Caleb Sander Mateos wrote:
> > On Wed, Apr 30, 2025 at 8:34 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Mon, Apr 28, 2025 at 05:43:12PM -0700, Caleb Sander Mateos wrote:
> > > > On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > > > > supporting to register/unregister bvec buffer to specified io_uring,
> > > > > which FD is usually passed from userspace.
> > > > >
> > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > ---
> > > > >  include/linux/io_uring/cmd.h |  4 ++
> > > > >  io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> > > > >  2 files changed, 67 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > > > > index 78fa336a284b..7516fe5cd606 100644
> > > > > --- a/include/linux/io_uring/cmd.h
> > > > > +++ b/include/linux/io_uring/cmd.h
> > > > > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> > > > >
> > > > >  struct io_buf_data {
> > > > >         unsigned short index;
> > > > > +       bool has_fd;
> > > > > +       bool registered_fd;
> > > > > +
> > > > > +       int ring_fd;
> > > > >         struct request *rq;
> > > > >         void (*release)(void *);
> > > > >  };
> > > > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > > > index 5f8ab130a573..701dd33fecf7 100644
> > > > > --- a/io_uring/rsrc.c
> > > > > +++ b/io_uring/rsrc.c
> > > > > @@ -969,21 +969,6 @@ static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > -int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> > > > > -                           struct io_buf_data *buf,
> > > > > -                           unsigned int issue_flags)
> > > > > -{
> > > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > > > -       int ret;
> > > > > -
> > > > > -       io_ring_submit_lock(ctx, issue_flags);
> > > > > -       ret = __io_buffer_register_bvec(ctx, buf);
> > > > > -       io_ring_submit_unlock(ctx, issue_flags);
> > > > > -
> > > > > -       return ret;
> > > > > -}
> > > > > -EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> > > > > -
> > > > >  static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > > >                                        struct io_buf_data *buf)
> > > > >  {
> > > > > @@ -1006,19 +991,77 @@ static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> > > > > -                             struct io_buf_data *buf,
> > > > > -                             unsigned int issue_flags)
> > > > > +static inline int do_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > > +                                   struct io_buf_data *buf,
> > > > > +                                   unsigned int issue_flags,
> > > > > +                                   bool reg)
> > > > >  {
> > > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > > >         int ret;
> > > > >
> > > > >         io_ring_submit_lock(ctx, issue_flags);
> > > > > -       ret = __io_buffer_unregister_bvec(ctx, buf);
> > > > > +       if (reg)
> > > > > +               ret = __io_buffer_register_bvec(ctx, buf);
> > > > > +       else
> > > > > +               ret = __io_buffer_unregister_bvec(ctx, buf);
> > > >
> > > > It feels like unifying __io_buffer_register_bvec() and
> > > > __io_buffer_unregister_bvec() would belong better in the prior patch
> > > > that changes their signatures.
> > >
> > > Can you share how to do above in previous patch?
> >
> > I was thinking you could define do_reg_unreg_bvec() in the previous
> > patch. It's a logical step once you've extracted out all the
> > differences between io_buffer_register_bvec() and
> > io_buffer_unregister_bvec() into the helpers
> > __io_buffer_register_bvec() and __io_buffer_unregister_bvec(). But
> > either way is fine.
>
> 'has_fd' and 'ring_fd' fields isn't added yet, the defined do_reg_unreg_bvec()
> could be quite simple, looks no big difference, I can do that...
>
> >
> > >
> > > >
> > > > >         io_ring_submit_unlock(ctx, issue_flags);
> > > > >
> > > > >         return ret;
> > > > >  }
> > > > > +
> > > > > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > > +                                   struct io_buf_data *buf,
> > > > > +                                   unsigned int issue_flags,
> > > > > +                                   bool reg)
> > > > > +{
> > > > > +       struct io_ring_ctx *remote_ctx = ctx;
> > > > > +       struct file *file = NULL;
> > > > > +       int ret;
> > > > > +
> > > > > +       if (buf->has_fd) {
> > > > > +               file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> > > > > +               if (IS_ERR(file))
> > > > > +                       return PTR_ERR(file);
> > > >
> > > > It would be good to avoid the overhead of this lookup and
> > > > reference-counting in the I/O path. Would it be possible to move this
> > > > lookup to when UBLK_IO_FETCH_REQ (and UBLK_IO_COMMIT_AND_FETCH_REQ, if
> > > > it specifies a different ring_fd) is submitted? I guess that might
> > > > require storing an extra io_ring_ctx pointer in struct ublk_io.
> > >
> > > Let's start from the flexible way & simple implementation.
> > >
> > > Any optimization & improvement can be done as follow-up.
> >
> > Sure, we can start with this as-is. But I suspect the extra
> > reference-counting here will significantly decrease the benefit of the
> > auto-register register feature.
>
> The reference-counting should only be needed for registering buffer to
> external ring, which may have been slow because of the cross-ring thing...

The current code is incrementing and decrementing the io_uring file
reference count even if the remote_ctx == ctx, right? I agree it
should definitely be possible to skip the reference count in that
case, as this code is already running in task work context for a
command on the io_uring. It should also be possible to avoid atomic
reference-counting in the UBLK_AUTO_BUF_REGISTERED_RING case too.

>
> Maybe we can start automatic buffer register for ubq_daemon context only,
> meantime allow to register buffer from external io_uring by adding per-io
> spin_lock, which may help the per-io task Uday is working on too.

I'm not sure I understand why a spinlock would be required? In Uday's
patch set, each ublk_io still belongs to a single task. So no
additional locking should be required.

>
> And the interface still allow to support automatic buffer register to
> external io_uring since `ublk_auto_buf_reg` includes 'flags' field, we can
> enable it in future when efficient implementation is figured out.

Sure, we can definitely start with support only for auto-registering
the buffer with the ublk command's own io_uring. Implementing a flag
in the future to specify a different io_uring seems like a good
approach.

Best,
Caleb

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-05-02 21:21           ` Caleb Sander Mateos
@ 2025-05-03  1:00             ` Ming Lei
  2025-05-03 18:55               ` Caleb Sander Mateos
  0 siblings, 1 reply; 29+ messages in thread
From: Ming Lei @ 2025-05-03  1:00 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Fri, May 02, 2025 at 02:21:05PM -0700, Caleb Sander Mateos wrote:
> On Fri, May 2, 2025 at 8:59 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Thu, May 01, 2025 at 06:31:03PM -0700, Caleb Sander Mateos wrote:
> > > On Wed, Apr 30, 2025 at 8:34 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Mon, Apr 28, 2025 at 05:43:12PM -0700, Caleb Sander Mateos wrote:
> > > > > On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > >
> > > > > > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > > > > > supporting to register/unregister bvec buffer to specified io_uring,
> > > > > > which FD is usually passed from userspace.
> > > > > >
> > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > ---
> > > > > >  include/linux/io_uring/cmd.h |  4 ++
> > > > > >  io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> > > > > >  2 files changed, 67 insertions(+), 20 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > > > > > index 78fa336a284b..7516fe5cd606 100644
> > > > > > --- a/include/linux/io_uring/cmd.h
> > > > > > +++ b/include/linux/io_uring/cmd.h
> > > > > > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> > > > > >
> > > > > >  struct io_buf_data {
> > > > > >         unsigned short index;
> > > > > > +       bool has_fd;
> > > > > > +       bool registered_fd;
> > > > > > +
> > > > > > +       int ring_fd;
> > > > > >         struct request *rq;
> > > > > >         void (*release)(void *);
> > > > > >  };
> > > > > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > > > > index 5f8ab130a573..701dd33fecf7 100644
> > > > > > --- a/io_uring/rsrc.c
> > > > > > +++ b/io_uring/rsrc.c
> > > > > > @@ -969,21 +969,6 @@ static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
> > > > > >         return 0;
> > > > > >  }
> > > > > >
> > > > > > -int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> > > > > > -                           struct io_buf_data *buf,
> > > > > > -                           unsigned int issue_flags)
> > > > > > -{
> > > > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > > > > -       int ret;
> > > > > > -
> > > > > > -       io_ring_submit_lock(ctx, issue_flags);
> > > > > > -       ret = __io_buffer_register_bvec(ctx, buf);
> > > > > > -       io_ring_submit_unlock(ctx, issue_flags);
> > > > > > -
> > > > > > -       return ret;
> > > > > > -}
> > > > > > -EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> > > > > > -
> > > > > >  static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > > > >                                        struct io_buf_data *buf)
> > > > > >  {
> > > > > > @@ -1006,19 +991,77 @@ static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > > > >         return 0;
> > > > > >  }
> > > > > >
> > > > > > -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> > > > > > -                             struct io_buf_data *buf,
> > > > > > -                             unsigned int issue_flags)
> > > > > > +static inline int do_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > > > +                                   struct io_buf_data *buf,
> > > > > > +                                   unsigned int issue_flags,
> > > > > > +                                   bool reg)
> > > > > >  {
> > > > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > > > >         int ret;
> > > > > >
> > > > > >         io_ring_submit_lock(ctx, issue_flags);
> > > > > > -       ret = __io_buffer_unregister_bvec(ctx, buf);
> > > > > > +       if (reg)
> > > > > > +               ret = __io_buffer_register_bvec(ctx, buf);
> > > > > > +       else
> > > > > > +               ret = __io_buffer_unregister_bvec(ctx, buf);
> > > > >
> > > > > It feels like unifying __io_buffer_register_bvec() and
> > > > > __io_buffer_unregister_bvec() would belong better in the prior patch
> > > > > that changes their signatures.
> > > >
> > > > Can you share how to do above in previous patch?
> > >
> > > I was thinking you could define do_reg_unreg_bvec() in the previous
> > > patch. It's a logical step once you've extracted out all the
> > > differences between io_buffer_register_bvec() and
> > > io_buffer_unregister_bvec() into the helpers
> > > __io_buffer_register_bvec() and __io_buffer_unregister_bvec(). But
> > > either way is fine.
> >
> > 'has_fd' and 'ring_fd' fields isn't added yet, the defined do_reg_unreg_bvec()
> > could be quite simple, looks no big difference, I can do that...
> >
> > >
> > > >
> > > > >
> > > > > >         io_ring_submit_unlock(ctx, issue_flags);
> > > > > >
> > > > > >         return ret;
> > > > > >  }
> > > > > > +
> > > > > > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > > > +                                   struct io_buf_data *buf,
> > > > > > +                                   unsigned int issue_flags,
> > > > > > +                                   bool reg)
> > > > > > +{
> > > > > > +       struct io_ring_ctx *remote_ctx = ctx;
> > > > > > +       struct file *file = NULL;
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       if (buf->has_fd) {
> > > > > > +               file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> > > > > > +               if (IS_ERR(file))
> > > > > > +                       return PTR_ERR(file);
> > > > >
> > > > > It would be good to avoid the overhead of this lookup and
> > > > > reference-counting in the I/O path. Would it be possible to move this
> > > > > lookup to when UBLK_IO_FETCH_REQ (and UBLK_IO_COMMIT_AND_FETCH_REQ, if
> > > > > it specifies a different ring_fd) is submitted? I guess that might
> > > > > require storing an extra io_ring_ctx pointer in struct ublk_io.
> > > >
> > > > Let's start from the flexible way & simple implementation.
> > > >
> > > > Any optimization & improvement can be done as follow-up.
> > >
> > > Sure, we can start with this as-is. But I suspect the extra
> > > reference-counting here will significantly decrease the benefit of the
> > > auto-register register feature.
> >
> > The reference-counting should only be needed for registering buffer to
> > external ring, which may have been slow because of the cross-ring thing...
> 
> The current code is incrementing and decrementing the io_uring file
> reference count even if the remote_ctx == ctx, right? I agree it

Yes, but it can be changed to drop the inc/dec file reference easily since we
have a flag field.

> should definitely be possible to skip the reference count in that
> case, as this code is already running in task work context for a
> command on the io_uring.

The current 'uring_cmd' instance holds one reference of the
io_ring_ctx instance.

> It should also be possible to avoid atomic
> reference-counting in the UBLK_AUTO_BUF_REGISTERED_RING case too.

For registering buffer to external io_ring, it is hard to avoid to grag
the io_uring_ctx reference when specifying the io_uring_ctx via its FD.

> 
> >
> > Maybe we can start automatic buffer register for ubq_daemon context only,
> > meantime allow to register buffer from external io_uring by adding per-io
> > spin_lock, which may help the per-io task Uday is working on too.
> 
> I'm not sure I understand why a spinlock would be required? In Uday's
> patch set, each ublk_io still belongs to a single task. So no
> additional locking should be required.

I think it is very useful to allow to register io buffer in the
other(non-ubq_daemon) io_uring context by the offload style.

Especially the register/unregister io buffer uring_cmd is for handling
target IO, which should have been issued in same context of target io
handling.

Without one per-io spinlock, it is hard to avoid one race you mentioned:

https://lore.kernel.org/linux-block/aA2pNRkBhgKsofRP@fedora/#t

in case of bad ublk server implementation.

> 
> >
> > And the interface still allow to support automatic buffer register to
> > external io_uring since `ublk_auto_buf_reg` includes 'flags' field, we can
> > enable it in future when efficient implementation is figured out.
> 
> Sure, we can definitely start with support only for auto-registering
> the buffer with the ublk command's own io_uring. Implementing a flag
> in the future to specify a different io_uring seems like a good
> approach.

OK, I will send V2 by starting with auto-registering buffer to the ublk
uring_cmd io_uring first.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-05-03  1:00             ` Ming Lei
@ 2025-05-03 18:55               ` Caleb Sander Mateos
  2025-05-06  2:45                 ` Ming Lei
  0 siblings, 1 reply; 29+ messages in thread
From: Caleb Sander Mateos @ 2025-05-03 18:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Fri, May 2, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Fri, May 02, 2025 at 02:21:05PM -0700, Caleb Sander Mateos wrote:
> > On Fri, May 2, 2025 at 8:59 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Thu, May 01, 2025 at 06:31:03PM -0700, Caleb Sander Mateos wrote:
> > > > On Wed, Apr 30, 2025 at 8:34 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > On Mon, Apr 28, 2025 at 05:43:12PM -0700, Caleb Sander Mateos wrote:
> > > > > > On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > > >
> > > > > > > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > > > > > > supporting to register/unregister bvec buffer to specified io_uring,
> > > > > > > which FD is usually passed from userspace.
> > > > > > >
> > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > > ---
> > > > > > >  include/linux/io_uring/cmd.h |  4 ++
> > > > > > >  io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> > > > > > >  2 files changed, 67 insertions(+), 20 deletions(-)
> > > > > > >
> > > > > > > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > > > > > > index 78fa336a284b..7516fe5cd606 100644
> > > > > > > --- a/include/linux/io_uring/cmd.h
> > > > > > > +++ b/include/linux/io_uring/cmd.h
> > > > > > > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> > > > > > >
> > > > > > >  struct io_buf_data {
> > > > > > >         unsigned short index;
> > > > > > > +       bool has_fd;
> > > > > > > +       bool registered_fd;
> > > > > > > +
> > > > > > > +       int ring_fd;
> > > > > > >         struct request *rq;
> > > > > > >         void (*release)(void *);
> > > > > > >  };
> > > > > > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > > > > > index 5f8ab130a573..701dd33fecf7 100644
> > > > > > > --- a/io_uring/rsrc.c
> > > > > > > +++ b/io_uring/rsrc.c
> > > > > > > @@ -969,21 +969,6 @@ static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
> > > > > > >         return 0;
> > > > > > >  }
> > > > > > >
> > > > > > > -int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> > > > > > > -                           struct io_buf_data *buf,
> > > > > > > -                           unsigned int issue_flags)
> > > > > > > -{
> > > > > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > > > > > -       int ret;
> > > > > > > -
> > > > > > > -       io_ring_submit_lock(ctx, issue_flags);
> > > > > > > -       ret = __io_buffer_register_bvec(ctx, buf);
> > > > > > > -       io_ring_submit_unlock(ctx, issue_flags);
> > > > > > > -
> > > > > > > -       return ret;
> > > > > > > -}
> > > > > > > -EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> > > > > > > -
> > > > > > >  static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > > > > >                                        struct io_buf_data *buf)
> > > > > > >  {
> > > > > > > @@ -1006,19 +991,77 @@ static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > > > > >         return 0;
> > > > > > >  }
> > > > > > >
> > > > > > > -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> > > > > > > -                             struct io_buf_data *buf,
> > > > > > > -                             unsigned int issue_flags)
> > > > > > > +static inline int do_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > > > > +                                   struct io_buf_data *buf,
> > > > > > > +                                   unsigned int issue_flags,
> > > > > > > +                                   bool reg)
> > > > > > >  {
> > > > > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > > > > >         int ret;
> > > > > > >
> > > > > > >         io_ring_submit_lock(ctx, issue_flags);
> > > > > > > -       ret = __io_buffer_unregister_bvec(ctx, buf);
> > > > > > > +       if (reg)
> > > > > > > +               ret = __io_buffer_register_bvec(ctx, buf);
> > > > > > > +       else
> > > > > > > +               ret = __io_buffer_unregister_bvec(ctx, buf);
> > > > > >
> > > > > > It feels like unifying __io_buffer_register_bvec() and
> > > > > > __io_buffer_unregister_bvec() would belong better in the prior patch
> > > > > > that changes their signatures.
> > > > >
> > > > > Can you share how to do above in previous patch?
> > > >
> > > > I was thinking you could define do_reg_unreg_bvec() in the previous
> > > > patch. It's a logical step once you've extracted out all the
> > > > differences between io_buffer_register_bvec() and
> > > > io_buffer_unregister_bvec() into the helpers
> > > > __io_buffer_register_bvec() and __io_buffer_unregister_bvec(). But
> > > > either way is fine.
> > >
> > > 'has_fd' and 'ring_fd' fields isn't added yet, the defined do_reg_unreg_bvec()
> > > could be quite simple, looks no big difference, I can do that...
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >         io_ring_submit_unlock(ctx, issue_flags);
> > > > > > >
> > > > > > >         return ret;
> > > > > > >  }
> > > > > > > +
> > > > > > > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > > > > +                                   struct io_buf_data *buf,
> > > > > > > +                                   unsigned int issue_flags,
> > > > > > > +                                   bool reg)
> > > > > > > +{
> > > > > > > +       struct io_ring_ctx *remote_ctx = ctx;
> > > > > > > +       struct file *file = NULL;
> > > > > > > +       int ret;
> > > > > > > +
> > > > > > > +       if (buf->has_fd) {
> > > > > > > +               file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> > > > > > > +               if (IS_ERR(file))
> > > > > > > +                       return PTR_ERR(file);
> > > > > >
> > > > > > It would be good to avoid the overhead of this lookup and
> > > > > > reference-counting in the I/O path. Would it be possible to move this
> > > > > > lookup to when UBLK_IO_FETCH_REQ (and UBLK_IO_COMMIT_AND_FETCH_REQ, if
> > > > > > it specifies a different ring_fd) is submitted? I guess that might
> > > > > > require storing an extra io_ring_ctx pointer in struct ublk_io.
> > > > >
> > > > > Let's start from the flexible way & simple implementation.
> > > > >
> > > > > Any optimization & improvement can be done as follow-up.
> > > >
> > > > Sure, we can start with this as-is. But I suspect the extra
> > > > reference-counting here will significantly decrease the benefit of the
> > > > auto-register register feature.
> > >
> > > The reference-counting should only be needed for registering buffer to
> > > external ring, which may have been slow because of the cross-ring thing...
> >
> > The current code is incrementing and decrementing the io_uring file
> > reference count even if the remote_ctx == ctx, right? I agree it
>
> Yes, but it can be changed to drop the inc/dec file reference easily since we
> have a flag field.
>
> > should definitely be possible to skip the reference count in that
> > case, as this code is already running in task work context for a
> > command on the io_uring.
>
> The current 'uring_cmd' instance holds one reference of the
> io_ring_ctx instance.
>
> > It should also be possible to avoid atomic
> > reference-counting in the UBLK_AUTO_BUF_REGISTERED_RING case too.
>
> For registering buffer to external io_ring, it is hard to avoid to grag
> the io_uring_ctx reference when specifying the io_uring_ctx via its FD.

If the io_uring is specified by a file descriptor (not using
UBLK_AUTO_BUF_REGISTERED_RING), I agree reference counting is
necessary.
But the whole point of registering ring fds is to avoid reference
counting of the io_uring file. See how IORING_ENTER_REGISTERED_RING is
handled in io_uring_enter(). It simply indexes
current->io_uring->registered_rings to get the file, skipping the
fget() and fput(). Since the auto register is running in task work
context, it should also be able to access the task-local
registered_rings without reference counting.

>
> >
> > >
> > > Maybe we can start automatic buffer register for ubq_daemon context only,
> > > meantime allow to register buffer from external io_uring by adding per-io
> > > spin_lock, which may help the per-io task Uday is working on too.
> >
> > I'm not sure I understand why a spinlock would be required? In Uday's
> > patch set, each ublk_io still belongs to a single task. So no
> > additional locking should be required.
>
> I think it is very useful to allow to register io buffer in the
> other(non-ubq_daemon) io_uring context by the offload style.
>
> Especially the register/unregister io buffer uring_cmd is for handling
> target IO, which should have been issued in same context of target io
> handling.
>
> Without one per-io spinlock, it is hard to avoid one race you mentioned:

I don't believe a spinlock is necessary. It should be possible to
avoid accessing the ublk_io at all when registering the request
buffer. __ublk_check_and_get_req() calls kref_get_unless_zero() on the
request, which already ensures the request is owned by the ublk server
and prevents it from completing while its buffer is registered. This
is analogous to how UBLK_F_USER_COPY works;
ublk_ch_read_iter()/ublk_ch_write_iter() can be safely called from any
thread.

Best,
Caleb

>
> https://lore.kernel.org/linux-block/aA2pNRkBhgKsofRP@fedora/#t
>
> in case of bad ublk server implementation.
>
> >
> > >
> > > And the interface still allow to support automatic buffer register to
> > > external io_uring since `ublk_auto_buf_reg` includes 'flags' field, we can
> > > enable it in future when efficient implementation is figured out.
> >
> > Sure, we can definitely start with support only for auto-registering
> > the buffer with the ublk command's own io_uring. Implementing a flag
> > in the future to specify a different io_uring seems like a good
> > approach.
>
> OK, I will send V2 by starting with auto-registering buffer to the ublk
> uring_cmd io_uring first.
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring
  2025-05-03 18:55               ` Caleb Sander Mateos
@ 2025-05-06  2:45                 ` Ming Lei
  0 siblings, 0 replies; 29+ messages in thread
From: Ming Lei @ 2025-05-06  2:45 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block, Uday Shankar,
	Keith Busch

On Sat, May 03, 2025 at 11:55:05AM -0700, Caleb Sander Mateos wrote:
> On Fri, May 2, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Fri, May 02, 2025 at 02:21:05PM -0700, Caleb Sander Mateos wrote:
> > > On Fri, May 2, 2025 at 8:59 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Thu, May 01, 2025 at 06:31:03PM -0700, Caleb Sander Mateos wrote:
> > > > > On Wed, Apr 30, 2025 at 8:34 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Apr 28, 2025 at 05:43:12PM -0700, Caleb Sander Mateos wrote:
> > > > > > > On Mon, Apr 28, 2025 at 2:44 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > > > >
> > > > > > > > Extend io_buffer_register_bvec() and io_buffer_unregister_bvec() for
> > > > > > > > supporting to register/unregister bvec buffer to specified io_uring,
> > > > > > > > which FD is usually passed from userspace.
> > > > > > > >
> > > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > > > ---
> > > > > > > >  include/linux/io_uring/cmd.h |  4 ++
> > > > > > > >  io_uring/rsrc.c              | 83 +++++++++++++++++++++++++++---------
> > > > > > > >  2 files changed, 67 insertions(+), 20 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > > > > > > > index 78fa336a284b..7516fe5cd606 100644
> > > > > > > > --- a/include/linux/io_uring/cmd.h
> > > > > > > > +++ b/include/linux/io_uring/cmd.h
> > > > > > > > @@ -25,6 +25,10 @@ struct io_uring_cmd_data {
> > > > > > > >
> > > > > > > >  struct io_buf_data {
> > > > > > > >         unsigned short index;
> > > > > > > > +       bool has_fd;
> > > > > > > > +       bool registered_fd;
> > > > > > > > +
> > > > > > > > +       int ring_fd;
> > > > > > > >         struct request *rq;
> > > > > > > >         void (*release)(void *);
> > > > > > > >  };
> > > > > > > > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > > > > > > > index 5f8ab130a573..701dd33fecf7 100644
> > > > > > > > --- a/io_uring/rsrc.c
> > > > > > > > +++ b/io_uring/rsrc.c
> > > > > > > > @@ -969,21 +969,6 @@ static int __io_buffer_register_bvec(struct io_ring_ctx *ctx,
> > > > > > > >         return 0;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > -int io_buffer_register_bvec(struct io_uring_cmd *cmd,
> > > > > > > > -                           struct io_buf_data *buf,
> > > > > > > > -                           unsigned int issue_flags)
> > > > > > > > -{
> > > > > > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > > > > > > -       int ret;
> > > > > > > > -
> > > > > > > > -       io_ring_submit_lock(ctx, issue_flags);
> > > > > > > > -       ret = __io_buffer_register_bvec(ctx, buf);
> > > > > > > > -       io_ring_submit_unlock(ctx, issue_flags);
> > > > > > > > -
> > > > > > > > -       return ret;
> > > > > > > > -}
> > > > > > > > -EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> > > > > > > > -
> > > > > > > >  static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > > > > > >                                        struct io_buf_data *buf)
> > > > > > > >  {
> > > > > > > > @@ -1006,19 +991,77 @@ static int __io_buffer_unregister_bvec(struct io_ring_ctx *ctx,
> > > > > > > >         return 0;
> > > > > > > >  }
> > > > > > > >
> > > > > > > > -int io_buffer_unregister_bvec(struct io_uring_cmd *cmd,
> > > > > > > > -                             struct io_buf_data *buf,
> > > > > > > > -                             unsigned int issue_flags)
> > > > > > > > +static inline int do_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > > > > > +                                   struct io_buf_data *buf,
> > > > > > > > +                                   unsigned int issue_flags,
> > > > > > > > +                                   bool reg)
> > > > > > > >  {
> > > > > > > > -       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > > > > > > >         int ret;
> > > > > > > >
> > > > > > > >         io_ring_submit_lock(ctx, issue_flags);
> > > > > > > > -       ret = __io_buffer_unregister_bvec(ctx, buf);
> > > > > > > > +       if (reg)
> > > > > > > > +               ret = __io_buffer_register_bvec(ctx, buf);
> > > > > > > > +       else
> > > > > > > > +               ret = __io_buffer_unregister_bvec(ctx, buf);
> > > > > > >
> > > > > > > It feels like unifying __io_buffer_register_bvec() and
> > > > > > > __io_buffer_unregister_bvec() would belong better in the prior patch
> > > > > > > that changes their signatures.
> > > > > >
> > > > > > Can you share how to do above in previous patch?
> > > > >
> > > > > I was thinking you could define do_reg_unreg_bvec() in the previous
> > > > > patch. It's a logical step once you've extracted out all the
> > > > > differences between io_buffer_register_bvec() and
> > > > > io_buffer_unregister_bvec() into the helpers
> > > > > __io_buffer_register_bvec() and __io_buffer_unregister_bvec(). But
> > > > > either way is fine.
> > > >
> > > > 'has_fd' and 'ring_fd' fields isn't added yet, the defined do_reg_unreg_bvec()
> > > > could be quite simple, looks no big difference, I can do that...
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >         io_ring_submit_unlock(ctx, issue_flags);
> > > > > > > >
> > > > > > > >         return ret;
> > > > > > > >  }
> > > > > > > > +
> > > > > > > > +static int io_buffer_reg_unreg_bvec(struct io_ring_ctx *ctx,
> > > > > > > > +                                   struct io_buf_data *buf,
> > > > > > > > +                                   unsigned int issue_flags,
> > > > > > > > +                                   bool reg)
> > > > > > > > +{
> > > > > > > > +       struct io_ring_ctx *remote_ctx = ctx;
> > > > > > > > +       struct file *file = NULL;
> > > > > > > > +       int ret;
> > > > > > > > +
> > > > > > > > +       if (buf->has_fd) {
> > > > > > > > +               file = io_uring_register_get_file(buf->ring_fd, buf->registered_fd);
> > > > > > > > +               if (IS_ERR(file))
> > > > > > > > +                       return PTR_ERR(file);
> > > > > > >
> > > > > > > It would be good to avoid the overhead of this lookup and
> > > > > > > reference-counting in the I/O path. Would it be possible to move this
> > > > > > > lookup to when UBLK_IO_FETCH_REQ (and UBLK_IO_COMMIT_AND_FETCH_REQ, if
> > > > > > > it specifies a different ring_fd) is submitted? I guess that might
> > > > > > > require storing an extra io_ring_ctx pointer in struct ublk_io.
> > > > > >
> > > > > > Let's start from the flexible way & simple implementation.
> > > > > >
> > > > > > Any optimization & improvement can be done as follow-up.
> > > > >
> > > > > Sure, we can start with this as-is. But I suspect the extra
> > > > > reference-counting here will significantly decrease the benefit of the
> > > > > auto-register register feature.
> > > >
> > > > The reference-counting should only be needed for registering buffer to
> > > > external ring, which may have been slow because of the cross-ring thing...
> > >
> > > The current code is incrementing and decrementing the io_uring file
> > > reference count even if the remote_ctx == ctx, right? I agree it
> >
> > Yes, but it can be changed to drop the inc/dec file reference easily since we
> > have a flag field.
> >
> > > should definitely be possible to skip the reference count in that
> > > case, as this code is already running in task work context for a
> > > command on the io_uring.
> >
> > The current 'uring_cmd' instance holds one reference of the
> > io_ring_ctx instance.
> >
> > > It should also be possible to avoid atomic
> > > reference-counting in the UBLK_AUTO_BUF_REGISTERED_RING case too.
> >
> > For registering buffer to external io_ring, it is hard to avoid to grag
> > the io_uring_ctx reference when specifying the io_uring_ctx via its FD.
> 
> If the io_uring is specified by a file descriptor (not using
> UBLK_AUTO_BUF_REGISTERED_RING), I agree reference counting is
> necessary.
> But the whole point of registering ring fds is to avoid reference
> counting of the io_uring file. See how IORING_ENTER_REGISTERED_RING is
> handled in io_uring_enter(). It simply indexes
> current->io_uring->registered_rings to get the file, skipping the
> fget() and fput(). Since the auto register is running in task work
> context, it should also be able to access the task-local
> registered_rings without reference counting.

registered ring requires the io_uring is registered & used in the local
pthread, which usage is still very limited.

> 
> >
> > >
> > > >
> > > > Maybe we can start automatic buffer register for ubq_daemon context only,
> > > > meantime allow to register buffer from external io_uring by adding per-io
> > > > spin_lock, which may help the per-io task Uday is working on too.
> > >
> > > I'm not sure I understand why a spinlock would be required? In Uday's
> > > patch set, each ublk_io still belongs to a single task. So no
> > > additional locking should be required.
> >
> > I think it is very useful to allow to register io buffer in the
> > other(non-ubq_daemon) io_uring context by the offload style.
> >
> > Especially the register/unregister io buffer uring_cmd is for handling
> > target IO, which should have been issued in same context of target io
> > handling.
> >
> > Without one per-io spinlock, it is hard to avoid one race you mentioned:
> 
> I don't believe a spinlock is necessary. It should be possible to
> avoid accessing the ublk_io at all when registering the request
> buffer. __ublk_check_and_get_req() calls kref_get_unless_zero() on the
> request, which already ensures the request is owned by the ublk server

I thought the request still may be completed & recycled before calling
__ublk_check_and_get_req(). But it can be treated as one ublk server
logic bug since use-after-free doesn't exist actually.

> and prevents it from completing while its buffer is registered. This
> is analogous to how UBLK_F_USER_COPY works;
> ublk_ch_read_iter()/ublk_ch_write_iter() can be safely called from any
> thread.

OK, spinlock isn't needed.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-05-06  2:46 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-28  9:44 [RFC PATCH 0/7] ublk: support to register bvec buffer automatically Ming Lei
2025-04-28  9:44 ` [RFC PATCH 1/7] io_uring: add 'struct io_buf_data' for register/unregister bvec buffer Ming Lei
2025-04-29  0:35   ` Caleb Sander Mateos
2025-04-28  9:44 ` [RFC PATCH 2/7] io_uring: add helper __io_buffer_[un]register_bvec Ming Lei
2025-04-29  0:36   ` Caleb Sander Mateos
2025-04-28  9:44 ` [RFC PATCH 3/7] io_uring: support to register bvec buffer to specified io_uring Ming Lei
2025-04-28 10:28   ` Pavel Begunkov
2025-04-29  0:46     ` Caleb Sander Mateos
2025-04-29  0:47     ` Ming Lei
2025-04-30  8:25       ` Pavel Begunkov
2025-04-30 14:44         ` Ming Lei
2025-04-29  0:43   ` Caleb Sander Mateos
2025-04-30 15:34     ` Ming Lei
2025-05-02  1:31       ` Caleb Sander Mateos
2025-05-02 15:59         ` Ming Lei
2025-05-02 21:21           ` Caleb Sander Mateos
2025-05-03  1:00             ` Ming Lei
2025-05-03 18:55               ` Caleb Sander Mateos
2025-05-06  2:45                 ` Ming Lei
2025-04-28  9:44 ` [RFC PATCH 4/7] ublk: convert to refcount_t Ming Lei
2025-04-28 17:13   ` Caleb Sander Mateos
2025-04-28  9:44 ` [RFC PATCH 5/7] ublk: prepare for supporting to register request buffer automatically Ming Lei
2025-04-29  0:50   ` Caleb Sander Mateos
2025-04-28  9:44 ` [RFC PATCH 6/7] ublk: register buffer to specified io_uring & buf index via UBLK_F_AUTO_BUF_REG Ming Lei
2025-04-29  0:52   ` Caleb Sander Mateos
2025-04-30 15:45     ` Ming Lei
2025-04-30 16:30       ` Caleb Sander Mateos
2025-05-02 14:09         ` Ming Lei
2025-04-28  9:44 ` [RFC PATCH 7/7] selftests: ublk: support UBLK_F_AUTO_BUF_REG Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox