public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCHv4 0/5] ublk zero-copy support
@ 2025-02-18 22:42 Keith Busch
  2025-02-18 22:42 ` [PATCHv4 1/5] io_uring: move fixed buffer import to issue path Keith Busch
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Keith Busch @ 2025-02-18 22:42 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

From: Keith Busch <[email protected]>

Changes from v3:

  Fixed putting the imu back in the cache on free instead of releasing
  it to the system (Caleb)

  Fixed the build bisect breakage (Caleb)

  Use appropriate error value if cache initialization fails (Caleb)

  Check data direction when importing the buffer (Ming)

  Using the array_no_spec accessor when using a user index (Pavel)

  Various cleanups

Keith Busch (5):
  io_uring: move fixed buffer import to issue path
  io_uring: add support for kernel registered bvecs
  ublk: zc register/unregister bvec
  io_uring: add abstraction for buf_table rsrc data
  io_uring: cache nodes and mapped buffers

 drivers/block/ublk_drv.c       | 137 ++++++++++++-----
 include/linux/io_uring.h       |   1 +
 include/linux/io_uring_types.h |  33 ++--
 include/uapi/linux/ublk_cmd.h  |   4 +
 io_uring/fdinfo.c              |   8 +-
 io_uring/filetable.c           |   2 +-
 io_uring/io_uring.c            |  19 +++
 io_uring/net.c                 |  25 +---
 io_uring/nop.c                 |  22 +--
 io_uring/register.c            |   2 +-
 io_uring/rsrc.c                | 266 ++++++++++++++++++++++++++-------
 io_uring/rsrc.h                |   6 +-
 io_uring/rw.c                  |  45 ++++--
 io_uring/uring_cmd.c           |  16 +-
 14 files changed, 419 insertions(+), 167 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCHv4 1/5] io_uring: move fixed buffer import to issue path
  2025-02-18 22:42 [PATCHv4 0/5] ublk zero-copy support Keith Busch
@ 2025-02-18 22:42 ` Keith Busch
  2025-02-19  1:27   ` Caleb Sander Mateos
                     ` (2 more replies)
  2025-02-18 22:42 ` [PATCHv4 2/5] io_uring: add support for kernel registered bvecs Keith Busch
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 28+ messages in thread
From: Keith Busch @ 2025-02-18 22:42 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

From: Keith Busch <[email protected]>

Similar to the fixed file path, requests may depend on a previous one
to set up an index, so we need to allow linking them. The prep callback
happens too soon for linked commands, so the lookup needs to be deferred
to the issue path. Change the prep callbacks to just set the buf_index
and let generic io_uring code handle the fixed buffer node setup, just
like it already does for fixed files.

Signed-off-by: Keith Busch <[email protected]>
---
 include/linux/io_uring_types.h |  3 +++
 io_uring/io_uring.c            | 19 ++++++++++++++
 io_uring/net.c                 | 25 ++++++-------------
 io_uring/nop.c                 | 22 +++--------------
 io_uring/rw.c                  | 45 ++++++++++++++++++++++++----------
 io_uring/uring_cmd.c           | 16 ++----------
 6 files changed, 67 insertions(+), 63 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c0fe8a00fe53a..0bcaefc4ffe02 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -482,6 +482,7 @@ enum {
 	REQ_F_DOUBLE_POLL_BIT,
 	REQ_F_APOLL_MULTISHOT_BIT,
 	REQ_F_CLEAR_POLLIN_BIT,
+	REQ_F_FIXED_BUFFER_BIT,
 	/* keep async read/write and isreg together and in order */
 	REQ_F_SUPPORT_NOWAIT_BIT,
 	REQ_F_ISREG_BIT,
@@ -574,6 +575,8 @@ enum {
 	REQ_F_BUF_NODE		= IO_REQ_FLAG(REQ_F_BUF_NODE_BIT),
 	/* request has read/write metadata assigned */
 	REQ_F_HAS_METADATA	= IO_REQ_FLAG(REQ_F_HAS_METADATA_BIT),
+	/* request has a fixed buffer at buf_index */
+	REQ_F_FIXED_BUFFER	= IO_REQ_FLAG(REQ_F_FIXED_BUFFER_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, io_tw_token_t tw);
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 58528bf61638e..7800edbc57279 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1721,6 +1721,23 @@ static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
 	return !!req->file;
 }
 
+static bool io_assign_buffer(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_rsrc_node *node;
+
+	if (req->buf_node || !(req->flags & REQ_F_FIXED_BUFFER))
+		return true;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
+	if (node)
+		io_req_assign_buf_node(req, node);
+	io_ring_submit_unlock(ctx, issue_flags);
+
+	return !!node;
+}
+
 static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 {
 	const struct io_issue_def *def = &io_issue_defs[req->opcode];
@@ -1729,6 +1746,8 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 
 	if (unlikely(!io_assign_file(req, def, issue_flags)))
 		return -EBADF;
+	if (unlikely(!io_assign_buffer(req, issue_flags)))
+		return -EFAULT;
 
 	if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred()))
 		creds = override_creds(req->creds);
diff --git a/io_uring/net.c b/io_uring/net.c
index 000dc70d08d0d..39838e8575b53 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1373,6 +1373,10 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 #endif
 	if (unlikely(!io_msg_alloc_async(req)))
 		return -ENOMEM;
+	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
+		req->buf_index = zc->buf_index;
+		req->flags |= REQ_F_FIXED_BUFFER;
+	}
 	if (req->opcode != IORING_OP_SENDMSG_ZC)
 		return io_send_setup(req, sqe);
 	return io_sendmsg_setup(req, sqe);
@@ -1434,25 +1438,10 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_async_msghdr *kmsg = req->async_data;
 	int ret;
 
-	if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
-		struct io_ring_ctx *ctx = req->ctx;
-		struct io_rsrc_node *node;
-
-		ret = -EFAULT;
-		io_ring_submit_lock(ctx, issue_flags);
-		node = io_rsrc_node_lookup(&ctx->buf_table, sr->buf_index);
-		if (node) {
-			io_req_assign_buf_node(sr->notif, node);
-			ret = 0;
-		}
-		io_ring_submit_unlock(ctx, issue_flags);
-
-		if (unlikely(ret))
-			return ret;
-
+	if (req->flags & REQ_F_FIXED_BUFFER) {
 		ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter,
-					node->buf, (u64)(uintptr_t)sr->buf,
-					sr->len);
+					req->buf_node->buf,
+					(u64)(uintptr_t)sr->buf, sr->len);
 		if (unlikely(ret))
 			return ret;
 		kmsg->msg.sg_from_iter = io_sg_from_iter;
diff --git a/io_uring/nop.c b/io_uring/nop.c
index 5e5196df650a1..989908603112f 100644
--- a/io_uring/nop.c
+++ b/io_uring/nop.c
@@ -16,7 +16,6 @@ struct io_nop {
 	struct file     *file;
 	int             result;
 	int		fd;
-	int		buffer;
 	unsigned int	flags;
 };
 
@@ -39,10 +38,10 @@ int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		nop->fd = READ_ONCE(sqe->fd);
 	else
 		nop->fd = -1;
-	if (nop->flags & IORING_NOP_FIXED_BUFFER)
-		nop->buffer = READ_ONCE(sqe->buf_index);
-	else
-		nop->buffer = -1;
+	if (nop->flags & IORING_NOP_FIXED_BUFFER) {
+		req->buf_index = READ_ONCE(sqe->buf_index);
+		req->flags |= REQ_F_FIXED_BUFFER;
+	}
 	return 0;
 }
 
@@ -63,19 +62,6 @@ int io_nop(struct io_kiocb *req, unsigned int issue_flags)
 			goto done;
 		}
 	}
-	if (nop->flags & IORING_NOP_FIXED_BUFFER) {
-		struct io_ring_ctx *ctx = req->ctx;
-		struct io_rsrc_node *node;
-
-		ret = -EFAULT;
-		io_ring_submit_lock(ctx, issue_flags);
-		node = io_rsrc_node_lookup(&ctx->buf_table, nop->buffer);
-		if (node) {
-			io_req_assign_buf_node(req, node);
-			ret = 0;
-		}
-		io_ring_submit_unlock(ctx, issue_flags);
-	}
 done:
 	if (ret < 0)
 		req_set_fail(req);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 16f12f94943f7..2d8910d9197a0 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -353,25 +353,14 @@ int io_prep_writev(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 			    int ddir)
 {
-	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
-	struct io_ring_ctx *ctx = req->ctx;
-	struct io_rsrc_node *node;
-	struct io_async_rw *io;
 	int ret;
 
 	ret = io_prep_rw(req, sqe, ddir, false);
 	if (unlikely(ret))
 		return ret;
 
-	node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
-	if (!node)
-		return -EFAULT;
-	io_req_assign_buf_node(req, node);
-
-	io = req->async_data;
-	ret = io_import_fixed(ddir, &io->iter, node->buf, rw->addr, rw->len);
-	iov_iter_save_state(&io->iter, &io->iter_state);
-	return ret;
+	req->flags |= REQ_F_FIXED_BUFFER;
+	return 0;
 }
 
 int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -954,10 +943,36 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
+static int io_import_fixed_buffer(struct io_kiocb *req, int ddir)
+{
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+	struct io_async_rw *io;
+	int ret;
+
+	if (!(req->flags & REQ_F_FIXED_BUFFER))
+		return 0;
+
+	io = req->async_data;
+	if (io->bytes_done)
+		return 0;
+
+	ret = io_import_fixed(ddir, &io->iter, req->buf_node->buf, rw->addr,
+			      rw->len);
+	if (ret)
+		return ret;
+
+	iov_iter_save_state(&io->iter, &io->iter_state);
+	return 0;
+}
+
 int io_read(struct io_kiocb *req, unsigned int issue_flags)
 {
 	int ret;
 
+	ret = io_import_fixed_buffer(req, READ);
+	if (unlikely(ret))
+		return ret;
+
 	ret = __io_read(req, issue_flags);
 	if (ret >= 0)
 		return kiocb_done(req, ret, issue_flags);
@@ -1062,6 +1077,10 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	ssize_t ret, ret2;
 	loff_t *ppos;
 
+	ret = io_import_fixed_buffer(req, WRITE);
+	if (unlikely(ret))
+		return ret;
+
 	ret = io_rw_init_file(req, FMODE_WRITE, WRITE);
 	if (unlikely(ret))
 		return ret;
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8bdf2c9b3fef9..112b49fde23e5 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -200,19 +200,8 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EINVAL;
 
 	if (ioucmd->flags & IORING_URING_CMD_FIXED) {
-		struct io_ring_ctx *ctx = req->ctx;
-		struct io_rsrc_node *node;
-		u16 index = READ_ONCE(sqe->buf_index);
-
-		node = io_rsrc_node_lookup(&ctx->buf_table, index);
-		if (unlikely(!node))
-			return -EFAULT;
-		/*
-		 * Pi node upfront, prior to io_uring_cmd_import_fixed()
-		 * being called. This prevents destruction of the mapped buffer
-		 * we'll need at actual import time.
-		 */
-		io_req_assign_buf_node(req, node);
+		req->buf_index = READ_ONCE(sqe->buf_index);
+		req->flags |= REQ_F_FIXED_BUFFER;
 	}
 	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
 
@@ -262,7 +251,6 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
 	struct io_rsrc_node *node = req->buf_node;
 
-	/* Must have had rsrc_node assigned at prep time */
 	if (node)
 		return io_import_fixed(rw, iter, node->buf, ubuf, len);
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 2/5] io_uring: add support for kernel registered bvecs
  2025-02-18 22:42 [PATCHv4 0/5] ublk zero-copy support Keith Busch
  2025-02-18 22:42 ` [PATCHv4 1/5] io_uring: move fixed buffer import to issue path Keith Busch
@ 2025-02-18 22:42 ` Keith Busch
  2025-02-19  1:54   ` Caleb Sander Mateos
  2025-02-20 10:38   ` Pavel Begunkov
  2025-02-18 22:42 ` [PATCHv4 3/5] ublk: zc register/unregister bvec Keith Busch
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 28+ messages in thread
From: Keith Busch @ 2025-02-18 22:42 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

From: Keith Busch <[email protected]>

Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.

User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.

Signed-off-by: Keith Busch <[email protected]>
---
 include/linux/io_uring.h       |   1 +
 include/linux/io_uring_types.h |   6 ++
 io_uring/rsrc.c                | 115 ++++++++++++++++++++++++++++++---
 io_uring/rsrc.h                |   4 ++
 4 files changed, 118 insertions(+), 8 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 85fe4e6b275c7..b5637a2aae340 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -5,6 +5,7 @@
 #include <linux/sched.h>
 #include <linux/xarray.h>
 #include <uapi/linux/io_uring.h>
+#include <linux/blk-mq.h>
 
 #if defined(CONFIG_IO_URING)
 void __io_uring_cancel(bool cancel_all);
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 0bcaefc4ffe02..2aed51e8c79ee 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -709,4 +709,10 @@ static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
 	return ctx->flags & IORING_SETUP_CQE32;
 }
 
+int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
+			    void (*release)(void *), unsigned int index,
+			    unsigned int issue_flags);
+void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index,
+			       unsigned int issue_flags);
+
 #endif
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 20b884c84e55f..88bcacc77b72e 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -103,19 +103,23 @@ int io_buffer_validate(struct iovec *iov)
 
 static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 {
-	unsigned int i;
+	struct io_mapped_ubuf *imu = node->buf;
 
-	if (node->buf) {
-		struct io_mapped_ubuf *imu = node->buf;
+	if (!refcount_dec_and_test(&imu->refs))
+		return;
+
+	if (imu->release) {
+		imu->release(imu->priv);
+	} else {
+		unsigned int i;
 
-		if (!refcount_dec_and_test(&imu->refs))
-			return;
 		for (i = 0; i < imu->nr_bvecs; i++)
 			unpin_user_page(imu->bvec[i].bv_page);
 		if (imu->acct_pages)
 			io_unaccount_mem(ctx, imu->acct_pages);
-		kvfree(imu);
 	}
+
+	kvfree(imu);
 }
 
 struct io_rsrc_node *io_rsrc_node_alloc(int type)
@@ -764,6 +768,10 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 	imu->len = iov->iov_len;
 	imu->nr_bvecs = nr_pages;
 	imu->folio_shift = PAGE_SHIFT;
+	imu->release = NULL;
+	imu->priv = NULL;
+	imu->readable = true;
+	imu->writeable = true;
 	if (coalesced)
 		imu->folio_shift = data.folio_shift;
 	refcount_set(&imu->refs, 1);
@@ -860,6 +868,87 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
+int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
+			    void (*release)(void *), unsigned int index,
+			    unsigned int issue_flags)
+{
+	struct io_rsrc_data *data = &ctx->buf_table;
+	struct req_iterator rq_iter;
+	struct io_mapped_ubuf *imu;
+	struct io_rsrc_node *node;
+	struct bio_vec bv, *bvec;
+	int ret = 0;
+	u16 nr_bvecs;
+
+	io_ring_submit_lock(ctx, issue_flags);
+
+	if (io_rsrc_node_lookup(data, index)) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+	if (!node) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	nr_bvecs = blk_rq_nr_phys_segments(rq);
+	imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+	if (!imu) {
+		kfree(node);
+		ret = -ENOMEM;
+		goto unlock;
+	}
+
+	imu->ubuf = 0;
+	imu->len = blk_rq_bytes(rq);
+	imu->acct_pages = 0;
+	imu->folio_shift = PAGE_SHIFT;
+	imu->nr_bvecs = nr_bvecs;
+	refcount_set(&imu->refs, 1);
+	imu->release = release;
+	imu->priv = rq;
+
+	if (rq_data_dir(rq))
+		imu->writeable = true;
+	else
+		imu->readable = true;
+
+	bvec = imu->bvec;
+	rq_for_each_bvec(bv, rq, rq_iter)
+		*bvec++ = bv;
+
+	node->buf = imu;
+	data->nodes[index] = node;
+unlock:
+	io_ring_submit_unlock(ctx, issue_flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
+
+void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index,
+			       unsigned int issue_flags)
+{
+	struct io_rsrc_data *data = &ctx->buf_table;
+	struct io_rsrc_node *node;
+
+	io_ring_submit_lock(ctx, issue_flags);
+
+	if (!data->nr)
+		goto unlock;
+
+	node = io_rsrc_node_lookup(data, index);
+	if (!node || !node->buf->release)
+		goto unlock;
+
+	io_put_rsrc_node(ctx, node);
+	data->nodes[index] = NULL;
+unlock:
+	io_ring_submit_unlock(ctx, issue_flags);
+}
+EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
+
 int io_import_fixed(int ddir, struct iov_iter *iter,
 			   struct io_mapped_ubuf *imu,
 			   u64 buf_addr, size_t len)
@@ -874,6 +963,9 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 	/* not inside the mapped region */
 	if (unlikely(buf_addr < imu->ubuf || buf_end > (imu->ubuf + imu->len)))
 		return -EFAULT;
+	if ((ddir == READ && !imu->readable) ||
+	    (ddir == WRITE && !imu->writeable))
+		return -EFAULT;
 
 	/*
 	 * Might not be a start of buffer, set size appropriately
@@ -886,8 +978,8 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 		/*
 		 * Don't use iov_iter_advance() here, as it's really slow for
 		 * using the latter parts of a big fixed buffer - it iterates
-		 * over each segment manually. We can cheat a bit here, because
-		 * we know that:
+		 * over each segment manually. We can cheat a bit here for user
+		 * registered nodes, because we know that:
 		 *
 		 * 1) it's a BVEC iter, we set it up
 		 * 2) all bvecs are the same in size, except potentially the
@@ -901,8 +993,15 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 		 */
 		const struct bio_vec *bvec = imu->bvec;
 
+		/*
+		 * Kernel buffer bvecs, on the other hand, don't necessarily
+		 * have the size property of user registered ones, so we have
+		 * to use the slow iter advance.
+		 */
 		if (offset < bvec->bv_len) {
 			iter->iov_offset = offset;
+		} else if (imu->release) {
+			iov_iter_advance(iter, offset);
 		} else {
 			unsigned long seg_skip;
 
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index abf86b5b86140..81c31b93d4f7b 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -33,6 +33,10 @@ struct io_mapped_ubuf {
 	unsigned int    folio_shift;
 	refcount_t	refs;
 	unsigned long	acct_pages;
+	void		(*release)(void *);
+	void		*priv;
+	bool		readable;
+	bool		writeable;
 	struct bio_vec	bvec[] __counted_by(nr_bvecs);
 };
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 3/5] ublk: zc register/unregister bvec
  2025-02-18 22:42 [PATCHv4 0/5] ublk zero-copy support Keith Busch
  2025-02-18 22:42 ` [PATCHv4 1/5] io_uring: move fixed buffer import to issue path Keith Busch
  2025-02-18 22:42 ` [PATCHv4 2/5] io_uring: add support for kernel registered bvecs Keith Busch
@ 2025-02-18 22:42 ` Keith Busch
  2025-02-19  2:36   ` Caleb Sander Mateos
  2025-02-20 11:11   ` Pavel Begunkov
  2025-02-18 22:42 ` [PATCHv4 4/5] io_uring: add abstraction for buf_table rsrc data Keith Busch
  2025-02-18 22:42 ` [PATCHv4 5/5] io_uring: cache nodes and mapped buffers Keith Busch
  4 siblings, 2 replies; 28+ messages in thread
From: Keith Busch @ 2025-02-18 22:42 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

From: Keith Busch <[email protected]>

Provide new operations for the user to request mapping an active request
to an io uring instance's buf_table. The user has to provide the index
it wants to install the buffer.

A reference count is taken on the request to ensure it can't be
completed while it is active in a ring's buf_table.

Signed-off-by: Keith Busch <[email protected]>
---
 drivers/block/ublk_drv.c      | 137 +++++++++++++++++++++++++---------
 include/uapi/linux/ublk_cmd.h |   4 +
 2 files changed, 105 insertions(+), 36 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 529085181f355..0c753176b14e9 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -51,6 +51,9 @@
 /* private ioctl command mirror */
 #define UBLK_CMD_DEL_DEV_ASYNC	_IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
 
+#define UBLK_IO_REGISTER_IO_BUF		_IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
+#define UBLK_IO_UNREGISTER_IO_BUF	_IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
+
 /* All UBLK_F_* have to be included into UBLK_F_ALL */
 #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
 		| UBLK_F_URING_CMD_COMP_IN_TASK \
@@ -201,7 +204,7 @@ static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq,
 						   int tag);
 static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub)
 {
-	return ub->dev_info.flags & UBLK_F_USER_COPY;
+	return ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
 }
 
 static inline bool ublk_dev_is_zoned(const struct ublk_device *ub)
@@ -581,7 +584,7 @@ static void ublk_apply_params(struct ublk_device *ub)
 
 static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
 {
-	return ubq->flags & UBLK_F_USER_COPY;
+	return ubq->flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
 }
 
 static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
@@ -1747,6 +1750,96 @@ static inline void ublk_prep_cancel(struct io_uring_cmd *cmd,
 	io_uring_cmd_mark_cancelable(cmd, issue_flags);
 }
 
+static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
+		struct ublk_queue *ubq, int tag, size_t offset)
+{
+	struct request *req;
+
+	if (!ublk_need_req_ref(ubq))
+		return NULL;
+
+	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+	if (!req)
+		return NULL;
+
+	if (!ublk_get_req_ref(ubq, req))
+		return NULL;
+
+	if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
+		goto fail_put;
+
+	if (!ublk_rq_has_data(req))
+		goto fail_put;
+
+	if (offset > blk_rq_bytes(req))
+		goto fail_put;
+
+	return req;
+fail_put:
+	ublk_put_req_ref(ubq, req);
+	return NULL;
+}
+
+static void ublk_io_release(void *priv)
+{
+	struct request *rq = priv;
+	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
+
+	ublk_put_req_ref(ubq, rq);
+}
+
+static int ublk_register_io_buf(struct io_uring_cmd *cmd,
+				struct ublk_queue *ubq, int tag,
+				const struct ublksrv_io_cmd *ub_cmd,
+				unsigned int issue_flags)
+{
+	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+	struct ublk_device *ub = cmd->file->private_data;
+	int index = (int)ub_cmd->addr, ret;
+	struct ublk_rq_data *data;
+	struct request *req;
+
+	if (!ub)
+		return -EPERM;
+
+	req = __ublk_check_and_get_req(ub, ubq, tag, 0);
+	if (!req)
+		return -EINVAL;
+
+	data = blk_mq_rq_to_pdu(req);
+	ret = io_buffer_register_bvec(ctx, req, ublk_io_release, index,
+				      issue_flags);
+	if (ret) {
+		ublk_put_req_ref(ubq, req);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
+				  struct ublk_queue *ubq, int tag,
+				  const struct ublksrv_io_cmd *ub_cmd,
+				  unsigned int issue_flags)
+{
+	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+	struct ublk_device *ub = cmd->file->private_data;
+	int index = (int)ub_cmd->addr;
+	struct ublk_rq_data *data;
+	struct request *req;
+
+	if (!ub)
+		return -EPERM;
+
+	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+	if (!req)
+		return -EINVAL;
+
+	data = blk_mq_rq_to_pdu(req);
+	io_buffer_unregister_bvec(ctx, index, issue_flags);
+	return 0;
+}
+
 static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 			       unsigned int issue_flags,
 			       const struct ublksrv_io_cmd *ub_cmd)
@@ -1798,6 +1891,11 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 
 	ret = -EINVAL;
 	switch (_IOC_NR(cmd_op)) {
+	case UBLK_IO_REGISTER_IO_BUF:
+		return ublk_register_io_buf(cmd, ubq, tag, ub_cmd, issue_flags);
+	case UBLK_IO_UNREGISTER_IO_BUF:
+		return ublk_unregister_io_buf(cmd, ubq, tag, ub_cmd,
+					      issue_flags);
 	case UBLK_IO_FETCH_REQ:
 		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
 		if (ublk_queue_ready(ubq)) {
@@ -1872,36 +1970,6 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 	return -EIOCBQUEUED;
 }
 
-static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
-		struct ublk_queue *ubq, int tag, size_t offset)
-{
-	struct request *req;
-
-	if (!ublk_need_req_ref(ubq))
-		return NULL;
-
-	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
-	if (!req)
-		return NULL;
-
-	if (!ublk_get_req_ref(ubq, req))
-		return NULL;
-
-	if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
-		goto fail_put;
-
-	if (!ublk_rq_has_data(req))
-		goto fail_put;
-
-	if (offset > blk_rq_bytes(req))
-		goto fail_put;
-
-	return req;
-fail_put:
-	ublk_put_req_ref(ubq, req);
-	return NULL;
-}
-
 static inline int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
 {
@@ -2527,9 +2595,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 		goto out_free_dev_number;
 	}
 
-	/* We are not ready to support zero copy */
-	ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
-
 	ub->dev_info.nr_hw_queues = min_t(unsigned int,
 			ub->dev_info.nr_hw_queues, nr_cpu_ids);
 	ublk_align_max_io_size(ub);
@@ -2860,7 +2925,7 @@ static int ublk_ctrl_get_features(struct io_uring_cmd *cmd)
 {
 	const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe);
 	void __user *argp = (void __user *)(unsigned long)header->addr;
-	u64 features = UBLK_F_ALL & ~UBLK_F_SUPPORT_ZERO_COPY;
+	u64 features = UBLK_F_ALL;
 
 	if (header->len != UBLK_FEATURES_LEN || !header->addr)
 		return -EINVAL;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index a8bc98bb69fce..74246c926b55f 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -94,6 +94,10 @@
 	_IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd)
 #define	UBLK_U_IO_NEED_GET_DATA		\
 	_IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd)
+#define	UBLK_U_IO_REGISTER_IO_BUF	\
+	_IOWR('u', 0x23, struct ublksrv_io_cmd)
+#define	UBLK_U_IO_UNREGISTER_IO_BUF	\
+	_IOWR('u', 0x24, struct ublksrv_io_cmd)
 
 /* only ABORT means that no re-fetch */
 #define UBLK_IO_RES_OK			0
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 4/5] io_uring: add abstraction for buf_table rsrc data
  2025-02-18 22:42 [PATCHv4 0/5] ublk zero-copy support Keith Busch
                   ` (2 preceding siblings ...)
  2025-02-18 22:42 ` [PATCHv4 3/5] ublk: zc register/unregister bvec Keith Busch
@ 2025-02-18 22:42 ` Keith Busch
  2025-02-19  3:04   ` Caleb Sander Mateos
  2025-02-18 22:42 ` [PATCHv4 5/5] io_uring: cache nodes and mapped buffers Keith Busch
  4 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2025-02-18 22:42 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

From: Keith Busch <[email protected]>

We'll need to add more fields specific to the registered buffers, so
make a layer for it now. No functional change in this patch.

Signed-off-by: Keith Busch <[email protected]>
---
 include/linux/io_uring_types.h |  6 ++++-
 io_uring/fdinfo.c              |  8 +++---
 io_uring/io_uring.c            |  2 +-
 io_uring/register.c            |  2 +-
 io_uring/rsrc.c                | 49 +++++++++++++++++-----------------
 5 files changed, 35 insertions(+), 32 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 2aed51e8c79ee..810d1dccd27b1 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -69,6 +69,10 @@ struct io_file_table {
 	unsigned int alloc_hint;
 };
 
+struct io_buf_table {
+	struct io_rsrc_data	data;
+};
+
 struct io_hash_bucket {
 	struct hlist_head	list;
 } ____cacheline_aligned_in_smp;
@@ -293,7 +297,7 @@ struct io_ring_ctx {
 		struct io_wq_work_list	iopoll_list;
 
 		struct io_file_table	file_table;
-		struct io_rsrc_data	buf_table;
+		struct io_buf_table	buf_table;
 
 		struct io_submit_state	submit_state;
 
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index f60d0a9d505e2..d389c06cbce10 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -217,12 +217,12 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
 			seq_puts(m, "\n");
 		}
 	}
-	seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.nr);
-	for (i = 0; has_lock && i < ctx->buf_table.nr; i++) {
+	seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.data.nr);
+	for (i = 0; has_lock && i < ctx->buf_table.data.nr; i++) {
 		struct io_mapped_ubuf *buf = NULL;
 
-		if (ctx->buf_table.nodes[i])
-			buf = ctx->buf_table.nodes[i]->buf;
+		if (ctx->buf_table.data.nodes[i])
+			buf = ctx->buf_table.data.nodes[i]->buf;
 		if (buf)
 			seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf, buf->len);
 		else
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7800edbc57279..bcc8ee31cc97c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1730,7 +1730,7 @@ static bool io_assign_buffer(struct io_kiocb *req, unsigned int issue_flags)
 		return true;
 
 	io_ring_submit_lock(ctx, issue_flags);
-	node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
+	node = io_rsrc_node_lookup(&ctx->buf_table.data, req->buf_index);
 	if (node)
 		io_req_assign_buf_node(req, node);
 	io_ring_submit_unlock(ctx, issue_flags);
diff --git a/io_uring/register.c b/io_uring/register.c
index cc23a4c205cd4..f15a8d52ad30f 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -926,7 +926,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
 	ret = __io_uring_register(ctx, opcode, arg, nr_args);
 
 	trace_io_uring_register(ctx, opcode, ctx->file_table.data.nr,
-				ctx->buf_table.nr, ret);
+				ctx->buf_table.data.nr, ret);
 	mutex_unlock(&ctx->uring_lock);
 
 	fput(file);
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 88bcacc77b72e..261b5535f46c6 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -235,9 +235,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 	__u32 done;
 	int i, err;
 
-	if (!ctx->buf_table.nr)
+	if (!ctx->buf_table.data.nr)
 		return -ENXIO;
-	if (up->offset + nr_args > ctx->buf_table.nr)
+	if (up->offset + nr_args > ctx->buf_table.data.nr)
 		return -EINVAL;
 
 	for (done = 0; done < nr_args; done++) {
@@ -269,9 +269,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 			}
 			node->tag = tag;
 		}
-		i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
-		io_reset_rsrc_node(ctx, &ctx->buf_table, i);
-		ctx->buf_table.nodes[i] = node;
+		i = array_index_nospec(up->offset + done, ctx->buf_table.data.nr);
+		io_reset_rsrc_node(ctx, &ctx->buf_table.data, i);
+		ctx->buf_table.data.nodes[i] = node;
 		if (ctx->compat)
 			user_data += sizeof(struct compat_iovec);
 		else
@@ -549,9 +549,9 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 
 int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
-	if (!ctx->buf_table.nr)
+	if (!ctx->buf_table.data.nr)
 		return -ENXIO;
-	io_rsrc_data_free(ctx, &ctx->buf_table);
+	io_rsrc_data_free(ctx, &ctx->buf_table.data);
 	return 0;
 }
 
@@ -578,8 +578,8 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
 	}
 
 	/* check previously registered pages */
-	for (i = 0; i < ctx->buf_table.nr; i++) {
-		struct io_rsrc_node *node = ctx->buf_table.nodes[i];
+	for (i = 0; i < ctx->buf_table.data.nr; i++) {
+		struct io_rsrc_node *node = ctx->buf_table.data.nodes[i];
 		struct io_mapped_ubuf *imu;
 
 		if (!node)
@@ -809,7 +809,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 
 	BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));
 
-	if (ctx->buf_table.nr)
+	if (ctx->buf_table.data.nr)
 		return -EBUSY;
 	if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
 		return -EINVAL;
@@ -862,7 +862,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 		data.nodes[i] = node;
 	}
 
-	ctx->buf_table = data;
+	ctx->buf_table.data = data;
 	if (ret)
 		io_sqe_buffers_unregister(ctx);
 	return ret;
@@ -872,7 +872,7 @@ int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
 			    void (*release)(void *), unsigned int index,
 			    unsigned int issue_flags)
 {
-	struct io_rsrc_data *data = &ctx->buf_table;
+	struct io_rsrc_data *data = &ctx->buf_table.data;
 	struct req_iterator rq_iter;
 	struct io_mapped_ubuf *imu;
 	struct io_rsrc_node *node;
@@ -930,7 +930,7 @@ EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
 void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index,
 			       unsigned int issue_flags)
 {
-	struct io_rsrc_data *data = &ctx->buf_table;
+	struct io_rsrc_data *data = &ctx->buf_table.data;
 	struct io_rsrc_node *node;
 
 	io_ring_submit_lock(ctx, issue_flags);
@@ -1049,10 +1049,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	if (!arg->nr && (arg->dst_off || arg->src_off))
 		return -EINVAL;
 	/* not allowed unless REPLACE is set */
-	if (ctx->buf_table.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE))
+	if (ctx->buf_table.data.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE))
 		return -EBUSY;
 
-	nbufs = src_ctx->buf_table.nr;
+	nbufs = src_ctx->buf_table.data.nr;
 	if (!arg->nr)
 		arg->nr = nbufs;
 	else if (arg->nr > nbufs)
@@ -1062,13 +1062,13 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	if (check_add_overflow(arg->nr, arg->dst_off, &nbufs))
 		return -EOVERFLOW;
 
-	ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.nr));
+	ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr));
 	if (ret)
 		return ret;
 
 	/* Fill entries in data from dst that won't overlap with src */
-	for (i = 0; i < min(arg->dst_off, ctx->buf_table.nr); i++) {
-		struct io_rsrc_node *src_node = ctx->buf_table.nodes[i];
+	for (i = 0; i < min(arg->dst_off, ctx->buf_table.data.nr); i++) {
+		struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i];
 
 		if (src_node) {
 			data.nodes[i] = src_node;
@@ -1077,7 +1077,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	}
 
 	ret = -ENXIO;
-	nbufs = src_ctx->buf_table.nr;
+	nbufs = src_ctx->buf_table.data.nr;
 	if (!nbufs)
 		goto out_free;
 	ret = -EINVAL;
@@ -1097,7 +1097,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	while (nr--) {
 		struct io_rsrc_node *dst_node, *src_node;
 
-		src_node = io_rsrc_node_lookup(&src_ctx->buf_table, i);
+		src_node = io_rsrc_node_lookup(&src_ctx->buf_table.data, i);
 		if (!src_node) {
 			dst_node = NULL;
 		} else {
@@ -1119,7 +1119,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	 * old and new nodes at this point.
 	 */
 	if (arg->flags & IORING_REGISTER_DST_REPLACE)
-		io_rsrc_data_free(ctx, &ctx->buf_table);
+		io_sqe_buffers_unregister(ctx);
 
 	/*
 	 * ctx->buf_table must be empty now - either the contents are being
@@ -1127,10 +1127,9 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	 * copied to a ring that does not have buffers yet (checked at function
 	 * entry).
 	 */
-	WARN_ON_ONCE(ctx->buf_table.nr);
-	ctx->buf_table = data;
+	WARN_ON_ONCE(ctx->buf_table.data.nr);
+	ctx->buf_table.data = data;
 	return 0;
-
 out_free:
 	io_rsrc_data_free(ctx, &data);
 	return ret;
@@ -1155,7 +1154,7 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg)
 		return -EFAULT;
 	if (buf.flags & ~(IORING_REGISTER_SRC_REGISTERED|IORING_REGISTER_DST_REPLACE))
 		return -EINVAL;
-	if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.nr)
+	if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.data.nr)
 		return -EBUSY;
 	if (memchr_inv(buf.pad, 0, sizeof(buf.pad)))
 		return -EINVAL;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCHv4 5/5] io_uring: cache nodes and mapped buffers
  2025-02-18 22:42 [PATCHv4 0/5] ublk zero-copy support Keith Busch
                   ` (3 preceding siblings ...)
  2025-02-18 22:42 ` [PATCHv4 4/5] io_uring: add abstraction for buf_table rsrc data Keith Busch
@ 2025-02-18 22:42 ` Keith Busch
  2025-02-19  4:22   ` Caleb Sander Mateos
  2025-02-20 11:08   ` Pavel Begunkov
  4 siblings, 2 replies; 28+ messages in thread
From: Keith Busch @ 2025-02-18 22:42 UTC (permalink / raw)
  To: ming.lei, asml.silence, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

From: Keith Busch <[email protected]>

Frequent alloc/free cycles on these is pretty costly. Use an io cache to
more efficiently reuse these buffers.

Signed-off-by: Keith Busch <[email protected]>
---
 include/linux/io_uring_types.h |  18 ++---
 io_uring/filetable.c           |   2 +-
 io_uring/rsrc.c                | 120 +++++++++++++++++++++++++--------
 io_uring/rsrc.h                |   2 +-
 4 files changed, 104 insertions(+), 38 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 810d1dccd27b1..bbfb651506522 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -69,8 +69,18 @@ struct io_file_table {
 	unsigned int alloc_hint;
 };
 
+struct io_alloc_cache {
+	void			**entries;
+	unsigned int		nr_cached;
+	unsigned int		max_cached;
+	size_t			elem_size;
+	unsigned int		init_clear;
+};
+
 struct io_buf_table {
 	struct io_rsrc_data	data;
+	struct io_alloc_cache	node_cache;
+	struct io_alloc_cache	imu_cache;
 };
 
 struct io_hash_bucket {
@@ -224,14 +234,6 @@ struct io_submit_state {
 	struct blk_plug		plug;
 };
 
-struct io_alloc_cache {
-	void			**entries;
-	unsigned int		nr_cached;
-	unsigned int		max_cached;
-	unsigned int		elem_size;
-	unsigned int		init_clear;
-};
-
 struct io_ring_ctx {
 	/* const or read-mostly hot data */
 	struct {
diff --git a/io_uring/filetable.c b/io_uring/filetable.c
index dd8eeec97acf6..a21660e3145ab 100644
--- a/io_uring/filetable.c
+++ b/io_uring/filetable.c
@@ -68,7 +68,7 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
 	if (slot_index >= ctx->file_table.data.nr)
 		return -EINVAL;
 
-	node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+	node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
 	if (!node)
 		return -ENOMEM;
 
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 261b5535f46c6..d5cac3a234316 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -32,6 +32,8 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 #define IORING_MAX_FIXED_FILES	(1U << 20)
 #define IORING_MAX_REG_BUFFERS	(1U << 14)
 
+#define IO_CACHED_BVECS_SEGS	32
+
 int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
 {
 	unsigned long page_limit, cur_pages, new_pages;
@@ -101,6 +103,22 @@ int io_buffer_validate(struct iovec *iov)
 	return 0;
 }
 
+static struct io_mapped_ubuf *io_alloc_imu(struct io_ring_ctx *ctx,
+					   int nr_bvecs)
+{
+	if (nr_bvecs <= IO_CACHED_BVECS_SEGS)
+		return io_cache_alloc(&ctx->buf_table.imu_cache, GFP_KERNEL);
+	return kvmalloc(struct_size_t(struct io_mapped_ubuf, bvec, nr_bvecs),
+			GFP_KERNEL);
+}
+
+static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
+{
+	if (imu->nr_bvecs > IO_CACHED_BVECS_SEGS ||
+	    !io_alloc_cache_put(&ctx->buf_table.imu_cache, imu))
+		kvfree(imu);
+}
+
 static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 {
 	struct io_mapped_ubuf *imu = node->buf;
@@ -119,22 +137,35 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 			io_unaccount_mem(ctx, imu->acct_pages);
 	}
 
-	kvfree(imu);
+	io_free_imu(ctx, imu);
 }
 
-struct io_rsrc_node *io_rsrc_node_alloc(int type)
+struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type)
 {
 	struct io_rsrc_node *node;
 
-	node = kzalloc(sizeof(*node), GFP_KERNEL);
+	if (type == IORING_RSRC_FILE)
+		node = kmalloc(sizeof(*node), GFP_KERNEL);
+	else
+		node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL);
 	if (node) {
 		node->type = type;
 		node->refs = 1;
+		node->tag = 0;
+		node->file_ptr = 0;
 	}
 	return node;
 }
 
-__cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data)
+static __cold void __io_rsrc_data_free(struct io_rsrc_data *data)
+{
+	kvfree(data->nodes);
+	data->nodes = NULL;
+	data->nr = 0;
+}
+
+__cold void io_rsrc_data_free(struct io_ring_ctx *ctx,
+			      struct io_rsrc_data *data)
 {
 	if (!data->nr)
 		return;
@@ -142,9 +173,7 @@ __cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data
 		if (data->nodes[data->nr])
 			io_put_rsrc_node(ctx, data->nodes[data->nr]);
 	}
-	kvfree(data->nodes);
-	data->nodes = NULL;
-	data->nr = 0;
+	__io_rsrc_data_free(data);
 }
 
 __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
@@ -158,6 +187,31 @@ __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
 	return -ENOMEM;
 }
 
+static __cold int io_rsrc_buffer_alloc(struct io_buf_table *table, unsigned nr)
+{
+	const int imu_cache_size = struct_size_t(struct io_mapped_ubuf, bvec,
+						 IO_CACHED_BVECS_SEGS);
+	const int node_size = sizeof(struct io_rsrc_node);
+	int ret;
+
+	ret = io_rsrc_data_alloc(&table->data, nr);
+	if (ret)
+		return ret;
+
+	if (io_alloc_cache_init(&table->node_cache, nr, node_size, 0))
+		goto free_data;
+
+	if (io_alloc_cache_init(&table->imu_cache, nr, imu_cache_size, 0))
+		goto free_cache;
+
+	return 0;
+free_cache:
+	io_alloc_cache_free(&table->node_cache, kfree);
+free_data:
+	__io_rsrc_data_free(&table->data);
+	return -ENOMEM;
+}
+
 static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				 struct io_uring_rsrc_update2 *up,
 				 unsigned nr_args)
@@ -207,7 +261,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
 				err = -EBADF;
 				break;
 			}
-			node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+			node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
 			if (!node) {
 				err = -ENOMEM;
 				fput(file);
@@ -459,6 +513,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 	case IORING_RSRC_BUFFER:
 		if (node->buf)
 			io_buffer_unmap(ctx, node);
+		if (io_alloc_cache_put(&ctx->buf_table.node_cache, node))
+			return;
 		break;
 	default:
 		WARN_ON_ONCE(1);
@@ -527,7 +583,7 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 			goto fail;
 		}
 		ret = -ENOMEM;
-		node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+		node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
 		if (!node) {
 			fput(file);
 			goto fail;
@@ -547,11 +603,19 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
+static void io_rsrc_buffer_free(struct io_ring_ctx *ctx,
+				struct io_buf_table *table)
+{
+	io_rsrc_data_free(ctx, &table->data);
+	io_alloc_cache_free(&table->node_cache, kfree);
+	io_alloc_cache_free(&table->imu_cache, kfree);
+}
+
 int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
 {
 	if (!ctx->buf_table.data.nr)
 		return -ENXIO;
-	io_rsrc_data_free(ctx, &ctx->buf_table.data);
+	io_rsrc_buffer_free(ctx, &ctx->buf_table);
 	return 0;
 }
 
@@ -732,7 +796,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 	if (!iov->iov_base)
 		return NULL;
 
-	node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+	node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
 	if (!node)
 		return ERR_PTR(-ENOMEM);
 	node->buf = NULL;
@@ -752,7 +816,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 			coalesced = io_coalesce_buffer(&pages, &nr_pages, &data);
 	}
 
-	imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
+	imu = io_alloc_imu(ctx, nr_pages);
 	if (!imu)
 		goto done;
 
@@ -789,7 +853,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
 	}
 done:
 	if (ret) {
-		kvfree(imu);
+		io_free_imu(ctx, imu);
 		if (node)
 			io_put_rsrc_node(ctx, node);
 		node = ERR_PTR(ret);
@@ -802,9 +866,9 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			    unsigned int nr_args, u64 __user *tags)
 {
 	struct page *last_hpage = NULL;
-	struct io_rsrc_data data;
 	struct iovec fast_iov, *iov = &fast_iov;
 	const struct iovec __user *uvec;
+	struct io_buf_table table;
 	int i, ret;
 
 	BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));
@@ -813,13 +877,14 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 		return -EBUSY;
 	if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
 		return -EINVAL;
-	ret = io_rsrc_data_alloc(&data, nr_args);
+	ret = io_rsrc_buffer_alloc(&table, nr_args);
 	if (ret)
 		return ret;
 
 	if (!arg)
 		memset(iov, 0, sizeof(*iov));
 
+	ctx->buf_table = table;
 	for (i = 0; i < nr_args; i++) {
 		struct io_rsrc_node *node;
 		u64 tag = 0;
@@ -859,10 +924,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			}
 			node->tag = tag;
 		}
-		data.nodes[i] = node;
+		table.data.nodes[i] = node;
 	}
-
-	ctx->buf_table.data = data;
 	if (ret)
 		io_sqe_buffers_unregister(ctx);
 	return ret;
@@ -887,14 +950,15 @@ int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
 		goto unlock;
 	}
 
-	node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+	node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
 	if (!node) {
 		ret = -ENOMEM;
 		goto unlock;
 	}
 
 	nr_bvecs = blk_rq_nr_phys_segments(rq);
-	imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+
+	imu = io_alloc_imu(ctx, nr_bvecs);
 	if (!imu) {
 		kfree(node);
 		ret = -ENOMEM;
@@ -1031,7 +1095,7 @@ static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
 static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx,
 			    struct io_uring_clone_buffers *arg)
 {
-	struct io_rsrc_data data;
+	struct io_buf_table table;
 	int i, ret, off, nr;
 	unsigned int nbufs;
 
@@ -1062,7 +1126,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	if (check_add_overflow(arg->nr, arg->dst_off, &nbufs))
 		return -EOVERFLOW;
 
-	ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr));
+	ret = io_rsrc_buffer_alloc(&table, max(nbufs, ctx->buf_table.data.nr));
 	if (ret)
 		return ret;
 
@@ -1071,7 +1135,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 		struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i];
 
 		if (src_node) {
-			data.nodes[i] = src_node;
+			table.data.nodes[i] = src_node;
 			src_node->refs++;
 		}
 	}
@@ -1101,7 +1165,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 		if (!src_node) {
 			dst_node = NULL;
 		} else {
-			dst_node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+			dst_node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
 			if (!dst_node) {
 				ret = -ENOMEM;
 				goto out_free;
@@ -1110,12 +1174,12 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 			refcount_inc(&src_node->buf->refs);
 			dst_node->buf = src_node->buf;
 		}
-		data.nodes[off++] = dst_node;
+		table.data.nodes[off++] = dst_node;
 		i++;
 	}
 
 	/*
-	 * If asked for replace, put the old table. data->nodes[] holds both
+	 * If asked for replace, put the old table. table.data->nodes[] holds both
 	 * old and new nodes at this point.
 	 */
 	if (arg->flags & IORING_REGISTER_DST_REPLACE)
@@ -1128,10 +1192,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
 	 * entry).
 	 */
 	WARN_ON_ONCE(ctx->buf_table.data.nr);
-	ctx->buf_table.data = data;
+	ctx->buf_table = table;
 	return 0;
 out_free:
-	io_rsrc_data_free(ctx, &data);
+	io_rsrc_buffer_free(ctx, &table);
 	return ret;
 }
 
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 81c31b93d4f7b..69cc212ad0c7c 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -49,7 +49,7 @@ struct io_imu_folio_data {
 	unsigned int	nr_folios;
 };
 
-struct io_rsrc_node *io_rsrc_node_alloc(int type);
+struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type);
 void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
 void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data);
 int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 1/5] io_uring: move fixed buffer import to issue path
  2025-02-18 22:42 ` [PATCHv4 1/5] io_uring: move fixed buffer import to issue path Keith Busch
@ 2025-02-19  1:27   ` Caleb Sander Mateos
  2025-02-19  4:23   ` Ming Lei
  2025-02-19 16:48   ` Pavel Begunkov
  2 siblings, 0 replies; 28+ messages in thread
From: Caleb Sander Mateos @ 2025-02-19  1:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: ming.lei, asml.silence, axboe, linux-block, io-uring, bernd,
	Keith Busch

On Tue, Feb 18, 2025 at 2:43 PM Keith Busch <[email protected]> wrote:
>
> From: Keith Busch <[email protected]>
>
> Similar to the fixed file path, requests may depend on a previous one
> to set up an index, so we need to allow linking them. The prep callback
> happens too soon for linked commands, so the lookup needs to be deferred
> to the issue path. Change the prep callbacks to just set the buf_index
> and let generic io_uring code handle the fixed buffer node setup, just
> like it already does for fixed files.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
>  include/linux/io_uring_types.h |  3 +++
>  io_uring/io_uring.c            | 19 ++++++++++++++
>  io_uring/net.c                 | 25 ++++++-------------
>  io_uring/nop.c                 | 22 +++--------------
>  io_uring/rw.c                  | 45 ++++++++++++++++++++++++----------
>  io_uring/uring_cmd.c           | 16 ++----------
>  6 files changed, 67 insertions(+), 63 deletions(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index c0fe8a00fe53a..0bcaefc4ffe02 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -482,6 +482,7 @@ enum {
>         REQ_F_DOUBLE_POLL_BIT,
>         REQ_F_APOLL_MULTISHOT_BIT,
>         REQ_F_CLEAR_POLLIN_BIT,
> +       REQ_F_FIXED_BUFFER_BIT,

Move this to the end of the REQ_F_*_BIT definitions (before __REQ_F_LAST_BIT)?

>         /* keep async read/write and isreg together and in order */
>         REQ_F_SUPPORT_NOWAIT_BIT,
>         REQ_F_ISREG_BIT,
> @@ -574,6 +575,8 @@ enum {
>         REQ_F_BUF_NODE          = IO_REQ_FLAG(REQ_F_BUF_NODE_BIT),
>         /* request has read/write metadata assigned */
>         REQ_F_HAS_METADATA      = IO_REQ_FLAG(REQ_F_HAS_METADATA_BIT),
> +       /* request has a fixed buffer at buf_index */
> +       REQ_F_FIXED_BUFFER      = IO_REQ_FLAG(REQ_F_FIXED_BUFFER_BIT),
>  };
>
>  typedef void (*io_req_tw_func_t)(struct io_kiocb *req, io_tw_token_t tw);
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 58528bf61638e..7800edbc57279 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -1721,6 +1721,23 @@ static bool io_assign_file(struct io_kiocb *req, const struct io_issue_def *def,
>         return !!req->file;
>  }
>
> +static bool io_assign_buffer(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +       struct io_ring_ctx *ctx = req->ctx;
> +       struct io_rsrc_node *node;
> +
> +       if (req->buf_node || !(req->flags & REQ_F_FIXED_BUFFER))
> +               return true;
> +
> +       io_ring_submit_lock(ctx, issue_flags);
> +       node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
> +       if (node)
> +               io_req_assign_buf_node(req, node);
> +       io_ring_submit_unlock(ctx, issue_flags);
> +
> +       return !!node;
> +}
> +
>  static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>  {
>         const struct io_issue_def *def = &io_issue_defs[req->opcode];
> @@ -1729,6 +1746,8 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>
>         if (unlikely(!io_assign_file(req, def, issue_flags)))
>                 return -EBADF;
> +       if (unlikely(!io_assign_buffer(req, issue_flags)))
> +               return -EFAULT;
>
>         if (unlikely((req->flags & REQ_F_CREDS) && req->creds != current_cred()))
>                 creds = override_creds(req->creds);
> diff --git a/io_uring/net.c b/io_uring/net.c
> index 000dc70d08d0d..39838e8575b53 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -1373,6 +1373,10 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  #endif
>         if (unlikely(!io_msg_alloc_async(req)))
>                 return -ENOMEM;
> +       if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
> +               req->buf_index = zc->buf_index;

Can the buf_index field of io_sr_msg be removed now that it's only
used within io_send_zc_prep()?

> +               req->flags |= REQ_F_FIXED_BUFFER;
> +       }
>         if (req->opcode != IORING_OP_SENDMSG_ZC)
>                 return io_send_setup(req, sqe);
>         return io_sendmsg_setup(req, sqe);
> @@ -1434,25 +1438,10 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
>         struct io_async_msghdr *kmsg = req->async_data;
>         int ret;
>
> -       if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
> -               struct io_ring_ctx *ctx = req->ctx;
> -               struct io_rsrc_node *node;
> -
> -               ret = -EFAULT;
> -               io_ring_submit_lock(ctx, issue_flags);
> -               node = io_rsrc_node_lookup(&ctx->buf_table, sr->buf_index);
> -               if (node) {
> -                       io_req_assign_buf_node(sr->notif, node);
> -                       ret = 0;
> -               }
> -               io_ring_submit_unlock(ctx, issue_flags);
> -
> -               if (unlikely(ret))
> -                       return ret;
> -
> +       if (req->flags & REQ_F_FIXED_BUFFER) {
>                 ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter,
> -                                       node->buf, (u64)(uintptr_t)sr->buf,
> -                                       sr->len);
> +                                       req->buf_node->buf,
> +                                       (u64)(uintptr_t)sr->buf, sr->len);
>                 if (unlikely(ret))
>                         return ret;
>                 kmsg->msg.sg_from_iter = io_sg_from_iter;
> diff --git a/io_uring/nop.c b/io_uring/nop.c
> index 5e5196df650a1..989908603112f 100644
> --- a/io_uring/nop.c
> +++ b/io_uring/nop.c
> @@ -16,7 +16,6 @@ struct io_nop {
>         struct file     *file;
>         int             result;
>         int             fd;
> -       int             buffer;
>         unsigned int    flags;
>  };
>
> @@ -39,10 +38,10 @@ int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>                 nop->fd = READ_ONCE(sqe->fd);
>         else
>                 nop->fd = -1;
> -       if (nop->flags & IORING_NOP_FIXED_BUFFER)
> -               nop->buffer = READ_ONCE(sqe->buf_index);
> -       else
> -               nop->buffer = -1;
> +       if (nop->flags & IORING_NOP_FIXED_BUFFER) {
> +               req->buf_index = READ_ONCE(sqe->buf_index);
> +               req->flags |= REQ_F_FIXED_BUFFER;
> +       }
>         return 0;
>  }
>
> @@ -63,19 +62,6 @@ int io_nop(struct io_kiocb *req, unsigned int issue_flags)
>                         goto done;
>                 }
>         }
> -       if (nop->flags & IORING_NOP_FIXED_BUFFER) {
> -               struct io_ring_ctx *ctx = req->ctx;
> -               struct io_rsrc_node *node;
> -
> -               ret = -EFAULT;
> -               io_ring_submit_lock(ctx, issue_flags);
> -               node = io_rsrc_node_lookup(&ctx->buf_table, nop->buffer);
> -               if (node) {
> -                       io_req_assign_buf_node(req, node);
> -                       ret = 0;
> -               }
> -               io_ring_submit_unlock(ctx, issue_flags);
> -       }
>  done:
>         if (ret < 0)
>                 req_set_fail(req);
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 16f12f94943f7..2d8910d9197a0 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -353,25 +353,14 @@ int io_prep_writev(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>                             int ddir)
>  {
> -       struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> -       struct io_ring_ctx *ctx = req->ctx;
> -       struct io_rsrc_node *node;
> -       struct io_async_rw *io;
>         int ret;
>
>         ret = io_prep_rw(req, sqe, ddir, false);
>         if (unlikely(ret))
>                 return ret;
>
> -       node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
> -       if (!node)
> -               return -EFAULT;
> -       io_req_assign_buf_node(req, node);
> -
> -       io = req->async_data;
> -       ret = io_import_fixed(ddir, &io->iter, node->buf, rw->addr, rw->len);
> -       iov_iter_save_state(&io->iter, &io->iter_state);
> -       return ret;
> +       req->flags |= REQ_F_FIXED_BUFFER;
> +       return 0;
>  }
>
>  int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> @@ -954,10 +943,36 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
>         return ret;
>  }
>
> +static int io_import_fixed_buffer(struct io_kiocb *req, int ddir)
> +{
> +       struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> +       struct io_async_rw *io;
> +       int ret;
> +
> +       if (!(req->flags & REQ_F_FIXED_BUFFER))
> +               return 0;
> +
> +       io = req->async_data;
> +       if (io->bytes_done)
> +               return 0;
> +
> +       ret = io_import_fixed(ddir, &io->iter, req->buf_node->buf, rw->addr,
> +                             rw->len);
> +       if (ret)
> +               return ret;
> +
> +       iov_iter_save_state(&io->iter, &io->iter_state);
> +       return 0;
> +}
> +
>  int io_read(struct io_kiocb *req, unsigned int issue_flags)
>  {
>         int ret;
>
> +       ret = io_import_fixed_buffer(req, READ);
> +       if (unlikely(ret))
> +               return ret;
> +
>         ret = __io_read(req, issue_flags);
>         if (ret >= 0)
>                 return kiocb_done(req, ret, issue_flags);
> @@ -1062,6 +1077,10 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
>         ssize_t ret, ret2;
>         loff_t *ppos;
>
> +       ret = io_import_fixed_buffer(req, WRITE);
> +       if (unlikely(ret))
> +               return ret;
> +
>         ret = io_rw_init_file(req, FMODE_WRITE, WRITE);
>         if (unlikely(ret))
>                 return ret;
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 8bdf2c9b3fef9..112b49fde23e5 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -200,19 +200,8 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>                 return -EINVAL;
>
>         if (ioucmd->flags & IORING_URING_CMD_FIXED) {
> -               struct io_ring_ctx *ctx = req->ctx;
> -               struct io_rsrc_node *node;
> -               u16 index = READ_ONCE(sqe->buf_index);
> -
> -               node = io_rsrc_node_lookup(&ctx->buf_table, index);
> -               if (unlikely(!node))
> -                       return -EFAULT;
> -               /*
> -                * Pi node upfront, prior to io_uring_cmd_import_fixed()
> -                * being called. This prevents destruction of the mapped buffer
> -                * we'll need at actual import time.
> -                */
> -               io_req_assign_buf_node(req, node);
> +               req->buf_index = READ_ONCE(sqe->buf_index);
> +               req->flags |= REQ_F_FIXED_BUFFER;
>         }
>         ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
>
> @@ -262,7 +251,6 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
>         struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
>         struct io_rsrc_node *node = req->buf_node;
>
> -       /* Must have had rsrc_node assigned at prep time */
>         if (node)
>                 return io_import_fixed(rw, iter, node->buf, ubuf, len);

Is it possible for node to be NULL? If the buffer lookup failed,
io_issue_sqe() would have returned early and not called ->issue(), so
this function wouldn't have been called.

Best,
Caleb

>
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 2/5] io_uring: add support for kernel registered bvecs
  2025-02-18 22:42 ` [PATCHv4 2/5] io_uring: add support for kernel registered bvecs Keith Busch
@ 2025-02-19  1:54   ` Caleb Sander Mateos
  2025-02-19 17:23     ` Pavel Begunkov
  2025-02-20 10:31     ` Pavel Begunkov
  2025-02-20 10:38   ` Pavel Begunkov
  1 sibling, 2 replies; 28+ messages in thread
From: Caleb Sander Mateos @ 2025-02-19  1:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: ming.lei, asml.silence, axboe, linux-block, io-uring, bernd,
	Keith Busch

On Tue, Feb 18, 2025 at 2:42 PM Keith Busch <[email protected]> wrote:
>
> From: Keith Busch <[email protected]>
>
> Provide an interface for the kernel to leverage the existing
> pre-registered buffers that io_uring provides. User space can reference
> these later to achieve zero-copy IO.
>
> User space must register an empty fixed buffer table with io_uring in
> order for the kernel to make use of it.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
>  include/linux/io_uring.h       |   1 +
>  include/linux/io_uring_types.h |   6 ++
>  io_uring/rsrc.c                | 115 ++++++++++++++++++++++++++++++---
>  io_uring/rsrc.h                |   4 ++
>  4 files changed, 118 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 85fe4e6b275c7..b5637a2aae340 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -5,6 +5,7 @@
>  #include <linux/sched.h>
>  #include <linux/xarray.h>
>  #include <uapi/linux/io_uring.h>
> +#include <linux/blk-mq.h>

Should this be added to the header file that declares
io_buffer_register_bvec() instead?

>
>  #if defined(CONFIG_IO_URING)
>  void __io_uring_cancel(bool cancel_all);
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 0bcaefc4ffe02..2aed51e8c79ee 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -709,4 +709,10 @@ static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
>         return ctx->flags & IORING_SETUP_CQE32;
>  }
>
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
> +                           void (*release)(void *), unsigned int index,
> +                           unsigned int issue_flags);
> +void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index,
> +                              unsigned int issue_flags);

Hmm, io_uring_types.h seems like a strange place for these. The rest
of this file consists of types and inline functions. Maybe
io_uring/rsrc.h would make more sense?

> +
>  #endif
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 20b884c84e55f..88bcacc77b72e 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -103,19 +103,23 @@ int io_buffer_validate(struct iovec *iov)
>
>  static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
>  {
> -       unsigned int i;
> +       struct io_mapped_ubuf *imu = node->buf;
>
> -       if (node->buf) {
> -               struct io_mapped_ubuf *imu = node->buf;
> +       if (!refcount_dec_and_test(&imu->refs))
> +               return;
> +
> +       if (imu->release) {
> +               imu->release(imu->priv);
> +       } else {
> +               unsigned int i;
>
> -               if (!refcount_dec_and_test(&imu->refs))
> -                       return;
>                 for (i = 0; i < imu->nr_bvecs; i++)
>                         unpin_user_page(imu->bvec[i].bv_page);
>                 if (imu->acct_pages)
>                         io_unaccount_mem(ctx, imu->acct_pages);
> -               kvfree(imu);
>         }
> +
> +       kvfree(imu);
>  }
>
>  struct io_rsrc_node *io_rsrc_node_alloc(int type)
> @@ -764,6 +768,10 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
>         imu->len = iov->iov_len;
>         imu->nr_bvecs = nr_pages;
>         imu->folio_shift = PAGE_SHIFT;
> +       imu->release = NULL;
> +       imu->priv = NULL;
> +       imu->readable = true;
> +       imu->writeable = true;
>         if (coalesced)
>                 imu->folio_shift = data.folio_shift;
>         refcount_set(&imu->refs, 1);
> @@ -860,6 +868,87 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>         return ret;
>  }
>
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
> +                           void (*release)(void *), unsigned int index,
> +                           unsigned int issue_flags)
> +{
> +       struct io_rsrc_data *data = &ctx->buf_table;
> +       struct req_iterator rq_iter;
> +       struct io_mapped_ubuf *imu;
> +       struct io_rsrc_node *node;
> +       struct bio_vec bv, *bvec;
> +       int ret = 0;
> +       u16 nr_bvecs;
> +
> +       io_ring_submit_lock(ctx, issue_flags);
> +
> +       if (io_rsrc_node_lookup(data, index)) {
> +               ret = -EBUSY;
> +               goto unlock;
> +       }
> +
> +       node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
> +       if (!node) {
> +               ret = -ENOMEM;
> +               goto unlock;
> +       }
> +
> +       nr_bvecs = blk_rq_nr_phys_segments(rq);
> +       imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
> +       if (!imu) {
> +               kfree(node);
> +               ret = -ENOMEM;
> +               goto unlock;
> +       }
> +
> +       imu->ubuf = 0;
> +       imu->len = blk_rq_bytes(rq);
> +       imu->acct_pages = 0;
> +       imu->folio_shift = PAGE_SHIFT;
> +       imu->nr_bvecs = nr_bvecs;
> +       refcount_set(&imu->refs, 1);
> +       imu->release = release;
> +       imu->priv = rq;
> +
> +       if (rq_data_dir(rq))

Check rq_data_dir(rq) == WRITE instead to avoid depending on the value
of READ/WRITE?

> +               imu->writeable = true;
> +       else
> +               imu->readable = true;
> +
> +       bvec = imu->bvec;
> +       rq_for_each_bvec(bv, rq, rq_iter)
> +               *bvec++ = bv;
> +
> +       node->buf = imu;
> +       data->nodes[index] = node;
> +unlock:
> +       io_ring_submit_unlock(ctx, issue_flags);
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> +
> +void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index,
> +                              unsigned int issue_flags)
> +{
> +       struct io_rsrc_data *data = &ctx->buf_table;
> +       struct io_rsrc_node *node;
> +
> +       io_ring_submit_lock(ctx, issue_flags);
> +
> +       if (!data->nr)
> +               goto unlock;
> +
> +       node = io_rsrc_node_lookup(data, index);
> +       if (!node || !node->buf->release)
> +               goto unlock;

Still think the data->nr check is unnecessary, since
io_rsrc_node_lookup() would return NULL in that case.

> +
> +       io_put_rsrc_node(ctx, node);
> +       data->nodes[index] = NULL;
> +unlock:
> +       io_ring_submit_unlock(ctx, issue_flags);
> +}
> +EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
> +
>  int io_import_fixed(int ddir, struct iov_iter *iter,
>                            struct io_mapped_ubuf *imu,
>                            u64 buf_addr, size_t len)
> @@ -874,6 +963,9 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
>         /* not inside the mapped region */
>         if (unlikely(buf_addr < imu->ubuf || buf_end > (imu->ubuf + imu->len)))
>                 return -EFAULT;
> +       if ((ddir == READ && !imu->readable) ||
> +           (ddir == WRITE && !imu->writeable))
> +               return -EFAULT;

This could be made less branchy by storing a bitmask of allowed data
transfer directions instead of 2 bool fields. Then this could just be:
if (!(imu->ddirs >> ddir & 1)
        return -EFAULT;

Best,
Caleb

>
>         /*
>          * Might not be a start of buffer, set size appropriately
> @@ -886,8 +978,8 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
>                 /*
>                  * Don't use iov_iter_advance() here, as it's really slow for
>                  * using the latter parts of a big fixed buffer - it iterates
> -                * over each segment manually. We can cheat a bit here, because
> -                * we know that:
> +                * over each segment manually. We can cheat a bit here for user
> +                * registered nodes, because we know that:
>                  *
>                  * 1) it's a BVEC iter, we set it up
>                  * 2) all bvecs are the same in size, except potentially the
> @@ -901,8 +993,15 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
>                  */
>                 const struct bio_vec *bvec = imu->bvec;
>
> +               /*
> +                * Kernel buffer bvecs, on the other hand, don't necessarily
> +                * have the size property of user registered ones, so we have
> +                * to use the slow iter advance.
> +                */
>                 if (offset < bvec->bv_len) {
>                         iter->iov_offset = offset;
> +               } else if (imu->release) {
> +                       iov_iter_advance(iter, offset);
>                 } else {
>                         unsigned long seg_skip;
>
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index abf86b5b86140..81c31b93d4f7b 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -33,6 +33,10 @@ struct io_mapped_ubuf {
>         unsigned int    folio_shift;
>         refcount_t      refs;
>         unsigned long   acct_pages;
> +       void            (*release)(void *);
> +       void            *priv;
> +       bool            readable;
> +       bool            writeable;
>         struct bio_vec  bvec[] __counted_by(nr_bvecs);
>  };
>
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 3/5] ublk: zc register/unregister bvec
  2025-02-18 22:42 ` [PATCHv4 3/5] ublk: zc register/unregister bvec Keith Busch
@ 2025-02-19  2:36   ` Caleb Sander Mateos
  2025-02-20 11:11   ` Pavel Begunkov
  1 sibling, 0 replies; 28+ messages in thread
From: Caleb Sander Mateos @ 2025-02-19  2:36 UTC (permalink / raw)
  To: Keith Busch
  Cc: ming.lei, asml.silence, axboe, linux-block, io-uring, bernd,
	Keith Busch

Sorry, I sent these comments on v3 to you directly and forgot to CC
the list. Copying them here.

On Tue, Feb 18, 2025 at 2:43 PM Keith Busch <[email protected]> wrote:
>
> From: Keith Busch <[email protected]>
>
> Provide new operations for the user to request mapping an active request
> to an io uring instance's buf_table. The user has to provide the index
> it wants to install the buffer.
>
> A reference count is taken on the request to ensure it can't be
> completed while it is active in a ring's buf_table.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
>  drivers/block/ublk_drv.c      | 137 +++++++++++++++++++++++++---------
>  include/uapi/linux/ublk_cmd.h |   4 +
>  2 files changed, 105 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 529085181f355..0c753176b14e9 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -51,6 +51,9 @@
>  /* private ioctl command mirror */
>  #define UBLK_CMD_DEL_DEV_ASYNC _IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
>
> +#define UBLK_IO_REGISTER_IO_BUF                _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
> +#define UBLK_IO_UNREGISTER_IO_BUF      _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
> +
>  /* All UBLK_F_* have to be included into UBLK_F_ALL */
>  #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
>                 | UBLK_F_URING_CMD_COMP_IN_TASK \
> @@ -201,7 +204,7 @@ static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq,
>                                                    int tag);
>  static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub)
>  {
> -       return ub->dev_info.flags & UBLK_F_USER_COPY;
> +       return ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
>  }
>
>  static inline bool ublk_dev_is_zoned(const struct ublk_device *ub)
> @@ -581,7 +584,7 @@ static void ublk_apply_params(struct ublk_device *ub)
>
>  static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
>  {
> -       return ubq->flags & UBLK_F_USER_COPY;
> +       return ubq->flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
>  }
>
>  static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
> @@ -1747,6 +1750,96 @@ static inline void ublk_prep_cancel(struct io_uring_cmd *cmd,
>         io_uring_cmd_mark_cancelable(cmd, issue_flags);
>  }
>
> +static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> +               struct ublk_queue *ubq, int tag, size_t offset)
> +{
> +       struct request *req;
> +
> +       if (!ublk_need_req_ref(ubq))
> +               return NULL;
> +
> +       req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> +       if (!req)
> +               return NULL;
> +
> +       if (!ublk_get_req_ref(ubq, req))
> +               return NULL;
> +
> +       if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
> +               goto fail_put;
> +
> +       if (!ublk_rq_has_data(req))
> +               goto fail_put;
> +
> +       if (offset > blk_rq_bytes(req))
> +               goto fail_put;
> +
> +       return req;
> +fail_put:
> +       ublk_put_req_ref(ubq, req);
> +       return NULL;
> +}
> +
> +static void ublk_io_release(void *priv)
> +{
> +       struct request *rq = priv;
> +       struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> +
> +       ublk_put_req_ref(ubq, rq);
> +}
> +
> +static int ublk_register_io_buf(struct io_uring_cmd *cmd,
> +                               struct ublk_queue *ubq, int tag,
> +                               const struct ublksrv_io_cmd *ub_cmd,
> +                               unsigned int issue_flags)
> +{
> +       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +       struct ublk_device *ub = cmd->file->private_data;
> +       int index = (int)ub_cmd->addr, ret;

Make index an unsigned to match io_buffer_register_bvec()? Same
comment for ublk_unregister_io_buf().

> +       struct ublk_rq_data *data;
> +       struct request *req;
> +
> +       if (!ub)
> +               return -EPERM;

__ublk_ch_uring_cmd() has already dereferenced ub =
cmd->file->private_data, how is it possible to hit this? Same comment
for ublk_unregister_io_buf()

> +
> +       req = __ublk_check_and_get_req(ub, ubq, tag, 0);

Consider moving the offset > blk_rq_bytes(req) check from
__ublk_check_and_get_req() to ublk_check_and_get_req() so we don't
need to pass an unused offset here.

> +       if (!req)
> +               return -EINVAL;
> +
> +       data = blk_mq_rq_to_pdu(req);

data appears unused in this function. Same comment for ublk_unregister_io_buf().

> +       ret = io_buffer_register_bvec(ctx, req, ublk_io_release, index,
> +                                     issue_flags);
> +       if (ret) {
> +               ublk_put_req_ref(ubq, req);
> +               return ret;
> +       }
> +
> +       return 0;
> +}
> +
> +static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
> +                                 struct ublk_queue *ubq, int tag,
> +                                 const struct ublksrv_io_cmd *ub_cmd,
> +                                 unsigned int issue_flags)

Make tag an unsigned to match __ublk_ch_uring_cmd() and
blk_mq_tag_to_rq()? Same comment for ublk_register_io_buf().

> +{
> +       struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +       struct ublk_device *ub = cmd->file->private_data;
> +       int index = (int)ub_cmd->addr;
> +       struct ublk_rq_data *data;
> +       struct request *req;
> +
> +       if (!ub)
> +               return -EPERM;
> +
> +       req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> +       if (!req)
> +               return -EINVAL;
> +
> +       data = blk_mq_rq_to_pdu(req);
> +       io_buffer_unregister_bvec(ctx, index, issue_flags);

Should we check that the registered bvec actually corresponds to this
ublk request? Otherwise, I don't see a reason for the unregister
command to involve the ublk request at all. Perhaps a generic io_uring
"unregister buffer index" operation similar to IORING_OP_FILES_UPDATE
would make more sense?

Best,
Caleb

> +       return 0;
> +}
> +
>  static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
>                                unsigned int issue_flags,
>                                const struct ublksrv_io_cmd *ub_cmd)
> @@ -1798,6 +1891,11 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
>
>         ret = -EINVAL;
>         switch (_IOC_NR(cmd_op)) {
> +       case UBLK_IO_REGISTER_IO_BUF:
> +               return ublk_register_io_buf(cmd, ubq, tag, ub_cmd, issue_flags);
> +       case UBLK_IO_UNREGISTER_IO_BUF:
> +               return ublk_unregister_io_buf(cmd, ubq, tag, ub_cmd,
> +                                             issue_flags);
>         case UBLK_IO_FETCH_REQ:
>                 /* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
>                 if (ublk_queue_ready(ubq)) {
> @@ -1872,36 +1970,6 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
>         return -EIOCBQUEUED;
>  }
>
> -static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> -               struct ublk_queue *ubq, int tag, size_t offset)
> -{
> -       struct request *req;
> -
> -       if (!ublk_need_req_ref(ubq))
> -               return NULL;
> -
> -       req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> -       if (!req)
> -               return NULL;
> -
> -       if (!ublk_get_req_ref(ubq, req))
> -               return NULL;
> -
> -       if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
> -               goto fail_put;
> -
> -       if (!ublk_rq_has_data(req))
> -               goto fail_put;
> -
> -       if (offset > blk_rq_bytes(req))
> -               goto fail_put;
> -
> -       return req;
> -fail_put:
> -       ublk_put_req_ref(ubq, req);
> -       return NULL;
> -}
> -
>  static inline int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
>                 unsigned int issue_flags)
>  {
> @@ -2527,9 +2595,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
>                 goto out_free_dev_number;
>         }
>
> -       /* We are not ready to support zero copy */
> -       ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
> -
>         ub->dev_info.nr_hw_queues = min_t(unsigned int,
>                         ub->dev_info.nr_hw_queues, nr_cpu_ids);
>         ublk_align_max_io_size(ub);
> @@ -2860,7 +2925,7 @@ static int ublk_ctrl_get_features(struct io_uring_cmd *cmd)
>  {
>         const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe);
>         void __user *argp = (void __user *)(unsigned long)header->addr;
> -       u64 features = UBLK_F_ALL & ~UBLK_F_SUPPORT_ZERO_COPY;
> +       u64 features = UBLK_F_ALL;
>
>         if (header->len != UBLK_FEATURES_LEN || !header->addr)
>                 return -EINVAL;
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index a8bc98bb69fce..74246c926b55f 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -94,6 +94,10 @@
>         _IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd)
>  #define        UBLK_U_IO_NEED_GET_DATA         \
>         _IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd)
> +#define        UBLK_U_IO_REGISTER_IO_BUF       \
> +       _IOWR('u', 0x23, struct ublksrv_io_cmd)
> +#define        UBLK_U_IO_UNREGISTER_IO_BUF     \
> +       _IOWR('u', 0x24, struct ublksrv_io_cmd)
>
>  /* only ABORT means that no re-fetch */
>  #define UBLK_IO_RES_OK                 0
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 4/5] io_uring: add abstraction for buf_table rsrc data
  2025-02-18 22:42 ` [PATCHv4 4/5] io_uring: add abstraction for buf_table rsrc data Keith Busch
@ 2025-02-19  3:04   ` Caleb Sander Mateos
  0 siblings, 0 replies; 28+ messages in thread
From: Caleb Sander Mateos @ 2025-02-19  3:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: ming.lei, asml.silence, axboe, linux-block, io-uring, bernd,
	Keith Busch

On Tue, Feb 18, 2025 at 2:43 PM Keith Busch <[email protected]> wrote:
>
> From: Keith Busch <[email protected]>
>
> We'll need to add more fields specific to the registered buffers, so
> make a layer for it now. No functional change in this patch.
>
> Signed-off-by: Keith Busch <[email protected]>

Reviewed-by: Caleb Sander Mateos <[email protected]>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 5/5] io_uring: cache nodes and mapped buffers
  2025-02-18 22:42 ` [PATCHv4 5/5] io_uring: cache nodes and mapped buffers Keith Busch
@ 2025-02-19  4:22   ` Caleb Sander Mateos
  2025-02-24 21:01     ` Keith Busch
  2025-02-20 11:08   ` Pavel Begunkov
  1 sibling, 1 reply; 28+ messages in thread
From: Caleb Sander Mateos @ 2025-02-19  4:22 UTC (permalink / raw)
  To: Keith Busch
  Cc: ming.lei, asml.silence, axboe, linux-block, io-uring, bernd,
	Keith Busch

On Tue, Feb 18, 2025 at 2:43 PM Keith Busch <[email protected]> wrote:
>
> From: Keith Busch <[email protected]>
>
> Frequent alloc/free cycles on these is pretty costly. Use an io cache to
> more efficiently reuse these buffers.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
>  include/linux/io_uring_types.h |  18 ++---
>  io_uring/filetable.c           |   2 +-
>  io_uring/rsrc.c                | 120 +++++++++++++++++++++++++--------
>  io_uring/rsrc.h                |   2 +-
>  4 files changed, 104 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 810d1dccd27b1..bbfb651506522 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -69,8 +69,18 @@ struct io_file_table {
>         unsigned int alloc_hint;
>  };
>
> +struct io_alloc_cache {
> +       void                    **entries;
> +       unsigned int            nr_cached;
> +       unsigned int            max_cached;
> +       size_t                  elem_size;

Is growing this field from unsigned to size_t really necessary? It
probably doesn't make sense to be caching allocations > 4 GB.

> +       unsigned int            init_clear;
> +};
> +
>  struct io_buf_table {
>         struct io_rsrc_data     data;
> +       struct io_alloc_cache   node_cache;
> +       struct io_alloc_cache   imu_cache;
>  };
>
>  struct io_hash_bucket {
> @@ -224,14 +234,6 @@ struct io_submit_state {
>         struct blk_plug         plug;
>  };
>
> -struct io_alloc_cache {
> -       void                    **entries;
> -       unsigned int            nr_cached;
> -       unsigned int            max_cached;
> -       unsigned int            elem_size;
> -       unsigned int            init_clear;
> -};
> -
>  struct io_ring_ctx {
>         /* const or read-mostly hot data */
>         struct {
> diff --git a/io_uring/filetable.c b/io_uring/filetable.c
> index dd8eeec97acf6..a21660e3145ab 100644
> --- a/io_uring/filetable.c
> +++ b/io_uring/filetable.c
> @@ -68,7 +68,7 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
>         if (slot_index >= ctx->file_table.data.nr)
>                 return -EINVAL;
>
> -       node = io_rsrc_node_alloc(IORING_RSRC_FILE);
> +       node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
>         if (!node)
>                 return -ENOMEM;
>
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 261b5535f46c6..d5cac3a234316 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -32,6 +32,8 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
>  #define IORING_MAX_FIXED_FILES (1U << 20)
>  #define IORING_MAX_REG_BUFFERS (1U << 14)
>
> +#define IO_CACHED_BVECS_SEGS   32
> +
>  int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
>  {
>         unsigned long page_limit, cur_pages, new_pages;
> @@ -101,6 +103,22 @@ int io_buffer_validate(struct iovec *iov)
>         return 0;
>  }
>
> +static struct io_mapped_ubuf *io_alloc_imu(struct io_ring_ctx *ctx,
> +                                          int nr_bvecs)

Pass nr_bvecs as unsigned to avoid sign-extension in struct_size_t()?

> +{
> +       if (nr_bvecs <= IO_CACHED_BVECS_SEGS)
> +               return io_cache_alloc(&ctx->buf_table.imu_cache, GFP_KERNEL);
> +       return kvmalloc(struct_size_t(struct io_mapped_ubuf, bvec, nr_bvecs),
> +                       GFP_KERNEL);
> +}
> +
> +static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
> +{
> +       if (imu->nr_bvecs > IO_CACHED_BVECS_SEGS ||
> +           !io_alloc_cache_put(&ctx->buf_table.imu_cache, imu))
> +               kvfree(imu);
> +}
> +
>  static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
>  {
>         struct io_mapped_ubuf *imu = node->buf;
> @@ -119,22 +137,35 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
>                         io_unaccount_mem(ctx, imu->acct_pages);
>         }
>
> -       kvfree(imu);
> +       io_free_imu(ctx, imu);
>  }
>
> -struct io_rsrc_node *io_rsrc_node_alloc(int type)
> +struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type)
>  {
>         struct io_rsrc_node *node;
>
> -       node = kzalloc(sizeof(*node), GFP_KERNEL);
> +       if (type == IORING_RSRC_FILE)
> +               node = kmalloc(sizeof(*node), GFP_KERNEL);
> +       else
> +               node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL);
>         if (node) {
>                 node->type = type;
>                 node->refs = 1;
> +               node->tag = 0;
> +               node->file_ptr = 0;
>         }
>         return node;
>  }
>
> -__cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data)
> +static __cold void __io_rsrc_data_free(struct io_rsrc_data *data)
> +{
> +       kvfree(data->nodes);
> +       data->nodes = NULL;
> +       data->nr = 0;
> +}
> +
> +__cold void io_rsrc_data_free(struct io_ring_ctx *ctx,
> +                             struct io_rsrc_data *data)
>  {
>         if (!data->nr)
>                 return;
> @@ -142,9 +173,7 @@ __cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data
>                 if (data->nodes[data->nr])
>                         io_put_rsrc_node(ctx, data->nodes[data->nr]);
>         }
> -       kvfree(data->nodes);
> -       data->nodes = NULL;
> -       data->nr = 0;
> +       __io_rsrc_data_free(data);
>  }
>
>  __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
> @@ -158,6 +187,31 @@ __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
>         return -ENOMEM;
>  }
>
> +static __cold int io_rsrc_buffer_alloc(struct io_buf_table *table, unsigned nr)
> +{
> +       const int imu_cache_size = struct_size_t(struct io_mapped_ubuf, bvec,
> +                                                IO_CACHED_BVECS_SEGS);
> +       const int node_size = sizeof(struct io_rsrc_node);
> +       int ret;
> +
> +       ret = io_rsrc_data_alloc(&table->data, nr);
> +       if (ret)
> +               return ret;
> +
> +       if (io_alloc_cache_init(&table->node_cache, nr, node_size, 0))
> +               goto free_data;
> +
> +       if (io_alloc_cache_init(&table->imu_cache, nr, imu_cache_size, 0))
> +               goto free_cache;
> +
> +       return 0;
> +free_cache:
> +       io_alloc_cache_free(&table->node_cache, kfree);
> +free_data:
> +       __io_rsrc_data_free(&table->data);
> +       return -ENOMEM;
> +}
> +
>  static int __io_sqe_files_update(struct io_ring_ctx *ctx,
>                                  struct io_uring_rsrc_update2 *up,
>                                  unsigned nr_args)
> @@ -207,7 +261,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
>                                 err = -EBADF;
>                                 break;
>                         }
> -                       node = io_rsrc_node_alloc(IORING_RSRC_FILE);
> +                       node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
>                         if (!node) {
>                                 err = -ENOMEM;
>                                 fput(file);
> @@ -459,6 +513,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
>         case IORING_RSRC_BUFFER:
>                 if (node->buf)
>                         io_buffer_unmap(ctx, node);
> +               if (io_alloc_cache_put(&ctx->buf_table.node_cache, node))
> +                       return;
>                 break;
>         default:
>                 WARN_ON_ONCE(1);
> @@ -527,7 +583,7 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
>                         goto fail;
>                 }
>                 ret = -ENOMEM;
> -               node = io_rsrc_node_alloc(IORING_RSRC_FILE);
> +               node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
>                 if (!node) {
>                         fput(file);
>                         goto fail;
> @@ -547,11 +603,19 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
>         return ret;
>  }
>
> +static void io_rsrc_buffer_free(struct io_ring_ctx *ctx,
> +                               struct io_buf_table *table)
> +{
> +       io_rsrc_data_free(ctx, &table->data);
> +       io_alloc_cache_free(&table->node_cache, kfree);
> +       io_alloc_cache_free(&table->imu_cache, kfree);
> +}
> +
>  int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
>  {
>         if (!ctx->buf_table.data.nr)
>                 return -ENXIO;
> -       io_rsrc_data_free(ctx, &ctx->buf_table.data);
> +       io_rsrc_buffer_free(ctx, &ctx->buf_table);
>         return 0;
>  }
>
> @@ -732,7 +796,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
>         if (!iov->iov_base)
>                 return NULL;
>
> -       node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
> +       node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
>         if (!node)
>                 return ERR_PTR(-ENOMEM);
>         node->buf = NULL;
> @@ -752,7 +816,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
>                         coalesced = io_coalesce_buffer(&pages, &nr_pages, &data);
>         }
>
> -       imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
> +       imu = io_alloc_imu(ctx, nr_pages);
>         if (!imu)
>                 goto done;
>
> @@ -789,7 +853,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
>         }
>  done:
>         if (ret) {
> -               kvfree(imu);
> +               io_free_imu(ctx, imu);
>                 if (node)
>                         io_put_rsrc_node(ctx, node);
>                 node = ERR_PTR(ret);
> @@ -802,9 +866,9 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>                             unsigned int nr_args, u64 __user *tags)
>  {
>         struct page *last_hpage = NULL;
> -       struct io_rsrc_data data;
>         struct iovec fast_iov, *iov = &fast_iov;
>         const struct iovec __user *uvec;
> +       struct io_buf_table table;
>         int i, ret;
>
>         BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));
> @@ -813,13 +877,14 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>                 return -EBUSY;
>         if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
>                 return -EINVAL;
> -       ret = io_rsrc_data_alloc(&data, nr_args);
> +       ret = io_rsrc_buffer_alloc(&table, nr_args);
>         if (ret)
>                 return ret;
>
>         if (!arg)
>                 memset(iov, 0, sizeof(*iov));
>
> +       ctx->buf_table = table;
>         for (i = 0; i < nr_args; i++) {
>                 struct io_rsrc_node *node;
>                 u64 tag = 0;
> @@ -859,10 +924,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
>                         }
>                         node->tag = tag;
>                 }
> -               data.nodes[i] = node;
> +               table.data.nodes[i] = node;
>         }
> -
> -       ctx->buf_table.data = data;

Still don't see the need to move this assignment. Is there a reason
you prefer setting ctx->buf_table before initializing its nodes? I
find the existing code easier to follow, where the table is moved to
ctx->buf_table after filling it in. It's also consistent with
io_clone_buffers().

>         if (ret)
>                 io_sqe_buffers_unregister(ctx);
>         return ret;
> @@ -887,14 +950,15 @@ int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
>                 goto unlock;
>         }
>
> -       node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
> +       node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
>         if (!node) {
>                 ret = -ENOMEM;
>                 goto unlock;
>         }
>
>         nr_bvecs = blk_rq_nr_phys_segments(rq);
> -       imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
> +
> +       imu = io_alloc_imu(ctx, nr_bvecs);

nit: probably don't need to add a blank line here

Best,
Caleb

>         if (!imu) {
>                 kfree(node);
>                 ret = -ENOMEM;
> @@ -1031,7 +1095,7 @@ static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
>  static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx,
>                             struct io_uring_clone_buffers *arg)
>  {
> -       struct io_rsrc_data data;
> +       struct io_buf_table table;
>         int i, ret, off, nr;
>         unsigned int nbufs;
>
> @@ -1062,7 +1126,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
>         if (check_add_overflow(arg->nr, arg->dst_off, &nbufs))
>                 return -EOVERFLOW;
>
> -       ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr));
> +       ret = io_rsrc_buffer_alloc(&table, max(nbufs, ctx->buf_table.data.nr));
>         if (ret)
>                 return ret;
>
> @@ -1071,7 +1135,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
>                 struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i];
>
>                 if (src_node) {
> -                       data.nodes[i] = src_node;
> +                       table.data.nodes[i] = src_node;
>                         src_node->refs++;
>                 }
>         }
> @@ -1101,7 +1165,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
>                 if (!src_node) {
>                         dst_node = NULL;
>                 } else {
> -                       dst_node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
> +                       dst_node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
>                         if (!dst_node) {
>                                 ret = -ENOMEM;
>                                 goto out_free;
> @@ -1110,12 +1174,12 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
>                         refcount_inc(&src_node->buf->refs);
>                         dst_node->buf = src_node->buf;
>                 }
> -               data.nodes[off++] = dst_node;
> +               table.data.nodes[off++] = dst_node;
>                 i++;
>         }
>
>         /*
> -        * If asked for replace, put the old table. data->nodes[] holds both
> +        * If asked for replace, put the old table. table.data->nodes[] holds both
>          * old and new nodes at this point.
>          */
>         if (arg->flags & IORING_REGISTER_DST_REPLACE)
> @@ -1128,10 +1192,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
>          * entry).
>          */
>         WARN_ON_ONCE(ctx->buf_table.data.nr);
> -       ctx->buf_table.data = data;
> +       ctx->buf_table = table;
>         return 0;
>  out_free:
> -       io_rsrc_data_free(ctx, &data);
> +       io_rsrc_buffer_free(ctx, &table);
>         return ret;
>  }
>
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index 81c31b93d4f7b..69cc212ad0c7c 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -49,7 +49,7 @@ struct io_imu_folio_data {
>         unsigned int    nr_folios;
>  };
>
> -struct io_rsrc_node *io_rsrc_node_alloc(int type);
> +struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type);
>  void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
>  void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data);
>  int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
> --
> 2.43.5
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 1/5] io_uring: move fixed buffer import to issue path
  2025-02-18 22:42 ` [PATCHv4 1/5] io_uring: move fixed buffer import to issue path Keith Busch
  2025-02-19  1:27   ` Caleb Sander Mateos
@ 2025-02-19  4:23   ` Ming Lei
  2025-02-19 16:48   ` Pavel Begunkov
  2 siblings, 0 replies; 28+ messages in thread
From: Ming Lei @ 2025-02-19  4:23 UTC (permalink / raw)
  To: Keith Busch
  Cc: asml.silence, axboe, linux-block, io-uring, bernd, csander,
	Keith Busch

On Tue, Feb 18, 2025 at 02:42:25PM -0800, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Similar to the fixed file path, requests may depend on a previous one
> to set up an index, so we need to allow linking them. The prep callback
> happens too soon for linked commands, so the lookup needs to be deferred
> to the issue path. Change the prep callbacks to just set the buf_index
> and let generic io_uring code handle the fixed buffer node setup, just
> like it already does for fixed files.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---

...

> diff --git a/io_uring/net.c b/io_uring/net.c
> index 000dc70d08d0d..39838e8575b53 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -1373,6 +1373,10 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  #endif
>  	if (unlikely(!io_msg_alloc_async(req)))
>  		return -ENOMEM;
> +	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
> +		req->buf_index = zc->buf_index;
> +		req->flags |= REQ_F_FIXED_BUFFER;
> +	}
>  	if (req->opcode != IORING_OP_SENDMSG_ZC)
>  		return io_send_setup(req, sqe);
>  	return io_sendmsg_setup(req, sqe);
> @@ -1434,25 +1438,10 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
>  	struct io_async_msghdr *kmsg = req->async_data;
>  	int ret;
>  
> -	if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
> -		struct io_ring_ctx *ctx = req->ctx;
> -		struct io_rsrc_node *node;
> -
> -		ret = -EFAULT;
> -		io_ring_submit_lock(ctx, issue_flags);
> -		node = io_rsrc_node_lookup(&ctx->buf_table, sr->buf_index);
> -		if (node) {
> -			io_req_assign_buf_node(sr->notif, node);

Here the node buffer is assigned to ->notif req, instead of the current
request, and you may have to deal with this case here.

Otherwise, this patch looks fine:

Reviewed-by: Ming Lei <[email protected]>



Thanks,
Ming


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 1/5] io_uring: move fixed buffer import to issue path
  2025-02-18 22:42 ` [PATCHv4 1/5] io_uring: move fixed buffer import to issue path Keith Busch
  2025-02-19  1:27   ` Caleb Sander Mateos
  2025-02-19  4:23   ` Ming Lei
@ 2025-02-19 16:48   ` Pavel Begunkov
  2025-02-19 17:15     ` Pavel Begunkov
  2025-02-20  1:25     ` Keith Busch
  2 siblings, 2 replies; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-19 16:48 UTC (permalink / raw)
  To: Keith Busch, ming.lei, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

On 2/18/25 22:42, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Similar to the fixed file path, requests may depend on a previous one
> to set up an index, so we need to allow linking them. The prep callback
> happens too soon for linked commands, so the lookup needs to be deferred
> to the issue path. Change the prep callbacks to just set the buf_index
> and let generic io_uring code handle the fixed buffer node setup, just
> like it already does for fixed files.
> 
> Signed-off-by: Keith Busch <[email protected]>

It wasn't great before, and it'd be harder to follow if we shove it
into the issue path like that. Add additional overhead in the common
path and that it's not super flexible, like the notification problem
and what we need out of it for other features.

We're better to remove the lookup vs import split like below.
Here is a branch, let's do it on top.

https://github.com/isilence/linux.git regbuf-import


diff --git a/io_uring/net.c b/io_uring/net.c
index ce0a39972cce..322cf023233a 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1360,24 +1360,10 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
  	int ret;
  
  	if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
-		struct io_ring_ctx *ctx = req->ctx;
-		struct io_rsrc_node *node;
-
-		ret = -EFAULT;
-		io_ring_submit_lock(ctx, issue_flags);
-		node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
-		if (node) {
-			io_req_assign_buf_node(sr->notif, node);
-			ret = 0;
-		}
-		io_ring_submit_unlock(ctx, issue_flags);
-
-		if (unlikely(ret))
-			return ret;
-
-		ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter,
-					node->buf, (u64)(uintptr_t)sr->buf,
-					sr->len);
+		sr->notif->buf_index = req->buf_index;
+		ret = io_import_reg_buf(sr->notif, &kmsg->msg.msg_iter,
+					(u64)(uintptr_t)sr->buf, sr->len,
+					ITER_SOURCE, issue_flags);
  		if (unlikely(ret))
  			return ret;
  		kmsg->msg.sg_from_iter = io_sg_from_iter;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index af39b69eb4fd..a4557ed14bea 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -860,9 +860,9 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
  	return ret;
  }
  
-int io_import_fixed(int ddir, struct iov_iter *iter,
-			   struct io_mapped_ubuf *imu,
-			   u64 buf_addr, size_t len)
+static int io_import_fixed_imu(int ddir, struct iov_iter *iter,
+				struct io_mapped_ubuf *imu,
+				u64 buf_addr, size_t len)
  {
  	u64 buf_end;
  	size_t offset;
@@ -919,6 +919,35 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
  	return 0;
  }
  
+static inline struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
+						    unsigned issue_flags)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_rsrc_node *node;
+
+	if (req->buf_node)
+		return req->buf_node;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
+	if (node)
+		io_req_assign_buf_node(req, node);
+	io_ring_submit_unlock(ctx, issue_flags);
+	return node;
+}
+
+int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
+			u64 buf_addr, size_t len, int ddir,
+			unsigned issue_flags)
+{
+	struct io_rsrc_node *node;
+
+	node = io_find_buf_node(req, issue_flags);
+	if (!node)
+		return -EFAULT;
+	return io_import_fixed_imu(ddir, iter, node->buf, buf_addr, len);
+}
+
  /* Lock two rings at once. The rings must be different! */
  static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
  {
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index a6d883c62b22..ce199eb0ac9f 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -50,9 +50,9 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
  void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data);
  int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
  
-int io_import_fixed(int ddir, struct iov_iter *iter,
-			   struct io_mapped_ubuf *imu,
-			   u64 buf_addr, size_t len);
+int io_import_reg_buf(struct io_kiocb *req, struct iov_iter *iter,
+			u64 buf_addr, size_t len, int ddir,
+			unsigned issue_flags);
  
  int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg);
  int io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 16f12f94943f..31a1f15f5f01 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -354,8 +354,6 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe
  			    int ddir)
  {
  	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
-	struct io_ring_ctx *ctx = req->ctx;
-	struct io_rsrc_node *node;
  	struct io_async_rw *io;
  	int ret;
  
@@ -363,13 +361,8 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe
  	if (unlikely(ret))
  		return ret;
  
-	node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
-	if (!node)
-		return -EFAULT;
-	io_req_assign_buf_node(req, node);
-
  	io = req->async_data;
-	ret = io_import_fixed(ddir, &io->iter, node->buf, rw->addr, rw->len);
+	ret = io_import_reg_buf(req, &io->iter, rw->addr, rw->len, ddir, 0);
  	iov_iter_save_state(&io->iter, &io->iter_state);
  	return ret;
  }
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index cf5e9822e49a..80046e755211 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -199,21 +199,9 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  	if (ioucmd->flags & ~IORING_URING_CMD_MASK)
  		return -EINVAL;
  
-	if (ioucmd->flags & IORING_URING_CMD_FIXED) {
-		struct io_ring_ctx *ctx = req->ctx;
-		struct io_rsrc_node *node;
-		u16 index = READ_ONCE(sqe->buf_index);
-
-		node = io_rsrc_node_lookup(&ctx->buf_table, index);
-		if (unlikely(!node))
-			return -EFAULT;
-		/*
-		 * Pi node upfront, prior to io_uring_cmd_import_fixed()
-		 * being called. This prevents destruction of the mapped buffer
-		 * we'll need at actual import time.
-		 */
-		io_req_assign_buf_node(req, node);
-	}
+	if (ioucmd->flags & IORING_URING_CMD_FIXED)
+		req->buf_index = READ_ONCE(sqe->buf_index);
+
  	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
  
  	return io_uring_cmd_prep_setup(req, sqe);
@@ -261,13 +249,8 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
  			      unsigned int issue_flags)
  {
  	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
-	struct io_rsrc_node *node = req->buf_node;
-
-	/* Must have had rsrc_node assigned at prep time */
-	if (node)
-		return io_import_fixed(rw, iter, node->buf, ubuf, len);
  
-	return -EFAULT;
+	return io_import_reg_buf(req, iter, ubuf, len, rw, issue_flags);
  }
  EXPORT_SYMBOL_GPL(io_uring_cmd_import_fixed);
  


-- 
Pavel Begunkov


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 1/5] io_uring: move fixed buffer import to issue path
  2025-02-19 16:48   ` Pavel Begunkov
@ 2025-02-19 17:15     ` Pavel Begunkov
  2025-02-20  1:25     ` Keith Busch
  1 sibling, 0 replies; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-19 17:15 UTC (permalink / raw)
  To: Keith Busch, ming.lei, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

On 2/19/25 16:48, Pavel Begunkov wrote:
> On 2/18/25 22:42, Keith Busch wrote:
>> From: Keith Busch <[email protected]>
>>
>> Similar to the fixed file path, requests may depend on a previous one
>> to set up an index, so we need to allow linking them. The prep callback
>> happens too soon for linked commands, so the lookup needs to be deferred
>> to the issue path. Change the prep callbacks to just set the buf_index
>> and let generic io_uring code handle the fixed buffer node setup, just
>> like it already does for fixed files.
>>
>> Signed-off-by: Keith Busch <[email protected]>
> 
> It wasn't great before, and it'd be harder to follow if we shove it
> into the issue path like that. Add additional overhead in the common
> path and that it's not super flexible, like the notification problem
> and what we need out of it for other features.
> 
> We're better to remove the lookup vs import split like below.
> Here is a branch, let's do it on top.
> 
> https://github.com/isilence/linux.git regbuf-import
> 
> 
> diff --git a/io_uring/net.c b/io_uring/net.c
> index ce0a39972cce..322cf023233a 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -1360,24 +1360,10 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
...
> -int io_import_fixed(int ddir, struct iov_iter *iter,
> -               struct io_mapped_ubuf *imu,
> -               u64 buf_addr, size_t len)
> +static int io_import_fixed_imu(int ddir, struct iov_iter *iter,
> +                struct io_mapped_ubuf *imu,
> +                u64 buf_addr, size_t len)
>   {
>       u64 buf_end;
>       size_t offset;
> @@ -919,6 +919,35 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
>       return 0;
>   }
> 
> +static inline struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
> +                            unsigned issue_flags)
> +{
> +    struct io_ring_ctx *ctx = req->ctx;
> +    struct io_rsrc_node *node;
> +
> +    if (req->buf_node)

Seems it should be checking for REQ_F_BUF_NODE instead

> +        return req->buf_node;
> +
> +    io_ring_submit_lock(ctx, issue_flags);
> +    node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
> +    if (node)
> +        io_req_assign_buf_node(req, node);
> +    io_ring_submit_unlock(ctx, issue_flags);
> +    return node;
> +}

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 2/5] io_uring: add support for kernel registered bvecs
  2025-02-19  1:54   ` Caleb Sander Mateos
@ 2025-02-19 17:23     ` Pavel Begunkov
  2025-02-20 10:31     ` Pavel Begunkov
  1 sibling, 0 replies; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-19 17:23 UTC (permalink / raw)
  To: Caleb Sander Mateos, Keith Busch
  Cc: ming.lei, axboe, linux-block, io-uring, bernd, Keith Busch

On 2/19/25 01:54, Caleb Sander Mateos wrote:
> On Tue, Feb 18, 2025 at 2:42 PM Keith Busch <[email protected]> wrote:
...
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 0bcaefc4ffe02..2aed51e8c79ee 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -709,4 +709,10 @@ static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
>>          return ctx->flags & IORING_SETUP_CQE32;
>>   }
>>
>> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
>> +                           void (*release)(void *), unsigned int index,
>> +                           unsigned int issue_flags);
>> +void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index,
>> +                              unsigned int issue_flags);
> 
> Hmm, io_uring_types.h seems like a strange place for these. The rest
> of this file consists of types and inline functions. Maybe
> io_uring/rsrc.h would make more sense?

There is linux/io_uring/cmd.h for that

And it should be taking struct io_uring_cmd. ublk shouldn't
know anything about ctx, and this line from the following
patch in not allowed.

struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 1/5] io_uring: move fixed buffer import to issue path
  2025-02-19 16:48   ` Pavel Begunkov
  2025-02-19 17:15     ` Pavel Begunkov
@ 2025-02-20  1:25     ` Keith Busch
  2025-02-20 10:12       ` Pavel Begunkov
  1 sibling, 1 reply; 28+ messages in thread
From: Keith Busch @ 2025-02-20  1:25 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd,
	csander

On Wed, Feb 19, 2025 at 04:48:30PM +0000, Pavel Begunkov wrote:
> We're better to remove the lookup vs import split like below.
> Here is a branch, let's do it on top.
> 
> https://github.com/isilence/linux.git regbuf-import

Your first patch adds a 10th parameter to an nvme function, most of
which are unused in half the branches. I think we've done something
wrong here, so I want to take a shot at cleaning that up. Otherwise I
think what you're proposing is an improvement.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 1/5] io_uring: move fixed buffer import to issue path
  2025-02-20  1:25     ` Keith Busch
@ 2025-02-20 10:12       ` Pavel Begunkov
  0 siblings, 0 replies; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-20 10:12 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd,
	csander

On 2/20/25 01:25, Keith Busch wrote:
> On Wed, Feb 19, 2025 at 04:48:30PM +0000, Pavel Begunkov wrote:
>> We're better to remove the lookup vs import split like below.
>> Here is a branch, let's do it on top.
>>
>> https://github.com/isilence/linux.git regbuf-import
> 
> Your first patch adds a 10th parameter to an nvme function, most of
> which are unused in half the branches. I think we've done something
> wrong here, so I want to take a shot at cleaning that up. Otherwise I

That would be great, I didn't feel great about it either, even
the existing ioucmd arg doesn't seem like a perfect fit.


> think what you're proposing is an improvement.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 2/5] io_uring: add support for kernel registered bvecs
  2025-02-19  1:54   ` Caleb Sander Mateos
  2025-02-19 17:23     ` Pavel Begunkov
@ 2025-02-20 10:31     ` Pavel Begunkov
  1 sibling, 0 replies; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-20 10:31 UTC (permalink / raw)
  To: Caleb Sander Mateos, Keith Busch
  Cc: ming.lei, axboe, linux-block, io-uring, bernd, Keith Busch

On 2/19/25 01:54, Caleb Sander Mateos wrote:
> On Tue, Feb 18, 2025 at 2:42 PM Keith Busch <[email protected]> wrote:
...
>>   int io_import_fixed(int ddir, struct iov_iter *iter,
>>                             struct io_mapped_ubuf *imu,
>>                             u64 buf_addr, size_t len)
>> @@ -874,6 +963,9 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
>>          /* not inside the mapped region */
>>          if (unlikely(buf_addr < imu->ubuf || buf_end > (imu->ubuf + imu->len)))
>>                  return -EFAULT;
>> +       if ((ddir == READ && !imu->readable) ||
>> +           (ddir == WRITE && !imu->writeable))
>> +               return -EFAULT;
> 
> This could be made less branchy by storing a bitmask of allowed data
> transfer directions instead of 2 bool fields. Then this could just be:
> if (!(imu->ddirs >> ddir & 1)
>          return -EFAULT;

I'd prefer a direction mask indeed.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 2/5] io_uring: add support for kernel registered bvecs
  2025-02-18 22:42 ` [PATCHv4 2/5] io_uring: add support for kernel registered bvecs Keith Busch
  2025-02-19  1:54   ` Caleb Sander Mateos
@ 2025-02-20 10:38   ` Pavel Begunkov
  1 sibling, 0 replies; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-20 10:38 UTC (permalink / raw)
  To: Keith Busch, ming.lei, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

On 2/18/25 22:42, Keith Busch wrote:
> From: Keith Busch <[email protected]>
...
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, struct request *rq,
> +			    void (*release)(void *), unsigned int index,
> +			    unsigned int issue_flags)
> +{
> +	struct io_rsrc_data *data = &ctx->buf_table;
> +	struct req_iterator rq_iter;
> +	struct io_mapped_ubuf *imu;
> +	struct io_rsrc_node *node;
> +	struct bio_vec bv, *bvec;
> +	int ret = 0;
> +	u16 nr_bvecs;
> +
> +	io_ring_submit_lock(ctx, issue_flags);
> +
> +	if (io_rsrc_node_lookup(data, index)) {

I don't think array nospec inside is enough as you use the
old unsanitised index below to assign the node, and it seems
like it can speculate there.

Same in io_buffer_unregister_bvec().

...
> +	node->buf = imu;
> +	data->nodes[index] = node;
> +unlock:
> +	io_ring_submit_unlock(ctx, issue_flags);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> +
> +void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index,
> +			       unsigned int issue_flags)
> +{
> +	struct io_rsrc_data *data = &ctx->buf_table;
> +	struct io_rsrc_node *node;
> +
> +	io_ring_submit_lock(ctx, issue_flags);
> +
> +	if (!data->nr)
> +		goto unlock;
> +
> +	node = io_rsrc_node_lookup(data, index);
> +	if (!node || !node->buf->release)
> +		goto unlock;
> +
> +	io_put_rsrc_node(ctx, node);
> +	data->nodes[index] = NULL;
> +unlock:
> +	io_ring_submit_unlock(ctx, issue_flags);
> +}
> +EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);


-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 5/5] io_uring: cache nodes and mapped buffers
  2025-02-18 22:42 ` [PATCHv4 5/5] io_uring: cache nodes and mapped buffers Keith Busch
  2025-02-19  4:22   ` Caleb Sander Mateos
@ 2025-02-20 11:08   ` Pavel Begunkov
  2025-02-20 15:24     ` Keith Busch
  1 sibling, 1 reply; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-20 11:08 UTC (permalink / raw)
  To: Keith Busch, ming.lei, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

On 2/18/25 22:42, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Frequent alloc/free cycles on these is pretty costly. Use an io cache to
> more efficiently reuse these buffers.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---
>   include/linux/io_uring_types.h |  18 ++---
>   io_uring/filetable.c           |   2 +-
>   io_uring/rsrc.c                | 120 +++++++++++++++++++++++++--------
>   io_uring/rsrc.h                |   2 +-
>   4 files changed, 104 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 810d1dccd27b1..bbfb651506522 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -69,8 +69,18 @@ struct io_file_table {
>   	unsigned int alloc_hint;
...
> -struct io_rsrc_node *io_rsrc_node_alloc(int type)
> +struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type)
>   {
>   	struct io_rsrc_node *node;
>   
> -	node = kzalloc(sizeof(*node), GFP_KERNEL);
> +	if (type == IORING_RSRC_FILE)
> +		node = kmalloc(sizeof(*node), GFP_KERNEL);
> +	else
> +		node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL);

That's why node allocators shouldn't be a part of the buffer table.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 3/5] ublk: zc register/unregister bvec
  2025-02-18 22:42 ` [PATCHv4 3/5] ublk: zc register/unregister bvec Keith Busch
  2025-02-19  2:36   ` Caleb Sander Mateos
@ 2025-02-20 11:11   ` Pavel Begunkov
  2025-02-24 21:02     ` Keith Busch
  1 sibling, 1 reply; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-20 11:11 UTC (permalink / raw)
  To: Keith Busch, ming.lei, axboe, linux-block, io-uring
  Cc: bernd, csander, Keith Busch

On 2/18/25 22:42, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Provide new operations for the user to request mapping an active request
> to an io uring instance's buf_table. The user has to provide the index
> it wants to install the buffer.
> 
> A reference count is taken on the request to ensure it can't be
> completed while it is active in a ring's buf_table.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---
>   drivers/block/ublk_drv.c      | 137 +++++++++++++++++++++++++---------
>   include/uapi/linux/ublk_cmd.h |   4 +
>   2 files changed, 105 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 529085181f355..0c753176b14e9 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -51,6 +51,9 @@
...
> +static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
> +				  struct ublk_queue *ubq, int tag,
> +				  const struct ublksrv_io_cmd *ub_cmd,
> +				  unsigned int issue_flags)
> +{
> +	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> +	struct ublk_device *ub = cmd->file->private_data;
> +	int index = (int)ub_cmd->addr;
> +	struct ublk_rq_data *data;
> +	struct request *req;
> +
> +	if (!ub)
> +		return -EPERM;
> +
> +	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);

Shouldn't there some speculation sanitisation for the tag as well?
Looks like a user passed value directly indexing an array.

> +	if (!req)
> +		return -EINVAL;
> +
> +	data = blk_mq_rq_to_pdu(req);
> +	io_buffer_unregister_bvec(ctx, index, issue_flags);
> +	return 0;
> +}
> +


Pavel Begunkov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 5/5] io_uring: cache nodes and mapped buffers
  2025-02-20 11:08   ` Pavel Begunkov
@ 2025-02-20 15:24     ` Keith Busch
  2025-02-20 16:06       ` Pavel Begunkov
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2025-02-20 15:24 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd,
	csander

On Thu, Feb 20, 2025 at 11:08:25AM +0000, Pavel Begunkov wrote:
> On 2/18/25 22:42, Keith Busch wrote:
> > From: Keith Busch <[email protected]>
> > 
> > Frequent alloc/free cycles on these is pretty costly. Use an io cache to
> > more efficiently reuse these buffers.
> > 
> > Signed-off-by: Keith Busch <[email protected]>
> > ---
> >   include/linux/io_uring_types.h |  18 ++---
> >   io_uring/filetable.c           |   2 +-
> >   io_uring/rsrc.c                | 120 +++++++++++++++++++++++++--------
> >   io_uring/rsrc.h                |   2 +-
> >   4 files changed, 104 insertions(+), 38 deletions(-)
> > 
> > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > index 810d1dccd27b1..bbfb651506522 100644
> > --- a/include/linux/io_uring_types.h
> > +++ b/include/linux/io_uring_types.h
> > @@ -69,8 +69,18 @@ struct io_file_table {
> >   	unsigned int alloc_hint;
> ...
> > -struct io_rsrc_node *io_rsrc_node_alloc(int type)
> > +struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type)
> >   {
> >   	struct io_rsrc_node *node;
> > -	node = kzalloc(sizeof(*node), GFP_KERNEL);
> > +	if (type == IORING_RSRC_FILE)
> > +		node = kmalloc(sizeof(*node), GFP_KERNEL);
> > +	else
> > +		node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL);
> 
> That's why node allocators shouldn't be a part of the buffer table.

Are you saying you want file nodes to also subscribe to the cache? The
two tables can be resized independently of each other, we don't know how
many elements the cache needs to hold.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 5/5] io_uring: cache nodes and mapped buffers
  2025-02-20 15:24     ` Keith Busch
@ 2025-02-20 16:06       ` Pavel Begunkov
  2025-02-24 21:04         ` Keith Busch
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-20 16:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd,
	csander

On 2/20/25 15:24, Keith Busch wrote:
> On Thu, Feb 20, 2025 at 11:08:25AM +0000, Pavel Begunkov wrote:
>> On 2/18/25 22:42, Keith Busch wrote:
>>> From: Keith Busch <[email protected]>
>>>
>>> Frequent alloc/free cycles on these is pretty costly. Use an io cache to
>>> more efficiently reuse these buffers.
>>>
>>> Signed-off-by: Keith Busch <[email protected]>
>>> ---
>>>    include/linux/io_uring_types.h |  18 ++---
>>>    io_uring/filetable.c           |   2 +-
>>>    io_uring/rsrc.c                | 120 +++++++++++++++++++++++++--------
>>>    io_uring/rsrc.h                |   2 +-
>>>    4 files changed, 104 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>> index 810d1dccd27b1..bbfb651506522 100644
>>> --- a/include/linux/io_uring_types.h
>>> +++ b/include/linux/io_uring_types.h
>>> @@ -69,8 +69,18 @@ struct io_file_table {
>>>    	unsigned int alloc_hint;
>> ...
>>> -struct io_rsrc_node *io_rsrc_node_alloc(int type)
>>> +struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type)
>>>    {
>>>    	struct io_rsrc_node *node;
>>> -	node = kzalloc(sizeof(*node), GFP_KERNEL);
>>> +	if (type == IORING_RSRC_FILE)
>>> +		node = kmalloc(sizeof(*node), GFP_KERNEL);
>>> +	else
>>> +		node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL);
>>
>> That's why node allocators shouldn't be a part of the buffer table.
> 
> Are you saying you want file nodes to also subscribe to the cache? The

Yes, but it might be easier for you to focus on finalising the essential
parts, and then we can improve later.

> two tables can be resized independently of each other, we don't know how
> many elements the cache needs to hold.

I wouldn't try to correlate table sizes with desired cache sizes,
users can have quite different patterns like allocating a barely used
huge table. And you care about the speed of node change, which at
extremes is rather limited by CPU and performance and not spatiality
of the table. And you can also reallocate it as well.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 5/5] io_uring: cache nodes and mapped buffers
  2025-02-19  4:22   ` Caleb Sander Mateos
@ 2025-02-24 21:01     ` Keith Busch
  2025-02-24 21:39       ` Caleb Sander Mateos
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2025-02-24 21:01 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Keith Busch, ming.lei, asml.silence, axboe, linux-block, io-uring,
	bernd

On Tue, Feb 18, 2025 at 08:22:36PM -0800, Caleb Sander Mateos wrote:
> > +struct io_alloc_cache {
> > +       void                    **entries;
> > +       unsigned int            nr_cached;
> > +       unsigned int            max_cached;
> > +       size_t                  elem_size;
> 
> Is growing this field from unsigned to size_t really necessary? It
> probably doesn't make sense to be caching allocations > 4 GB.

It used to be a size_t when I initially moved the struct to here, but
it's not anymore, so I'm out of sync. I'll fix it up.
 
> > @@ -859,10 +924,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
> >                         }
> >                         node->tag = tag;
> >                 }
> > -               data.nodes[i] = node;
> > +               table.data.nodes[i] = node;
> >         }
> > -
> > -       ctx->buf_table.data = data;
> 
> Still don't see the need to move this assignment. Is there a reason
> you prefer setting ctx->buf_table before initializing its nodes? I
> find the existing code easier to follow, where the table is moved to
> ctx->buf_table after filling it in. It's also consistent with
> io_clone_buffers().

Yes, it needs to move to earlier. The ctx buf_table needs to be set
before any allocations from io_rsrc_node_alloc() can happen.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 3/5] ublk: zc register/unregister bvec
  2025-02-20 11:11   ` Pavel Begunkov
@ 2025-02-24 21:02     ` Keith Busch
  0 siblings, 0 replies; 28+ messages in thread
From: Keith Busch @ 2025-02-24 21:02 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd,
	csander

On Thu, Feb 20, 2025 at 11:11:59AM +0000, Pavel Begunkov wrote:
> On 2/18/25 22:42, Keith Busch wrote:
> > +static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
> > +				  struct ublk_queue *ubq, int tag,
> > +				  const struct ublksrv_io_cmd *ub_cmd,
> > +				  unsigned int issue_flags)
> > +{
> > +	struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
> > +	struct ublk_device *ub = cmd->file->private_data;
> > +	int index = (int)ub_cmd->addr;
> > +	struct ublk_rq_data *data;
> > +	struct request *req;
> > +
> > +	if (!ub)
> > +		return -EPERM;
> > +
> > +	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> 
> Shouldn't there some speculation sanitisation for the tag as well?
> Looks like a user passed value directly indexing an array.

There are no other array speculation defenses here, so looks like a
pre-existing issue. I'll send something to address that separate from
this series.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 5/5] io_uring: cache nodes and mapped buffers
  2025-02-20 16:06       ` Pavel Begunkov
@ 2025-02-24 21:04         ` Keith Busch
  2025-02-25 13:06           ` Pavel Begunkov
  0 siblings, 1 reply; 28+ messages in thread
From: Keith Busch @ 2025-02-24 21:04 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd,
	csander

On Thu, Feb 20, 2025 at 04:06:21PM +0000, Pavel Begunkov wrote:
> On 2/20/25 15:24, Keith Busch wrote:
> > > > +		node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL);
> > > 
> > > That's why node allocators shouldn't be a part of the buffer table.
> > 
> > Are you saying you want file nodes to also subscribe to the cache? The
> 
> Yes, but it might be easier for you to focus on finalising the essential
> parts, and then we can improve later.
> 
> > two tables can be resized independently of each other, we don't know how
> > many elements the cache needs to hold.
> 
> I wouldn't try to correlate table sizes with desired cache sizes,
> users can have quite different patterns like allocating a barely used
> huge table. And you care about the speed of node change, which at
> extremes is rather limited by CPU and performance and not spatiality
> of the table. And you can also reallocate it as well.

Having the cache size and lifetime match a table that it's providing
seems as simple as I can make this. This is still an optimization at the
end of the day, so it's not strictly necessary to take the last two
patches from this series to make zero copy work if you don't want to
include it from the beginning.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 5/5] io_uring: cache nodes and mapped buffers
  2025-02-24 21:01     ` Keith Busch
@ 2025-02-24 21:39       ` Caleb Sander Mateos
  0 siblings, 0 replies; 28+ messages in thread
From: Caleb Sander Mateos @ 2025-02-24 21:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, ming.lei, asml.silence, axboe, linux-block, io-uring,
	bernd

On Mon, Feb 24, 2025 at 1:01 PM Keith Busch <[email protected]> wrote:
>
> On Tue, Feb 18, 2025 at 08:22:36PM -0800, Caleb Sander Mateos wrote:
> > > +struct io_alloc_cache {
> > > +       void                    **entries;
> > > +       unsigned int            nr_cached;
> > > +       unsigned int            max_cached;
> > > +       size_t                  elem_size;
> >
> > Is growing this field from unsigned to size_t really necessary? It
> > probably doesn't make sense to be caching allocations > 4 GB.
>
> It used to be a size_t when I initially moved the struct to here, but
> it's not anymore, so I'm out of sync. I'll fix it up.
>
> > > @@ -859,10 +924,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
> > >                         }
> > >                         node->tag = tag;
> > >                 }
> > > -               data.nodes[i] = node;
> > > +               table.data.nodes[i] = node;
> > >         }
> > > -
> > > -       ctx->buf_table.data = data;
> >
> > Still don't see the need to move this assignment. Is there a reason
> > you prefer setting ctx->buf_table before initializing its nodes? I
> > find the existing code easier to follow, where the table is moved to
> > ctx->buf_table after filling it in. It's also consistent with
> > io_clone_buffers().
>
> Yes, it needs to move to earlier. The ctx buf_table needs to be set
> before any allocations from io_rsrc_node_alloc() can happen.

Got it, thank you for explaining.

Best,
Caleb

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHv4 5/5] io_uring: cache nodes and mapped buffers
  2025-02-24 21:04         ` Keith Busch
@ 2025-02-25 13:06           ` Pavel Begunkov
  0 siblings, 0 replies; 28+ messages in thread
From: Pavel Begunkov @ 2025-02-25 13:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd,
	csander

On 2/24/25 21:04, Keith Busch wrote:
> On Thu, Feb 20, 2025 at 04:06:21PM +0000, Pavel Begunkov wrote:
>> On 2/20/25 15:24, Keith Busch wrote:
>>>>> +		node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL);
>>>>
>>>> That's why node allocators shouldn't be a part of the buffer table.
>>>
>>> Are you saying you want file nodes to also subscribe to the cache? The
>>
>> Yes, but it might be easier for you to focus on finalising the essential
>> parts, and then we can improve later.
>>
>>> two tables can be resized independently of each other, we don't know how
>>> many elements the cache needs to hold.
>>
>> I wouldn't try to correlate table sizes with desired cache sizes,
>> users can have quite different patterns like allocating a barely used
>> huge table. And you care about the speed of node change, which at
>> extremes is rather limited by CPU and performance and not spatiality
>> of the table. And you can also reallocate it as well.
> 
> Having the cache size and lifetime match a table that it's providing

You don't need to align it with the table size, I actually
overlooked it, will comment on v5.

> seems as simple as I can make this. This is still an optimization at the
> end of the day, so it's not strictly necessary to take the last two
> patches from this series to make zero copy work if you don't want to
> include it from the beginning.

Ok. As I mentioned above, I do think it's wrong for nodes, but
we can merge the patch and rework on top.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-02-25 13:05 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-18 22:42 [PATCHv4 0/5] ublk zero-copy support Keith Busch
2025-02-18 22:42 ` [PATCHv4 1/5] io_uring: move fixed buffer import to issue path Keith Busch
2025-02-19  1:27   ` Caleb Sander Mateos
2025-02-19  4:23   ` Ming Lei
2025-02-19 16:48   ` Pavel Begunkov
2025-02-19 17:15     ` Pavel Begunkov
2025-02-20  1:25     ` Keith Busch
2025-02-20 10:12       ` Pavel Begunkov
2025-02-18 22:42 ` [PATCHv4 2/5] io_uring: add support for kernel registered bvecs Keith Busch
2025-02-19  1:54   ` Caleb Sander Mateos
2025-02-19 17:23     ` Pavel Begunkov
2025-02-20 10:31     ` Pavel Begunkov
2025-02-20 10:38   ` Pavel Begunkov
2025-02-18 22:42 ` [PATCHv4 3/5] ublk: zc register/unregister bvec Keith Busch
2025-02-19  2:36   ` Caleb Sander Mateos
2025-02-20 11:11   ` Pavel Begunkov
2025-02-24 21:02     ` Keith Busch
2025-02-18 22:42 ` [PATCHv4 4/5] io_uring: add abstraction for buf_table rsrc data Keith Busch
2025-02-19  3:04   ` Caleb Sander Mateos
2025-02-18 22:42 ` [PATCHv4 5/5] io_uring: cache nodes and mapped buffers Keith Busch
2025-02-19  4:22   ` Caleb Sander Mateos
2025-02-24 21:01     ` Keith Busch
2025-02-24 21:39       ` Caleb Sander Mateos
2025-02-20 11:08   ` Pavel Begunkov
2025-02-20 15:24     ` Keith Busch
2025-02-20 16:06       ` Pavel Begunkov
2025-02-24 21:04         ` Keith Busch
2025-02-25 13:06           ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox