public inbox for [email protected]
 help / color / mirror / Atom feed
* [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD
@ 2023-03-01 14:05 Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 01/12] io_uring: increase io_kiocb->flags into 64bit Ming Lei
                   ` (12 more replies)
  0 siblings, 13 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:05 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Hello,

Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
64byte SQE(slave) is another normal 64byte OP. For any OP which needs
to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1,
and its ->issue() can retrieve/import buffer from master request's
fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of
this idea is from Xiaoguang's ublk ebpf patchset, but this patchset
submits slave OP just like normal OP issued from userspace, that said,
SQE order is kept, and batching handling is done too.

Please see detailed design in commit log of the 7th patch, and one big
point is how to handle buffer ownership.

With this way, it is easy to support zero copy for ublk/fuse device.

Basically userspace can specify any sub-buffer of the ublk block request
buffer from the fused command just by setting 'offset/len'
in the slave SQE for running slave OP. This way is flexible to implement
io mapping: mirror, stripped, ...

The 8th & 9th patches enable fused slave support for the following OPs:

	OP_READ/OP_WRITE
	OP_SEND/OP_RECV/OP_SEND_ZC

The last 3 patches implement fused command support for ublk driver.

Follows userspace code:

https://github.com/ming1/ubdsrv/tree/fused-cmd-zc

Both loop and nbd ublk targets have supported zero copy by passing:

	ublk add -t [loop|nbd] -z .... 

Basic fs mount/kernel building and builtin test are done.

Performance improvement is obvious on memory bandwidth
related workloads, such as, 1~2X improvement on 64K/512K BS
IO test on loop with ramfs backing file.

Any comments are welcome!


Ming Lei (12):
  io_uring: increase io_kiocb->flags into 64bit
  io_uring: define io_mapped_ubuf->acct_pages as unsigned integer
  io_uring: extend io_mapped_ubuf to cover external bvec table
  io_uring: rename io_mapped_ubuf as io_mapped_buf
  io_uring: export 'struct io_mapped_buf' for fused cmd buffer
  io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD
  io_uring: add IORING_OP_FUSED_CMD
  io_uring: support OP_READ/OP_WRITE for fused slave request
  io_uring: support OP_SEND_ZC/OP_RECV for fused slave request
  block: ublk_drv: mark device as LIVE before adding disk
  block: ublk_drv: add common exit handling
  block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy

 drivers/block/ublk_drv.c       | 189 ++++++++++++++++++++++++--
 drivers/char/mem.c             |   4 +
 drivers/nvme/host/ioctl.c      |   9 ++
 include/linux/io_uring.h       |  65 ++++++++-
 include/linux/io_uring_types.h |  26 +++-
 include/uapi/linux/io_uring.h  |   1 +
 include/uapi/linux/ublk_cmd.h  |   1 +
 io_uring/Makefile              |   2 +-
 io_uring/fdinfo.c              |   6 +-
 io_uring/fused_cmd.c           | 233 +++++++++++++++++++++++++++++++++
 io_uring/fused_cmd.h           |  11 ++
 io_uring/io_uring.c            |  24 +++-
 io_uring/io_uring.h            |   3 +
 io_uring/net.c                 |  23 +++-
 io_uring/opdef.c               |  17 +++
 io_uring/opdef.h               |   2 +
 io_uring/rsrc.c                |  31 ++---
 io_uring/rsrc.h                |  12 +-
 io_uring/rw.c                  |  20 +++
 19 files changed, 623 insertions(+), 56 deletions(-)
 create mode 100644 io_uring/fused_cmd.c
 create mode 100644 io_uring/fused_cmd.h

-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH 01/12] io_uring: increase io_kiocb->flags into 64bit
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 02/12] io_uring: define io_mapped_ubuf->acct_pages as unsigned integer Ming Lei
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

The 32bit io_kiocb->flags has been used up, so extend it to 64bit.

Signed-off-by: Ming Lei <[email protected]>
---
 include/linux/io_uring_types.h | 2 +-
 io_uring/io_uring.c            | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 0efe4d784358..87342649d2c3 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -530,7 +530,7 @@ struct io_kiocb {
 	 * and after selection it points to the buffer ID itself.
 	 */
 	u16				buf_index;
-	unsigned int			flags;
+	u64				flags;
 
 	struct io_cqe			cqe;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1df68da89f99..09cc5eaec4ab 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -4418,7 +4418,7 @@ static int __init io_uring_init(void)
 	BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8));
 	BUILD_BUG_ON((SQE_VALID_FLAGS | SQE_COMMON_FLAGS) != SQE_VALID_FLAGS);
 
-	BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int));
+	BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(u64));
 
 	BUILD_BUG_ON(sizeof(atomic_t) != sizeof(u32));
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 02/12] io_uring: define io_mapped_ubuf->acct_pages as unsigned integer
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 01/12] io_uring: increase io_kiocb->flags into 64bit Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 03/12] io_uring: extend io_mapped_ubuf to cover external bvec table Ming Lei
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Unsigned integer is enough(4G * 4k = 16TB) to hold nr_pages in one
io_mapped_ubuf.

This way will save one word for io_mapped_ubuf.

Signed-off-by: Ming Lei <[email protected]>
---
 io_uring/rsrc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 2b8743645efc..774aca20326c 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -49,7 +49,7 @@ struct io_mapped_ubuf {
 	u64		ubuf;
 	u64		ubuf_end;
 	unsigned int	nr_bvecs;
-	unsigned long	acct_pages;
+	unsigned int	acct_pages;
 	struct bio_vec	bvec[];
 };
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 03/12] io_uring: extend io_mapped_ubuf to cover external bvec table
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 01/12] io_uring: increase io_kiocb->flags into 64bit Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 02/12] io_uring: define io_mapped_ubuf->acct_pages as unsigned integer Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 04/12] io_uring: rename io_mapped_ubuf as io_mapped_buf Ming Lei
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Extend io_mapped_ubuf to cover external bvec table for supporting
fused command kbuf, in which the bvec table could be from one IO
request.

Signed-off-by: Ming Lei <[email protected]>
---
 io_uring/rsrc.c | 5 +++--
 io_uring/rsrc.h | 3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index a59fc02de598..c41edd197b0a 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1221,7 +1221,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 		goto done;
 	}
 
-	imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
+	imu = kvmalloc(struct_size(imu, __bvec, nr_pages), GFP_KERNEL);
 	if (!imu)
 		goto done;
 
@@ -1237,7 +1237,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 		size_t vec_len;
 
 		vec_len = min_t(size_t, size, PAGE_SIZE - off);
-		bvec_set_page(&imu->bvec[i], pages[i], vec_len, off);
+		bvec_set_page(&imu->__bvec[i], pages[i], vec_len, off);
 		off = 0;
 		size -= vec_len;
 	}
@@ -1245,6 +1245,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	imu->ubuf = (unsigned long) iov->iov_base;
 	imu->ubuf_end = imu->ubuf + iov->iov_len;
 	imu->nr_bvecs = nr_pages;
+	imu->bvec = imu->__bvec;
 	*pimu = imu;
 	ret = 0;
 done:
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 774aca20326c..24329eca49ef 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -50,7 +50,8 @@ struct io_mapped_ubuf {
 	u64		ubuf_end;
 	unsigned int	nr_bvecs;
 	unsigned int	acct_pages;
-	struct bio_vec	bvec[];
+	struct bio_vec	*bvec;
+	struct bio_vec	__bvec[];
 };
 
 void io_rsrc_put_tw(struct callback_head *cb);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 04/12] io_uring: rename io_mapped_ubuf as io_mapped_buf
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (2 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 03/12] io_uring: extend io_mapped_ubuf to cover external bvec table Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 05/12] io_uring: export 'struct io_mapped_buf' for fused cmd buffer Ming Lei
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Prepare to reuse io_mapped_ubuf for feeding fused command
kbuf(bvec based buffer) to io_uring OP.

Meantime rename ->ubuf as ->buf, and -ubuf_end as ->buf_end,
both are actually just used for figuring out buffer offset &
length only.

Signed-off-by: Ming Lei <[email protected]>
---
 include/linux/io_uring_types.h |  6 +++---
 io_uring/fdinfo.c              |  6 +++---
 io_uring/io_uring.c            |  2 +-
 io_uring/rsrc.c                | 26 +++++++++++++-------------
 io_uring/rsrc.h                | 10 +++++-----
 5 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 87342649d2c3..7a27b1d3e2ea 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -244,7 +244,7 @@ struct io_ring_ctx {
 		struct io_file_table	file_table;
 		unsigned		nr_user_files;
 		unsigned		nr_user_bufs;
-		struct io_mapped_ubuf	**user_bufs;
+		struct io_mapped_buf	**user_bufs;
 
 		struct io_submit_state	submit_state;
 
@@ -326,7 +326,7 @@ struct io_ring_ctx {
 
 	/* slow path rsrc auxilary data, used by update/register */
 	struct io_rsrc_node		*rsrc_backup_node;
-	struct io_mapped_ubuf		*dummy_ubuf;
+	struct io_mapped_buf		*dummy_ubuf;
 	struct io_rsrc_data		*file_data;
 	struct io_rsrc_data		*buf_data;
 
@@ -541,7 +541,7 @@ struct io_kiocb {
 
 	union {
 		/* store used ubuf, so we can prevent reloading */
-		struct io_mapped_ubuf	*imu;
+		struct io_mapped_buf	*imu;
 
 		/* stores selected buf, valid IFF REQ_F_BUFFER_SELECTED is set */
 		struct io_buffer	*kbuf;
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index 882bd56b01ed..2f663a795411 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -157,10 +157,10 @@ static __cold void __io_uring_show_fdinfo(struct io_ring_ctx *ctx,
 	}
 	seq_printf(m, "UserBufs:\t%u\n", ctx->nr_user_bufs);
 	for (i = 0; has_lock && i < ctx->nr_user_bufs; i++) {
-		struct io_mapped_ubuf *buf = ctx->user_bufs[i];
-		unsigned int len = buf->ubuf_end - buf->ubuf;
+		struct io_mapped_buf *buf = ctx->user_bufs[i];
+		unsigned int len = buf->buf_end - buf->buf;
 
-		seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf, len);
+		seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->buf, len);
 	}
 	if (has_lock && !xa_empty(&ctx->personalities)) {
 		unsigned long index;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 09cc5eaec4ab..3df66fddda5a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -298,7 +298,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	if (!ctx->dummy_ubuf)
 		goto err;
 	/* set invalid range, so io_import_fixed() fails meeting it */
-	ctx->dummy_ubuf->ubuf = -1UL;
+	ctx->dummy_ubuf->buf = -1UL;
 
 	if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free,
 			    0, GFP_KERNEL))
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index c41edd197b0a..26c07b28e8bb 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -24,7 +24,7 @@ struct io_rsrc_update {
 };
 
 static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
-				  struct io_mapped_ubuf **pimu,
+				  struct io_mapped_buf **pimu,
 				  struct page **last_hpage);
 
 #define IO_RSRC_REF_BATCH	100
@@ -136,9 +136,9 @@ static int io_buffer_validate(struct iovec *iov)
 	return 0;
 }
 
-static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf **slot)
+static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_buf **slot)
 {
-	struct io_mapped_ubuf *imu = *slot;
+	struct io_mapped_buf *imu = *slot;
 	unsigned int i;
 
 	if (imu != ctx->dummy_ubuf) {
@@ -542,7 +542,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 		return -EINVAL;
 
 	for (done = 0; done < nr_args; done++) {
-		struct io_mapped_ubuf *imu;
+		struct io_mapped_buf *imu;
 		int offset = up->offset + done;
 		u64 tag = 0;
 
@@ -1092,7 +1092,7 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
 
 	/* check previously registered pages */
 	for (i = 0; i < ctx->nr_user_bufs; i++) {
-		struct io_mapped_ubuf *imu = ctx->user_bufs[i];
+		struct io_mapped_buf *imu = ctx->user_bufs[i];
 
 		for (j = 0; j < imu->nr_bvecs; j++) {
 			if (!PageCompound(imu->bvec[j].bv_page))
@@ -1106,7 +1106,7 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
 }
 
 static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
-				 int nr_pages, struct io_mapped_ubuf *imu,
+				 int nr_pages, struct io_mapped_buf *imu,
 				 struct page **last_hpage)
 {
 	int i, ret;
@@ -1199,10 +1199,10 @@ struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
 }
 
 static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
-				  struct io_mapped_ubuf **pimu,
+				  struct io_mapped_buf **pimu,
 				  struct page **last_hpage)
 {
-	struct io_mapped_ubuf *imu = NULL;
+	struct io_mapped_buf *imu = NULL;
 	struct page **pages = NULL;
 	unsigned long off;
 	size_t size;
@@ -1242,8 +1242,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 		size -= vec_len;
 	}
 	/* store original address for later verification */
-	imu->ubuf = (unsigned long) iov->iov_base;
-	imu->ubuf_end = imu->ubuf + iov->iov_len;
+	imu->buf = (unsigned long) iov->iov_base;
+	imu->buf_end = imu->buf + iov->iov_len;
 	imu->nr_bvecs = nr_pages;
 	imu->bvec = imu->__bvec;
 	*pimu = imu;
@@ -1321,7 +1321,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 }
 
 int io_import_fixed(int ddir, struct iov_iter *iter,
-			   struct io_mapped_ubuf *imu,
+			   struct io_mapped_buf *imu,
 			   u64 buf_addr, size_t len)
 {
 	u64 buf_end;
@@ -1332,14 +1332,14 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 	if (unlikely(check_add_overflow(buf_addr, (u64)len, &buf_end)))
 		return -EFAULT;
 	/* not inside the mapped region */
-	if (unlikely(buf_addr < imu->ubuf || buf_end > imu->ubuf_end))
+	if (unlikely(buf_addr < imu->buf || buf_end > imu->buf_end))
 		return -EFAULT;
 
 	/*
 	 * May not be a start of buffer, set size appropriately
 	 * and advance us to the beginning.
 	 */
-	offset = buf_addr - imu->ubuf;
+	offset = buf_addr - imu->buf;
 	iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
 
 	if (offset) {
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 24329eca49ef..5da54702cad1 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -19,7 +19,7 @@ struct io_rsrc_put {
 	union {
 		void *rsrc;
 		struct file *file;
-		struct io_mapped_ubuf *buf;
+		struct io_mapped_buf *buf;
 	};
 };
 
@@ -45,9 +45,9 @@ struct io_rsrc_node {
 	bool				done;
 };
 
-struct io_mapped_ubuf {
-	u64		ubuf;
-	u64		ubuf_end;
+struct io_mapped_buf {
+	u64		buf;
+	u64		buf_end;
 	unsigned int	nr_bvecs;
 	unsigned int	acct_pages;
 	struct bio_vec	*bvec;
@@ -67,7 +67,7 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,
 			 struct io_rsrc_data *data_to_kill);
 
 int io_import_fixed(int ddir, struct iov_iter *iter,
-			   struct io_mapped_ubuf *imu,
+			   struct io_mapped_buf *imu,
 			   u64 buf_addr, size_t len);
 
 void __io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 05/12] io_uring: export 'struct io_mapped_buf' for fused cmd buffer
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (3 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 04/12] io_uring: rename io_mapped_ubuf as io_mapped_buf Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 06/12] io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD Ming Lei
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Export 'struct io_mapped_buf' for the coming fused cmd buffer,
which is based on bvec too.

This instance is supposed to be immutable in its whole lifetime.

Signed-off-by: Ming Lei <[email protected]>
---
 include/linux/io_uring.h | 19 +++++++++++++++++++
 io_uring/rsrc.h          |  9 ---------
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 934e5dd4ccc0..88205ea566d3 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -4,6 +4,7 @@
 
 #include <linux/sched.h>
 #include <linux/xarray.h>
+#include <linux/bvec.h>
 #include <uapi/linux/io_uring.h>
 
 enum io_uring_cmd_flags {
@@ -36,6 +37,24 @@ struct io_uring_cmd {
 	u8		pdu[32]; /* available inline for free use */
 };
 
+/* The mapper buffer is supposed to be immutable */
+struct io_mapped_buf {
+	u64		buf;
+	u64		buf_end;
+	unsigned int	nr_bvecs;
+	union {
+		unsigned int	acct_pages;
+
+		/*
+		 * offset into the bvecs, use for external user; with
+		 * 'offset', immutable bvecs can be provided for io_uring
+		 */
+		unsigned int	offset;
+	};
+	struct bio_vec	*bvec;
+	struct bio_vec	__bvec[];
+};
+
 #if defined(CONFIG_IO_URING)
 int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd);
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 5da54702cad1..4bd17877d53a 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -45,15 +45,6 @@ struct io_rsrc_node {
 	bool				done;
 };
 
-struct io_mapped_buf {
-	u64		buf;
-	u64		buf_end;
-	unsigned int	nr_bvecs;
-	unsigned int	acct_pages;
-	struct bio_vec	*bvec;
-	struct bio_vec	__bvec[];
-};
-
 void io_rsrc_put_tw(struct callback_head *cb);
 void io_rsrc_put_work(struct work_struct *work);
 void io_rsrc_refs_refill(struct io_ring_ctx *ctx);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 06/12] io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (4 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 05/12] io_uring: export 'struct io_mapped_buf' for fused cmd buffer Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 07/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Add flag IO_URING_F_FUSED and prepare for supporting IO_URING_OP_FUSED_CMD,
which is still one type of IO_URING_OP_URING_CMD, so it is reasonable
to reuse ->uring_cmd() for handling IO_URING_F_FUSED_CMD.

And just IO_URING_F_FUSED_CMD will carry one 64byte SQE as payload which
will be handled by one slave request. The master uring command will
provide kernel buffer to the slave request via 'struct io_mapped_buf'.

Mark all existed drivers to not support IO_URING_F_FUSED_CMD, given it
depends if driver is capable of handling the slave request.

Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c  | 6 ++++++
 drivers/char/mem.c        | 4 ++++
 drivers/nvme/host/ioctl.c | 9 +++++++++
 include/linux/io_uring.h  | 7 +++++++
 4 files changed, 26 insertions(+)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index b9c759cef00e..c89ede1c9b22 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1274,6 +1274,9 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	if (!(issue_flags & IO_URING_F_SQE128))
 		goto out;
 
+	if (issue_flags & IO_URING_F_FUSED)
+		return -EOPNOTSUPP;
+
 	if (ub_cmd->q_id >= ub->dev_info.nr_hw_queues)
 		goto out;
 
@@ -2172,6 +2175,9 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
 	struct ublk_device *ub = NULL;
 	int ret = -EINVAL;
 
+	if (issue_flags & IO_URING_F_FUSED)
+		return -EOPNOTSUPP;
+
 	if (issue_flags & IO_URING_F_NONBLOCK)
 		return -EAGAIN;
 
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index ffb101d349f0..134ba6665194 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -30,6 +30,7 @@
 #include <linux/uio.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
+#include <linux/io_uring.h>
 
 #ifdef CONFIG_IA64
 # include <linux/efi.h>
@@ -482,6 +483,9 @@ static ssize_t splice_write_null(struct pipe_inode_info *pipe, struct file *out,
 
 static int uring_cmd_null(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
 {
+	if (issue_flags & IO_URING_F_FUSED)
+		return -EOPNOTSUPP;
+
 	return 0;
 }
 
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 723e7d5b778f..44a171bcaa90 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -773,6 +773,9 @@ int nvme_ns_chr_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
 	struct nvme_ns *ns = container_of(file_inode(ioucmd->file)->i_cdev,
 			struct nvme_ns, cdev);
 
+	if (issue_flags & IO_URING_F_FUSED)
+		return -EOPNOTSUPP;
+
 	return nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
 }
 
@@ -878,6 +881,9 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
 	struct nvme_ns *ns = nvme_find_path(head);
 	int ret = -EINVAL;
 
+	if (issue_flags & IO_URING_F_FUSED)
+		return -EOPNOTSUPP;
+
 	if (ns)
 		ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
 	srcu_read_unlock(&head->srcu, srcu_idx);
@@ -915,6 +921,9 @@ int nvme_dev_uring_cmd(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
 	struct nvme_ctrl *ctrl = ioucmd->file->private_data;
 	int ret;
 
+	if (issue_flags & IO_URING_F_FUSED)
+		return -EOPNOTSUPP;
+
 	/* IOPOLL not supported yet */
 	if (issue_flags & IO_URING_F_IOPOLL)
 		return -EOPNOTSUPP;
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 88205ea566d3..2ccf91146c13 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -21,6 +21,13 @@ enum io_uring_cmd_flags {
 	IO_URING_F_SQE128		= (1 << 8),
 	IO_URING_F_CQE32		= (1 << 9),
 	IO_URING_F_IOPOLL		= (1 << 10),
+
+	/* for FUSED_CMD only */
+	IO_URING_F_FUSED_WRITE		= (1 << 11), /* slave writes to buffer */
+	IO_URING_F_FUSED_READ		= (1 << 12), /* slave reads from buffer */
+	/* driver incapable of FUSED_CMD should fail cmd when seeing F_FUSED */
+	IO_URING_F_FUSED		= IO_URING_F_FUSED_WRITE |
+		IO_URING_F_FUSED_READ,
 };
 
 struct io_uring_cmd {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 07/12] io_uring: add IORING_OP_FUSED_CMD
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (5 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 06/12] io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 08/12] io_uring: support OP_READ/OP_WRITE for fused slave request Ming Lei
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to
be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd
64byte SQE(slave) is another normal 64byte OP. For any OP which needs
to support slave OP, io_issue_defs[op].fused_slave has to be set as 1,
and its ->issue() needs to retrieve buffer from master request's
fused_cmd_kbuf.

Follows the key points of the design/implementation:

1) The master uring command produces and provides immutable command
buffer(struct io_mapped_buf) to the slave request, and the slave
OP can retrieve any part of this buffer by sqe->addr and sqe->len.

2) Master command is always completed after the slave request is
completed.

- Before slave request is submitted, the buffer ownership is
transferred to slave request. After slave request is completed,
the buffer ownership is returned back to master request.

- This way also guarantees correct SQE order since the master
request uses slave request's LINK flag.

3) Master request is always completed by driver, so that driver
can know when the buffer is done with slave quest.

The motivation is for supporting zero copy for fuse/ublk, in which
the device holds IO request buffer, and IO handling is often normal
IO OP(fs, net, ..). With IORING_OP_FUSED_CMD, we can implement this kind
of zero copy easily & reliably.

Signed-off-by: Ming Lei <[email protected]>
---
 include/linux/io_uring.h       |  39 +++++-
 include/linux/io_uring_types.h |  18 +++
 include/uapi/linux/io_uring.h  |   1 +
 io_uring/Makefile              |   2 +-
 io_uring/fused_cmd.c           | 233 +++++++++++++++++++++++++++++++++
 io_uring/fused_cmd.h           |  11 ++
 io_uring/io_uring.c            |  20 ++-
 io_uring/io_uring.h            |   3 +
 io_uring/opdef.c               |  12 ++
 io_uring/opdef.h               |   2 +
 10 files changed, 335 insertions(+), 6 deletions(-)
 create mode 100644 io_uring/fused_cmd.c
 create mode 100644 io_uring/fused_cmd.h

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 2ccf91146c13..64552da503c0 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -30,6 +30,19 @@ enum io_uring_cmd_flags {
 		IO_URING_F_FUSED_READ,
 };
 
+union io_uring_fused_cmd_data {
+	/*
+	 * In case of slave request IOSQE_CQE_SKIP_SUCCESS, return slave
+	 * result via master command; otherwise we simply return success
+	 * if buffer is provided, and slave request will return its result
+	 * via its CQE
+	 */
+	s32 slave_res;
+
+	/* fused cmd private, driver do not touch it */
+	struct io_kiocb *__slave;
+};
+
 struct io_uring_cmd {
 	struct file	*file;
 	const void	*cmd;
@@ -41,11 +54,27 @@ struct io_uring_cmd {
 	};
 	u32		cmd_op;
 	u32		flags;
-	u8		pdu[32]; /* available inline for free use */
+
+	/* for fused command, the available pdu is a bit less */
+	union {
+		u8		pdu[32]; /* available inline for free use */
+		struct {
+			u8	pdu[24]; /* available inline for free use */
+			union io_uring_fused_cmd_data data;
+		} fused;
+	};
 };
 
 /* The mapper buffer is supposed to be immutable */
 struct io_mapped_buf {
+	/*
+	 * For kernel buffer without virtual address, buf is set as zero,
+	 * which is just fine given both buf/buf_end are just for
+	 * calculating iov iter offset/len and validating buffer.
+	 *
+	 * So slave OP has to fail request in case that the OP doesn't
+	 * support iov iter.
+	 */
 	u64		buf;
 	u64		buf_end;
 	unsigned int	nr_bvecs;
@@ -63,6 +92,9 @@ struct io_mapped_buf {
 };
 
 #if defined(CONFIG_IO_URING)
+void io_fused_cmd_provide_kbuf(struct io_uring_cmd *ioucmd, bool locked,
+		const struct io_mapped_buf *imu,
+		void (*complete_tw_cb)(struct io_uring_cmd *));
 int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd);
 void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2);
@@ -92,6 +124,11 @@ static inline void io_uring_free(struct task_struct *tsk)
 		__io_uring_free(tsk);
 }
 #else
+static inline void io_fused_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
+		bool locked, const struct io_mapped_buf *fused_cmd_kbuf,
+		unsigned int len, void (*complete_tw_cb)(struct io_uring_cmd *))
+{
+}
 static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd)
 {
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 7a27b1d3e2ea..7d358fae65f5 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -401,6 +401,8 @@ enum {
 	/* keep async read/write and isreg together and in order */
 	REQ_F_SUPPORT_NOWAIT_BIT,
 	REQ_F_ISREG_BIT,
+	REQ_F_FUSED_MASTER_BIT,
+	REQ_F_FUSED_SLAVE_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -470,6 +472,10 @@ enum {
 	REQ_F_CLEAR_POLLIN	= BIT(REQ_F_CLEAR_POLLIN_BIT),
 	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
 	REQ_F_HASH_LOCKED	= BIT(REQ_F_HASH_LOCKED_BIT),
+	/* master request(uring cmd) in fused cmd */
+	REQ_F_FUSED_MASTER	= BIT(REQ_F_FUSED_MASTER_BIT),
+	/* slave request in fused cmd, won't be one uring cmd */
+	REQ_F_FUSED_SLAVE	= BIT(REQ_F_FUSED_SLAVE_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, bool *locked);
@@ -551,6 +557,18 @@ struct io_kiocb {
 		 * REQ_F_BUFFER_RING is set.
 		 */
 		struct io_buffer_list	*buf_list;
+
+		/*
+		 * store kernel (sub)buffer of fused master request which OP
+		 * is IORING_OP_FUSED_CMD
+		 */
+		const struct io_mapped_buf *fused_cmd_kbuf;
+
+		/*
+		 * store fused command master request for fuse slave request,
+		 * which uses fuse master's kernel buffer for handling this OP
+		 */
+		struct io_kiocb *fused_master_req;
 	};
 
 	union {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 709de6d4feb2..f07d005ee898 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -223,6 +223,7 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_FUSED_CMD,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 8cc8e5387a75..5301077e61c5 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -7,5 +7,5 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o xattr.o nop.o fs.o splice.o \
 					openclose.o uring_cmd.o epoll.o \
 					statx.o net.o msg_ring.o timeout.o \
 					sqpoll.o fdinfo.o tctx.o poll.o \
-					cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o
+					cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o fused_cmd.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
diff --git a/io_uring/fused_cmd.c b/io_uring/fused_cmd.c
new file mode 100644
index 000000000000..9c380b3275f8
--- /dev/null
+++ b/io_uring/fused_cmd.c
@@ -0,0 +1,233 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "opdef.h"
+#include "rsrc.h"
+#include "uring_cmd.h"
+
+static bool io_fused_slave_valid(const struct io_uring_sqe *sqe, u8 op)
+{
+	unsigned int sqe_flags = READ_ONCE(sqe->flags);
+
+	if (op == IORING_OP_FUSED_CMD || op == IORING_OP_URING_CMD)
+		return false;
+
+	if (sqe_flags & REQ_F_BUFFER_SELECT)
+		return false;
+
+	if (!io_issue_defs[op].fused_slave)
+		return false;
+
+	return true;
+}
+
+static inline void io_fused_cmd_update_link_flags(struct io_kiocb *req,
+		const struct io_kiocb *slave)
+{
+	/*
+	 * We have to keep slave SQE in order, so update master link flags
+	 * with slave request's given master command isn't completed until
+	 * the slave request is done
+	 */
+	if (slave->flags & (REQ_F_LINK | REQ_F_HARDLINK))
+		req->flags |= REQ_F_LINK;
+}
+
+int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+	__must_hold(&req->ctx->uring_lock)
+{
+	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	const struct io_uring_sqe *slave_sqe = sqe + 1;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_kiocb *slave;
+	u8 slave_op;
+	int ret;
+
+	if (unlikely(!(ctx->flags & IORING_SETUP_SQE128)))
+		return -EINVAL;
+
+	if (unlikely(sqe->__pad1))
+		return -EINVAL;
+
+	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
+	if (unlikely(ioucmd->flags))
+		return -EINVAL;
+
+	slave_op = READ_ONCE(slave_sqe->opcode);
+	if (unlikely(!io_fused_slave_valid(slave_sqe, slave_op)))
+		return -EINVAL;
+
+	ioucmd->cmd = sqe->cmd;
+	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
+	req->fused_cmd_kbuf = NULL;
+
+	/* take one extra reference for the slave request */
+	io_get_task_refs(1);
+
+	ret = -ENOMEM;
+	if (unlikely(!io_alloc_req(ctx, &slave)))
+		goto fail;
+
+	ret = io_init_req(ctx, slave, slave_sqe, true);
+	if (unlikely(ret))
+		goto fail_free_req;
+
+	io_fused_cmd_update_link_flags(req, slave);
+
+	ioucmd->fused.data.__slave = slave;
+	req->flags |= REQ_F_FUSED_MASTER;
+
+	return 0;
+
+fail_free_req:
+	io_free_req(slave);
+fail:
+	current->io_uring->cached_refs += 1;
+	return ret;
+}
+
+static inline bool io_fused_slave_write_to_buf(u8 op)
+{
+	switch (op) {
+	case IORING_OP_READ:
+	case IORING_OP_READV:
+	case IORING_OP_READ_FIXED:
+	case IORING_OP_RECVMSG:
+	case IORING_OP_RECV:
+		return 1;
+	default:
+		return 0;
+	}
+}
+
+int io_fused_cmd(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	const struct io_kiocb *slave = ioucmd->fused.data.__slave;
+	int ret = -EINVAL;
+
+	/*
+	 * Pass buffer direction for driver to validate if the read/write
+	 * is legal
+	 */
+	if (io_fused_slave_write_to_buf(slave->opcode))
+		issue_flags |= IO_URING_F_FUSED_WRITE;
+	else
+		issue_flags |= IO_URING_F_FUSED_READ;
+
+	ret = io_uring_cmd(req, issue_flags);
+	if (ret != IOU_ISSUE_SKIP_COMPLETE)
+		io_free_req(ioucmd->fused.data.__slave);
+
+	return ret;
+}
+
+int io_import_kbuf_for_slave(u64 buf, unsigned int len, int rw,
+		struct iov_iter *iter, struct io_kiocb *slave)
+{
+	struct io_kiocb *req = slave->fused_master_req;
+	const struct io_mapped_buf *kbuf;
+	unsigned int offset;
+
+	if (unlikely(!(slave->flags & REQ_F_FUSED_SLAVE) || !req))
+		return -EINVAL;
+
+	if (unlikely(!req->fused_cmd_kbuf))
+		return -EINVAL;
+
+	/* req->fused_cmd_kbuf is immutable */
+	kbuf = req->fused_cmd_kbuf;
+	offset = kbuf->offset;
+
+	if (!kbuf->bvec)
+		return -EINVAL;
+
+	/* not inside the mapped region */
+	if (unlikely(buf < kbuf->buf || buf > kbuf->buf_end))
+		return -EFAULT;
+
+	if (unlikely(len > kbuf->buf_end - buf))
+		return -EFAULT;
+
+	/* don't use io_import_fixed which doesn't support multipage bvec */
+	offset += buf - kbuf->buf;
+	iov_iter_bvec(iter, rw, kbuf->bvec, kbuf->nr_bvecs, offset + len);
+
+	if (offset)
+		iov_iter_advance(iter, offset);
+
+	return 0;
+}
+
+/*
+ * Called when slave request is completed,
+ *
+ * Return back ownership of the fused_cmd kbuf to master request, and
+ * notify master request.
+ */
+void io_fused_cmd_return_kbuf(struct io_kiocb *slave)
+{
+	struct io_kiocb *req = slave->fused_master_req;
+	struct io_uring_cmd *ioucmd;
+
+	if (unlikely(!req || !(slave->flags & REQ_F_FUSED_SLAVE)))
+		return;
+
+	/* return back the buffer */
+	slave->fused_master_req = NULL;
+	ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	ioucmd->fused.data.__slave = NULL;
+
+	/* If slave OP skips CQE, return the result via master command */
+	if (slave->flags & REQ_F_CQE_SKIP)
+		ioucmd->fused.data.slave_res = slave->cqe.res;
+	else
+		ioucmd->fused.data.slave_res = 0;
+	io_uring_cmd_complete_in_task(ioucmd, ioucmd->task_work_cb);
+}
+
+/*
+ * This API needs to be called when master command has prepared
+ * FUSED_CMD buffer, and offset/len in ->fused.data is for retrieving
+ * sub-buffer in the command buffer, which is often figured out by
+ * command payload data.
+ *
+ * Master command is always completed after the slave request
+ * is completed, so driver has to set completion callback for
+ * getting notification.
+ *
+ * Ownership of the fused_cmd kbuf is transferred to slave request.
+ */
+void io_fused_cmd_provide_kbuf(struct io_uring_cmd *ioucmd, bool locked,
+		const struct io_mapped_buf *fused_cmd_kbuf,
+		void (*complete_tw_cb)(struct io_uring_cmd *))
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+	struct io_kiocb *slave = ioucmd->fused.data.__slave;
+
+	/*
+	 * Once the fused slave request is completed, the driver will
+	 * be notified by callback of complete_tw_cb
+	 */
+	ioucmd->task_work_cb = complete_tw_cb;
+
+	/* now we get the buffer */
+	req->fused_cmd_kbuf = fused_cmd_kbuf;
+	slave->fused_master_req = req;
+
+	trace_io_uring_submit_sqe(slave, true);
+	if (locked)
+		io_req_task_submit(slave, &locked);
+	else
+		io_req_task_queue(slave);
+}
+EXPORT_SYMBOL_GPL(io_fused_cmd_provide_kbuf);
diff --git a/io_uring/fused_cmd.h b/io_uring/fused_cmd.h
new file mode 100644
index 000000000000..c11d9d8989a1
--- /dev/null
+++ b/io_uring/fused_cmd.h
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_FUSED_CMD_H
+#define IOU_FUSED_CMD_H
+
+int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_fused_cmd(struct io_kiocb *req, unsigned int issue_flags);
+void io_fused_cmd_return_kbuf(struct io_kiocb *slave);
+int io_import_kbuf_for_slave(u64 buf, unsigned int len, int rw,
+		struct iov_iter *iter, struct io_kiocb *slave);
+
+#endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3df66fddda5a..d34ce82a4cc6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -91,6 +91,7 @@
 #include "cancel.h"
 #include "net.h"
 #include "notif.h"
+#include "fused_cmd.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -110,7 +111,7 @@
 
 #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
 				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
-				REQ_F_ASYNC_DATA)
+				REQ_F_ASYNC_DATA | REQ_F_FUSED_SLAVE)
 
 #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | REQ_F_HARDLINK |\
 				 IO_REQ_CLEAN_FLAGS)
@@ -964,6 +965,9 @@ static void __io_req_complete_post(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
+	if (req->flags & REQ_F_FUSED_SLAVE)
+		io_fused_cmd_return_kbuf(req);
+
 	io_cq_lock(ctx);
 	if (!(req->flags & REQ_F_CQE_SKIP))
 		io_fill_cqe_req(ctx, req);
@@ -1848,6 +1852,8 @@ static void io_clean_op(struct io_kiocb *req)
 		spin_lock(&req->ctx->completion_lock);
 		io_put_kbuf_comp(req);
 		spin_unlock(&req->ctx->completion_lock);
+	} else if (req->flags & REQ_F_FUSED_SLAVE) {
+		io_fused_cmd_return_kbuf(req);
 	}
 
 	if (req->flags & REQ_F_NEED_CLEANUP) {
@@ -2156,8 +2162,8 @@ static void io_init_req_drain(struct io_kiocb *req)
 	}
 }
 
-static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
-		       const struct io_uring_sqe *sqe)
+int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
+		const struct io_uring_sqe *sqe, bool slave)
 	__must_hold(&ctx->uring_lock)
 {
 	const struct io_issue_def *def;
@@ -2210,6 +2216,12 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		}
 	}
 
+	if (slave) {
+		if (!def->fused_slave)
+		       return -EINVAL;
+		req->flags |= REQ_F_FUSED_SLAVE;
+	}
+
 	if (!def->ioprio && sqe->ioprio)
 		return -EINVAL;
 	if (!def->iopoll && (ctx->flags & IORING_SETUP_IOPOLL))
@@ -2294,7 +2306,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	struct io_submit_link *link = &ctx->submit_state.link;
 	int ret;
 
-	ret = io_init_req(ctx, req, sqe);
+	ret = io_init_req(ctx, req, sqe, false);
 	if (unlikely(ret))
 		return io_submit_fail_init(sqe, req, ret);
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2711865f1e19..a50c7e1f6e81 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -78,6 +78,9 @@ bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
 bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
 			bool cancel_all);
 
+int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
+		const struct io_uring_sqe *sqe, bool slave);
+
 #define io_lockdep_assert_cq_locked(ctx)				\
 	do {								\
 		if (ctx->flags & IORING_SETUP_IOPOLL) {			\
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..63b90e8e65f8 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -33,6 +33,7 @@
 #include "poll.h"
 #include "cancel.h"
 #include "rw.h"
+#include "fused_cmd.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -428,6 +429,12 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_FUSED_CMD] = {
+		.needs_file		= 1,
+		.plug			= 1,
+		.prep			= io_fused_cmd_prep,
+		.issue			= io_fused_cmd,
+	},
 };
 
 
@@ -648,6 +655,11 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_FUSED_CMD] = {
+		.name			= "FUSED_CMD",
+		.async_size		= uring_cmd_pdu_size(1),
+		.prep_async		= io_uring_cmd_prep_async,
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index c22c8696e749..306f6fc48ed4 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -29,6 +29,8 @@ struct io_issue_def {
 	unsigned		iopoll_queue : 1;
 	/* opcode specific path will handle ->async_data allocation if needed */
 	unsigned		manual_alloc : 1;
+	/* can be slave op of fused command */
+	unsigned		fused_slave : 1;
 
 	int (*issue)(struct io_kiocb *, unsigned int);
 	int (*prep)(struct io_kiocb *, const struct io_uring_sqe *);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 08/12] io_uring: support OP_READ/OP_WRITE for fused slave request
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (6 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 07/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 09/12] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Start to allow fused slave request to support OP_READ/OP_WRITE, and
the buffer can be retrieved from master request.

Once the slave request is completed, the master buffer will be returned
back.

Signed-off-by: Ming Lei <[email protected]>
---
 io_uring/opdef.c |  2 ++
 io_uring/rw.c    | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 63b90e8e65f8..f044629e5475 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -235,6 +235,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.ioprio			= 1,
 		.iopoll			= 1,
 		.iopoll_queue		= 1,
+		.fused_slave		= 1,
 		.prep			= io_prep_rw,
 		.issue			= io_read,
 	},
@@ -248,6 +249,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.ioprio			= 1,
 		.iopoll			= 1,
 		.iopoll_queue		= 1,
+		.fused_slave		= 1,
 		.prep			= io_prep_rw,
 		.issue			= io_write,
 	},
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 4c233910e200..36d31a943317 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -19,6 +19,7 @@
 #include "kbuf.h"
 #include "rsrc.h"
 #include "rw.h"
+#include "fused_cmd.h"
 
 struct io_rw {
 	/* NOTE: kiocb has the file as the first member, so don't do it here */
@@ -371,6 +372,17 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
 	size_t sqe_len;
 	ssize_t ret;
 
+	/*
+	 * SLAVE OP passes buffer offset from sqe->addr actually, since
+	 * the fused cmd kbuf's mapped start address is zero.
+	 */
+	if (req->flags & REQ_F_FUSED_SLAVE) {
+		ret = io_import_kbuf_for_slave(rw->addr, rw->len, ddir, iter, req);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
 	if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
 		ret = io_import_fixed(ddir, iter, req->imu, rw->addr, rw->len);
 		if (ret)
@@ -428,11 +440,19 @@ static inline loff_t *io_kiocb_ppos(struct kiocb *kiocb)
  */
 static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter)
 {
+	struct io_kiocb *req = cmd_to_io_kiocb(rw);
 	struct kiocb *kiocb = &rw->kiocb;
 	struct file *file = kiocb->ki_filp;
 	ssize_t ret = 0;
 	loff_t *ppos;
 
+	/*
+	 * Fused slave req hasn't user buffer, so ->read/->write can't
+	 * be supported
+	 */
+	if (req->flags & REQ_F_FUSED_SLAVE)
+		return -EOPNOTSUPP;
+
 	/*
 	 * Don't support polled IO through this interface, and we can't
 	 * support non-blocking either. For the latter, this just causes
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 09/12] io_uring: support OP_SEND_ZC/OP_RECV for fused slave request
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (7 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 08/12] io_uring: support OP_READ/OP_WRITE for fused slave request Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-01 14:06 ` [RFC PATCH 10/12] block: ublk_drv: mark device as LIVE before adding disk Ming Lei
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Start to allow fused slave request to support OP_SEND_ZC/OP_RECV, and
the buffer can be retrieved from master request.

Once the slave request is completed, the master buffer will be returned
back.

Signed-off-by: Ming Lei <[email protected]>
---
 io_uring/net.c   | 23 +++++++++++++++++++++--
 io_uring/opdef.c |  3 +++
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index cbd4b725f58c..be5ae5ca823d 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -16,6 +16,7 @@
 #include "net.h"
 #include "notif.h"
 #include "rsrc.h"
+#include "fused_cmd.h"
 
 #if defined(CONFIG_NET)
 struct io_shutdown {
@@ -378,7 +379,11 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(!sock))
 		return -ENOTSOCK;
 
-	ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &msg.msg_iter);
+	if (!(req->flags & REQ_F_FUSED_SLAVE))
+		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &msg.msg_iter);
+	else
+		ret = io_import_kbuf_for_slave((u64)sr->buf, sr->len,
+				ITER_SOURCE, &msg.msg_iter, req);
 	if (unlikely(ret))
 		return ret;
 
@@ -869,7 +874,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 		sr->buf = buf;
 	}
 
-	ret = import_ubuf(ITER_DEST, sr->buf, len, &msg.msg_iter);
+	if (!(req->flags & REQ_F_FUSED_SLAVE))
+		ret = import_ubuf(ITER_DEST, sr->buf, len, &msg.msg_iter);
+	else
+		ret = io_import_kbuf_for_slave((u64)sr->buf, sr->len, ITER_DEST,
+				&msg.msg_iter, req);
 	if (unlikely(ret))
 		goto out_free;
 
@@ -983,6 +992,9 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
 		unsigned idx = READ_ONCE(sqe->buf_index);
 
+		if (req->flags & REQ_F_FUSED_SLAVE)
+			return -EINVAL;
+
 		if (unlikely(idx >= ctx->nr_user_bufs))
 			return -EFAULT;
 		idx = array_index_nospec(idx, ctx->nr_user_bufs);
@@ -1119,8 +1131,15 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 		if (unlikely(ret))
 			return ret;
 		msg.sg_from_iter = io_sg_from_iter;
+	} else if (req->flags & REQ_F_FUSED_SLAVE) {
+		ret = io_import_kbuf_for_slave((u64)zc->buf, zc->len,
+				ITER_SOURCE, &msg.msg_iter, req);
+		if (unlikely(ret))
+			return ret;
+		msg.sg_from_iter = io_sg_from_iter;
 	} else {
 		io_notif_set_extended(zc->notif);
+
 		ret = import_ubuf(ITER_SOURCE, zc->buf, zc->len, &msg.msg_iter);
 		if (unlikely(ret))
 			return ret;
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index f044629e5475..0a9d39a9db16 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -271,6 +271,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.audit_skip		= 1,
 		.ioprio			= 1,
 		.manual_alloc		= 1,
+		.fused_slave		= 1,
 #if defined(CONFIG_NET)
 		.prep			= io_sendmsg_prep,
 		.issue			= io_send,
@@ -285,6 +286,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.buffer_select		= 1,
 		.audit_skip		= 1,
 		.ioprio			= 1,
+		.fused_slave		= 1,
 #if defined(CONFIG_NET)
 		.prep			= io_recvmsg_prep,
 		.issue			= io_recv,
@@ -411,6 +413,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.audit_skip		= 1,
 		.ioprio			= 1,
 		.manual_alloc		= 1,
+		.fused_slave		= 1,
 #if defined(CONFIG_NET)
 		.prep			= io_send_zc_prep,
 		.issue			= io_send_zc,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 10/12] block: ublk_drv: mark device as LIVE before adding disk
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (8 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 09/12] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-03  2:25   ` Ziyang Zhang
  2023-03-01 14:06 ` [RFC PATCH 11/12] block: ublk_drv: add common exit handling Ming Lei
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

IO can be started before add_disk() returns, such as reading parititon table,
then the monitor work should work for making forward progress.

So mark device as LIVE before adding disk, meantime change to
DEAD if add_disk() fails.

Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c89ede1c9b22..2497b91b48ba 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1608,17 +1608,18 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub, struct io_uring_cmd *cmd)
 		set_bit(GD_SUPPRESS_PART_SCAN, &disk->state);
 
 	get_device(&ub->cdev_dev);
+	ub->dev_info.state = UBLK_S_DEV_LIVE;
 	ret = add_disk(disk);
 	if (ret) {
 		/*
 		 * Has to drop the reference since ->free_disk won't be
 		 * called in case of add_disk failure.
 		 */
+		ub->dev_info.state = UBLK_S_DEV_DEAD;
 		ublk_put_device(ub);
 		goto out_put_disk;
 	}
 	set_bit(UB_STATE_USED, &ub->state);
-	ub->dev_info.state = UBLK_S_DEV_LIVE;
 out_put_disk:
 	if (ret)
 		put_disk(disk);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 11/12] block: ublk_drv: add common exit handling
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (9 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 10/12] block: ublk_drv: mark device as LIVE before adding disk Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-03  2:15   ` Ziyang Zhang
  2023-03-01 14:06 ` [RFC PATCH 12/12] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy Ming Lei
  2023-03-03  2:52 ` [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ziyang Zhang
  12 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Simplify exit handling a bit, and prepare for supporting fused command.

Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2497b91b48ba..b9e38ebabca7 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -655,14 +655,15 @@ static void ublk_complete_rq(struct request *req)
 	struct ublk_queue *ubq = req->mq_hctx->driver_data;
 	struct ublk_io *io = &ubq->ios[req->tag];
 	unsigned int unmapped_bytes;
+	int res = BLK_STS_OK;
 
 	/* failed read IO if nothing is read */
 	if (!io->res && req_op(req) == REQ_OP_READ)
 		io->res = -EIO;
 
 	if (io->res < 0) {
-		blk_mq_end_request(req, errno_to_blk_status(io->res));
-		return;
+		res = errno_to_blk_status(io->res);
+		goto exit;
 	}
 
 	/*
@@ -671,10 +672,8 @@ static void ublk_complete_rq(struct request *req)
 	 *
 	 * Both the two needn't unmap.
 	 */
-	if (req_op(req) != REQ_OP_READ && req_op(req) != REQ_OP_WRITE) {
-		blk_mq_end_request(req, BLK_STS_OK);
-		return;
-	}
+	if (req_op(req) != REQ_OP_READ && req_op(req) != REQ_OP_WRITE)
+		goto exit;
 
 	/* for READ request, writing data in iod->addr to rq buffers */
 	unmapped_bytes = ublk_unmap_io(ubq, req, io);
@@ -691,6 +690,10 @@ static void ublk_complete_rq(struct request *req)
 		blk_mq_requeue_request(req, true);
 	else
 		__blk_mq_end_request(req, BLK_STS_OK);
+
+	return;
+exit:
+	blk_mq_end_request(req, res);
 }
 
 /*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH 12/12] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (10 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 11/12] block: ublk_drv: add common exit handling Ming Lei
@ 2023-03-01 14:06 ` Ming Lei
  2023-03-03  2:52 ` [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ziyang Zhang
  12 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-01 14:06 UTC (permalink / raw)
  To: Jens Axboe, io-uring
  Cc: linux-block, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Ming Lei

Apply io_uring fused command for supporting zero copy:

1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(),
and deinit it in ublk_unmap_io(), and this buffer is immutable,
so it is just fine to retrieve it from concurrent fused command.

1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving
this fused cmd(zero copy) buffer

2) call io_fused_cmd_provide_kbuf() to provide buffer to slave
request; meantime setup complete callback via this API, once
slave request is completed, the complete callback is called
for freeing the buffer and completing the uring fused command

Todo: don't complete ublk block request until all in-flight fused
commands aiming this request are completed; this change requires
to clean up current ublk driver a bit, so delay this work in future
post, and it won't affect reviewing on this whole approach.

Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c      | 167 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/ublk_cmd.h |   1 +
 2 files changed, 160 insertions(+), 8 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index b9e38ebabca7..56a362798aa7 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -62,6 +62,8 @@
 struct ublk_rq_data {
 	struct llist_node node;
 	struct callback_head work;
+	bool allocated_bvec;
+	struct io_mapped_buf buf[0];
 };
 
 struct ublk_uring_cmd_pdu {
@@ -525,10 +527,87 @@ static inline int ublk_copy_user_pages(struct ublk_map_data *data,
 	return done;
 }
 
-static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
+/*
+ * The built command buffer is immutable, so it is fine to feed it to
+ * concurrent io_uring fused commands
+ */
+static int ublk_init_zero_copy_buffer(struct request *rq)
+{
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+	struct io_mapped_buf *imu = data->buf;
+	struct req_iterator rq_iter;
+	unsigned int nr_bvecs = 0;
+	struct bio_vec *bvec;
+	unsigned int offset;
+	struct bio_vec bv;
+
+	if (!ublk_rq_has_data(rq))
+		goto exit;
+
+	rq_for_each_bvec(bv, rq, rq_iter)
+		nr_bvecs++;
+
+	if (!nr_bvecs)
+		goto exit;
+
+	if (rq->bio != rq->biotail) {
+		int idx = 0;
+
+		bvec = kvmalloc_array(sizeof(struct bio_vec), nr_bvecs,
+				GFP_NOIO);
+		if (!bvec)
+			return -ENOMEM;
+
+		offset = 0;
+		rq_for_each_bvec(bv, rq, rq_iter)
+			bvec[idx++] = bv;
+		data->allocated_bvec = true;
+	} else {
+		struct bio *bio = rq->bio;
+
+		offset = bio->bi_iter.bi_bvec_done;
+		bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+	}
+	imu->bvec = bvec;
+	imu->nr_bvecs = nr_bvecs;
+	imu->offset = offset;
+	imu->buf = 0;
+	imu->buf_end = blk_rq_bytes(rq);
+
+	return 0;
+exit:
+	imu->bvec = NULL;
+	return 0;
+}
+
+static void ublk_deinit_zero_copy_buffer(struct request *rq)
+{
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+	struct io_mapped_buf *imu = data->buf;
+
+	if (data->allocated_bvec) {
+		kvfree(imu->bvec);
+		data->allocated_bvec = false;
+	}
+}
+
+static int ublk_map_io(const struct ublk_queue *ubq, struct request *req,
 		struct ublk_io *io)
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
+
+	if (ubq->flags & UBLK_F_SUPPORT_ZERO_COPY) {
+		int ret = ublk_init_zero_copy_buffer(req);
+
+		/*
+		 * The only failure is -ENOMEM for allocating fused cmd
+		 * buffer, return zero so that we can requeue this req.
+		 */
+		if (unlikely(ret))
+			return 0;
+		return rq_bytes;
+	}
+
 	/*
 	 * no zero copy, we delay copy WRITE request data into ublksrv
 	 * context and the big benefit is that pinning pages in current
@@ -553,11 +632,17 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 }
 
 static int ublk_unmap_io(const struct ublk_queue *ubq,
-		const struct request *req,
+		struct request *req,
 		struct ublk_io *io)
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
+	if (ubq->flags & UBLK_F_SUPPORT_ZERO_COPY) {
+		ublk_deinit_zero_copy_buffer(req);
+
+		return rq_bytes;
+	}
+
 	if (req_op(req) == REQ_OP_READ && ublk_rq_has_data(req)) {
 		struct ublk_map_data data = {
 			.ubq	=	ubq,
@@ -693,6 +778,7 @@ static void ublk_complete_rq(struct request *req)
 
 	return;
 exit:
+	ublk_deinit_zero_copy_buffer(req);
 	blk_mq_end_request(req, res);
 }
 
@@ -1259,6 +1345,66 @@ static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
 	ublk_queue_cmd(ubq, req);
 }
 
+static inline bool ublk_check_fused_buf_dir(const struct request *req,
+		unsigned int flags)
+{
+	flags &= IO_URING_F_FUSED;
+
+	if (req_op(req) == REQ_OP_READ && flags == IO_URING_F_FUSED_WRITE)
+		return true;
+
+	if (req_op(req) == REQ_OP_WRITE && flags == IO_URING_F_FUSED_READ)
+		return true;
+
+	return false;
+}
+
+static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd)
+{
+	io_uring_cmd_done(cmd, cmd->fused.data.slave_res, 0);
+}
+
+static int ublk_handle_fused_cmd(struct io_uring_cmd *cmd,
+		struct ublk_queue *ubq, int tag, unsigned int issue_flags)
+{
+	struct ublk_device *ub = cmd->file->private_data;
+	struct ublk_rq_data *data;
+	struct request *req;
+
+	if (!ub)
+		return -EPERM;
+
+	if (!(issue_flags & IO_URING_F_FUSED))
+		goto exit;
+
+	if (ub->dev_info.state == UBLK_S_DEV_DEAD)
+		goto exit;
+
+	if (!(ubq->flags & UBLK_F_SUPPORT_ZERO_COPY))
+		goto exit;
+
+	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+	if (!req || !blk_mq_request_started(req))
+		goto exit;
+
+	pr_devel("%s: qid %d tag %u request bytes %u, issue flags %x\n",
+			__func__, tag, ubq->q_id, blk_rq_bytes(req),
+			issue_flags);
+
+	if (!ublk_check_fused_buf_dir(req, issue_flags))
+		goto exit;
+
+	if (!ublk_rq_has_data(req))
+		goto exit;
+
+	data = blk_mq_rq_to_pdu(req);
+	io_fused_cmd_provide_kbuf(cmd, !(issue_flags & IO_URING_F_UNLOCKED),
+			data->buf, ublk_fused_cmd_done_cb);
+	return -EIOCBQUEUED;
+exit:
+	return -EINVAL;
+}
+
 static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
 	struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->cmd;
@@ -1277,7 +1423,8 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	if (!(issue_flags & IO_URING_F_SQE128))
 		goto out;
 
-	if (issue_flags & IO_URING_F_FUSED)
+	if ((issue_flags & IO_URING_F_FUSED) &&
+			cmd_op != UBLK_IO_FUSED_SUBMIT_IO)
 		return -EOPNOTSUPP;
 
 	if (ub_cmd->q_id >= ub->dev_info.nr_hw_queues)
@@ -1287,7 +1434,8 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	if (!ubq || ub_cmd->q_id != ubq->q_id)
 		goto out;
 
-	if (ubq->ubq_daemon && ubq->ubq_daemon != current)
+	if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
+			(cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
 		goto out;
 
 	if (tag >= ubq->q_depth)
@@ -1310,6 +1458,9 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 		goto out;
 
 	switch (cmd_op) {
+	case UBLK_IO_FUSED_SUBMIT_IO:
+		return ublk_handle_fused_cmd(cmd, ubq, tag, issue_flags);
+
 	case UBLK_IO_FETCH_REQ:
 		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
 		if (ublk_queue_ready(ubq)) {
@@ -1533,11 +1684,14 @@ static void ublk_align_max_io_size(struct ublk_device *ub)
 
 static int ublk_add_tag_set(struct ublk_device *ub)
 {
+	int zc = !!(ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY);
+	struct ublk_rq_data *data;
+
 	ub->tag_set.ops = &ublk_mq_ops;
 	ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues;
 	ub->tag_set.queue_depth = ub->dev_info.queue_depth;
 	ub->tag_set.numa_node = NUMA_NO_NODE;
-	ub->tag_set.cmd_size = sizeof(struct ublk_rq_data);
+	ub->tag_set.cmd_size = struct_size(data, buf, zc);
 	ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	ub->tag_set.driver_data = ub;
 	return blk_mq_alloc_tag_set(&ub->tag_set);
@@ -1756,9 +1910,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 	if (!IS_BUILTIN(CONFIG_BLK_DEV_UBLK))
 		ub->dev_info.flags |= UBLK_F_URING_CMD_COMP_IN_TASK;
 
-	/* We are not ready to support zero copy */
-	ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
-
 	ub->dev_info.nr_hw_queues = min_t(unsigned int,
 			ub->dev_info.nr_hw_queues, nr_cpu_ids);
 	ublk_align_max_io_size(ub);
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index f6238ccc7800..027e60e49cc8 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -44,6 +44,7 @@
 #define	UBLK_IO_FETCH_REQ		0x20
 #define	UBLK_IO_COMMIT_AND_FETCH_REQ	0x21
 #define	UBLK_IO_NEED_GET_DATA	0x22
+#define	UBLK_IO_FUSED_SUBMIT_IO	0x23
 
 /* only ABORT means that no re-fetch */
 #define UBLK_IO_RES_OK			0
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 11/12] block: ublk_drv: add common exit handling
  2023-03-01 14:06 ` [RFC PATCH 11/12] block: ublk_drv: add common exit handling Ming Lei
@ 2023-03-03  2:15   ` Ziyang Zhang
  0 siblings, 0 replies; 17+ messages in thread
From: Ziyang Zhang @ 2023-03-03  2:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Miklos Szeredi, io-uring, Xiaoguang Wang,
	Bernd Schubert, Jens Axboe

On 2023/3/1 22:06, Ming Lei wrote:
> Simplify exit handling a bit, and prepare for supporting fused command.
> 
> Signed-off-by: Ming Lei <[email protected]>
> ---

Reviewed-by: Ziyang Zhang <[email protected]>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 10/12] block: ublk_drv: mark device as LIVE before adding disk
  2023-03-01 14:06 ` [RFC PATCH 10/12] block: ublk_drv: mark device as LIVE before adding disk Ming Lei
@ 2023-03-03  2:25   ` Ziyang Zhang
  0 siblings, 0 replies; 17+ messages in thread
From: Ziyang Zhang @ 2023-03-03  2:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, Miklos Szeredi, io-uring, Jens Axboe, Xiaoguang Wang,
	Bernd Schubert

On 2023/3/1 22:06, Ming Lei wrote:
> IO can be started before add_disk() returns, such as reading parititon table,
> then the monitor work should work for making forward progress.
> 
> So mark device as LIVE before adding disk, meantime change to
> DEAD if add_disk() fails.
> 
> Signed-off-by: Ming Lei <[email protected]>
> ---

Reviewed-by: Ziyang Zhang <[email protected]>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD
  2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
                   ` (11 preceding siblings ...)
  2023-03-01 14:06 ` [RFC PATCH 12/12] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy Ming Lei
@ 2023-03-03  2:52 ` Ziyang Zhang
  2023-03-03  3:01   ` Ming Lei
  12 siblings, 1 reply; 17+ messages in thread
From: Ziyang Zhang @ 2023-03-03  2:52 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, io-uring

Hi Ming,

I tried this patchset but there are some conflicts while applying.
Could please tell me the base branch? I have tried both io_uring
and block.

Regards,
Zhang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD
  2023-03-03  2:52 ` [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ziyang Zhang
@ 2023-03-03  3:01   ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-03  3:01 UTC (permalink / raw)
  To: Ziyang Zhang; +Cc: linux-block, io-uring

On Fri, Mar 03, 2023 at 10:52:00AM +0800, Ziyang Zhang wrote:
> Hi Ming,
> 
> I tried this patchset but there are some conflicts while applying.
> Could please tell me the base branch? I have tried both io_uring
> and block.

The patchset is against the following commit:

489fa31ea873 (master) Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

I guess there might be new io_uring/block commits merged recently.

I will send V2 out after 6.3-rc1 or -rc2 is released. 


Thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-03-03  3:03 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-01 14:05 [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
2023-03-01 14:06 ` [RFC PATCH 01/12] io_uring: increase io_kiocb->flags into 64bit Ming Lei
2023-03-01 14:06 ` [RFC PATCH 02/12] io_uring: define io_mapped_ubuf->acct_pages as unsigned integer Ming Lei
2023-03-01 14:06 ` [RFC PATCH 03/12] io_uring: extend io_mapped_ubuf to cover external bvec table Ming Lei
2023-03-01 14:06 ` [RFC PATCH 04/12] io_uring: rename io_mapped_ubuf as io_mapped_buf Ming Lei
2023-03-01 14:06 ` [RFC PATCH 05/12] io_uring: export 'struct io_mapped_buf' for fused cmd buffer Ming Lei
2023-03-01 14:06 ` [RFC PATCH 06/12] io_uring: add IO_URING_F_FUSED and prepare for supporting OP_FUSED_CMD Ming Lei
2023-03-01 14:06 ` [RFC PATCH 07/12] io_uring: add IORING_OP_FUSED_CMD Ming Lei
2023-03-01 14:06 ` [RFC PATCH 08/12] io_uring: support OP_READ/OP_WRITE for fused slave request Ming Lei
2023-03-01 14:06 ` [RFC PATCH 09/12] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
2023-03-01 14:06 ` [RFC PATCH 10/12] block: ublk_drv: mark device as LIVE before adding disk Ming Lei
2023-03-03  2:25   ` Ziyang Zhang
2023-03-01 14:06 ` [RFC PATCH 11/12] block: ublk_drv: add common exit handling Ming Lei
2023-03-03  2:15   ` Ziyang Zhang
2023-03-01 14:06 ` [RFC PATCH 12/12] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy Ming Lei
2023-03-03  2:52 ` [RFC PATCH 00/12] io_uring: add IORING_OP_FUSED_CMD Ziyang Zhang
2023-03-03  3:01   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox