public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
@ 2023-03-30 11:36 Ming Lei
  2023-03-30 11:36 ` [PATCH V6 01/17] io_uring: increase io_kiocb->flags into 64bit Ming Lei
                   ` (19 more replies)
  0 siblings, 20 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Hello Jens and Guys,

Add generic fused command, which can include one primary command and multiple
secondary requests. This command provides one safe way to share resource between
primary command and secondary requests, and primary command is always
completed after all secondary requests are done, and resource lifetime
is bound with primary command.

With this way, it is easy to support zero copy for ublk/fuse device, and
there could be more potential use cases, such as offloading complicated logic
into userspace, or decouple kernel subsystems.

Follows ublksrv code, which implements zero copy for loop, nbd and
qcow2 targets with fused command:

https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6

All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:

	ublk add -t [loop|nbd|qcow2] -z .... 

Also add liburing test case for covering fused command based on miniublk
of blktest.

https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6

Performance improvement is obvious on memory bandwidth related workloads,
such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file.
ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided.

Please review and consider for v6.4.

V6:
	- re-design fused command, and make it more generic, moving sharing buffer
	as one plugin of fused command, so in future we can implement more plugins
	- document potential other use cases of fused command
	- drop support for builtin secondary sqe in SQE128, so all secondary
	  requests has standalone SQE
	- make fused command as one feature
	- cleanup & improve naming

V5:
	- rebase on for-6.4/io_uring
	- rename to primary/secondary as suggested by Jens
	- reserve interface for extending to support multiple secondary OPs in future,
	which isn't a must, because it can be done by submitting multiple fused
	commands with same primary request
	- rename to primary/secondary in ublksrv and liburing test code

V4:
	- improve APIs naming(patch 1 ~ 4)
	- improve documents and commit log(patch 2)
	- add buffer direction bit to opdef, suggested by Jens(patch 2)
	- add ublk zero copy document for cover: technical requirements(most related with
	buffer lifetime), and explains why splice isn't good and how fused command solves it(patch 17)
	- fix sparse warning(patch 7)
	- supports 64byte SQE fused command(patch 3)

V3:
	- fix build warning reported by kernel test robot
	- drop patch for checking fused flags on existed drivers with
	  ->uring_command(), which isn't necessary, since we do not do that
      when adding new ioctl or uring command
    - inline io_init_rq() for core code, so just export io_init_secondary_req
	- return result of failed secondary request unconditionally since REQ_F_CQE_SKIP
	will be cleared
	- pass xfstest over ublk-loop

V2:
	- don't resue io_mapped_ubuf (io_uring)
	- remove REQ_F_FUSED_MASTER_BIT (io_uring)
	- fix compile warning (io_uring)
	- rebase on v6.3-rc1 (io_uring)
	- grabbing io request reference when handling fused command 
	- simplify ublk_copy_user_pages() by iov iterator
	- add read()/write() for userspace to read/write ublk io buffer, so
	that some corner cases(read zero, passthrough request(report zones)) can
	be handled easily in case of zero copy; this way also helps to switch to
	zero copy completely
	- misc cleanup


Ming Lei (17):
  io_uring: increase io_kiocb->flags into 64bit
  io_uring: use ctx->cached_sq_head to calculate left sqes
  io_uring: add generic IORING_OP_FUSED_CMD
  io_uring: support providing buffer by IORING_OP_FUSED_CMD
  io_uring: support OP_READ/OP_WRITE for fused secondary request
  io_uring: support OP_SEND_ZC/OP_RECV for fused secondary request
  block: ublk_drv: add common exit handling
  block: ublk_drv: don't consider flush request in map/unmap io
  block: ublk_drv: add two helpers to clean up map/unmap request
  block: ublk_drv: clean up several helpers
  block: ublk_drv: cleanup 'struct ublk_map_data'
  block: ublk_drv: cleanup ublk_copy_user_pages
  block: ublk_drv: grab request reference when the request is handled by
    userspace
  block: ublk_drv: support to copy any part of request pages
  block: ublk_drv: add read()/write() support for ublk char device
  block: ublk_drv: don't check buffer in case of zero copy
  block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy

 Documentation/block/ublk.rst   | 126 ++++++-
 drivers/block/ublk_drv.c       | 603 ++++++++++++++++++++++++++-------
 include/linux/io_uring.h       |  41 ++-
 include/linux/io_uring_types.h |  76 +++--
 include/uapi/linux/io_uring.h  |  22 +-
 include/uapi/linux/ublk_cmd.h  |  37 +-
 io_uring/Makefile              |   2 +-
 io_uring/fused_cmd.c           | 239 +++++++++++++
 io_uring/fused_cmd.h           |  16 +
 io_uring/io_uring.c            |  57 +++-
 io_uring/io_uring.h            |   5 +
 io_uring/net.c                 |  30 +-
 io_uring/opdef.c               |  22 ++
 io_uring/opdef.h               |   7 +
 io_uring/rw.c                  |  21 ++
 15 files changed, 1124 insertions(+), 180 deletions(-)
 create mode 100644 io_uring/fused_cmd.c
 create mode 100644 io_uring/fused_cmd.h

-- 
2.39.2


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH V6 01/17] io_uring: increase io_kiocb->flags into 64bit
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 02/17] io_uring: use ctx->cached_sq_head to calculate left sqes Ming Lei
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

The 32bit io_kiocb->flags has been used up, so extend it to 64bit.

Signed-off-by: Ming Lei <[email protected]>
---
 include/linux/io_uring_types.h | 65 +++++++++++++++++-----------------
 io_uring/io_uring.c            |  2 +-
 2 files changed, 34 insertions(+), 33 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 561fa421c453..dd8ef886730b 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -414,68 +414,68 @@ enum {
 
 enum {
 	/* ctx owns file */
-	REQ_F_FIXED_FILE	= BIT(REQ_F_FIXED_FILE_BIT),
+	REQ_F_FIXED_FILE	= BIT_ULL(REQ_F_FIXED_FILE_BIT),
 	/* drain existing IO first */
-	REQ_F_IO_DRAIN		= BIT(REQ_F_IO_DRAIN_BIT),
+	REQ_F_IO_DRAIN		= BIT_ULL(REQ_F_IO_DRAIN_BIT),
 	/* linked sqes */
-	REQ_F_LINK		= BIT(REQ_F_LINK_BIT),
+	REQ_F_LINK		= BIT_ULL(REQ_F_LINK_BIT),
 	/* doesn't sever on completion < 0 */
-	REQ_F_HARDLINK		= BIT(REQ_F_HARDLINK_BIT),
+	REQ_F_HARDLINK		= BIT_ULL(REQ_F_HARDLINK_BIT),
 	/* IOSQE_ASYNC */
-	REQ_F_FORCE_ASYNC	= BIT(REQ_F_FORCE_ASYNC_BIT),
+	REQ_F_FORCE_ASYNC	= BIT_ULL(REQ_F_FORCE_ASYNC_BIT),
 	/* IOSQE_BUFFER_SELECT */
-	REQ_F_BUFFER_SELECT	= BIT(REQ_F_BUFFER_SELECT_BIT),
+	REQ_F_BUFFER_SELECT	= BIT_ULL(REQ_F_BUFFER_SELECT_BIT),
 	/* IOSQE_CQE_SKIP_SUCCESS */
-	REQ_F_CQE_SKIP		= BIT(REQ_F_CQE_SKIP_BIT),
+	REQ_F_CQE_SKIP		= BIT_ULL(REQ_F_CQE_SKIP_BIT),
 
 	/* fail rest of links */
-	REQ_F_FAIL		= BIT(REQ_F_FAIL_BIT),
+	REQ_F_FAIL		= BIT_ULL(REQ_F_FAIL_BIT),
 	/* on inflight list, should be cancelled and waited on exit reliably */
-	REQ_F_INFLIGHT		= BIT(REQ_F_INFLIGHT_BIT),
+	REQ_F_INFLIGHT		= BIT_ULL(REQ_F_INFLIGHT_BIT),
 	/* read/write uses file position */
-	REQ_F_CUR_POS		= BIT(REQ_F_CUR_POS_BIT),
+	REQ_F_CUR_POS		= BIT_ULL(REQ_F_CUR_POS_BIT),
 	/* must not punt to workers */
-	REQ_F_NOWAIT		= BIT(REQ_F_NOWAIT_BIT),
+	REQ_F_NOWAIT		= BIT_ULL(REQ_F_NOWAIT_BIT),
 	/* has or had linked timeout */
-	REQ_F_LINK_TIMEOUT	= BIT(REQ_F_LINK_TIMEOUT_BIT),
+	REQ_F_LINK_TIMEOUT	= BIT_ULL(REQ_F_LINK_TIMEOUT_BIT),
 	/* needs cleanup */
-	REQ_F_NEED_CLEANUP	= BIT(REQ_F_NEED_CLEANUP_BIT),
+	REQ_F_NEED_CLEANUP	= BIT_ULL(REQ_F_NEED_CLEANUP_BIT),
 	/* already went through poll handler */
-	REQ_F_POLLED		= BIT(REQ_F_POLLED_BIT),
+	REQ_F_POLLED		= BIT_ULL(REQ_F_POLLED_BIT),
 	/* buffer already selected */
-	REQ_F_BUFFER_SELECTED	= BIT(REQ_F_BUFFER_SELECTED_BIT),
+	REQ_F_BUFFER_SELECTED	= BIT_ULL(REQ_F_BUFFER_SELECTED_BIT),
 	/* buffer selected from ring, needs commit */
-	REQ_F_BUFFER_RING	= BIT(REQ_F_BUFFER_RING_BIT),
+	REQ_F_BUFFER_RING	= BIT_ULL(REQ_F_BUFFER_RING_BIT),
 	/* caller should reissue async */
-	REQ_F_REISSUE		= BIT(REQ_F_REISSUE_BIT),
+	REQ_F_REISSUE		= BIT_ULL(REQ_F_REISSUE_BIT),
 	/* supports async reads/writes */
-	REQ_F_SUPPORT_NOWAIT	= BIT(REQ_F_SUPPORT_NOWAIT_BIT),
+	REQ_F_SUPPORT_NOWAIT	= BIT_ULL(REQ_F_SUPPORT_NOWAIT_BIT),
 	/* regular file */
-	REQ_F_ISREG		= BIT(REQ_F_ISREG_BIT),
+	REQ_F_ISREG		= BIT_ULL(REQ_F_ISREG_BIT),
 	/* has creds assigned */
-	REQ_F_CREDS		= BIT(REQ_F_CREDS_BIT),
+	REQ_F_CREDS		= BIT_ULL(REQ_F_CREDS_BIT),
 	/* skip refcounting if not set */
-	REQ_F_REFCOUNT		= BIT(REQ_F_REFCOUNT_BIT),
+	REQ_F_REFCOUNT		= BIT_ULL(REQ_F_REFCOUNT_BIT),
 	/* there is a linked timeout that has to be armed */
-	REQ_F_ARM_LTIMEOUT	= BIT(REQ_F_ARM_LTIMEOUT_BIT),
+	REQ_F_ARM_LTIMEOUT	= BIT_ULL(REQ_F_ARM_LTIMEOUT_BIT),
 	/* ->async_data allocated */
-	REQ_F_ASYNC_DATA	= BIT(REQ_F_ASYNC_DATA_BIT),
+	REQ_F_ASYNC_DATA	= BIT_ULL(REQ_F_ASYNC_DATA_BIT),
 	/* don't post CQEs while failing linked requests */
-	REQ_F_SKIP_LINK_CQES	= BIT(REQ_F_SKIP_LINK_CQES_BIT),
+	REQ_F_SKIP_LINK_CQES	= BIT_ULL(REQ_F_SKIP_LINK_CQES_BIT),
 	/* single poll may be active */
-	REQ_F_SINGLE_POLL	= BIT(REQ_F_SINGLE_POLL_BIT),
+	REQ_F_SINGLE_POLL	= BIT_ULL(REQ_F_SINGLE_POLL_BIT),
 	/* double poll may active */
-	REQ_F_DOUBLE_POLL	= BIT(REQ_F_DOUBLE_POLL_BIT),
+	REQ_F_DOUBLE_POLL	= BIT_ULL(REQ_F_DOUBLE_POLL_BIT),
 	/* request has already done partial IO */
-	REQ_F_PARTIAL_IO	= BIT(REQ_F_PARTIAL_IO_BIT),
+	REQ_F_PARTIAL_IO	= BIT_ULL(REQ_F_PARTIAL_IO_BIT),
 	/* fast poll multishot mode */
-	REQ_F_APOLL_MULTISHOT	= BIT(REQ_F_APOLL_MULTISHOT_BIT),
+	REQ_F_APOLL_MULTISHOT	= BIT_ULL(REQ_F_APOLL_MULTISHOT_BIT),
 	/* ->extra1 and ->extra2 are initialised */
-	REQ_F_CQE32_INIT	= BIT(REQ_F_CQE32_INIT_BIT),
+	REQ_F_CQE32_INIT	= BIT_ULL(REQ_F_CQE32_INIT_BIT),
 	/* recvmsg special flag, clear EPOLLIN */
-	REQ_F_CLEAR_POLLIN	= BIT(REQ_F_CLEAR_POLLIN_BIT),
+	REQ_F_CLEAR_POLLIN	= BIT_ULL(REQ_F_CLEAR_POLLIN_BIT),
 	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
-	REQ_F_HASH_LOCKED	= BIT(REQ_F_HASH_LOCKED_BIT),
+	REQ_F_HASH_LOCKED	= BIT_ULL(REQ_F_HASH_LOCKED_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
@@ -536,7 +536,8 @@ struct io_kiocb {
 	 * and after selection it points to the buffer ID itself.
 	 */
 	u16				buf_index;
-	unsigned int			flags;
+	u32				__pad;
+	u64				flags;
 
 	struct io_cqe			cqe;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 536940675c67..693558c4b10b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -4486,7 +4486,7 @@ static int __init io_uring_init(void)
 	BUILD_BUG_ON(SQE_COMMON_FLAGS >= (1 << 8));
 	BUILD_BUG_ON((SQE_VALID_FLAGS | SQE_COMMON_FLAGS) != SQE_VALID_FLAGS);
 
-	BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(int));
+	BUILD_BUG_ON(__REQ_F_LAST_BIT > 8 * sizeof(u64));
 
 	BUILD_BUG_ON(sizeof(atomic_t) != sizeof(u32));
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 02/17] io_uring: use ctx->cached_sq_head to calculate left sqes
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
  2023-03-30 11:36 ` [PATCH V6 01/17] io_uring: increase io_kiocb->flags into 64bit Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 03/17] io_uring: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Use ctx->cached_sq_head to calculate 'left' sqes, and prepare for supporting
fused requests, which need to get req/sqe in its own ->prep() callback.

ctx->cached_sq_head should always be cached or to be fetched, so this change
is just fine.

Signed-off-by: Ming Lei <[email protected]>
---
 io_uring/io_uring.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 693558c4b10b..25a940f0ab68 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2429,15 +2429,16 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 	__must_hold(&ctx->uring_lock)
 {
 	unsigned int entries = io_sqring_entries(ctx);
-	unsigned int left;
+	unsigned old_head = ctx->cached_sq_head;
+	unsigned int left = 0;
 	int ret;
 
 	if (unlikely(!entries))
 		return 0;
 	/* make sure SQ entry isn't read before tail */
-	ret = left = min3(nr, ctx->sq_entries, entries);
-	io_get_task_refs(left);
-	io_submit_state_start(&ctx->submit_state, left);
+	ret = min3(nr, ctx->sq_entries, entries);
+	io_get_task_refs(ret);
+	io_submit_state_start(&ctx->submit_state, ret);
 
 	do {
 		const struct io_uring_sqe *sqe;
@@ -2456,11 +2457,12 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		 */
 		if (unlikely(io_submit_sqe(ctx, req, sqe)) &&
 		    !(ctx->flags & IORING_SETUP_SUBMIT_ALL)) {
-			left--;
+			left = 1;
 			break;
 		}
-	} while (--left);
+	} while ((ctx->cached_sq_head - old_head) < ret);
 
+	left = ret - (ctx->cached_sq_head - old_head) - left;
 	if (unlikely(left)) {
 		ret -= left;
 		/* try again if it submitted nothing and can't allocate a req */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 03/17] io_uring: add generic IORING_OP_FUSED_CMD
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
  2023-03-30 11:36 ` [PATCH V6 01/17] io_uring: increase io_kiocb->flags into 64bit Ming Lei
  2023-03-30 11:36 ` [PATCH V6 02/17] io_uring: use ctx->cached_sq_head to calculate left sqes Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-04-01 14:35   ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 04/17] io_uring: support providing buffer by IORING_OP_FUSED_CMD Ming Lei
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Multiple requests submitted as one whole request logically, and the 1st one
is primary command(IORING_OP_FUSED_CMD), and the others are secondary
requests, which number can be retrieved from primary SQE.

Primary command is responsible for providing resources and submitting
secondary requests, which depends on primary command's resources, and
primary command won't be completed until all secondary requests are done.

The provided resource has same lifetime with primary command, and it
lifetime won't cross multiple OPs, and this way provides safe way for
secondary OPs to use the resource.

Add generic IORING_OP_FUSED_CMD for modeling this primary/secondary
relationship among requests.

So far, the motivation is for supporting ublk zero copy, but it is one
generic framework which could be used for other use cases.

This way actually provides one generic interface for doing something
efficiently:

1) offload complicated logic to userspace, such as one generic lvm

- one read IO request is coming to this lvm block device
- the kernel driver gets notified, and start to handle the read request
- the driver finds that mapping for this IO isn't cached, so wakeup userspace
daemon, which calls KV store or generic btree mapping API to get the mapped
offset/len, then call IORING_OP_FUSED_CMD to handle the read request in
zero copy style
- when IORING_OP_FUSED_CMD is completed, the read IO request is completed

2) decouple kernel subsystems
- subsystem A wants to use subsystem B's service
- subsystem B doesn't provide kernel internal APIs, since it may be
  hard to support, but anytime generic syscalls are provided for userspace
- subsystem A can notify userspace proxy to call IORING_OP_FUSED_CMD for
  using subsystem B's service in resource-sharing way, such as zero copy

Signed-off-by: Ming Lei <[email protected]>
---
 include/linux/io_uring.h       |  15 ++++-
 include/linux/io_uring_types.h |   6 ++
 include/uapi/linux/io_uring.h  |  14 ++++
 io_uring/Makefile              |   2 +-
 io_uring/fused_cmd.c           | 119 +++++++++++++++++++++++++++++++++
 io_uring/fused_cmd.h           |   9 +++
 io_uring/io_uring.c            |  41 ++++++++++--
 io_uring/io_uring.h            |   5 ++
 io_uring/opdef.c               |  12 ++++
 io_uring/opdef.h               |   2 +
 10 files changed, 216 insertions(+), 9 deletions(-)
 create mode 100644 io_uring/fused_cmd.c
 create mode 100644 io_uring/fused_cmd.h

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 35b9328ca335..1a7e93b20fbf 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -22,6 +22,11 @@ enum io_uring_cmd_flags {
 	IO_URING_F_IOPOLL		= (1 << 10),
 };
 
+union io_uring_fused_cmd_data {
+	/* secondary list, fused cmd private, driver do not touch it */
+	struct io_kiocb *__secondary;
+};
+
 struct io_uring_cmd {
 	struct file	*file;
 	const void	*cmd;
@@ -33,7 +38,15 @@ struct io_uring_cmd {
 	};
 	u32		cmd_op;
 	u32		flags;
-	u8		pdu[32]; /* available inline for free use */
+
+	/* for fused command, the available pdu is a bit less */
+	union {
+		struct {
+			union io_uring_fused_cmd_data data;
+			u8	pdu[24]; /* available inline for free use */
+		} fused;
+		u8		pdu[32]; /* available inline for free use */
+	};
 };
 
 #if defined(CONFIG_IO_URING)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index dd8ef886730b..e7449d2d8f3c 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -407,6 +407,7 @@ enum {
 	/* keep async read/write and isreg together and in order */
 	REQ_F_SUPPORT_NOWAIT_BIT,
 	REQ_F_ISREG_BIT,
+	REQ_F_FUSED_SECONDARY_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -476,6 +477,8 @@ enum {
 	REQ_F_CLEAR_POLLIN	= BIT_ULL(REQ_F_CLEAR_POLLIN_BIT),
 	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
 	REQ_F_HASH_LOCKED	= BIT_ULL(REQ_F_HASH_LOCKED_BIT),
+	/* secondary request in fused cmd, won't be one uring cmd */
+	REQ_F_FUSED_SECONDARY	= BIT_ULL(REQ_F_FUSED_SECONDARY_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
@@ -558,6 +561,9 @@ struct io_kiocb {
 		 * REQ_F_BUFFER_RING is set.
 		 */
 		struct io_buffer_list	*buf_list;
+
+		/* store fused command's primary request for the secondary */
+		struct io_kiocb *fused_primary_req;
 	};
 
 	union {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f8d14d1c58d3..ee7d413e43fc 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -73,6 +73,8 @@ struct io_uring_sqe {
 		__u16	buf_index;
 		/* for grouped buffer selection */
 		__u16	buf_group;
+		/* how many secondary normal SQEs following this fused SQE */
+		__u16	nr_secondary;
 	} __attribute__((packed));
 	/* personality to use, if used */
 	__u16	personality;
@@ -173,6 +175,16 @@ enum {
  */
 #define IORING_SETUP_DEFER_TASKRUN	(1U << 13)
 
+/*
+ * Multiple requests submitted as one whole request logically, and the 1st
+ * one is primary request, and the others are secondary requests, which number
+ * can be retrieved from primary SQE.
+ *
+ * Primary request is responsible for submitting secondary requests, and it
+ * won't be completed until all secondary requests are done.
+ */
+#define IORING_SETUP_FUSED_REQ		(1U << 14)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
@@ -223,6 +235,7 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_FUSED_CMD,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -476,6 +489,7 @@ struct io_uring_params {
 #define IORING_FEAT_CQE_SKIP		(1U << 11)
 #define IORING_FEAT_LINKED_FILE		(1U << 12)
 #define IORING_FEAT_REG_REG_RING	(1U << 13)
+#define IORING_FEAT_FUSED_REQ		(1U << 14)
 
 /*
  * io_uring_register(2) opcodes and arguments
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 8cc8e5387a75..5301077e61c5 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -7,5 +7,5 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o xattr.o nop.o fs.o splice.o \
 					openclose.o uring_cmd.o epoll.o \
 					statx.o net.o msg_ring.o timeout.o \
 					sqpoll.o fdinfo.o tctx.o poll.o \
-					cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o
+					cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o fused_cmd.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
diff --git a/io_uring/fused_cmd.c b/io_uring/fused_cmd.c
new file mode 100644
index 000000000000..f964e69fa4aa
--- /dev/null
+++ b/io_uring/fused_cmd.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "opdef.h"
+#include "uring_cmd.h"
+#include "fused_cmd.h"
+
+static bool io_fused_secondary_valid(u8 op)
+{
+	if (op == IORING_OP_FUSED_CMD)
+		return false;
+
+	if (!io_issue_defs[op].fused_secondary)
+		return false;
+
+	return true;
+}
+
+int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+	__must_hold(&req->ctx->uring_lock)
+{
+	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	const struct io_uring_sqe *secondary_sqe = NULL;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_kiocb *secondary;
+	u8 secondary_op;
+	int ret;
+
+	if (!(ctx->flags & IORING_SETUP_FUSED_REQ))
+		return -EINVAL;
+
+	if (unlikely(sqe->__pad1))
+		return -EINVAL;
+
+	/*
+	 * Only support single secondary request, in future we may extend to
+	 * support multiple secondary requests, which can be covered by
+	 * multiple fused command now.
+	 */
+	if (unlikely(sqe->nr_secondary != 1))
+		return -EINVAL;
+
+	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
+	if (unlikely(ioucmd->flags))
+		return -EINVAL;
+
+	if (unlikely(!io_get_secondary_sqe(ctx, &secondary_sqe)))
+		return -EINVAL;
+
+	if (unlikely(!secondary_sqe))
+		return -EINVAL;
+
+	secondary_op = READ_ONCE(secondary_sqe->opcode);
+	if (unlikely(!io_fused_secondary_valid(secondary_op)))
+		return -EINVAL;
+
+	ioucmd->cmd = sqe->cmd;
+	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
+
+	ret = -ENOMEM;
+	if (unlikely(!io_alloc_req(ctx, &secondary)))
+		goto fail;
+
+	ret = io_init_secondary_req(ctx, secondary, secondary_sqe,
+			REQ_F_FUSED_SECONDARY);
+	if (unlikely(ret))
+		goto fail_free_req;
+
+	ioucmd->fused.data.__secondary = secondary;
+
+	return 0;
+
+fail_free_req:
+	io_free_req(secondary);
+fail:
+	return ret;
+}
+
+int io_fused_cmd(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	int ret = -EINVAL;
+
+	ret = io_uring_cmd(req, issue_flags);
+	if (ret != IOU_ISSUE_SKIP_COMPLETE)
+		io_free_req(ioucmd->fused.data.__secondary);
+
+	return ret;
+}
+
+/*
+ * Called after secondary request is completed,
+ *
+ * Notify primary request by the saved callback that we are done
+ */
+void io_fused_cmd_complete_secondary(struct io_kiocb *secondary)
+{
+	struct io_kiocb *req = secondary->fused_primary_req;
+	struct io_uring_cmd *ioucmd;
+
+	if (unlikely(!req || !(secondary->flags & REQ_F_FUSED_SECONDARY)))
+		return;
+
+	/* notify primary command that we are done */
+	secondary->fused_primary_req = NULL;
+	ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	ioucmd->fused.data.__secondary = NULL;
+
+	io_uring_cmd_complete_in_task(ioucmd, ioucmd->task_work_cb);
+}
diff --git a/io_uring/fused_cmd.h b/io_uring/fused_cmd.h
new file mode 100644
index 000000000000..162a4d70b12e
--- /dev/null
+++ b/io_uring/fused_cmd.h
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_FUSED_CMD_H
+#define IOU_FUSED_CMD_H
+
+int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_fused_cmd(struct io_kiocb *req, unsigned int issue_flags);
+void io_fused_cmd_complete_secondary(struct io_kiocb *secondary);
+
+#endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 25a940f0ab68..0aaf26bdc72a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -92,6 +92,7 @@
 #include "cancel.h"
 #include "net.h"
 #include "notif.h"
+#include "fused_cmd.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -111,7 +112,7 @@
 
 #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
 				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
-				REQ_F_ASYNC_DATA)
+				REQ_F_ASYNC_DATA | REQ_F_FUSED_SECONDARY)
 
 #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | REQ_F_HARDLINK |\
 				 IO_REQ_CLEAN_FLAGS)
@@ -971,6 +972,9 @@ static void __io_req_complete_post(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
+	if (req->flags & REQ_F_FUSED_SECONDARY)
+		io_fused_cmd_complete_secondary(req);
+
 	io_cq_lock(ctx);
 	if (!(req->flags & REQ_F_CQE_SKIP))
 		io_fill_cqe_req(ctx, req);
@@ -1855,6 +1859,8 @@ static void io_clean_op(struct io_kiocb *req)
 		spin_lock(&req->ctx->completion_lock);
 		io_put_kbuf_comp(req);
 		spin_unlock(&req->ctx->completion_lock);
+	} else if (req->flags & REQ_F_FUSED_SECONDARY) {
+		io_fused_cmd_complete_secondary(req);
 	}
 
 	if (req->flags & REQ_F_NEED_CLEANUP) {
@@ -2163,8 +2169,8 @@ static void io_init_req_drain(struct io_kiocb *req)
 	}
 }
 
-static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
-		       const struct io_uring_sqe *sqe)
+static inline int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
+		const struct io_uring_sqe *sqe, u64 secondary_flags)
 	__must_hold(&ctx->uring_lock)
 {
 	const struct io_issue_def *def;
@@ -2217,6 +2223,12 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		}
 	}
 
+	if (secondary_flags) {
+		if (!def->fused_secondary)
+			return -EINVAL;
+		req->flags |= secondary_flags;
+	}
+
 	if (!def->ioprio && sqe->ioprio)
 		return -EINVAL;
 	if (!def->iopoll && (ctx->flags & IORING_SETUP_IOPOLL))
@@ -2257,6 +2269,12 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return def->prep(req, sqe);
 }
 
+int io_init_secondary_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
+		const struct io_uring_sqe *sqe, u64 flags)
+{
+	return io_init_req(ctx, req, sqe, flags);
+}
+
 static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
 				      struct io_kiocb *req, int ret)
 {
@@ -2301,7 +2319,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	struct io_submit_link *link = &ctx->submit_state.link;
 	int ret;
 
-	ret = io_init_req(ctx, req, sqe);
+	ret = io_init_req(ctx, req, sqe, 0);
 	if (unlikely(ret))
 		return io_submit_fail_init(sqe, req, ret);
 
@@ -2396,7 +2414,8 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
  * used, it's important that those reads are done through READ_ONCE() to
  * prevent a re-load down the line.
  */
-static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
+static inline bool io_get_sqe(struct io_ring_ctx *ctx,
+		const struct io_uring_sqe **sqe)
 {
 	unsigned head, mask = ctx->sq_entries - 1;
 	unsigned sq_idx = ctx->cached_sq_head++ & mask;
@@ -2425,6 +2444,12 @@ static bool io_get_sqe(struct io_ring_ctx *ctx, const struct io_uring_sqe **sqe)
 	return false;
 }
 
+bool io_get_secondary_sqe(struct io_ring_ctx *ctx,
+		const struct io_uring_sqe **sqe)
+{
+	return io_get_sqe(ctx, sqe);
+}
+
 int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 	__must_hold(&ctx->uring_lock)
 {
@@ -3855,7 +3880,8 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 			IORING_FEAT_POLL_32BITS | IORING_FEAT_SQPOLL_NONFIXED |
 			IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS |
 			IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP |
-			IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING;
+			IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING |
+			IORING_FEAT_FUSED_REQ;
 
 	if (copy_to_user(params, p, sizeof(*p))) {
 		ret = -EFAULT;
@@ -3913,7 +3939,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
 			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
-			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN))
+			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
+			IORING_SETUP_FUSED_REQ))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index c33f719731ac..0a6fb37489b0 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -78,6 +78,11 @@ bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
 bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
 			bool cancel_all);
 
+bool io_get_secondary_sqe(struct io_ring_ctx *ctx,
+		const struct io_uring_sqe **sqe);
+int io_init_secondary_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
+		const struct io_uring_sqe *sqe, u64 flags);
+
 #define io_lockdep_assert_cq_locked(ctx)				\
 	do {								\
 		if (ctx->flags & IORING_SETUP_IOPOLL) {			\
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..63b90e8e65f8 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -33,6 +33,7 @@
 #include "poll.h"
 #include "cancel.h"
 #include "rw.h"
+#include "fused_cmd.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -428,6 +429,12 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_FUSED_CMD] = {
+		.needs_file		= 1,
+		.plug			= 1,
+		.prep			= io_fused_cmd_prep,
+		.issue			= io_fused_cmd,
+	},
 };
 
 
@@ -648,6 +655,11 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_FUSED_CMD] = {
+		.name			= "FUSED_CMD",
+		.async_size		= uring_cmd_pdu_size(1),
+		.prep_async		= io_uring_cmd_prep_async,
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index c22c8696e749..feaf0ff90c5d 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -29,6 +29,8 @@ struct io_issue_def {
 	unsigned		iopoll_queue : 1;
 	/* opcode specific path will handle ->async_data allocation if needed */
 	unsigned		manual_alloc : 1;
+	/* can be secondary op of fused command */
+	unsigned		fused_secondary : 1;
 
 	int (*issue)(struct io_kiocb *, unsigned int);
 	int (*prep)(struct io_kiocb *, const struct io_uring_sqe *);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 04/17] io_uring: support providing buffer by IORING_OP_FUSED_CMD
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (2 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 03/17] io_uring: add generic IORING_OP_FUSED_CMD Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 05/17] io_uring: support OP_READ/OP_WRITE for fused secondary request Ming Lei
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Add UAPI flag of IORING_FUSED_CMD_BUF so that primary command can provide
buffer to secondary OPs. And this is just one of fused command use cases,
and more use cases can be supported by passing different uring_command flags.

This way is like one plugin of IORING_OP_FUSED_CMD, and in future we
could add more plugins.

The primary command provides buffer to secondary OPs, and secondary OPs
can use the buffer in safe way because:

- the primary command is always completed after all secondary OPs
are completed
- the provided buffer has same lifetime with primary command
- buffer lifetime won't cross multiple OPs

The motivation is for supporting zero copy for fuse/ublk, in which
the device holds IO request buffer, and IO handling is often normal
IO OP(fs, net, ..). IORING_OP_FUSED_CMD/IORING_FUSED_CMD_BUF may help
to support zero copy for any userspace driver/device.

Signed-off-by: Ming Lei <[email protected]>
---
 include/linux/io_uring.h       |  26 +++++++
 include/linux/io_uring_types.h |   5 ++
 include/uapi/linux/io_uring.h  |   8 ++-
 io_uring/fused_cmd.c           | 124 ++++++++++++++++++++++++++++++++-
 io_uring/fused_cmd.h           |   7 ++
 io_uring/opdef.h               |   5 ++
 6 files changed, 172 insertions(+), 3 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 1a7e93b20fbf..be5e00be5201 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -4,6 +4,7 @@
 
 #include <linux/sched.h>
 #include <linux/xarray.h>
+#include <linux/bvec.h>
 #include <uapi/linux/io_uring.h>
 
 enum io_uring_cmd_flags {
@@ -20,6 +21,23 @@ enum io_uring_cmd_flags {
 	IO_URING_F_SQE128		= (1 << 8),
 	IO_URING_F_CQE32		= (1 << 9),
 	IO_URING_F_IOPOLL		= (1 << 10),
+
+	/* for FUSED_CMD with IORING_FUSED_CMD_BUF only */
+	IO_URING_F_FUSED_BUF_DEST		= (1 << 11), /* secondary write to buffer */
+	IO_URING_F_FUSED_BUF_SRC		= (1 << 12), /* secondary read from buffer */
+	/* driver incapable of FUSED_CMD should fail cmd when seeing F_FUSED */
+	IO_URING_F_FUSED_BUF		= IO_URING_F_FUSED_BUF_DEST |
+		IO_URING_F_FUSED_BUF_SRC,
+};
+
+struct io_uring_bvec_buf {
+	unsigned long	len;
+	unsigned int	nr_bvecs;
+
+	/* offset in the 1st bvec */
+	unsigned int		offset;
+	const struct bio_vec	*bvec;
+	struct bio_vec		__bvec[];
 };
 
 union io_uring_fused_cmd_data {
@@ -50,6 +68,9 @@ struct io_uring_cmd {
 };
 
 #if defined(CONFIG_IO_URING)
+void io_fused_provide_buf_and_start(struct io_uring_cmd *cmd,
+		unsigned issue_flags, const struct io_uring_bvec_buf *buf,
+		void (*complete_tw_cb)(struct io_uring_cmd *, unsigned));
 int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd);
 void io_uring_cmd_done(struct io_uring_cmd *cmd, ssize_t ret, ssize_t res2,
@@ -80,6 +101,11 @@ static inline void io_uring_free(struct task_struct *tsk)
 		__io_uring_free(tsk);
 }
 #else
+static inline void io_fused_provide_buf_and_start(struct io_uring_cmd *cmd,
+		unsigned issue_flags, const struct io_uring_bvec_buf *buf,
+		void (*complete_tw_cb)(struct io_uring_cmd *, unsigned))
+{
+}
 static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd)
 {
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e7449d2d8f3c..9cff455bfe6f 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -562,6 +562,11 @@ struct io_kiocb {
 		 */
 		struct io_buffer_list	*buf_list;
 
+		/*
+		 * store kernel buffer provided by fused primary request
+		 */
+		const struct io_uring_bvec_buf *fused_cmd_kbuf;
+
 		/* store fused command's primary request for the secondary */
 		struct io_kiocb *fused_primary_req;
 	};
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ee7d413e43fc..d200906c1a21 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -245,8 +245,14 @@ enum io_uring_op {
  * sqe->uring_cmd_flags
  * IORING_URING_CMD_FIXED	use registered buffer; pass this flag
  *				along with setting sqe->buf_index.
+ *
+ * IORING_FUSED_CMD_BUF		fused primary command provides buffer
+ *				for secondary requests which can retrieve
+ *				any sub-buffer with offset(sqe->addr) and
+ *				len(sqe->len)
  */
-#define IORING_URING_CMD_FIXED	(1U << 0)
+#define IORING_URING_CMD_FIXED		(1U << 0)
+#define IORING_FUSED_CMD_BUF		(1U << 1)
 
 
 /*
diff --git a/io_uring/fused_cmd.c b/io_uring/fused_cmd.c
index f964e69fa4aa..46e2e8640e30 100644
--- a/io_uring/fused_cmd.c
+++ b/io_uring/fused_cmd.c
@@ -25,6 +25,23 @@ static bool io_fused_secondary_valid(u8 op)
 	return true;
 }
 
+static int io_fused_prep_provide_buf(struct io_uring_cmd *ioucmd,
+		const struct io_uring_sqe *sqe)
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+	unsigned int sqe_flags = READ_ONCE(sqe->flags);
+
+	/*
+	 * Primary command is for providing buffer, non-sense to
+	 * set buffer select any more
+	 */
+	if (sqe_flags & REQ_F_BUFFER_SELECT)
+		return -EINVAL;
+
+	req->fused_cmd_kbuf = NULL;
+	return 0;
+}
+
 int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	__must_hold(&req->ctx->uring_lock)
 {
@@ -50,8 +67,14 @@ int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return -EINVAL;
 
 	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
-	if (unlikely(ioucmd->flags))
-		return -EINVAL;
+
+	/* so far, only support plugin of providing buffer */
+	if (ioucmd->flags & IORING_FUSED_CMD_BUF)
+		ret = io_fused_prep_provide_buf(ioucmd, sqe);
+	else
+		ret = -EINVAL;
+	if (ret)
+		return ret;
 
 	if (unlikely(!io_get_secondary_sqe(ctx, &secondary_sqe)))
 		return -EINVAL;
@@ -88,8 +111,20 @@ int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 int io_fused_cmd(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	const struct io_kiocb *secondary = ioucmd->fused.data.__secondary;
 	int ret = -EINVAL;
 
+	if (ioucmd->flags & IORING_FUSED_CMD_BUF) {
+		/*
+		 * Pass buffer direction for driver for validating if the
+		 * requested buffer direction is legal
+		 */
+		if (io_issue_defs[secondary->opcode].buf_dir)
+			issue_flags |= IO_URING_F_FUSED_BUF_DEST;
+		else
+			issue_flags |= IO_URING_F_FUSED_BUF_SRC;
+	}
+
 	ret = io_uring_cmd(req, issue_flags);
 	if (ret != IOU_ISSUE_SKIP_COMPLETE)
 		io_free_req(ioucmd->fused.data.__secondary);
@@ -117,3 +152,88 @@ void io_fused_cmd_complete_secondary(struct io_kiocb *secondary)
 
 	io_uring_cmd_complete_in_task(ioucmd, ioucmd->task_work_cb);
 }
+
+/* only for IORING_FUSED_CMD_BUF */
+int io_import_buf_from_fused(unsigned long buf_off, unsigned int len,
+		int dir, struct iov_iter *iter, struct io_kiocb *secondary)
+{
+	struct io_kiocb *req = secondary->fused_primary_req;
+	const struct io_uring_bvec_buf *kbuf;
+	struct io_uring_cmd *primary;
+	unsigned long offset;
+
+	if (unlikely(!(secondary->flags & REQ_F_FUSED_SECONDARY) || !req))
+		return -EINVAL;
+
+	if (unlikely(!req->fused_cmd_kbuf))
+		return -EINVAL;
+
+	primary = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	if (unlikely(!(primary->flags & IORING_FUSED_CMD_BUF)))
+		return -EINVAL;
+
+	/* req->fused_cmd_kbuf is immutable */
+	kbuf = req->fused_cmd_kbuf;
+	offset = kbuf->offset;
+
+	if (!kbuf->bvec)
+		return -EINVAL;
+
+	if (unlikely(buf_off > kbuf->len))
+		return -EFAULT;
+
+	if (unlikely(len > kbuf->len - buf_off))
+		return -EFAULT;
+
+	/* don't use io_import_fixed which doesn't support multipage bvec */
+	offset += buf_off;
+	iov_iter_bvec(iter, dir, kbuf->bvec, kbuf->nr_bvecs, offset + len);
+
+	if (offset)
+		iov_iter_advance(iter, offset);
+
+	return 0;
+}
+
+/*
+ * Called for starting secondary request after primary command prepared io
+ * buffer, only for IORING_FUSED_CMD_BUF
+ *
+ * Secondary request borrows primary's io buffer for handling the secondary
+ * operation, and the buffer is returned back via io_fused_complete_secondary
+ * after the secondary request is completed. Meantime the primary command is
+ * completed. And driver gets completion notification by the passed callback
+ * of @complete_tw_cb.
+ */
+void io_fused_provide_buf_and_start(struct io_uring_cmd *ioucmd,
+		unsigned issue_flags,
+		const struct io_uring_bvec_buf *fused_cmd_kbuf,
+		void (*complete_tw_cb)(struct io_uring_cmd *, unsigned))
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+	struct io_kiocb *secondary = ioucmd->fused.data.__secondary;
+	struct io_tw_state ts = {
+		.locked = !(issue_flags & IO_URING_F_UNLOCKED),
+	};
+
+	if (unlikely(!(ioucmd->flags & IORING_FUSED_CMD_BUF)))
+		return;
+
+	if (WARN_ON_ONCE(unlikely(!secondary || !(secondary->flags &
+						REQ_F_FUSED_SECONDARY))))
+		return;
+
+	/*
+	 * Once the fused secondary request is completed and the buffer isn't be
+	 * used, the driver will be notified by callback of complete_tw_cb
+	 */
+	ioucmd->task_work_cb = complete_tw_cb;
+
+	/* now we get the buffer */
+	req->fused_cmd_kbuf = fused_cmd_kbuf;
+	secondary->fused_primary_req = req;
+
+	trace_io_uring_submit_sqe(secondary, true);
+	io_req_task_submit(secondary, &ts);
+}
+EXPORT_SYMBOL_GPL(io_fused_provide_buf_and_start);
diff --git a/io_uring/fused_cmd.h b/io_uring/fused_cmd.h
index 162a4d70b12e..fef491757356 100644
--- a/io_uring/fused_cmd.h
+++ b/io_uring/fused_cmd.h
@@ -5,5 +5,12 @@
 int io_fused_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_fused_cmd(struct io_kiocb *req, unsigned int issue_flags);
 void io_fused_cmd_complete_secondary(struct io_kiocb *secondary);
+int io_import_buf_from_fused(unsigned long buf_off, unsigned int len,
+		int dir, struct iov_iter *iter, struct io_kiocb *secondary);
+
+static inline bool io_req_use_fused_buf(struct io_kiocb *req)
+{
+	return (req->flags & REQ_F_FUSED_SECONDARY) && req->fused_primary_req;
+}
 
 #endif
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index feaf0ff90c5d..bded61ebcbfc 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -31,6 +31,11 @@ struct io_issue_def {
 	unsigned		manual_alloc : 1;
 	/* can be secondary op of fused command */
 	unsigned		fused_secondary : 1;
+	/*
+	 * buffer direction, 0 : read from buffer, 1: write to buffer, used
+	 * for fused_secondary only
+	 */
+	unsigned		buf_dir : 1;
 
 	int (*issue)(struct io_kiocb *, unsigned int);
 	int (*prep)(struct io_kiocb *, const struct io_uring_sqe *);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 05/17] io_uring: support OP_READ/OP_WRITE for fused secondary request
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (3 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 04/17] io_uring: support providing buffer by IORING_OP_FUSED_CMD Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 06/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Start to allow fused secondary request to support OP_READ/OP_WRITE, and
the buffer can be retrieved from the primary request.

Once the secondary request is completed, the primary request buffer will
be returned back.

Signed-off-by: Ming Lei <[email protected]>
---
 io_uring/opdef.c |  4 ++++
 io_uring/rw.c    | 21 +++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 63b90e8e65f8..d81c9afd65ed 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -235,6 +235,8 @@ const struct io_issue_def io_issue_defs[] = {
 		.ioprio			= 1,
 		.iopoll			= 1,
 		.iopoll_queue		= 1,
+		.fused_secondary	= 1,
+		.buf_dir		= WRITE,
 		.prep			= io_prep_rw,
 		.issue			= io_read,
 	},
@@ -248,6 +250,8 @@ const struct io_issue_def io_issue_defs[] = {
 		.ioprio			= 1,
 		.iopoll			= 1,
 		.iopoll_queue		= 1,
+		.fused_secondary	= 1,
+		.buf_dir		= READ,
 		.prep			= io_prep_rw,
 		.issue			= io_write,
 	},
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 5431caf1e331..5ce7c8a2f74d 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -19,6 +19,7 @@
 #include "kbuf.h"
 #include "rsrc.h"
 #include "rw.h"
+#include "fused_cmd.h"
 
 struct io_rw {
 	/* NOTE: kiocb has the file as the first member, so don't do it here */
@@ -371,6 +372,18 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
 	size_t sqe_len;
 	ssize_t ret;
 
+	/*
+	 * fused_secondary OP passes buffer offset from sqe->addr actually, since
+	 * the fused cmd buf's mapped start address is zero.
+	 */
+	if (io_req_use_fused_buf(req)) {
+		ret = io_import_buf_from_fused(rw->addr, rw->len, ddir,
+				iter, req);
+		if (ret)
+			return ERR_PTR(ret);
+		return NULL;
+	}
+
 	if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) {
 		ret = io_import_fixed(ddir, iter, req->imu, rw->addr, rw->len);
 		if (ret)
@@ -443,11 +456,19 @@ static inline loff_t *io_kiocb_ppos(struct kiocb *kiocb)
  */
 static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter)
 {
+	struct io_kiocb *req = cmd_to_io_kiocb(rw);
 	struct kiocb *kiocb = &rw->kiocb;
 	struct file *file = kiocb->ki_filp;
 	ssize_t ret = 0;
 	loff_t *ppos;
 
+	/*
+	 * Fused secondary req hasn't user buffer, so ->read/->write can't
+	 * be supported
+	 */
+	if (io_req_use_fused_buf(req))
+		return -EOPNOTSUPP;
+
 	/*
 	 * Don't support polled IO through this interface, and we can't
 	 * support non-blocking either. For the latter, this just causes
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 06/17] io_uring: support OP_SEND_ZC/OP_RECV for fused secondary request
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (4 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 05/17] io_uring: support OP_READ/OP_WRITE for fused secondary request Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 07/17] block: ublk_drv: add common exit handling Ming Lei
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Start to allow fused secondary request to support OP_SEND_ZC/OP_RECV, and
the buffer can be retrieved from primary request.

Once secondary request is completed, the primary command will be done and
the associated buffer is returned back.

Signed-off-by: Ming Lei <[email protected]>
---
 io_uring/net.c   | 30 ++++++++++++++++++++++++++++--
 io_uring/opdef.c |  6 ++++++
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 4040cf093318..f0fce1db7596 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -16,6 +16,7 @@
 #include "net.h"
 #include "notif.h"
 #include "rsrc.h"
+#include "fused_cmd.h"
 
 #if defined(CONFIG_NET)
 struct io_shutdown {
@@ -69,6 +70,13 @@ struct io_sr_msg {
 	struct io_kiocb 		*notif;
 };
 
+#define user_ptr_to_u64(x) (		\
+{					\
+	typecheck(void __user *, (x));		\
+	(u64)(unsigned long)(x);	\
+}					\
+)
+
 static inline bool io_check_multishot(struct io_kiocb *req,
 				      unsigned int issue_flags)
 {
@@ -379,7 +387,11 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(!sock))
 		return -ENOTSOCK;
 
-	ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &msg.msg_iter);
+	if (!io_req_use_fused_buf(req))
+		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &msg.msg_iter);
+	else
+		ret = io_import_buf_from_fused(user_ptr_to_u64(sr->buf),
+				sr->len, ITER_SOURCE, &msg.msg_iter, req);
 	if (unlikely(ret))
 		return ret;
 
@@ -870,7 +882,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 		sr->buf = buf;
 	}
 
-	ret = import_ubuf(ITER_DEST, sr->buf, len, &msg.msg_iter);
+	if (!io_req_use_fused_buf(req))
+		ret = import_ubuf(ITER_DEST, sr->buf, len, &msg.msg_iter);
+	else
+		ret = io_import_buf_from_fused(user_ptr_to_u64(sr->buf),
+				sr->len, ITER_DEST, &msg.msg_iter, req);
 	if (unlikely(ret))
 		goto out_free;
 
@@ -984,6 +1000,9 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
 		unsigned idx = READ_ONCE(sqe->buf_index);
 
+		if (io_req_use_fused_buf(req))
+			return -EINVAL;
+
 		if (unlikely(idx >= ctx->nr_user_bufs))
 			return -EFAULT;
 		idx = array_index_nospec(idx, ctx->nr_user_bufs);
@@ -1120,8 +1139,15 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 		if (unlikely(ret))
 			return ret;
 		msg.sg_from_iter = io_sg_from_iter;
+	} else if (io_req_use_fused_buf(req)) {
+		ret = io_import_buf_from_fused(user_ptr_to_u64(zc->buf),
+				zc->len, ITER_SOURCE, &msg.msg_iter, req);
+		if (unlikely(ret))
+			return ret;
+		msg.sg_from_iter = io_sg_from_iter;
 	} else {
 		io_notif_set_extended(zc->notif);
+
 		ret = import_ubuf(ITER_SOURCE, zc->buf, zc->len, &msg.msg_iter);
 		if (unlikely(ret))
 			return ret;
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index d81c9afd65ed..c31badf4fe45 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -273,6 +273,8 @@ const struct io_issue_def io_issue_defs[] = {
 		.audit_skip		= 1,
 		.ioprio			= 1,
 		.manual_alloc		= 1,
+		.fused_secondary	= 1,
+		.buf_dir		= READ,
 #if defined(CONFIG_NET)
 		.prep			= io_sendmsg_prep,
 		.issue			= io_send,
@@ -287,6 +289,8 @@ const struct io_issue_def io_issue_defs[] = {
 		.buffer_select		= 1,
 		.audit_skip		= 1,
 		.ioprio			= 1,
+		.fused_secondary	= 1,
+		.buf_dir		= WRITE,
 #if defined(CONFIG_NET)
 		.prep			= io_recvmsg_prep,
 		.issue			= io_recv,
@@ -413,6 +417,8 @@ const struct io_issue_def io_issue_defs[] = {
 		.audit_skip		= 1,
 		.ioprio			= 1,
 		.manual_alloc		= 1,
+		.fused_secondary	= 1,
+		.buf_dir		= READ,
 #if defined(CONFIG_NET)
 		.prep			= io_send_zc_prep,
 		.issue			= io_send_zc,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 07/17] block: ublk_drv: add common exit handling
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (5 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 06/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 08/17] block: ublk_drv: don't consider flush request in map/unmap io Ming Lei
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Simplify exit handling a bit, and prepare for supporting fused command.

Reviewed-by: Ziyang Zhang <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c73cc57ec547..bc46616710d4 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -655,14 +655,15 @@ static void ublk_complete_rq(struct request *req)
 	struct ublk_queue *ubq = req->mq_hctx->driver_data;
 	struct ublk_io *io = &ubq->ios[req->tag];
 	unsigned int unmapped_bytes;
+	blk_status_t res = BLK_STS_OK;
 
 	/* failed read IO if nothing is read */
 	if (!io->res && req_op(req) == REQ_OP_READ)
 		io->res = -EIO;
 
 	if (io->res < 0) {
-		blk_mq_end_request(req, errno_to_blk_status(io->res));
-		return;
+		res = errno_to_blk_status(io->res);
+		goto exit;
 	}
 
 	/*
@@ -671,10 +672,8 @@ static void ublk_complete_rq(struct request *req)
 	 *
 	 * Both the two needn't unmap.
 	 */
-	if (req_op(req) != REQ_OP_READ && req_op(req) != REQ_OP_WRITE) {
-		blk_mq_end_request(req, BLK_STS_OK);
-		return;
-	}
+	if (req_op(req) != REQ_OP_READ && req_op(req) != REQ_OP_WRITE)
+		goto exit;
 
 	/* for READ request, writing data in iod->addr to rq buffers */
 	unmapped_bytes = ublk_unmap_io(ubq, req, io);
@@ -691,6 +690,10 @@ static void ublk_complete_rq(struct request *req)
 		blk_mq_requeue_request(req, true);
 	else
 		__blk_mq_end_request(req, BLK_STS_OK);
+
+	return;
+exit:
+	blk_mq_end_request(req, res);
 }
 
 /*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 08/17] block: ublk_drv: don't consider flush request in map/unmap io
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (6 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 07/17] block: ublk_drv: add common exit handling Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 09/17] block: ublk_drv: add two helpers to clean up map/unmap request Ming Lei
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

There isn't data in request of REQ_OP_FLUSH always, so don't consider
it in both ublk_map_io() and ublk_unmap_io().

Reviewed-by: Ziyang Zhang <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index bc46616710d4..c73b2dba25ce 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -529,15 +529,13 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 		struct ublk_io *io)
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
+
 	/*
 	 * no zero copy, we delay copy WRITE request data into ublksrv
 	 * context and the big benefit is that pinning pages in current
 	 * context is pretty fast, see ublk_pin_user_pages
 	 */
-	if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH)
-		return rq_bytes;
-
-	if (ublk_rq_has_data(req)) {
+	if (ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE) {
 		struct ublk_map_data data = {
 			.ubq	=	ubq,
 			.rq	=	req,
@@ -774,9 +772,7 @@ static inline void __ublk_rq_task_work(struct request *req,
 		return;
 	}
 
-	if (ublk_need_get_data(ubq) &&
-			(req_op(req) == REQ_OP_WRITE ||
-			req_op(req) == REQ_OP_FLUSH)) {
+	if (ublk_need_get_data(ubq) && (req_op(req) == REQ_OP_WRITE)) {
 		/*
 		 * We have not handled UBLK_IO_NEED_GET_DATA command yet,
 		 * so immepdately pass UBLK_IO_RES_NEED_GET_DATA to ublksrv
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 09/17] block: ublk_drv: add two helpers to clean up map/unmap request
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (7 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 08/17] block: ublk_drv: don't consider flush request in map/unmap io Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 10/17] block: ublk_drv: clean up several helpers Ming Lei
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Add two helpers for checking if map/unmap is needed, since we may have
passthrough request which needs map or unmap in future, such as for
supporting report zones.

Meantime don't mark ublk_copy_user_pages as inline since this function
is a bit fat now.

Reviewed-by: Ziyang Zhang <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c73b2dba25ce..f87597a7d679 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -488,8 +488,7 @@ static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
 	return done;
 }
 
-static inline int ublk_copy_user_pages(struct ublk_map_data *data,
-		bool to_vm)
+static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
 {
 	const unsigned int gup_flags = to_vm ? FOLL_WRITE : 0;
 	const unsigned long start_vm = data->io->addr;
@@ -525,6 +524,16 @@ static inline int ublk_copy_user_pages(struct ublk_map_data *data,
 	return done;
 }
 
+static inline bool ublk_need_map_req(const struct request *req)
+{
+	return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
+}
+
+static inline bool ublk_need_unmap_req(const struct request *req)
+{
+	return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ;
+}
+
 static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 		struct ublk_io *io)
 {
@@ -535,7 +544,7 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 	 * context and the big benefit is that pinning pages in current
 	 * context is pretty fast, see ublk_pin_user_pages
 	 */
-	if (ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE) {
+	if (ublk_need_map_req(req)) {
 		struct ublk_map_data data = {
 			.ubq	=	ubq,
 			.rq	=	req,
@@ -556,7 +565,7 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
-	if (req_op(req) == REQ_OP_READ && ublk_rq_has_data(req)) {
+	if (ublk_need_unmap_req(req)) {
 		struct ublk_map_data data = {
 			.ubq	=	ubq,
 			.rq	=	req,
@@ -772,7 +781,7 @@ static inline void __ublk_rq_task_work(struct request *req,
 		return;
 	}
 
-	if (ublk_need_get_data(ubq) && (req_op(req) == REQ_OP_WRITE)) {
+	if (ublk_need_get_data(ubq) && ublk_need_map_req(req)) {
 		/*
 		 * We have not handled UBLK_IO_NEED_GET_DATA command yet,
 		 * so immepdately pass UBLK_IO_RES_NEED_GET_DATA to ublksrv
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 10/17] block: ublk_drv: clean up several helpers
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (8 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 09/17] block: ublk_drv: add two helpers to clean up map/unmap request Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 11/17] block: ublk_drv: cleanup 'struct ublk_map_data' Ming Lei
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Convert the following pattern in several helpers
	if (Z)
		return true
	return false
into:
	return Z;

Reviewed-by: Ziyang Zhang <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index f87597a7d679..1c057003a40a 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -298,9 +298,7 @@ static inline bool ublk_can_use_task_work(const struct ublk_queue *ubq)
 
 static inline bool ublk_need_get_data(const struct ublk_queue *ubq)
 {
-	if (ubq->flags & UBLK_F_NEED_GET_DATA)
-		return true;
-	return false;
+	return ubq->flags & UBLK_F_NEED_GET_DATA;
 }
 
 static struct ublk_device *ublk_get_device(struct ublk_device *ub)
@@ -349,25 +347,19 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id)
 static inline bool ublk_queue_can_use_recovery_reissue(
 		struct ublk_queue *ubq)
 {
-	if ((ubq->flags & UBLK_F_USER_RECOVERY) &&
-			(ubq->flags & UBLK_F_USER_RECOVERY_REISSUE))
-		return true;
-	return false;
+	return (ubq->flags & UBLK_F_USER_RECOVERY) &&
+			(ubq->flags & UBLK_F_USER_RECOVERY_REISSUE);
 }
 
 static inline bool ublk_queue_can_use_recovery(
 		struct ublk_queue *ubq)
 {
-	if (ubq->flags & UBLK_F_USER_RECOVERY)
-		return true;
-	return false;
+	return ubq->flags & UBLK_F_USER_RECOVERY;
 }
 
 static inline bool ublk_can_use_recovery(struct ublk_device *ub)
 {
-	if (ub->dev_info.flags & UBLK_F_USER_RECOVERY)
-		return true;
-	return false;
+	return ub->dev_info.flags & UBLK_F_USER_RECOVERY;
 }
 
 static void ublk_free_disk(struct gendisk *disk)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 11/17] block: ublk_drv: cleanup 'struct ublk_map_data'
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (9 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 10/17] block: ublk_drv: clean up several helpers Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

'struct ublk_map_data' is passed to ublk_copy_user_pages()
for copying data between userspace buffer and request pages.

Here what matters is userspace buffer address/len and 'struct request',
so replace ->io field with user buffer address, and rename max_bytes
as len.

Meantime remove 'ubq' field from ublk_map_data, since it isn't used
any more.

Then code becomes more readable.

Reviewed-by: Ziyang Zhang <[email protected]>
Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 1c057003a40a..fdccbf5fdaa1 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -420,10 +420,9 @@ static const struct block_device_operations ub_fops = {
 #define UBLK_MAX_PIN_PAGES	32
 
 struct ublk_map_data {
-	const struct ublk_queue *ubq;
 	const struct request *rq;
-	const struct ublk_io *io;
-	unsigned max_bytes;
+	unsigned long	ubuf;
+	unsigned int	len;
 };
 
 struct ublk_io_iter {
@@ -483,14 +482,14 @@ static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
 static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
 {
 	const unsigned int gup_flags = to_vm ? FOLL_WRITE : 0;
-	const unsigned long start_vm = data->io->addr;
+	const unsigned long start_vm = data->ubuf;
 	unsigned int done = 0;
 	struct ublk_io_iter iter = {
 		.pg_off	= start_vm & (PAGE_SIZE - 1),
 		.bio	= data->rq->bio,
 		.iter	= data->rq->bio->bi_iter,
 	};
-	const unsigned int nr_pages = round_up(data->max_bytes +
+	const unsigned int nr_pages = round_up(data->len +
 			(start_vm & (PAGE_SIZE - 1)), PAGE_SIZE) >> PAGE_SHIFT;
 
 	while (done < nr_pages) {
@@ -503,13 +502,13 @@ static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
 				iter.pages);
 		if (iter.nr_pages <= 0)
 			return done == 0 ? iter.nr_pages : done;
-		len = ublk_copy_io_pages(&iter, data->max_bytes, to_vm);
+		len = ublk_copy_io_pages(&iter, data->len, to_vm);
 		for (i = 0; i < iter.nr_pages; i++) {
 			if (to_vm)
 				set_page_dirty(iter.pages[i]);
 			put_page(iter.pages[i]);
 		}
-		data->max_bytes -= len;
+		data->len -= len;
 		done += iter.nr_pages;
 	}
 
@@ -538,15 +537,14 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 	 */
 	if (ublk_need_map_req(req)) {
 		struct ublk_map_data data = {
-			.ubq	=	ubq,
 			.rq	=	req,
-			.io	=	io,
-			.max_bytes =	rq_bytes,
+			.ubuf	=	io->addr,
+			.len	=	rq_bytes,
 		};
 
 		ublk_copy_user_pages(&data, true);
 
-		return rq_bytes - data.max_bytes;
+		return rq_bytes - data.len;
 	}
 	return rq_bytes;
 }
@@ -559,17 +557,16 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
 
 	if (ublk_need_unmap_req(req)) {
 		struct ublk_map_data data = {
-			.ubq	=	ubq,
 			.rq	=	req,
-			.io	=	io,
-			.max_bytes =	io->res,
+			.ubuf	=	io->addr,
+			.len	=	io->res,
 		};
 
 		WARN_ON_ONCE(io->res > rq_bytes);
 
 		ublk_copy_user_pages(&data, false);
 
-		return io->res - data.max_bytes;
+		return io->res - data.len;
 	}
 	return rq_bytes;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 12/17] block: ublk_drv: cleanup ublk_copy_user_pages
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (10 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 11/17] block: ublk_drv: cleanup 'struct ublk_map_data' Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-31 16:22   ` Bernd Schubert
  2023-03-30 11:36 ` [PATCH V6 13/17] block: ublk_drv: grab request reference when the request is handled by userspace Ming Lei
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Clean up ublk_copy_user_pages() by using iov iter, and code
gets simplified a lot and becomes much more readable than before.

Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 112 +++++++++++++++++----------------------
 1 file changed, 49 insertions(+), 63 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index fdccbf5fdaa1..cca0e95a89d8 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -419,49 +419,39 @@ static const struct block_device_operations ub_fops = {
 
 #define UBLK_MAX_PIN_PAGES	32
 
-struct ublk_map_data {
-	const struct request *rq;
-	unsigned long	ubuf;
-	unsigned int	len;
-};
-
 struct ublk_io_iter {
 	struct page *pages[UBLK_MAX_PIN_PAGES];
-	unsigned pg_off;	/* offset in the 1st page in pages */
-	int nr_pages;		/* how many page pointers in pages */
 	struct bio *bio;
 	struct bvec_iter iter;
 };
 
-static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
-		unsigned max_bytes, bool to_vm)
+/* return how many pages are copied */
+static void ublk_copy_io_pages(struct ublk_io_iter *data,
+		size_t total, size_t pg_off, int dir)
 {
-	const unsigned total = min_t(unsigned, max_bytes,
-			PAGE_SIZE - data->pg_off +
-			((data->nr_pages - 1) << PAGE_SHIFT));
 	unsigned done = 0;
 	unsigned pg_idx = 0;
 
 	while (done < total) {
 		struct bio_vec bv = bio_iter_iovec(data->bio, data->iter);
-		const unsigned int bytes = min3(bv.bv_len, total - done,
-				(unsigned)(PAGE_SIZE - data->pg_off));
+		unsigned int bytes = min3(bv.bv_len, (unsigned)total - done,
+				(unsigned)(PAGE_SIZE - pg_off));
 		void *bv_buf = bvec_kmap_local(&bv);
 		void *pg_buf = kmap_local_page(data->pages[pg_idx]);
 
-		if (to_vm)
-			memcpy(pg_buf + data->pg_off, bv_buf, bytes);
+		if (dir == ITER_DEST)
+			memcpy(pg_buf + pg_off, bv_buf, bytes);
 		else
-			memcpy(bv_buf, pg_buf + data->pg_off, bytes);
+			memcpy(bv_buf, pg_buf + pg_off, bytes);
 
 		kunmap_local(pg_buf);
 		kunmap_local(bv_buf);
 
 		/* advance page array */
-		data->pg_off += bytes;
-		if (data->pg_off == PAGE_SIZE) {
+		pg_off += bytes;
+		if (pg_off == PAGE_SIZE) {
 			pg_idx += 1;
-			data->pg_off = 0;
+			pg_off = 0;
 		}
 
 		done += bytes;
@@ -475,41 +465,40 @@ static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
 			data->iter = data->bio->bi_iter;
 		}
 	}
-
-	return done;
 }
 
-static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
+/*
+ * Copy data between request pages and io_iter, and 'offset'
+ * is the start point of linear offset of request.
+ */
+static size_t ublk_copy_user_pages(const struct request *req,
+		struct iov_iter *uiter, int dir)
 {
-	const unsigned int gup_flags = to_vm ? FOLL_WRITE : 0;
-	const unsigned long start_vm = data->ubuf;
-	unsigned int done = 0;
 	struct ublk_io_iter iter = {
-		.pg_off	= start_vm & (PAGE_SIZE - 1),
-		.bio	= data->rq->bio,
-		.iter	= data->rq->bio->bi_iter,
+		.bio	= req->bio,
+		.iter	= req->bio->bi_iter,
 	};
-	const unsigned int nr_pages = round_up(data->len +
-			(start_vm & (PAGE_SIZE - 1)), PAGE_SIZE) >> PAGE_SHIFT;
-
-	while (done < nr_pages) {
-		const unsigned to_pin = min_t(unsigned, UBLK_MAX_PIN_PAGES,
-				nr_pages - done);
-		unsigned i, len;
-
-		iter.nr_pages = get_user_pages_fast(start_vm +
-				(done << PAGE_SHIFT), to_pin, gup_flags,
-				iter.pages);
-		if (iter.nr_pages <= 0)
-			return done == 0 ? iter.nr_pages : done;
-		len = ublk_copy_io_pages(&iter, data->len, to_vm);
-		for (i = 0; i < iter.nr_pages; i++) {
-			if (to_vm)
+	size_t done = 0;
+
+	while (iov_iter_count(uiter) && iter.bio) {
+		unsigned nr_pages;
+		size_t len, off;
+		int i;
+
+		len = iov_iter_get_pages2(uiter, iter.pages,
+				iov_iter_count(uiter),
+				UBLK_MAX_PIN_PAGES, &off);
+		if (len <= 0)
+			return done;
+
+		ublk_copy_io_pages(&iter, len, off, dir);
+		nr_pages = DIV_ROUND_UP(len + off, PAGE_SIZE);
+		for (i = 0; i < nr_pages; i++) {
+			if (dir == ITER_DEST)
 				set_page_dirty(iter.pages[i]);
 			put_page(iter.pages[i]);
 		}
-		data->len -= len;
-		done += iter.nr_pages;
+		done += len;
 	}
 
 	return done;
@@ -536,15 +525,14 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 	 * context is pretty fast, see ublk_pin_user_pages
 	 */
 	if (ublk_need_map_req(req)) {
-		struct ublk_map_data data = {
-			.rq	=	req,
-			.ubuf	=	io->addr,
-			.len	=	rq_bytes,
-		};
+		struct iov_iter iter;
+		struct iovec iov;
+		const int dir = ITER_DEST;
 
-		ublk_copy_user_pages(&data, true);
+		import_single_range(dir, u64_to_user_ptr(io->addr), rq_bytes,
+				&iov, &iter);
 
-		return rq_bytes - data.len;
+		return ublk_copy_user_pages(req, &iter, dir);
 	}
 	return rq_bytes;
 }
@@ -556,17 +544,15 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
 	if (ublk_need_unmap_req(req)) {
-		struct ublk_map_data data = {
-			.rq	=	req,
-			.ubuf	=	io->addr,
-			.len	=	io->res,
-		};
+		struct iov_iter iter;
+		struct iovec iov;
+		const int dir = ITER_SOURCE;
 
 		WARN_ON_ONCE(io->res > rq_bytes);
 
-		ublk_copy_user_pages(&data, false);
-
-		return io->res - data.len;
+		import_single_range(dir, u64_to_user_ptr(io->addr), io->res,
+				&iov, &iter);
+		return ublk_copy_user_pages(req, &iter, dir);
 	}
 	return rq_bytes;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 13/17] block: ublk_drv: grab request reference when the request is handled by userspace
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (11 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 14/17] block: ublk_drv: support to copy any part of request pages Ming Lei
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Add one reference counter into request pdu data, and hold this reference
in the request's lifetime. This way is always safe. In theory, the ublk
request won't be completed until fused commands are done. However, it
is userspace, and application can submit fused command at will.

Prepare for supporting zero copy, which needs to retrieve request buffer
by fused command, so we have to guarantee:

- the fused command can't succeed unless the request isn't queued

- when any fused command is successful, this request can't be freed
until all fused commands on this request are done.

Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 67 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 64 insertions(+), 3 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index cca0e95a89d8..0dc8eb04b9a5 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -43,6 +43,7 @@
 #include <asm/page.h>
 #include <linux/task_work.h>
 #include <linux/namei.h>
+#include <linux/kref.h>
 #include <uapi/linux/ublk_cmd.h>
 
 #define UBLK_MINORS		(1U << MINORBITS)
@@ -62,6 +63,17 @@
 struct ublk_rq_data {
 	struct llist_node node;
 	struct callback_head work;
+
+	/*
+	 * Only for applying fused command to support zero copy:
+	 *
+	 * - if there is any fused command aiming at this request, not complete
+	 *   request until all fused commands are done
+	 *
+	 * - fused command has to fail unless this reference is grabbed
+	 *   successfully
+	 */
+	struct kref ref;
 };
 
 struct ublk_uring_cmd_pdu {
@@ -180,6 +192,9 @@ struct ublk_params_header {
 	__u32	types;
 };
 
+static inline void __ublk_complete_rq(struct request *req);
+static void ublk_complete_rq(struct kref *ref);
+
 static dev_t ublk_chr_devt;
 static struct class *ublk_chr_class;
 
@@ -288,6 +303,35 @@ static int ublk_apply_params(struct ublk_device *ub)
 	return 0;
 }
 
+static inline bool ublk_support_zc(const struct ublk_queue *ubq)
+{
+	return ubq->flags & UBLK_F_SUPPORT_ZERO_COPY;
+}
+
+static inline bool ublk_get_req_ref(const struct ublk_queue *ubq,
+		struct request *req)
+{
+	if (ublk_support_zc(ubq)) {
+		struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
+
+		return kref_get_unless_zero(&data->ref);
+	}
+
+	return true;
+}
+
+static inline void ublk_put_req_ref(const struct ublk_queue *ubq,
+		struct request *req)
+{
+	if (ublk_support_zc(ubq)) {
+		struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
+
+		kref_put(&data->ref, ublk_complete_rq);
+	} else {
+		__ublk_complete_rq(req);
+	}
+}
+
 static inline bool ublk_can_use_task_work(const struct ublk_queue *ubq)
 {
 	if (IS_BUILTIN(CONFIG_BLK_DEV_UBLK) &&
@@ -632,13 +676,19 @@ static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
 }
 
 /* todo: handle partial completion */
-static void ublk_complete_rq(struct request *req)
+static inline void __ublk_complete_rq(struct request *req)
 {
 	struct ublk_queue *ubq = req->mq_hctx->driver_data;
 	struct ublk_io *io = &ubq->ios[req->tag];
 	unsigned int unmapped_bytes;
 	blk_status_t res = BLK_STS_OK;
 
+	/* called from ublk_abort_queue() code path */
+	if (io->flags & UBLK_IO_FLAG_ABORTED) {
+		res = BLK_STS_IOERR;
+		goto exit;
+	}
+
 	/* failed read IO if nothing is read */
 	if (!io->res && req_op(req) == REQ_OP_READ)
 		io->res = -EIO;
@@ -678,6 +728,15 @@ static void ublk_complete_rq(struct request *req)
 	blk_mq_end_request(req, res);
 }
 
+static void ublk_complete_rq(struct kref *ref)
+{
+	struct ublk_rq_data *data = container_of(ref, struct ublk_rq_data,
+			ref);
+	struct request *req = blk_mq_rq_from_pdu(data);
+
+	__ublk_complete_rq(req);
+}
+
 /*
  * Since __ublk_rq_task_work always fails requests immediately during
  * exiting, __ublk_fail_req() is only called from abort context during
@@ -696,7 +755,7 @@ static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
 		if (ublk_queue_can_use_recovery_reissue(ubq))
 			blk_mq_requeue_request(req, false);
 		else
-			blk_mq_end_request(req, BLK_STS_IOERR);
+			ublk_put_req_ref(ubq, req);
 	}
 }
 
@@ -734,6 +793,7 @@ static inline void __ublk_rq_task_work(struct request *req,
 				       unsigned issue_flags)
 {
 	struct ublk_queue *ubq = req->mq_hctx->driver_data;
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
 	int tag = req->tag;
 	struct ublk_io *io = &ubq->ios[tag];
 	unsigned int mapped_bytes;
@@ -805,6 +865,7 @@ static inline void __ublk_rq_task_work(struct request *req,
 			mapped_bytes >> 9;
 	}
 
+	kref_init(&data->ref);
 	ubq_complete_io_cmd(io, UBLK_IO_RES_OK, issue_flags);
 }
 
@@ -1017,7 +1078,7 @@ static void ublk_commit_completion(struct ublk_device *ub,
 	req = blk_mq_tag_to_rq(ub->tag_set.tags[qid], tag);
 
 	if (req && likely(!blk_should_fake_timeout(req->q)))
-		ublk_complete_rq(req);
+		ublk_put_req_ref(ubq, req);
 }
 
 /*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 14/17] block: ublk_drv: support to copy any part of request pages
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (12 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 13/17] block: ublk_drv: grab request reference when the request is handled by userspace Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 15/17] block: ublk_drv: add read()/write() support for ublk char device Ming Lei
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Add 'offset' to 'struct ublk_map_data', so that ublk_copy_user_pages()
can be used to copy any sub-buffer(linear mapped) of the request.

Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 0dc8eb04b9a5..32304942ab87 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -511,19 +511,36 @@ static void ublk_copy_io_pages(struct ublk_io_iter *data,
 	}
 }
 
+static bool ublk_advance_io_iter(const struct request *req,
+		struct ublk_io_iter *iter, unsigned int offset)
+{
+	struct bio *bio = req->bio;
+
+	for_each_bio(bio) {
+		if (bio->bi_iter.bi_size > offset) {
+			iter->bio = bio;
+			iter->iter = bio->bi_iter;
+			bio_advance_iter(iter->bio, &iter->iter, offset);
+			return true;
+		}
+		offset -= bio->bi_iter.bi_size;
+	}
+	return false;
+}
+
 /*
  * Copy data between request pages and io_iter, and 'offset'
  * is the start point of linear offset of request.
  */
 static size_t ublk_copy_user_pages(const struct request *req,
-		struct iov_iter *uiter, int dir)
+		unsigned offset, struct iov_iter *uiter, int dir)
 {
-	struct ublk_io_iter iter = {
-		.bio	= req->bio,
-		.iter	= req->bio->bi_iter,
-	};
+	struct ublk_io_iter iter;
 	size_t done = 0;
 
+	if (!ublk_advance_io_iter(req, &iter, offset))
+		return 0;
+
 	while (iov_iter_count(uiter) && iter.bio) {
 		unsigned nr_pages;
 		size_t len, off;
@@ -576,7 +593,7 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 		import_single_range(dir, u64_to_user_ptr(io->addr), rq_bytes,
 				&iov, &iter);
 
-		return ublk_copy_user_pages(req, &iter, dir);
+		return ublk_copy_user_pages(req, 0, &iter, dir);
 	}
 	return rq_bytes;
 }
@@ -596,7 +613,7 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
 
 		import_single_range(dir, u64_to_user_ptr(io->addr), io->res,
 				&iov, &iter);
-		return ublk_copy_user_pages(req, &iter, dir);
+		return ublk_copy_user_pages(req, 0, &iter, dir);
 	}
 	return rq_bytes;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 15/17] block: ublk_drv: add read()/write() support for ublk char device
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (13 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 14/17] block: ublk_drv: support to copy any part of request pages Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 16/17] block: ublk_drv: don't check buffer in case of zero copy Ming Lei
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

We are going to support zero copy by fused uring command, the userspace
can't read from or write to the io buffer any more, it becomes not
flexible for applications:

1) some targets need to zero buffer explicitly, such as when reading
unmapped qcow2 cluster

2) some targets need to support passthrough command, such as zoned
report zones, and still need to read/write the io buffer

Support pread()/pwrite() on ublk char device for reading/writing request
io buffer, so ublk server can handle the above cases easily.

This also can help to make zero copy becoming the primary option, and
non-zero-copy will become legacy code path since the added read()/write()
can cover non-zero-copy feature.

Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c      | 131 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/ublk_cmd.h |  31 +++++++-
 2 files changed, 161 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 32304942ab87..0ff70cda4b91 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1322,6 +1322,36 @@ static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
 	ublk_queue_cmd(ubq, req);
 }
 
+static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
+		struct ublk_queue *ubq, int tag, size_t offset)
+{
+	struct request *req;
+
+	if (!ublk_support_zc(ubq))
+		return NULL;
+
+	req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+	if (!req)
+		return NULL;
+
+	if (!ublk_get_req_ref(ubq, req))
+		return NULL;
+
+	if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
+		goto fail_put;
+
+	if (!ublk_rq_has_data(req))
+		goto fail_put;
+
+	if (offset > blk_rq_bytes(req))
+		goto fail_put;
+
+	return req;
+fail_put:
+	ublk_put_req_ref(ubq, req);
+	return NULL;
+}
+
 static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
 	struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->cmd;
@@ -1423,11 +1453,112 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	return -EIOCBQUEUED;
 }
 
+static inline bool ublk_check_ubuf_dir(const struct request *req,
+		int ubuf_dir)
+{
+	/* copy ubuf to request pages */
+	if (req_op(req) == REQ_OP_READ && ubuf_dir == ITER_SOURCE)
+		return true;
+
+	/* copy request pages to ubuf */
+	if (req_op(req) == REQ_OP_WRITE && ubuf_dir == ITER_DEST)
+		return true;
+
+	return false;
+}
+
+static struct request *ublk_check_and_get_req(struct kiocb *iocb,
+		struct iov_iter *iter, size_t *off, int dir)
+{
+	struct ublk_device *ub = iocb->ki_filp->private_data;
+	struct ublk_queue *ubq;
+	struct request *req;
+	size_t buf_off;
+	u16 tag, q_id;
+
+	if (!ub)
+		return ERR_PTR(-EACCES);
+
+	if (!user_backed_iter(iter))
+		return ERR_PTR(-EACCES);
+
+	if (ub->dev_info.state == UBLK_S_DEV_DEAD)
+		return ERR_PTR(-EACCES);
+
+	tag = ublk_pos_to_tag(iocb->ki_pos);
+	q_id = ublk_pos_to_hwq(iocb->ki_pos);
+	buf_off = ublk_pos_to_buf_offset(iocb->ki_pos);
+
+	if (q_id >= ub->dev_info.nr_hw_queues)
+		return ERR_PTR(-EINVAL);
+
+	ubq = ublk_get_queue(ub, q_id);
+	if (!ubq)
+		return ERR_PTR(-EINVAL);
+
+	if (tag >= ubq->q_depth)
+		return ERR_PTR(-EINVAL);
+
+	req = __ublk_check_and_get_req(ub, ubq, tag, buf_off);
+	if (!req)
+		return ERR_PTR(-EINVAL);
+
+	if (!req->mq_hctx || !req->mq_hctx->driver_data)
+		goto fail;
+
+	if (!ublk_check_ubuf_dir(req, dir))
+		goto fail;
+
+	*off = buf_off;
+	return req;
+fail:
+	ublk_put_req_ref(ubq, req);
+	return ERR_PTR(-EACCES);
+}
+
+static ssize_t ublk_ch_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+	struct ublk_queue *ubq;
+	struct request *req;
+	size_t buf_off;
+	size_t ret;
+
+	req = ublk_check_and_get_req(iocb, to, &buf_off, ITER_DEST);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	ret = ublk_copy_user_pages(req, buf_off, to, ITER_DEST);
+	ubq = req->mq_hctx->driver_data;
+	ublk_put_req_ref(ubq, req);
+
+	return ret;
+}
+
+static ssize_t ublk_ch_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct ublk_queue *ubq;
+	struct request *req;
+	size_t buf_off;
+	size_t ret;
+
+	req = ublk_check_and_get_req(iocb, from, &buf_off, ITER_SOURCE);
+	if (IS_ERR(req))
+		return PTR_ERR(req);
+
+	ret = ublk_copy_user_pages(req, buf_off, from, ITER_SOURCE);
+	ubq = req->mq_hctx->driver_data;
+	ublk_put_req_ref(ubq, req);
+
+	return ret;
+}
+
 static const struct file_operations ublk_ch_fops = {
 	.owner = THIS_MODULE,
 	.open = ublk_ch_open,
 	.release = ublk_ch_release,
 	.llseek = no_llseek,
+	.read_iter = ublk_ch_read_iter,
+	.write_iter = ublk_ch_write_iter,
 	.uring_cmd = ublk_ch_uring_cmd,
 	.mmap = ublk_ch_mmap,
 };
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index f6238ccc7800..d1a6b3dc0327 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -54,7 +54,36 @@
 #define UBLKSRV_IO_BUF_OFFSET	0x80000000
 
 /* tag bit is 12bit, so at most 4096 IOs for each queue */
-#define UBLK_MAX_QUEUE_DEPTH	4096
+#define UBLK_TAG_BITS		12
+#define UBLK_MAX_QUEUE_DEPTH	(1U << UBLK_TAG_BITS)
+
+/* used for locating each io buffer for pread()/pwrite() on char device */
+#define UBLK_BUFS_SIZE_BITS	42
+#define UBLK_BUFS_SIZE_MASK    ((1ULL << UBLK_BUFS_SIZE_BITS) - 1)
+#define UBLK_BUF_SIZE_BITS     (UBLK_BUFS_SIZE_BITS - UBLK_TAG_BITS)
+#define UBLK_BUF_MAX_SIZE      (1ULL << UBLK_BUF_SIZE_BITS)
+
+static inline __u16 ublk_pos_to_hwq(__u64 pos)
+{
+	return pos >> UBLK_BUFS_SIZE_BITS;
+}
+
+static inline __u32 ublk_pos_to_buf_offset(__u64 pos)
+{
+	return (pos & UBLK_BUFS_SIZE_MASK) & (UBLK_BUF_MAX_SIZE - 1);
+}
+
+static inline __u16 ublk_pos_to_tag(__u64 pos)
+{
+	return (pos & UBLK_BUFS_SIZE_MASK) >> UBLK_BUF_SIZE_BITS;
+}
+
+/* offset of single buffer, which has to be < UBLK_BUX_MAX_SIZE */
+static inline __u64 ublk_pos(__u16 q_id, __u16 tag, __u32 offset)
+{
+	return (((__u64)q_id) << UBLK_BUFS_SIZE_BITS) |
+		((((__u64)tag) << UBLK_BUF_SIZE_BITS) + offset);
+}
 
 /*
  * zero copy requires 4k block size, and can remap ublk driver's io
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 16/17] block: ublk_drv: don't check buffer in case of zero copy
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (14 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 15/17] block: ublk_drv: add read()/write() support for ublk char device Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-30 11:36 ` [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

In case of zero copy, ublk server needn't to pre-allocate IO buffer
and provide it to driver more.

Meantime not set the buffer in case of zero copy any more, and the
userspace can use pread()/pwrite() to read from/write to the io request
buffer, which is easier & simpler from userspace viewpoint.

Signed-off-by: Ming Lei <[email protected]>
---
 drivers/block/ublk_drv.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 0ff70cda4b91..a744d3b5da91 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1410,25 +1410,30 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 		if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
 			goto out;
 		/* FETCH_RQ has to provide IO buffer if NEED GET DATA is not enabled */
-		if (!ub_cmd->addr && !ublk_need_get_data(ubq))
-			goto out;
+		if (!ublk_support_zc(ubq)) {
+			if (!ub_cmd->addr && !ublk_need_get_data(ubq))
+				goto out;
+			io->addr = ub_cmd->addr;
+		}
 		io->cmd = cmd;
 		io->flags |= UBLK_IO_FLAG_ACTIVE;
-		io->addr = ub_cmd->addr;
-
 		ublk_mark_io_ready(ub, ubq);
 		break;
 	case UBLK_IO_COMMIT_AND_FETCH_REQ:
 		req = blk_mq_tag_to_rq(ub->tag_set.tags[ub_cmd->q_id], tag);
+
+		if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
+			goto out;
 		/*
 		 * COMMIT_AND_FETCH_REQ has to provide IO buffer if NEED GET DATA is
 		 * not enabled or it is Read IO.
 		 */
-		if (!ub_cmd->addr && (!ublk_need_get_data(ubq) || req_op(req) == REQ_OP_READ))
-			goto out;
-		if (!(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV))
-			goto out;
-		io->addr = ub_cmd->addr;
+		if (!ublk_support_zc(ubq)) {
+			if (!ub_cmd->addr && (!ublk_need_get_data(ubq) ||
+						req_op(req) == REQ_OP_READ))
+				goto out;
+			io->addr = ub_cmd->addr;
+		}
 		io->flags |= UBLK_IO_FLAG_ACTIVE;
 		io->cmd = cmd;
 		ublk_commit_completion(ub, ub_cmd);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (15 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 16/17] block: ublk_drv: don't check buffer in case of zero copy Ming Lei
@ 2023-03-30 11:36 ` Ming Lei
  2023-03-31 19:13   ` Bernd Schubert
  2023-03-31 19:55   ` Bernd Schubert
  2023-04-03  1:11 ` [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (2 subsequent siblings)
  19 siblings, 2 replies; 37+ messages in thread
From: Ming Lei @ 2023-03-30 11:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Ming Lei

Apply io_uring fused command for supporting zero copy:

1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and deinit it
in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
retrieve it from concurrent fused command.

1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
fused cmd(zero copy) buffer

2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
request and submit secondary request; meantime setup complete callback via
this API, once secondary request is completed, the complete callback is
called for freeing the buffer and completing the fused command

Also request reference is held during fused command lifetime, and this way
guarantees that request buffer won't be freed until all inflight fused
commands are completed.

userspace(only implement sqe128 fused command):

	https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6

liburing test(only implement normal sqe fused command: two 64byte SQEs)

	https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6

Signed-off-by: Ming Lei <[email protected]>
---
 Documentation/block/ublk.rst  | 126 ++++++++++++++++++++--
 drivers/block/ublk_drv.c      | 192 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/ublk_cmd.h |   6 +-
 3 files changed, 303 insertions(+), 21 deletions(-)

diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index 1713b2890abb..7b7aa24e9729 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -297,18 +297,126 @@ with specified IO tag in the command data:
   ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
   the server buffer (pages) read to the IO request pages.
 
-Future development
-==================
+- ``UBLK_IO_FUSED_SUBMIT_IO``
+
+  Used for implementing zero copy feature.
+
+  It has to been the primary command of io_uring fused command. This command
+  submits the generic secondary IO request with io buffer provided by our primary
+  command, and won't be completed until the secondary request is done.
+
+  The provided buffer is represented as ``io_uring_bvec_buf``, which is
+  actually ublk request buffer's reference, and the reference is shared &
+  read-only, so the generic secondary request can retrieve any part of the buffer
+  by passing buffer offset & length.
 
 Zero copy
----------
+=========
+
+What is zero copy?
+------------------
+
+When application submits IO to ``/dev/ublkb*``, userspace buffer(direct io)
+or page cache buffer(buffered io) or kernel buffer(meta io often) is used
+for submitting data to ublk driver, and all kinds of these buffers are
+represented by bio/bvecs(ublk request buffer) finally. Before supporting
+zero copy, data in these buffers has to be copied to ublk server userspace
+buffer before handling WRITE IO, or after handing READ IO, so that ublk
+server can handle IO for ``/dev/ublkb*`` with the copied data.
+
+The extra copy between ublk request buffer and ublk server userspace buffer
+not only increases CPU utilization(such as pinning pages, copy data), but
+also consumes memory bandwidth, and the cost could be very big when IO size
+is big. It is observed that ublk-null IOPS may be increased to ~5X if the
+extra copy can be avoided.
+
+So zero copy is very important for supporting high performance block device
+in userspace.
+
+Technical requirements
+----------------------
+
+- ublk request buffer use
+
+ublk request buffer is represented by bio/bvec, which is immutable, so do
+not try to change bvec via buffer reference; data can be read from or
+written to the buffer according to buffer direction, but bvec can't be
+changed
+
+- buffer lifetime
+
+Ublk server borrows ublk request buffer for handling ublk IO, ublk request
+buffer reference is used. Reference can't outlive the referent buffer. That
+means all request buffer references have to be released by ublk server
+before ublk driver completes this request, when request buffer ownership
+is transferred to upper layer(FS, application, ...).
+
+Also after ublk request is completed, any page belonging to this ublk
+request can not be written or read any more from ublk server since it is
+one block device from kernel viewpoint.
+
+- buffer direction
+
+For ublk WRITE request, ublk request buffer should only be accessed as data
+source, and the buffer can't be written by ublk server
+
+For ublk READ request, ublk request buffer should only be accessed as data
+destination, and the buffer can't be read by ublk server, otherwise kernel
+data is leaked to ublk server, which can be unprivileged application.
+
+- arbitrary size sub-buffer needs to be retrieved from ublk server
+
+ublk is one generic framework for implementing block device in userspace,
+and typical requirements include logical volume manager(mirror, stripped, ...),
+distributed network storage, compressed target, ...
+
+ublk server needs to retrieve arbitrary size sub-buffer of ublk request, and
+ublk server needs to submit IOs with these sub-buffer(s). That also means
+arbitrary size sub-buffer(s) can be used to submit IO multiple times.
+
+Any sub-buffer is actually one reference of ublk request buffer, which
+ownership can't be transferred to upper layer if any reference is held
+by ublk server.
+
+Why slice isn't good for ublk zero copy
+---------------------------------------
+
+- spliced page from ->splice_read() can't be written
+
+ublk READ request can't be handled because spliced page can't be written to, and
+extending splice for ublk zero copy isn't one good solution [#splice_extend]_
+
+- it is very hard to meet above requirements  wrt. request buffer lifetime
+
+splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
+attention to ublk request buffer lifetime. If is very inefficient to respect
+request buffer lifetime by using all pipe buffer's ->release() which requires
+all pipe buffers and pipe to be kept when ublk server handles IO. That means
+one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
+provided buffer, and the pipe needs to be populated with pages in ublk request
+buffer.
+
+
+io_uring fused command based zero copy
+--------------------------------------
+
+io_uring fused command includes one primary command(uring command) and one
+generic secondary request. The primary command is responsible for submitting
+secondary request with provided buffer from ublk request, and primary command
+won't be completed until the secondary request is completed.
+
+Typical ublk IO handling includes network and FS IO, so it is usual enough
+for io_uring net & fs to support IO with provided buffer from primary command.
 
-Zero copy is a generic requirement for nbd, fuse or similar drivers. A
-problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
-can't be remapped any more in kernel with existing mm interfaces. This can
-occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
-big requests (IO size >= 256 KB) may benefit a lot from zero copy.
+Once primary command is submitted successfully, ublk driver guarantees that
+the ublk request buffer won't be gone away since secondary request actually
+grabs the buffer's reference. This way also guarantees that multiple
+concurrent fused commands associated with same request buffer works fine,
+as the provided buffer reference is shared & read-only.
 
+Also buffer usage direction flag is passed to primary command from userspace,
+so ublk driver can validate if it is legal to use buffer with requested
+direction.
 
 References
 ==========
@@ -323,4 +431,4 @@ References
 
 .. [#stefan] https://lore.kernel.org/linux-block/[email protected]/
 
-.. [#xiaoguang] https://lore.kernel.org/linux-block/[email protected]/
+.. [#splice_extend] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index a744d3b5da91..64b0408873f6 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -74,10 +74,15 @@ struct ublk_rq_data {
 	 *   successfully
 	 */
 	struct kref ref;
+	bool allocated_bvec;
+	struct io_uring_bvec_buf buf[0];
 };
 
 struct ublk_uring_cmd_pdu {
-	struct ublk_queue *ubq;
+	union {
+		struct ublk_queue *ubq;
+		struct request *req;
+	};
 };
 
 /*
@@ -565,6 +570,69 @@ static size_t ublk_copy_user_pages(const struct request *req,
 	return done;
 }
 
+/*
+ * The built command buffer is immutable, so it is fine to feed it to
+ * concurrent io_uring fused commands
+ */
+static int ublk_init_zero_copy_buffer(struct request *rq)
+{
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+	struct io_uring_bvec_buf *imu = data->buf;
+	struct req_iterator rq_iter;
+	unsigned int nr_bvecs = 0;
+	struct bio_vec *bvec;
+	unsigned int offset;
+	struct bio_vec bv;
+
+	if (!ublk_rq_has_data(rq))
+		goto exit;
+
+	rq_for_each_bvec(bv, rq, rq_iter)
+		nr_bvecs++;
+
+	if (!nr_bvecs)
+		goto exit;
+
+	if (rq->bio != rq->biotail) {
+		int idx = 0;
+
+		bvec = kvmalloc_array(nr_bvecs, sizeof(struct bio_vec),
+				GFP_NOIO);
+		if (!bvec)
+			return -ENOMEM;
+
+		offset = 0;
+		rq_for_each_bvec(bv, rq, rq_iter)
+			bvec[idx++] = bv;
+		data->allocated_bvec = true;
+	} else {
+		struct bio *bio = rq->bio;
+
+		offset = bio->bi_iter.bi_bvec_done;
+		bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+	}
+	imu->bvec = bvec;
+	imu->nr_bvecs = nr_bvecs;
+	imu->offset = offset;
+	imu->len = blk_rq_bytes(rq);
+
+	return 0;
+exit:
+	imu->bvec = NULL;
+	return 0;
+}
+
+static void ublk_deinit_zero_copy_buffer(struct request *rq)
+{
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+	struct io_uring_bvec_buf *imu = data->buf;
+
+	if (data->allocated_bvec) {
+		kvfree(imu->bvec);
+		data->allocated_bvec = false;
+	}
+}
+
 static inline bool ublk_need_map_req(const struct request *req)
 {
 	return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
@@ -575,11 +643,23 @@ static inline bool ublk_need_unmap_req(const struct request *req)
 	return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ;
 }
 
-static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
+static int ublk_map_io(const struct ublk_queue *ubq, struct request *req,
 		struct ublk_io *io)
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
+	if (ublk_support_zc(ubq)) {
+		int ret = ublk_init_zero_copy_buffer(req);
+
+		/*
+		 * The only failure is -ENOMEM for allocating fused cmd
+		 * buffer, return zero so that we can requeue this req.
+		 */
+		if (unlikely(ret))
+			return 0;
+		return rq_bytes;
+	}
+
 	/*
 	 * no zero copy, we delay copy WRITE request data into ublksrv
 	 * context and the big benefit is that pinning pages in current
@@ -599,11 +679,17 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 }
 
 static int ublk_unmap_io(const struct ublk_queue *ubq,
-		const struct request *req,
+		struct request *req,
 		struct ublk_io *io)
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
+	if (ublk_support_zc(ubq)) {
+		ublk_deinit_zero_copy_buffer(req);
+
+		return rq_bytes;
+	}
+
 	if (ublk_need_unmap_req(req)) {
 		struct iov_iter iter;
 		struct iovec iov;
@@ -687,6 +773,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
 	return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
 }
 
+static inline struct ublk_uring_cmd_pdu *ublk_get_uring_fused_cmd_pdu(
+		struct io_uring_cmd *ioucmd)
+{
+	return (struct ublk_uring_cmd_pdu *)&ioucmd->fused.pdu;
+}
+
 static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
 {
 	return ubq->ubq_daemon->flags & PF_EXITING;
@@ -742,6 +834,7 @@ static inline void __ublk_complete_rq(struct request *req)
 
 	return;
 exit:
+	ublk_deinit_zero_copy_buffer(req);
 	blk_mq_end_request(req, res);
 }
 
@@ -1352,6 +1445,68 @@ static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
 	return NULL;
 }
 
+static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd,
+		unsigned issue_flags)
+{
+	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
+	struct request *req = pdu->req;
+	struct ublk_queue *ubq = req->mq_hctx->driver_data;
+
+	ublk_put_req_ref(ubq, req);
+	io_uring_cmd_done(cmd, 0, 0, issue_flags);
+}
+
+static inline bool ublk_check_fused_buf_dir(const struct request *req,
+		unsigned int flags)
+{
+	flags &= IO_URING_F_FUSED_BUF;
+
+	if (req_op(req) == REQ_OP_READ && flags == IO_URING_F_FUSED_BUF_DEST)
+		return true;
+
+	if (req_op(req) == REQ_OP_WRITE && flags == IO_URING_F_FUSED_BUF_SRC)
+		return true;
+
+	return false;
+}
+
+static int ublk_handle_fused_cmd(struct io_uring_cmd *cmd,
+		struct ublk_queue *ubq, int tag, unsigned int issue_flags)
+{
+	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
+	struct ublk_device *ub = cmd->file->private_data;
+	struct ublk_rq_data *data;
+	struct request *req;
+
+	if (!ub)
+		return -EPERM;
+
+	if (!(issue_flags & IO_URING_F_FUSED_BUF))
+		goto exit;
+
+	req = __ublk_check_and_get_req(ub, ubq, tag, 0);
+	if (!req)
+		goto exit;
+
+	pr_devel("%s: qid %d tag %u request bytes %u, issue flags %x\n",
+			__func__, tag, ubq->q_id, blk_rq_bytes(req),
+			issue_flags);
+
+	if (!ublk_check_fused_buf_dir(req, issue_flags))
+		goto exit_put_ref;
+
+	pdu->req = req;
+	data = blk_mq_rq_to_pdu(req);
+	io_fused_provide_buf_and_start(cmd, issue_flags, data->buf,
+			ublk_fused_cmd_done_cb);
+	return -EIOCBQUEUED;
+
+exit_put_ref:
+	ublk_put_req_ref(ubq, req);
+exit:
+	return -EINVAL;
+}
+
 static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
 	struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->cmd;
@@ -1367,6 +1522,10 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 			__func__, cmd->cmd_op, ub_cmd->q_id, tag,
 			ub_cmd->result);
 
+	if ((issue_flags & IO_URING_F_FUSED_BUF) &&
+			cmd_op != UBLK_IO_FUSED_SUBMIT_IO)
+		return -EOPNOTSUPP;
+
 	if (ub_cmd->q_id >= ub->dev_info.nr_hw_queues)
 		goto out;
 
@@ -1374,7 +1533,12 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	if (!ubq || ub_cmd->q_id != ubq->q_id)
 		goto out;
 
-	if (ubq->ubq_daemon && ubq->ubq_daemon != current)
+	/*
+	 * The fused command reads the io buffer data structure only, so it
+	 * is fine to be issued from other context.
+	 */
+	if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
+			(cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
 		goto out;
 
 	if (tag >= ubq->q_depth)
@@ -1397,6 +1561,9 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 		goto out;
 
 	switch (cmd_op) {
+	case UBLK_IO_FUSED_SUBMIT_IO:
+		return ublk_handle_fused_cmd(cmd, ubq, tag, issue_flags);
+
 	case UBLK_IO_FETCH_REQ:
 		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
 		if (ublk_queue_ready(ubq)) {
@@ -1726,11 +1893,14 @@ static void ublk_align_max_io_size(struct ublk_device *ub)
 
 static int ublk_add_tag_set(struct ublk_device *ub)
 {
+	int zc = !!(ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY);
+	struct ublk_rq_data *data;
+
 	ub->tag_set.ops = &ublk_mq_ops;
 	ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues;
 	ub->tag_set.queue_depth = ub->dev_info.queue_depth;
 	ub->tag_set.numa_node = NUMA_NO_NODE;
-	ub->tag_set.cmd_size = sizeof(struct ublk_rq_data);
+	ub->tag_set.cmd_size = struct_size(data, buf, zc);
 	ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	ub->tag_set.driver_data = ub;
 	return blk_mq_alloc_tag_set(&ub->tag_set);
@@ -1946,12 +2116,18 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 	 */
 	ub->dev_info.flags &= UBLK_F_ALL;
 
+	/*
+	 * NEED_GET_DATA doesn't make sense any more in case that
+	 * ZERO_COPY is requested. Another reason is that userspace
+	 * can read/write io request buffer by pread()/pwrite() with
+	 * each io buffer's position.
+	 */
+	if (ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY)
+		ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
+
 	if (!IS_BUILTIN(CONFIG_BLK_DEV_UBLK))
 		ub->dev_info.flags |= UBLK_F_URING_CMD_COMP_IN_TASK;
 
-	/* We are not ready to support zero copy */
-	ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
-
 	ub->dev_info.nr_hw_queues = min_t(unsigned int,
 			ub->dev_info.nr_hw_queues, nr_cpu_ids);
 	ublk_align_max_io_size(ub);
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index d1a6b3dc0327..c4f3465399cf 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -44,6 +44,7 @@
 #define	UBLK_IO_FETCH_REQ		0x20
 #define	UBLK_IO_COMMIT_AND_FETCH_REQ	0x21
 #define	UBLK_IO_NEED_GET_DATA	0x22
+#define	UBLK_IO_FUSED_SUBMIT_IO	0x23
 
 /* only ABORT means that no re-fetch */
 #define UBLK_IO_RES_OK			0
@@ -85,10 +86,7 @@ static inline __u64 ublk_pos(__u16 q_id, __u16 tag, __u32 offset)
 		((((__u64)tag) << UBLK_BUF_SIZE_BITS) + offset);
 }
 
-/*
- * zero copy requires 4k block size, and can remap ublk driver's io
- * request into ublksrv's vm space
- */
+/* io_uring fused command based zero copy */
 #define UBLK_F_SUPPORT_ZERO_COPY	(1ULL << 0)
 
 /*
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 12/17] block: ublk_drv: cleanup ublk_copy_user_pages
  2023-03-30 11:36 ` [PATCH V6 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
@ 2023-03-31 16:22   ` Bernd Schubert
  0 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2023-03-31 16:22 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, [email protected],
	[email protected]
  Cc: [email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams

On 3/30/23 13:36, Ming Lei wrote:
> Clean up ublk_copy_user_pages() by using iov iter, and code
> gets simplified a lot and becomes much more readable than before.
> 
> Signed-off-by: Ming Lei <[email protected]>
> ---
>   drivers/block/ublk_drv.c | 112 +++++++++++++++++----------------------
>   1 file changed, 49 insertions(+), 63 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index fdccbf5fdaa1..cca0e95a89d8 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -419,49 +419,39 @@ static const struct block_device_operations ub_fops = {
>   
>   #define UBLK_MAX_PIN_PAGES	32
>   
> -struct ublk_map_data {
> -	const struct request *rq;
> -	unsigned long	ubuf;
> -	unsigned int	len;
> -};
> -
>   struct ublk_io_iter {
>   	struct page *pages[UBLK_MAX_PIN_PAGES];
> -	unsigned pg_off;	/* offset in the 1st page in pages */
> -	int nr_pages;		/* how many page pointers in pages */
>   	struct bio *bio;
>   	struct bvec_iter iter;
>   };
>   
> -static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
> -		unsigned max_bytes, bool to_vm)
> +/* return how many pages are copied */
> +static void ublk_copy_io_pages(struct ublk_io_iter *data,
> +		size_t total, size_t pg_off, int dir)
>   {
> -	const unsigned total = min_t(unsigned, max_bytes,
> -			PAGE_SIZE - data->pg_off +
> -			((data->nr_pages - 1) << PAGE_SHIFT));
>   	unsigned done = 0;
>   	unsigned pg_idx = 0;
>   
>   	while (done < total) {
>   		struct bio_vec bv = bio_iter_iovec(data->bio, data->iter);
> -		const unsigned int bytes = min3(bv.bv_len, total - done,
> -				(unsigned)(PAGE_SIZE - data->pg_off));
> +		unsigned int bytes = min3(bv.bv_len, (unsigned)total - done,
> +				(unsigned)(PAGE_SIZE - pg_off));
>   		void *bv_buf = bvec_kmap_local(&bv);
>   		void *pg_buf = kmap_local_page(data->pages[pg_idx]);
>   
> -		if (to_vm)
> -			memcpy(pg_buf + data->pg_off, bv_buf, bytes);
> +		if (dir == ITER_DEST)
> +			memcpy(pg_buf + pg_off, bv_buf, bytes);
>   		else
> -			memcpy(bv_buf, pg_buf + data->pg_off, bytes);
> +			memcpy(bv_buf, pg_buf + pg_off, bytes);
>   
>   		kunmap_local(pg_buf);
>   		kunmap_local(bv_buf);
>   
>   		/* advance page array */
> -		data->pg_off += bytes;
> -		if (data->pg_off == PAGE_SIZE) {
> +		pg_off += bytes;
> +		if (pg_off == PAGE_SIZE) {
>   			pg_idx += 1;
> -			data->pg_off = 0;
> +			pg_off = 0;
>   		}
>   
>   		done += bytes;
> @@ -475,41 +465,40 @@ static inline unsigned ublk_copy_io_pages(struct ublk_io_iter *data,
>   			data->iter = data->bio->bi_iter;
>   		}
>   	}
> -
> -	return done;
>   }
>   
> -static int ublk_copy_user_pages(struct ublk_map_data *data, bool to_vm)
> +/*
> + * Copy data between request pages and io_iter, and 'offset'
> + * is the start point of linear offset of request.
> + */
> +static size_t ublk_copy_user_pages(const struct request *req,
> +		struct iov_iter *uiter, int dir)
>   {
> -	const unsigned int gup_flags = to_vm ? FOLL_WRITE : 0;
> -	const unsigned long start_vm = data->ubuf;
> -	unsigned int done = 0;
>   	struct ublk_io_iter iter = {
> -		.pg_off	= start_vm & (PAGE_SIZE - 1),
> -		.bio	= data->rq->bio,
> -		.iter	= data->rq->bio->bi_iter,
> +		.bio	= req->bio,
> +		.iter	= req->bio->bi_iter,
>   	};
> -	const unsigned int nr_pages = round_up(data->len +
> -			(start_vm & (PAGE_SIZE - 1)), PAGE_SIZE) >> PAGE_SHIFT;
> -
> -	while (done < nr_pages) {
> -		const unsigned to_pin = min_t(unsigned, UBLK_MAX_PIN_PAGES,
> -				nr_pages - done);
> -		unsigned i, len;
> -
> -		iter.nr_pages = get_user_pages_fast(start_vm +
> -				(done << PAGE_SHIFT), to_pin, gup_flags,
> -				iter.pages);
> -		if (iter.nr_pages <= 0)
> -			return done == 0 ? iter.nr_pages : done;
> -		len = ublk_copy_io_pages(&iter, data->len, to_vm);
> -		for (i = 0; i < iter.nr_pages; i++) {
> -			if (to_vm)
> +	size_t done = 0;
> +
> +	while (iov_iter_count(uiter) && iter.bio) {
> +		unsigned nr_pages;
> +		size_t len, off;
> +		int i;
> +
> +		len = iov_iter_get_pages2(uiter, iter.pages,
> +				iov_iter_count(uiter),
> +				UBLK_MAX_PIN_PAGES, &off);
> +		if (len <= 0)
> +			return done;
> +
> +		ublk_copy_io_pages(&iter, len, off, dir);
> +		nr_pages = DIV_ROUND_UP(len + off, PAGE_SIZE);
> +		for (i = 0; i < nr_pages; i++) {
> +			if (dir == ITER_DEST)
>   				set_page_dirty(iter.pages[i]);
>   			put_page(iter.pages[i]);
>   		}
> -		data->len -= len;
> -		done += iter.nr_pages;
> +		done += len;
>   	}
>   
>   	return done;
> @@ -536,15 +525,14 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
>   	 * context is pretty fast, see ublk_pin_user_pages
>   	 */
>   	if (ublk_need_map_req(req)) {
> -		struct ublk_map_data data = {
> -			.rq	=	req,
> -			.ubuf	=	io->addr,
> -			.len	=	rq_bytes,
> -		};
> +		struct iov_iter iter;
> +		struct iovec iov;
> +		const int dir = ITER_DEST;

Maybe a comment here that this means "copy to daemon"?

>   
> -		ublk_copy_user_pages(&data, true);
> +		import_single_range(dir, u64_to_user_ptr(io->addr), rq_bytes,
> +				&iov, &iter);
>   
> -		return rq_bytes - data.len;
> +		return ublk_copy_user_pages(req, &iter, dir);
>   	}
>   	return rq_bytes;
>   }
> @@ -556,17 +544,15 @@ static int ublk_unmap_io(const struct ublk_queue *ubq,
>   	const unsigned int rq_bytes = blk_rq_bytes(req);
>   
>   	if (ublk_need_unmap_req(req)) {
> -		struct ublk_map_data data = {
> -			.rq	=	req,
> -			.ubuf	=	io->addr,
> -			.len	=	io->res,
> -		};
> +		struct iov_iter iter;
> +		struct iovec iov;
> +		const int dir = ITER_SOURCE;

And here "from daemon"?
>   
>   		WARN_ON_ONCE(io->res > rq_bytes);
>   
> -		ublk_copy_user_pages(&data, false);
> -
> -		return io->res - data.len;
> +		import_single_range(dir, u64_to_user_ptr(io->addr), io->res,
> +				&iov, &iter);
> +		return ublk_copy_user_pages(req, &iter, dir);
>   	}
>   	return rq_bytes;
>   }


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
  2023-03-30 11:36 ` [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
@ 2023-03-31 19:13   ` Bernd Schubert
  2023-04-01 13:19     ` Ming Lei
  2023-03-31 19:55   ` Bernd Schubert
  1 sibling, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2023-03-31 19:13 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, [email protected],
	[email protected]
  Cc: [email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams

On 3/30/23 13:36, Ming Lei wrote:
> Apply io_uring fused command for supporting zero copy:
> 
> 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and deinit it
> in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
> retrieve it from concurrent fused command.
> 
> 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
> fused cmd(zero copy) buffer
> 
> 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
> request and submit secondary request; meantime setup complete callback via
> this API, once secondary request is completed, the complete callback is
> called for freeing the buffer and completing the fused command
> 
> Also request reference is held during fused command lifetime, and this way
> guarantees that request buffer won't be freed until all inflight fused
> commands are completed.
> 
> userspace(only implement sqe128 fused command):
> 
> 	https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
> 
> liburing test(only implement normal sqe fused command: two 64byte SQEs)
> 
> 	https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
> 
> Signed-off-by: Ming Lei <[email protected]>
> ---
>   Documentation/block/ublk.rst  | 126 ++++++++++++++++++++--
>   drivers/block/ublk_drv.c      | 192 ++++++++++++++++++++++++++++++++--
>   include/uapi/linux/ublk_cmd.h |   6 +-
>   3 files changed, 303 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> index 1713b2890abb..7b7aa24e9729 100644
> --- a/Documentation/block/ublk.rst
> +++ b/Documentation/block/ublk.rst
> @@ -297,18 +297,126 @@ with specified IO tag in the command data:
>     ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
>     the server buffer (pages) read to the IO request pages.
>   
> -Future development
> -==================
> +- ``UBLK_IO_FUSED_SUBMIT_IO``
> +
> +  Used for implementing zero copy feature.
> +
> +  It has to been the primary command of io_uring fused command. This command

s/been/be/ ?

> +  submits the generic secondary IO request with io buffer provided by our primary
> +  command, and won't be completed until the secondary request is done.
> +
> +  The provided buffer is represented as ``io_uring_bvec_buf``, which is
> +  actually ublk request buffer's reference, and the reference is shared &
> +  read-only, so the generic secondary request can retrieve any part of the buffer
> +  by passing buffer offset & length.
>   
>   Zero copy
> ----------
> +=========
> +
> +What is zero copy?
> +------------------
> +
> +When application submits IO to ``/dev/ublkb*``, userspace buffer(direct io)
> +or page cache buffer(buffered io) or kernel buffer(meta io often) is used
> +for submitting data to ublk driver, and all kinds of these buffers are
> +represented by bio/bvecs(ublk request buffer) finally. Before supporting
> +zero copy, data in these buffers has to be copied to ublk server userspace
> +buffer before handling WRITE IO, or after handing READ IO, so that ublk
> +server can handle IO for ``/dev/ublkb*`` with the copied data.
> +
> +The extra copy between ublk request buffer and ublk server userspace buffer
> +not only increases CPU utilization(such as pinning pages, copy data), but
> +also consumes memory bandwidth, and the cost could be very big when IO size
> +is big. It is observed that ublk-null IOPS may be increased to ~5X if the
> +extra copy can be avoided.
> +
> +So zero copy is very important for supporting high performance block device
> +in userspace.
> +
> +Technical requirements
> +----------------------
> +
> +- ublk request buffer use
> +
> +ublk request buffer is represented by bio/bvec, which is immutable, so do
> +not try to change bvec via buffer reference; data can be read from or
> +written to the buffer according to buffer direction, but bvec can't be
> +changed
> +
> +- buffer lifetime
> +
> +Ublk server borrows ublk request buffer for handling ublk IO, ublk request
> +buffer reference is used. Reference can't outlive the referent buffer. That
> +means all request buffer references have to be released by ublk server
> +before ublk driver completes this request, when request buffer ownership
> +is transferred to upper layer(FS, application, ...).
> +
> +Also after ublk request is completed, any page belonging to this ublk
> +request can not be written or read any more from ublk server since it is
> +one block device from kernel viewpoint.
> +
> +- buffer direction
> +
> +For ublk WRITE request, ublk request buffer should only be accessed as data
> +source, and the buffer can't be written by ublk server
> +
> +For ublk READ request, ublk request buffer should only be accessed as data
> +destination, and the buffer can't be read by ublk server, otherwise kernel
> +data is leaked to ublk server, which can be unprivileged application.
> +
> +- arbitrary size sub-buffer needs to be retrieved from ublk server
> +
> +ublk is one generic framework for implementing block device in userspace,
> +and typical requirements include logical volume manager(mirror, stripped, ...),
> +distributed network storage, compressed target, ...
> +
> +ublk server needs to retrieve arbitrary size sub-buffer of ublk request, and
> +ublk server needs to submit IOs with these sub-buffer(s). That also means
> +arbitrary size sub-buffer(s) can be used to submit IO multiple times.
> +
> +Any sub-buffer is actually one reference of ublk request buffer, which
> +ownership can't be transferred to upper layer if any reference is held
> +by ublk server.
> +
> +Why slice isn't good for ublk zero copy

slice

> +---------------------------------------
> +
> +- spliced page from ->splice_read() can't be written
> +
> +ublk READ request can't be handled because spliced page can't be written to, and
> +extending splice for ublk zero copy isn't one good solution [#splice_extend]_
> +
> +- it is very hard to meet above requirements  wrt. request buffer lifetime
> +
> +splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
> +attention to ublk request buffer lifetime. If is very inefficient to respect
> +request buffer lifetime by using all pipe buffer's ->release() which requires
> +all pipe buffers and pipe to be kept when ublk server handles IO. That means
> +one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
> +provided buffer, and the pipe needs to be populated with pages in ublk request
> +buffer.
> +
> +
> +io_uring fused command based zero copy
> +--------------------------------------
> +
> +io_uring fused command includes one primary command(uring command) and one
> +generic secondary request. The primary command is responsible for submitting
> +secondary request with provided buffer from ublk request, and primary command
> +won't be completed until the secondary request is completed.
> +
> +Typical ublk IO handling includes network and FS IO, so it is usual enough
> +for io_uring net & fs to support IO with provided buffer from primary command.
>   
> -Zero copy is a generic requirement for nbd, fuse or similar drivers. A

Maybe replace 'requirement' with "very useful"? It works without zc, 
just with limited performance.

> -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
> -can't be remapped any more in kernel with existing mm interfaces. This can
> -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
> -big requests (IO size >= 256 KB) may benefit a lot from zero copy.
> +Once primary command is submitted successfully, ublk driver guarantees that
> +the ublk request buffer won't be gone away since secondary request actually
> +grabs the buffer's reference. This way also guarantees that multiple
> +concurrent fused commands associated with same request buffer works fine,
> +as the provided buffer reference is shared & read-only.
>   
> +Also buffer usage direction flag is passed to primary command from userspace,
> +so ublk driver can validate if it is legal to use buffer with requested
> +direction.
>   
>   References
>   ==========
> @@ -323,4 +431,4 @@ References
>   
>   .. [#stefan] https://lore.kernel.org/linux-block/[email protected]/
>   
> -.. [#xiaoguang] https://lore.kernel.org/linux-block/[email protected]/
> +.. [#splice_extend] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/


Any chance you could add a few code path into this document? For example 
as in Documentation/filesystems/fuse.rst,
section "Kernel - userspace interface"



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
  2023-03-30 11:36 ` [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
  2023-03-31 19:13   ` Bernd Schubert
@ 2023-03-31 19:55   ` Bernd Schubert
  2023-04-01 13:22     ` Ming Lei
  2023-04-03  9:25     ` Bernd Schubert
  1 sibling, 2 replies; 37+ messages in thread
From: Bernd Schubert @ 2023-03-31 19:55 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, [email protected],
	[email protected]
  Cc: [email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams

On 3/30/23 13:36, Ming Lei wrote:
> Apply io_uring fused command for supporting zero copy:
> 
> 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and deinit it
> in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
> retrieve it from concurrent fused command.
> 
> 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
> fused cmd(zero copy) buffer
> 
> 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
> request and submit secondary request; meantime setup complete callback via
> this API, once secondary request is completed, the complete callback is
> called for freeing the buffer and completing the fused command
> 
> Also request reference is held during fused command lifetime, and this way
> guarantees that request buffer won't be freed until all inflight fused
> commands are completed.
> 
> userspace(only implement sqe128 fused command):
> 
> 	https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
> 
> liburing test(only implement normal sqe fused command: two 64byte SQEs)
> 
> 	https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
> 
> Signed-off-by: Ming Lei <[email protected]>
> ---
>   Documentation/block/ublk.rst  | 126 ++++++++++++++++++++--
>   drivers/block/ublk_drv.c      | 192 ++++++++++++++++++++++++++++++++--
>   include/uapi/linux/ublk_cmd.h |   6 +-
>   3 files changed, 303 insertions(+), 21 deletions(-)
> 
> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> index 1713b2890abb..7b7aa24e9729 100644
> --- a/Documentation/block/ublk.rst
> +++ b/Documentation/block/ublk.rst
> @@ -297,18 +297,126 @@ with specified IO tag in the command data:
>     ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
>     the server buffer (pages) read to the IO request pages.
>   
> -Future development
> -==================
> +- ``UBLK_IO_FUSED_SUBMIT_IO``
> +
> +  Used for implementing zero copy feature.
> +
> +  It has to been the primary command of io_uring fused command. This command
> +  submits the generic secondary IO request with io buffer provided by our primary
> +  command, and won't be completed until the secondary request is done.
> +
> +  The provided buffer is represented as ``io_uring_bvec_buf``, which is
> +  actually ublk request buffer's reference, and the reference is shared &
> +  read-only, so the generic secondary request can retrieve any part of the buffer
> +  by passing buffer offset & length.
>   
>   Zero copy
> ----------
> +=========
> +
> +What is zero copy?
> +------------------
> +
> +When application submits IO to ``/dev/ublkb*``, userspace buffer(direct io)
> +or page cache buffer(buffered io) or kernel buffer(meta io often) is used
> +for submitting data to ublk driver, and all kinds of these buffers are
> +represented by bio/bvecs(ublk request buffer) finally. Before supporting
> +zero copy, data in these buffers has to be copied to ublk server userspace
> +buffer before handling WRITE IO, or after handing READ IO, so that ublk
> +server can handle IO for ``/dev/ublkb*`` with the copied data.
> +
> +The extra copy between ublk request buffer and ublk server userspace buffer
> +not only increases CPU utilization(such as pinning pages, copy data), but
> +also consumes memory bandwidth, and the cost could be very big when IO size
> +is big. It is observed that ublk-null IOPS may be increased to ~5X if the
> +extra copy can be avoided.
> +
> +So zero copy is very important for supporting high performance block device
> +in userspace.
> +
> +Technical requirements
> +----------------------
> +
> +- ublk request buffer use
> +
> +ublk request buffer is represented by bio/bvec, which is immutable, so do
> +not try to change bvec via buffer reference; data can be read from or
> +written to the buffer according to buffer direction, but bvec can't be
> +changed
> +
> +- buffer lifetime
> +
> +Ublk server borrows ublk request buffer for handling ublk IO, ublk request
> +buffer reference is used. Reference can't outlive the referent buffer. That
> +means all request buffer references have to be released by ublk server
> +before ublk driver completes this request, when request buffer ownership
> +is transferred to upper layer(FS, application, ...).
> +
> +Also after ublk request is completed, any page belonging to this ublk
> +request can not be written or read any more from ublk server since it is
> +one block device from kernel viewpoint.
> +
> +- buffer direction
> +
> +For ublk WRITE request, ublk request buffer should only be accessed as data
> +source, and the buffer can't be written by ublk server
> +
> +For ublk READ request, ublk request buffer should only be accessed as data
> +destination, and the buffer can't be read by ublk server, otherwise kernel
> +data is leaked to ublk server, which can be unprivileged application.
> +
> +- arbitrary size sub-buffer needs to be retrieved from ublk server
> +
> +ublk is one generic framework for implementing block device in userspace,
> +and typical requirements include logical volume manager(mirror, stripped, ...),
> +distributed network storage, compressed target, ...
> +
> +ublk server needs to retrieve arbitrary size sub-buffer of ublk request, and
> +ublk server needs to submit IOs with these sub-buffer(s). That also means
> +arbitrary size sub-buffer(s) can be used to submit IO multiple times.
> +
> +Any sub-buffer is actually one reference of ublk request buffer, which
> +ownership can't be transferred to upper layer if any reference is held
> +by ublk server.
> +
> +Why slice isn't good for ublk zero copy
> +---------------------------------------
> +
> +- spliced page from ->splice_read() can't be written
> +
> +ublk READ request can't be handled because spliced page can't be written to, and
> +extending splice for ublk zero copy isn't one good solution [#splice_extend]_
> +
> +- it is very hard to meet above requirements  wrt. request buffer lifetime
> +
> +splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
> +attention to ublk request buffer lifetime. If is very inefficient to respect
> +request buffer lifetime by using all pipe buffer's ->release() which requires
> +all pipe buffers and pipe to be kept when ublk server handles IO. That means
> +one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
> +provided buffer, and the pipe needs to be populated with pages in ublk request
> +buffer.
> +
> +
> +io_uring fused command based zero copy
> +--------------------------------------
> +
> +io_uring fused command includes one primary command(uring command) and one
> +generic secondary request. The primary command is responsible for submitting
> +secondary request with provided buffer from ublk request, and primary command
> +won't be completed until the secondary request is completed.
> +
> +Typical ublk IO handling includes network and FS IO, so it is usual enough
> +for io_uring net & fs to support IO with provided buffer from primary command.
>   
> -Zero copy is a generic requirement for nbd, fuse or similar drivers. A
> -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
> -can't be remapped any more in kernel with existing mm interfaces. This can
> -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
> -big requests (IO size >= 256 KB) may benefit a lot from zero copy.
> +Once primary command is submitted successfully, ublk driver guarantees that
> +the ublk request buffer won't be gone away since secondary request actually
> +grabs the buffer's reference. This way also guarantees that multiple
> +concurrent fused commands associated with same request buffer works fine,
> +as the provided buffer reference is shared & read-only.
>   
> +Also buffer usage direction flag is passed to primary command from userspace,
> +so ublk driver can validate if it is legal to use buffer with requested
> +direction.
>   
>   References
>   ==========
> @@ -323,4 +431,4 @@ References
>   
>   .. [#stefan] https://lore.kernel.org/linux-block/[email protected]/
>   
> -.. [#xiaoguang] https://lore.kernel.org/linux-block/[email protected]/
> +.. [#splice_extend] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index a744d3b5da91..64b0408873f6 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -74,10 +74,15 @@ struct ublk_rq_data {
>   	 *   successfully
>   	 */
>   	struct kref ref;
> +	bool allocated_bvec;
> +	struct io_uring_bvec_buf buf[0];
>   };
>   
>   struct ublk_uring_cmd_pdu {
> -	struct ublk_queue *ubq;
> +	union {
> +		struct ublk_queue *ubq;
> +		struct request *req;
> +	};
>   };
>   
>   /*
> @@ -565,6 +570,69 @@ static size_t ublk_copy_user_pages(const struct request *req,
>   	return done;
>   }
>   
> +/*
> + * The built command buffer is immutable, so it is fine to feed it to
> + * concurrent io_uring fused commands
> + */
> +static int ublk_init_zero_copy_buffer(struct request *rq)
> +{
> +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> +	struct io_uring_bvec_buf *imu = data->buf;
> +	struct req_iterator rq_iter;
> +	unsigned int nr_bvecs = 0;
> +	struct bio_vec *bvec;
> +	unsigned int offset;
> +	struct bio_vec bv;
> +
> +	if (!ublk_rq_has_data(rq))
> +		goto exit;
> +
> +	rq_for_each_bvec(bv, rq, rq_iter)
> +		nr_bvecs++;
> +
> +	if (!nr_bvecs)
> +		goto exit;
> +
> +	if (rq->bio != rq->biotail) {
> +		int idx = 0;
> +
> +		bvec = kvmalloc_array(nr_bvecs, sizeof(struct bio_vec),
> +				GFP_NOIO);
> +		if (!bvec)
> +			return -ENOMEM;
> +
> +		offset = 0;
> +		rq_for_each_bvec(bv, rq, rq_iter)
> +			bvec[idx++] = bv;
> +		data->allocated_bvec = true;
> +	} else {
> +		struct bio *bio = rq->bio;
> +
> +		offset = bio->bi_iter.bi_bvec_done;
> +		bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> +	}
> +	imu->bvec = bvec;
> +	imu->nr_bvecs = nr_bvecs;
> +	imu->offset = offset;
> +	imu->len = blk_rq_bytes(rq);
> +
> +	return 0;
> +exit:
> +	imu->bvec = NULL;
> +	return 0;
> +}
> +
> +static void ublk_deinit_zero_copy_buffer(struct request *rq)
> +{
> +	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> +	struct io_uring_bvec_buf *imu = data->buf;
> +
> +	if (data->allocated_bvec) {
> +		kvfree(imu->bvec);
> +		data->allocated_bvec = false;
> +	}
> +}
> +
>   static inline bool ublk_need_map_req(const struct request *req)
>   {
>   	return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
> @@ -575,11 +643,23 @@ static inline bool ublk_need_unmap_req(const struct request *req)
>   	return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ;
>   }
>   
> -static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
> +static int ublk_map_io(const struct ublk_queue *ubq, struct request *req,
>   		struct ublk_io *io)
>   {
>   	const unsigned int rq_bytes = blk_rq_bytes(req);
>   
> +	if (ublk_support_zc(ubq)) {
> +		int ret = ublk_init_zero_copy_buffer(req);
> +
> +		/*
> +		 * The only failure is -ENOMEM for allocating fused cmd
> +		 * buffer, return zero so that we can requeue this req.
> +		 */
> +		if (unlikely(ret))
> +			return 0;
> +		return rq_bytes;
> +	}
> +
>   	/*
>   	 * no zero copy, we delay copy WRITE request data into ublksrv
>   	 * context and the big benefit is that pinning pages in current
> @@ -599,11 +679,17 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
>   }
>   
>   static int ublk_unmap_io(const struct ublk_queue *ubq,
> -		const struct request *req,
> +		struct request *req,
>   		struct ublk_io *io)
>   {
>   	const unsigned int rq_bytes = blk_rq_bytes(req);
>   
> +	if (ublk_support_zc(ubq)) {
> +		ublk_deinit_zero_copy_buffer(req);
> +
> +		return rq_bytes;
> +	}
> +
>   	if (ublk_need_unmap_req(req)) {
>   		struct iov_iter iter;
>   		struct iovec iov;
> @@ -687,6 +773,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
>   	return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
>   }
>   
> +static inline struct ublk_uring_cmd_pdu *ublk_get_uring_fused_cmd_pdu(
> +		struct io_uring_cmd *ioucmd)
> +{
> +	return (struct ublk_uring_cmd_pdu *)&ioucmd->fused.pdu;
> +}
> +
>   static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
>   {
>   	return ubq->ubq_daemon->flags & PF_EXITING;
> @@ -742,6 +834,7 @@ static inline void __ublk_complete_rq(struct request *req)
>   
>   	return;
>   exit:
> +	ublk_deinit_zero_copy_buffer(req);
>   	blk_mq_end_request(req, res);
>   }
>   
> @@ -1352,6 +1445,68 @@ static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
>   	return NULL;
>   }
>   
> +static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd,
> +		unsigned issue_flags)
> +{
> +	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
> +	struct request *req = pdu->req;
> +	struct ublk_queue *ubq = req->mq_hctx->driver_data;
> +
> +	ublk_put_req_ref(ubq, req);
> +	io_uring_cmd_done(cmd, 0, 0, issue_flags);
> +}
> +
> +static inline bool ublk_check_fused_buf_dir(const struct request *req,
> +		unsigned int flags)
> +{
> +	flags &= IO_URING_F_FUSED_BUF;

~IO_URING_F_FUSED_BUF ?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
  2023-03-31 19:13   ` Bernd Schubert
@ 2023-04-01 13:19     ` Ming Lei
  0 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-04-01 13:19 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jens Axboe, [email protected], [email protected],
	[email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams

On Fri, Mar 31, 2023 at 07:13:21PM +0000, Bernd Schubert wrote:
> On 3/30/23 13:36, Ming Lei wrote:
> > Apply io_uring fused command for supporting zero copy:
> > 
> > 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and deinit it
> > in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
> > retrieve it from concurrent fused command.
> > 
> > 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
> > fused cmd(zero copy) buffer
> > 
> > 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
> > request and submit secondary request; meantime setup complete callback via
> > this API, once secondary request is completed, the complete callback is
> > called for freeing the buffer and completing the fused command
> > 
> > Also request reference is held during fused command lifetime, and this way
> > guarantees that request buffer won't be freed until all inflight fused
> > commands are completed.
> > 
> > userspace(only implement sqe128 fused command):
> > 
> > 	https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
> > 
> > liburing test(only implement normal sqe fused command: two 64byte SQEs)
> > 
> > 	https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
> > 
> > Signed-off-by: Ming Lei <[email protected]>
> > ---
> >   Documentation/block/ublk.rst  | 126 ++++++++++++++++++++--
> >   drivers/block/ublk_drv.c      | 192 ++++++++++++++++++++++++++++++++--
> >   include/uapi/linux/ublk_cmd.h |   6 +-
> >   3 files changed, 303 insertions(+), 21 deletions(-)
> > 
> > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > index 1713b2890abb..7b7aa24e9729 100644
> > --- a/Documentation/block/ublk.rst
> > +++ b/Documentation/block/ublk.rst
> > @@ -297,18 +297,126 @@ with specified IO tag in the command data:
> >     ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
> >     the server buffer (pages) read to the IO request pages.
> >   
> > -Future development
> > -==================
> > +- ``UBLK_IO_FUSED_SUBMIT_IO``
> > +
> > +  Used for implementing zero copy feature.
> > +
> > +  It has to been the primary command of io_uring fused command. This command
> 
> s/been/be/ ?

Yeah.

> 
> > +  submits the generic secondary IO request with io buffer provided by our primary
> > +  command, and won't be completed until the secondary request is done.
> > +
> > +  The provided buffer is represented as ``io_uring_bvec_buf``, which is
> > +  actually ublk request buffer's reference, and the reference is shared &
> > +  read-only, so the generic secondary request can retrieve any part of the buffer
> > +  by passing buffer offset & length.
> >   
> >   Zero copy
> > ----------
> > +=========
> > +
> > +What is zero copy?
> > +------------------
> > +
> > +When application submits IO to ``/dev/ublkb*``, userspace buffer(direct io)
> > +or page cache buffer(buffered io) or kernel buffer(meta io often) is used
> > +for submitting data to ublk driver, and all kinds of these buffers are
> > +represented by bio/bvecs(ublk request buffer) finally. Before supporting
> > +zero copy, data in these buffers has to be copied to ublk server userspace
> > +buffer before handling WRITE IO, or after handing READ IO, so that ublk
> > +server can handle IO for ``/dev/ublkb*`` with the copied data.
> > +
> > +The extra copy between ublk request buffer and ublk server userspace buffer
> > +not only increases CPU utilization(such as pinning pages, copy data), but
> > +also consumes memory bandwidth, and the cost could be very big when IO size
> > +is big. It is observed that ublk-null IOPS may be increased to ~5X if the
> > +extra copy can be avoided.
> > +
> > +So zero copy is very important for supporting high performance block device
> > +in userspace.
> > +
> > +Technical requirements
> > +----------------------
> > +
> > +- ublk request buffer use
> > +
> > +ublk request buffer is represented by bio/bvec, which is immutable, so do
> > +not try to change bvec via buffer reference; data can be read from or
> > +written to the buffer according to buffer direction, but bvec can't be
> > +changed
> > +
> > +- buffer lifetime
> > +
> > +Ublk server borrows ublk request buffer for handling ublk IO, ublk request
> > +buffer reference is used. Reference can't outlive the referent buffer. That
> > +means all request buffer references have to be released by ublk server
> > +before ublk driver completes this request, when request buffer ownership
> > +is transferred to upper layer(FS, application, ...).
> > +
> > +Also after ublk request is completed, any page belonging to this ublk
> > +request can not be written or read any more from ublk server since it is
> > +one block device from kernel viewpoint.
> > +
> > +- buffer direction
> > +
> > +For ublk WRITE request, ublk request buffer should only be accessed as data
> > +source, and the buffer can't be written by ublk server
> > +
> > +For ublk READ request, ublk request buffer should only be accessed as data
> > +destination, and the buffer can't be read by ublk server, otherwise kernel
> > +data is leaked to ublk server, which can be unprivileged application.
> > +
> > +- arbitrary size sub-buffer needs to be retrieved from ublk server
> > +
> > +ublk is one generic framework for implementing block device in userspace,
> > +and typical requirements include logical volume manager(mirror, stripped, ...),
> > +distributed network storage, compressed target, ...
> > +
> > +ublk server needs to retrieve arbitrary size sub-buffer of ublk request, and
> > +ublk server needs to submit IOs with these sub-buffer(s). That also means
> > +arbitrary size sub-buffer(s) can be used to submit IO multiple times.
> > +
> > +Any sub-buffer is actually one reference of ublk request buffer, which
> > +ownership can't be transferred to upper layer if any reference is held
> > +by ublk server.
> > +
> > +Why slice isn't good for ublk zero copy
> 
> slice

Good catch.

> 
> > +---------------------------------------
> > +
> > +- spliced page from ->splice_read() can't be written
> > +
> > +ublk READ request can't be handled because spliced page can't be written to, and
> > +extending splice for ublk zero copy isn't one good solution [#splice_extend]_
> > +
> > +- it is very hard to meet above requirements  wrt. request buffer lifetime
> > +
> > +splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
> > +attention to ublk request buffer lifetime. If is very inefficient to respect
> > +request buffer lifetime by using all pipe buffer's ->release() which requires
> > +all pipe buffers and pipe to be kept when ublk server handles IO. That means
> > +one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
> > +provided buffer, and the pipe needs to be populated with pages in ublk request
> > +buffer.
> > +
> > +
> > +io_uring fused command based zero copy
> > +--------------------------------------
> > +
> > +io_uring fused command includes one primary command(uring command) and one
> > +generic secondary request. The primary command is responsible for submitting
> > +secondary request with provided buffer from ublk request, and primary command
> > +won't be completed until the secondary request is completed.
> > +
> > +Typical ublk IO handling includes network and FS IO, so it is usual enough
> > +for io_uring net & fs to support IO with provided buffer from primary command.
> >   
> > -Zero copy is a generic requirement for nbd, fuse or similar drivers. A
> 
> Maybe replace 'requirement' with "very useful"? It works without zc, 
> just with limited performance.

OK, also it has been observed that the extra copy affects performance a log for
big IO size(64K+). 

> 
> > -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
> > -can't be remapped any more in kernel with existing mm interfaces. This can
> > -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
> > -big requests (IO size >= 256 KB) may benefit a lot from zero copy.
> > +Once primary command is submitted successfully, ublk driver guarantees that
> > +the ublk request buffer won't be gone away since secondary request actually
> > +grabs the buffer's reference. This way also guarantees that multiple
> > +concurrent fused commands associated with same request buffer works fine,
> > +as the provided buffer reference is shared & read-only.
> >   
> > +Also buffer usage direction flag is passed to primary command from userspace,
> > +so ublk driver can validate if it is legal to use buffer with requested
> > +direction.
> >   
> >   References
> >   ==========
> > @@ -323,4 +431,4 @@ References
> >   
> >   .. [#stefan] https://lore.kernel.org/linux-block/[email protected]/
> >   
> > -.. [#xiaoguang] https://lore.kernel.org/linux-block/[email protected]/
> > +.. [#splice_extend] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/
> 
> 
> Any chance you could add a few code path into this document? For example 
> as in Documentation/filesystems/fuse.rst,
> section "Kernel - userspace interface"

OK, just it isn't handy to describe the flow by pure text.

BTW, you can also see the code path in the link:

https://github.com/ming1/ubdsrv/blob/master/doc/ublk_intro.pdf


thanks,
Ming


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
  2023-03-31 19:55   ` Bernd Schubert
@ 2023-04-01 13:22     ` Ming Lei
  2023-04-03  9:25     ` Bernd Schubert
  1 sibling, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-04-01 13:22 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jens Axboe, [email protected], [email protected],
	[email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	ming.lei

On Fri, Mar 31, 2023 at 07:55:30PM +0000, Bernd Schubert wrote:
> On 3/30/23 13:36, Ming Lei wrote:

...

> > +static inline bool ublk_check_fused_buf_dir(const struct request *req,
> > +		unsigned int flags)
> > +{
> > +	flags &= IO_URING_F_FUSED_BUF;
> 
> ~IO_URING_F_FUSED_BUF ?

IO_URING_F_FUSED_BUF is buffer direction mask, and we can get direction
flags only in above way, so the above is correct.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 03/17] io_uring: add generic IORING_OP_FUSED_CMD
  2023-03-30 11:36 ` [PATCH V6 03/17] io_uring: add generic IORING_OP_FUSED_CMD Ming Lei
@ 2023-04-01 14:35   ` Ming Lei
  0 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-04-01 14:35 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	ming.lei

On Thu, Mar 30, 2023 at 07:36:16PM +0800, Ming Lei wrote:
> Multiple requests submitted as one whole request logically, and the 1st one
> is primary command(IORING_OP_FUSED_CMD), and the others are secondary
> requests, which number can be retrieved from primary SQE.
> 
> Primary command is responsible for providing resources and submitting
> secondary requests, which depends on primary command's resources, and
> primary command won't be completed until all secondary requests are done.
> 
> The provided resource has same lifetime with primary command, and it
> lifetime won't cross multiple OPs, and this way provides safe way for
> secondary OPs to use the resource.
> 
> Add generic IORING_OP_FUSED_CMD for modeling this primary/secondary
> relationship among requests.

BTW, this model also solves 1:N dependency problem of io_uring.

Current io_uring can't handle this kind of dependency(1:N) efficiently,
and simply convert it into one linked list:

- N requests(1~n) depends on one request(0), and there isn't dependency among
these N requests

- current io_uring converts the dependency to linked list of (N + 1) requests
(0, 1, ...., n), in which all requests are just issued one by one from 0 to n,
inefficiently

The primary/secondary model solves it by issuing request 0 first, then issues
all other N requests concurrently.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (16 preceding siblings ...)
  2023-03-30 11:36 ` [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
@ 2023-04-03  1:11 ` Ming Lei
  2023-04-03  1:24   ` Jens Axboe
  2023-04-03  1:23 ` (subset) " Jens Axboe
  2023-04-18 19:38 ` Bernd Schubert
  19 siblings, 1 reply; 37+ messages in thread
From: Ming Lei @ 2023-04-03  1:11 UTC (permalink / raw)
  To: Jens Axboe, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	ming.lei

On Thu, Mar 30, 2023 at 07:36:13PM +0800, Ming Lei wrote:
> Hello Jens and Guys,
> 
> Add generic fused command, which can include one primary command and multiple
> secondary requests. This command provides one safe way to share resource between
> primary command and secondary requests, and primary command is always
> completed after all secondary requests are done, and resource lifetime
> is bound with primary command.
> 
> With this way, it is easy to support zero copy for ublk/fuse device, and
> there could be more potential use cases, such as offloading complicated logic
> into userspace, or decouple kernel subsystems.
> 
> Follows ublksrv code, which implements zero copy for loop, nbd and
> qcow2 targets with fused command:
> 
> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
> 
> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:
> 
> 	ublk add -t [loop|nbd|qcow2] -z .... 
> 
> Also add liburing test case for covering fused command based on miniublk
> of blktest.
> 
> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
> 
> Performance improvement is obvious on memory bandwidth related workloads,
> such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file.
> ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided.
> 
> Please review and consider for v6.4.
> 
> V6:
> 	- re-design fused command, and make it more generic, moving sharing buffer
> 	as one plugin of fused command, so in future we can implement more plugins
> 	- document potential other use cases of fused command
> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> 	  requests has standalone SQE
> 	- make fused command as one feature
> 	- cleanup & improve naming

Hi Jens,

Can you apply ublk cleanup patches 7~11 on for-6.4? For others, we may
delay to 6.5, and I am looking at other approach too.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: (subset) [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (17 preceding siblings ...)
  2023-04-03  1:11 ` [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
@ 2023-04-03  1:23 ` Jens Axboe
  2023-04-18 19:38 ` Bernd Schubert
  19 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2023-04-03  1:23 UTC (permalink / raw)
  To: io-uring, linux-block, Ming Lei
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams


On Thu, 30 Mar 2023 19:36:13 +0800, Ming Lei wrote:
> Add generic fused command, which can include one primary command and multiple
> secondary requests. This command provides one safe way to share resource between
> primary command and secondary requests, and primary command is always
> completed after all secondary requests are done, and resource lifetime
> is bound with primary command.
> 
> With this way, it is easy to support zero copy for ublk/fuse device, and
> there could be more potential use cases, such as offloading complicated logic
> into userspace, or decouple kernel subsystems.
> 
> [...]

Applied, thanks!

[07/17] block: ublk_drv: add common exit handling
        commit: 903f8aeea9fd1b97fba4ab805ddd639f57f117f8
[08/17] block: ublk_drv: don't consider flush request in map/unmap io
        commit: 23ef8220f287abe5bf741ddfc278e7359742d3b1
[09/17] block: ublk_drv: add two helpers to clean up map/unmap request
        commit: 2f3af723447c35c16f3c6a1b4b317c61dc41d6c3
[10/17] block: ublk_drv: clean up several helpers
        commit: 96cf2f5404c8bc979628a2b495852d735a56c5b5
[11/17] block: ublk_drv: cleanup 'struct ublk_map_data'
        commit: ae9f5ccea4c268a96763e51239b32d6b5172c18c

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-04-03  1:11 ` [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
@ 2023-04-03  1:24   ` Jens Axboe
  2023-04-04  7:48     ` Ming Lei
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2023-04-03  1:24 UTC (permalink / raw)
  To: Ming Lei, io-uring, linux-block
  Cc: linux-kernel, Miklos Szeredi, ZiyangZhang, Xiaoguang Wang,
	Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi, Dan Williams

On 4/2/23 7:11?PM, Ming Lei wrote:
> On Thu, Mar 30, 2023 at 07:36:13PM +0800, Ming Lei wrote:
>> Hello Jens and Guys,
>>
>> Add generic fused command, which can include one primary command and multiple
>> secondary requests. This command provides one safe way to share resource between
>> primary command and secondary requests, and primary command is always
>> completed after all secondary requests are done, and resource lifetime
>> is bound with primary command.
>>
>> With this way, it is easy to support zero copy for ublk/fuse device, and
>> there could be more potential use cases, such as offloading complicated logic
>> into userspace, or decouple kernel subsystems.
>>
>> Follows ublksrv code, which implements zero copy for loop, nbd and
>> qcow2 targets with fused command:
>>
>> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
>>
>> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:
>>
>> 	ublk add -t [loop|nbd|qcow2] -z .... 
>>
>> Also add liburing test case for covering fused command based on miniublk
>> of blktest.
>>
>> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
>>
>> Performance improvement is obvious on memory bandwidth related workloads,
>> such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file.
>> ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided.
>>
>> Please review and consider for v6.4.
>>
>> V6:
>> 	- re-design fused command, and make it more generic, moving sharing buffer
>> 	as one plugin of fused command, so in future we can implement more plugins
>> 	- document potential other use cases of fused command
>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
>> 	  requests has standalone SQE
>> 	- make fused command as one feature
>> 	- cleanup & improve naming
> 
> Hi Jens,
> 
> Can you apply ublk cleanup patches 7~11 on for-6.4? For others, we may
> delay to 6.5, and I am looking at other approach too.

Done - and yes, we're probably looking at 6.5 for the rest. But that's
fine, I'd rather end up with the right interface than try and rush one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
  2023-03-31 19:55   ` Bernd Schubert
  2023-04-01 13:22     ` Ming Lei
@ 2023-04-03  9:25     ` Bernd Schubert
  1 sibling, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2023-04-03  9:25 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, [email protected],
	[email protected]
  Cc: [email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams

On 3/31/23 21:55, Bernd Schubert wrote:
> On 3/30/23 13:36, Ming Lei wrote:
>> Apply io_uring fused command for supporting zero copy:
>>
>> 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and 
>> deinit it
>> in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
>> retrieve it from concurrent fused command.
>>
>> 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
>> fused cmd(zero copy) buffer
>>
>> 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
>> request and submit secondary request; meantime setup complete callback 
>> via
>> this API, once secondary request is completed, the complete callback is
>> called for freeing the buffer and completing the fused command
>>
>> Also request reference is held during fused command lifetime, and this 
>> way
>> guarantees that request buffer won't be freed until all inflight fused
>> commands are completed.
>>
>> userspace(only implement sqe128 fused command):
>>
>>     https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
>>
>> liburing test(only implement normal sqe fused command: two 64byte SQEs)
>>
>>     https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
>>
>> Signed-off-by: Ming Lei <[email protected]>
>> ---
>>   Documentation/block/ublk.rst  | 126 ++++++++++++++++++++--
>>   drivers/block/ublk_drv.c      | 192 ++++++++++++++++++++++++++++++++--
>>   include/uapi/linux/ublk_cmd.h |   6 +-
>>   3 files changed, 303 insertions(+), 21 deletions(-)
>>
>> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
>> index 1713b2890abb..7b7aa24e9729 100644
>> --- a/Documentation/block/ublk.rst
>> +++ b/Documentation/block/ublk.rst
>> @@ -297,18 +297,126 @@ with specified IO tag in the command data:
>>     ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
>>     the server buffer (pages) read to the IO request pages.
>> -Future development
>> -==================
>> +- ``UBLK_IO_FUSED_SUBMIT_IO``
>> +
>> +  Used for implementing zero copy feature.
>> +
>> +  It has to been the primary command of io_uring fused command. This 
>> command
>> +  submits the generic secondary IO request with io buffer provided by 
>> our primary
>> +  command, and won't be completed until the secondary request is done.
>> +
>> +  The provided buffer is represented as ``io_uring_bvec_buf``, which is
>> +  actually ublk request buffer's reference, and the reference is 
>> shared &
>> +  read-only, so the generic secondary request can retrieve any part 
>> of the buffer
>> +  by passing buffer offset & length.
>>   Zero copy
>> ----------
>> +=========
>> +
>> +What is zero copy?
>> +------------------
>> +
>> +When application submits IO to ``/dev/ublkb*``, userspace 
>> buffer(direct io)
>> +or page cache buffer(buffered io) or kernel buffer(meta io often) is 
>> used
>> +for submitting data to ublk driver, and all kinds of these buffers are
>> +represented by bio/bvecs(ublk request buffer) finally. Before supporting
>> +zero copy, data in these buffers has to be copied to ublk server 
>> userspace
>> +buffer before handling WRITE IO, or after handing READ IO, so that ublk
>> +server can handle IO for ``/dev/ublkb*`` with the copied data.
>> +
>> +The extra copy between ublk request buffer and ublk server userspace 
>> buffer
>> +not only increases CPU utilization(such as pinning pages, copy data), 
>> but
>> +also consumes memory bandwidth, and the cost could be very big when 
>> IO size
>> +is big. It is observed that ublk-null IOPS may be increased to ~5X if 
>> the
>> +extra copy can be avoided.
>> +
>> +So zero copy is very important for supporting high performance block 
>> device
>> +in userspace.
>> +
>> +Technical requirements
>> +----------------------
>> +
>> +- ublk request buffer use
>> +
>> +ublk request buffer is represented by bio/bvec, which is immutable, 
>> so do
>> +not try to change bvec via buffer reference; data can be read from or
>> +written to the buffer according to buffer direction, but bvec can't be
>> +changed
>> +
>> +- buffer lifetime
>> +
>> +Ublk server borrows ublk request buffer for handling ublk IO, ublk 
>> request
>> +buffer reference is used. Reference can't outlive the referent 
>> buffer. That
>> +means all request buffer references have to be released by ublk server
>> +before ublk driver completes this request, when request buffer ownership
>> +is transferred to upper layer(FS, application, ...).
>> +
>> +Also after ublk request is completed, any page belonging to this ublk
>> +request can not be written or read any more from ublk server since it is
>> +one block device from kernel viewpoint.
>> +
>> +- buffer direction
>> +
>> +For ublk WRITE request, ublk request buffer should only be accessed 
>> as data
>> +source, and the buffer can't be written by ublk server
>> +
>> +For ublk READ request, ublk request buffer should only be accessed as 
>> data
>> +destination, and the buffer can't be read by ublk server, otherwise 
>> kernel
>> +data is leaked to ublk server, which can be unprivileged application.
>> +
>> +- arbitrary size sub-buffer needs to be retrieved from ublk server
>> +
>> +ublk is one generic framework for implementing block device in 
>> userspace,
>> +and typical requirements include logical volume manager(mirror, 
>> stripped, ...),
>> +distributed network storage, compressed target, ...
>> +
>> +ublk server needs to retrieve arbitrary size sub-buffer of ublk 
>> request, and
>> +ublk server needs to submit IOs with these sub-buffer(s). That also 
>> means
>> +arbitrary size sub-buffer(s) can be used to submit IO multiple times.
>> +
>> +Any sub-buffer is actually one reference of ublk request buffer, which
>> +ownership can't be transferred to upper layer if any reference is held
>> +by ublk server.
>> +
>> +Why slice isn't good for ublk zero copy
>> +---------------------------------------
>> +
>> +- spliced page from ->splice_read() can't be written
>> +
>> +ublk READ request can't be handled because spliced page can't be 
>> written to, and
>> +extending splice for ublk zero copy isn't one good solution 
>> [#splice_extend]_
>> +
>> +- it is very hard to meet above requirements  wrt. request buffer 
>> lifetime
>> +
>> +splice/pipe focuses on page reference lifetime, but ublk zero copy 
>> pays more
>> +attention to ublk request buffer lifetime. If is very inefficient to 
>> respect
>> +request buffer lifetime by using all pipe buffer's ->release() which 
>> requires
>> +all pipe buffers and pipe to be kept when ublk server handles IO. 
>> That means
>> +one single dedicated ``pipe_inode_info`` has to be allocated runtime 
>> for each
>> +provided buffer, and the pipe needs to be populated with pages in 
>> ublk request
>> +buffer.
>> +
>> +
>> +io_uring fused command based zero copy
>> +--------------------------------------
>> +
>> +io_uring fused command includes one primary command(uring command) 
>> and one
>> +generic secondary request. The primary command is responsible for 
>> submitting
>> +secondary request with provided buffer from ublk request, and primary 
>> command
>> +won't be completed until the secondary request is completed.
>> +
>> +Typical ublk IO handling includes network and FS IO, so it is usual 
>> enough
>> +for io_uring net & fs to support IO with provided buffer from primary 
>> command.
>> -Zero copy is a generic requirement for nbd, fuse or similar drivers. A
>> -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to 
>> userspace
>> -can't be remapped any more in kernel with existing mm interfaces. 
>> This can
>> -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported 
>> that
>> -big requests (IO size >= 256 KB) may benefit a lot from zero copy.
>> +Once primary command is submitted successfully, ublk driver 
>> guarantees that
>> +the ublk request buffer won't be gone away since secondary request 
>> actually
>> +grabs the buffer's reference. This way also guarantees that multiple
>> +concurrent fused commands associated with same request buffer works 
>> fine,
>> +as the provided buffer reference is shared & read-only.
>> +Also buffer usage direction flag is passed to primary command from 
>> userspace,
>> +so ublk driver can validate if it is legal to use buffer with requested
>> +direction.
>>   References
>>   ==========
>> @@ -323,4 +431,4 @@ References
>>   .. [#stefan] 
>> https://lore.kernel.org/linux-block/[email protected]/
>> -.. [#xiaoguang] 
>> https://lore.kernel.org/linux-block/[email protected]/
>> +.. [#splice_extend] 
>> https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/
>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>> index a744d3b5da91..64b0408873f6 100644
>> --- a/drivers/block/ublk_drv.c
>> +++ b/drivers/block/ublk_drv.c
>> @@ -74,10 +74,15 @@ struct ublk_rq_data {
>>        *   successfully
>>        */
>>       struct kref ref;
>> +    bool allocated_bvec;
>> +    struct io_uring_bvec_buf buf[0];
>>   };
>>   struct ublk_uring_cmd_pdu {
>> -    struct ublk_queue *ubq;
>> +    union {
>> +        struct ublk_queue *ubq;
>> +        struct request *req;
>> +    };
>>   };
>>   /*
>> @@ -565,6 +570,69 @@ static size_t ublk_copy_user_pages(const struct 
>> request *req,
>>       return done;
>>   }
>> +/*
>> + * The built command buffer is immutable, so it is fine to feed it to
>> + * concurrent io_uring fused commands
>> + */
>> +static int ublk_init_zero_copy_buffer(struct request *rq)
>> +{
>> +    struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
>> +    struct io_uring_bvec_buf *imu = data->buf;
>> +    struct req_iterator rq_iter;
>> +    unsigned int nr_bvecs = 0;
>> +    struct bio_vec *bvec;
>> +    unsigned int offset;
>> +    struct bio_vec bv;
>> +
>> +    if (!ublk_rq_has_data(rq))
>> +        goto exit;
>> +
>> +    rq_for_each_bvec(bv, rq, rq_iter)
>> +        nr_bvecs++;
>> +
>> +    if (!nr_bvecs)
>> +        goto exit;
>> +
>> +    if (rq->bio != rq->biotail) {
>> +        int idx = 0;
>> +
>> +        bvec = kvmalloc_array(nr_bvecs, sizeof(struct bio_vec),
>> +                GFP_NOIO);
>> +        if (!bvec)
>> +            return -ENOMEM;
>> +
>> +        offset = 0;
>> +        rq_for_each_bvec(bv, rq, rq_iter)
>> +            bvec[idx++] = bv;
>> +        data->allocated_bvec = true;
>> +    } else {
>> +        struct bio *bio = rq->bio;
>> +
>> +        offset = bio->bi_iter.bi_bvec_done;
>> +        bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
>> +    }
>> +    imu->bvec = bvec;
>> +    imu->nr_bvecs = nr_bvecs;
>> +    imu->offset = offset;
>> +    imu->len = blk_rq_bytes(rq);
>> +
>> +    return 0;
>> +exit:
>> +    imu->bvec = NULL;
>> +    return 0;
>> +}
>> +
>> +static void ublk_deinit_zero_copy_buffer(struct request *rq)
>> +{
>> +    struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
>> +    struct io_uring_bvec_buf *imu = data->buf;
>> +
>> +    if (data->allocated_bvec) {
>> +        kvfree(imu->bvec);
>> +        data->allocated_bvec = false;
>> +    }
>> +}
>> +
>>   static inline bool ublk_need_map_req(const struct request *req)
>>   {
>>       return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
>> @@ -575,11 +643,23 @@ static inline bool ublk_need_unmap_req(const 
>> struct request *req)
>>       return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ;
>>   }
>> -static int ublk_map_io(const struct ublk_queue *ubq, const struct 
>> request *req,
>> +static int ublk_map_io(const struct ublk_queue *ubq, struct request 
>> *req,
>>           struct ublk_io *io)
>>   {
>>       const unsigned int rq_bytes = blk_rq_bytes(req);
>> +    if (ublk_support_zc(ubq)) {
>> +        int ret = ublk_init_zero_copy_buffer(req);
>> +
>> +        /*
>> +         * The only failure is -ENOMEM for allocating fused cmd
>> +         * buffer, return zero so that we can requeue this req.
>> +         */
>> +        if (unlikely(ret))
>> +            return 0;
>> +        return rq_bytes;
>> +    }
>> +
>>       /*
>>        * no zero copy, we delay copy WRITE request data into ublksrv
>>        * context and the big benefit is that pinning pages in current
>> @@ -599,11 +679,17 @@ static int ublk_map_io(const struct ublk_queue 
>> *ubq, const struct request *req,
>>   }
>>   static int ublk_unmap_io(const struct ublk_queue *ubq,
>> -        const struct request *req,
>> +        struct request *req,
>>           struct ublk_io *io)
>>   {
>>       const unsigned int rq_bytes = blk_rq_bytes(req);
>> +    if (ublk_support_zc(ubq)) {
>> +        ublk_deinit_zero_copy_buffer(req);
>> +
>> +        return rq_bytes;
>> +    }
>> +
>>       if (ublk_need_unmap_req(req)) {
>>           struct iov_iter iter;
>>           struct iovec iov;
>> @@ -687,6 +773,12 @@ static inline struct ublk_uring_cmd_pdu 
>> *ublk_get_uring_cmd_pdu(
>>       return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
>>   }
>> +static inline struct ublk_uring_cmd_pdu *ublk_get_uring_fused_cmd_pdu(
>> +        struct io_uring_cmd *ioucmd)
>> +{
>> +    return (struct ublk_uring_cmd_pdu *)&ioucmd->fused.pdu;
>> +}
>> +
>>   static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
>>   {
>>       return ubq->ubq_daemon->flags & PF_EXITING;
>> @@ -742,6 +834,7 @@ static inline void __ublk_complete_rq(struct 
>> request *req)
>>       return;
>>   exit:
>> +    ublk_deinit_zero_copy_buffer(req);
>>       blk_mq_end_request(req, res);
>>   }
>> @@ -1352,6 +1445,68 @@ static inline struct request 
>> *__ublk_check_and_get_req(struct ublk_device *ub,
>>       return NULL;
>>   }
>> +static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd,
>> +        unsigned issue_flags)
>> +{
>> +    struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
>> +    struct request *req = pdu->req;
>> +    struct ublk_queue *ubq = req->mq_hctx->driver_data;
>> +
>> +    ublk_put_req_ref(ubq, req);
>> +    io_uring_cmd_done(cmd, 0, 0, issue_flags);
>> +}
>> +
>> +static inline bool ublk_check_fused_buf_dir(const struct request *req,
>> +        unsigned int flags)
>> +{
>> +    flags &= IO_URING_F_FUSED_BUF;
> 
> ~IO_URING_F_FUSED_BUF ?
> 

Ah sorry for the noise, got confused because the caller 
ublk_handle_fused_cmd checks for the flag with '&' and then 
ublk_check_fused_buf_dir checks with '=' - I thought intend was to 
remove the flag.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-04-03  1:24   ` Jens Axboe
@ 2023-04-04  7:48     ` Ming Lei
  0 siblings, 0 replies; 37+ messages in thread
From: Ming Lei @ 2023-04-04  7:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: io-uring, linux-block, linux-kernel, Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Bernd Schubert, Pavel Begunkov, Stefan Hajnoczi,
	Dan Williams, ming.lei

Hello Jens and Everyone,

On Sun, Apr 02, 2023 at 07:24:17PM -0600, Jens Axboe wrote:
> On 4/2/23 7:11?PM, Ming Lei wrote:
> > On Thu, Mar 30, 2023 at 07:36:13PM +0800, Ming Lei wrote:
> >> Hello Jens and Guys,
> >>
> >> Add generic fused command, which can include one primary command and multiple
> >> secondary requests. This command provides one safe way to share resource between
> >> primary command and secondary requests, and primary command is always
> >> completed after all secondary requests are done, and resource lifetime
> >> is bound with primary command.
> >>
> >> With this way, it is easy to support zero copy for ublk/fuse device, and
> >> there could be more potential use cases, such as offloading complicated logic
> >> into userspace, or decouple kernel subsystems.
> >>
> >> Follows ublksrv code, which implements zero copy for loop, nbd and
> >> qcow2 targets with fused command:
> >>
> >> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
> >>
> >> All three(loop, nbd and qcow2) ublk targets have supported zero copy by passing:
> >>
> >> 	ublk add -t [loop|nbd|qcow2] -z .... 
> >>
> >> Also add liburing test case for covering fused command based on miniublk
> >> of blktest.
> >>
> >> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
> >>
> >> Performance improvement is obvious on memory bandwidth related workloads,
> >> such as, 1~2X improvement on 64K/512K BS IO test on loop with ramfs backing file.
> >> ublk-null shows 5X IOPS improvement on big BS test when the copy is avoided.
> >>
> >> Please review and consider for v6.4.
> >>
> >> V6:
> >> 	- re-design fused command, and make it more generic, moving sharing buffer
> >> 	as one plugin of fused command, so in future we can implement more plugins
> >> 	- document potential other use cases of fused command
> >> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> >> 	  requests has standalone SQE
> >> 	- make fused command as one feature
> >> 	- cleanup & improve naming
> > 
> > Hi Jens,
> > 
> > Can you apply ublk cleanup patches 7~11 on for-6.4? For others, we may
> > delay to 6.5, and I am looking at other approach too.
> 
> Done - and yes, we're probably looking at 6.5 for the rest. But that's

Thanks!

> fine, I'd rather end up with the right interface than try and rush one.

Also I'd provide one summery about this work here so that it may help
for anyone interested in this work, follows three approaches we have
tried or proposed:

1) splice can't do this job[1][2]

2) fused command in this patchset
- it is more like sendfile() or copy_file_range(), because the internal
  buffer isn't exposed outside

- v6 becomes a bit more generic, the theory is that one SQE list is submitted
as a whole request logically; the 1st sqe is the primary command, which
provides buffer for others, and is responsible for submitting other SQEs
(secondary)in this list; the primary command isn't completed until all secondary
requests are done

- this approach solves two problems efficiently in one simple way:

	a) buffer lifetime issue, and buffer lifetime is same with primary command, so
	all secondary OPs can be submitted & completely safely

	b) request dependency issue, all secondary requests depend on primary command,
	and secondary request itself could be independent, we start to allow to submit
	secondary request in non-async style, and all secondary requests can be issued
	concurrently

- this approach is simple, because we don't expose buffer outside, and
  buffer is just shared among these secondary requests; meantime
  internal buffer saves us complicated OPs' dependency issue, avoid
  contention by registering buffer anywhere between submission and
  completion code path

- the drawback is that we add one new SQE usage/model of primary SQE and
  secondary SQEs, and the whole logical request in concept, which is
  like sendfile() or copy_file_range()

3) register transient buffers for OPs[3]
- it is more like splice(), which is flexible and could be more generic, but
internal pipe buffer is added to pipe which is visible outside, so the
implementation becomes complicated; and it should be more than splice(),
because the io buffer needs to be shared among multiple OPs

- inefficiently & complicated

	a) buffer has to be added to one global container(suppose it is
	io_uring context pipe) by ADD_BUF OP, and either buffer needs to be removed after
	consumer OPs are completed, or DEL_OP is run for removing buffer explicitly, so
	either contention on the io_uring pipe is added, or another new dependency is
	added(DEL_OP depends on all normal OPs)

	b) ADD_BUF OP is needed, and normal OPs have to depend on this new
	OP by IOSQE_IO_LINK, then all normal OPs will be submitted in async way,
	even worse, each normal OP has to be issued one by one, because io_uring
	isn't capable of handling 1:N dependency issue[5]

    c) if DEL_BUF OP is needed, then it is basically not possible
	to solve 1:N dependency any more, given DEL_BUF starts to depends on the previous
	N OPs; otherwise, contention on pipe is inevitable.

	d) solving 1:N dependency issue generically

- advantage

Follows current io_uring SQE usage, and looks more generic/flexible,
like splice().

4) others approaches or suggestions?

Any idea is welcome as usual.


Finally from problem viewpoint, if the problem domain is just ublk/fuse zero copy
or other similar problems[6], fused command might be the simpler & more efficient
approach, compared with approach 3). However, are there any other problems we
want to cover by one more generic/flexible interface? If not, would we
like to pay the complexity & inefficiency for one kind of less generic
problem?


[1] https://lore.kernel.org/linux-block/[email protected]/T/#m1bfa358524b6af94731bcd5be28056f9f4408ecf
[2] https://github.com/ming1/linux/blob/my_v6.3-io_uring_fuse_cmd_v6/Documentation/block/ublk.rst#zero-copy
[3] https://lore.kernel.org/linux-block/[email protected]/T/#mbe428dfeb0417487cd1db7e6dabca7399a3c265b
[4] https://lore.kernel.org/linux-block/[email protected]/T/#md035ffa4c6b69e85de2ab145418a9849a3b33741
[5] https://lore.kernel.org/linux-block/[email protected]/T/#m5e0c282ad26d9f3d8e519645168aeb3a19b5740b
[6] https://lore.kernel.org/linux-block/[email protected]/T/#me5cca4db606541fae452d625780635fcedcd5c6c

Thanks,
Ming


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
                   ` (18 preceding siblings ...)
  2023-04-03  1:23 ` (subset) " Jens Axboe
@ 2023-04-18 19:38 ` Bernd Schubert
  2023-04-19  1:51   ` Ming Lei
  19 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2023-04-18 19:38 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, [email protected],
	[email protected]
  Cc: [email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Amir Goldstein

On 3/30/23 13:36, Ming Lei wrote:
[...]
> V6:
> 	- re-design fused command, and make it more generic, moving sharing buffer
> 	as one plugin of fused command, so in future we can implement more plugins
> 	- document potential other use cases of fused command
> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> 	  requests has standalone SQE
> 	- make fused command as one feature
> 	- cleanup & improve naming

Hi Ming, et al.,

I started to wonder if fused SQE could be extended to combine multiple 
syscalls, for example open/read/close.  Which would be another solution 
for the readfile syscall Miklos had proposed some time ago.

https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/

If fused SQEs could be extended, I think it would be quite helpful for 
many other patterns. Another similar examples would open/write/close, 
but ideal would be also to allow to have it more complex like 
"open/write/sync_file_range/close" - open/write/close might be the 
fastest and could possibly return before sync_file_range. Use case for 
the latter would be a file server that wants to give notifications to 
client when pages have been written out.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-04-18 19:38 ` Bernd Schubert
@ 2023-04-19  1:51   ` Ming Lei
  2023-04-19  9:56     ` Bernd Schubert
  0 siblings, 1 reply; 37+ messages in thread
From: Ming Lei @ 2023-04-19  1:51 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jens Axboe, [email protected], [email protected],
	[email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Amir Goldstein, ming.lei

On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
> On 3/30/23 13:36, Ming Lei wrote:
> [...]
> > V6:
> > 	- re-design fused command, and make it more generic, moving sharing buffer
> > 	as one plugin of fused command, so in future we can implement more plugins
> > 	- document potential other use cases of fused command
> > 	- drop support for builtin secondary sqe in SQE128, so all secondary
> > 	  requests has standalone SQE
> > 	- make fused command as one feature
> > 	- cleanup & improve naming
> 
> Hi Ming, et al.,
> 
> I started to wonder if fused SQE could be extended to combine multiple 
> syscalls, for example open/read/close.  Which would be another solution 
> for the readfile syscall Miklos had proposed some time ago.
> 
> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
> 
> If fused SQEs could be extended, I think it would be quite helpful for 
> many other patterns. Another similar examples would open/write/close, 
> but ideal would be also to allow to have it more complex like 
> "open/write/sync_file_range/close" - open/write/close might be the 
> fastest and could possibly return before sync_file_range. Use case for 
> the latter would be a file server that wants to give notifications to 
> client when pages have been written out.

The above pattern needn't fused command, and it can be done by plain
SQEs chain, follows the usage:

1) suppose you get one command from /dev/fuse, then FUSE daemon
needs to handle the command as open/write/sync/close
2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
5) get sqe4, prepare it for close syscall
6) io_uring_enter();	//for submit and get events

Then all the four OPs are done one by one by io_uring internal
machinery, and you can choose to get successful CQE for each OP.

Is the above what you want to do?

The fused command proposal is actually for zero copy(but not limited to zc).

If the above write OP need to write to file with in-kernel buffer
of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
before 1), and set sqe2->addr to offet of the buffer in 3).

However, fused command is usually used in the following way, such as FUSE daemon
gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
as io_uring fused command:

1) get sqe0 and prepare it for primary command, in which you need to
provide info for retrieving kernel buffer/pages of this READ request

2) suppose this READ request needs to be handled by translating it to
READs to two files/devices, considering it as one mirror:

- get sqe1, prepare it for read from file1, and set sqe->addr to offset
  of the buffer in 1), set sqe->len as length for read; this READ OP
  uses the kernel buffer in 1) directly 

- get sqe2, prepare it for read from file2, and set sqe->addr to offset
  of buffer in 1), set sqe->len as length for read;  this READ OP
  uses the kernel buffer in 1) directly 

3) submit the three sqe by io_uring_enter()

sqe1 and sqe2 can be submitted concurrently or be issued one by one
in order, fused command supports both, and depends on user requirement.
But io_uring linked OPs is usually slower.

Also file1/file2 needs to be opened beforehand in this example, and FD is
passed to sqe1/sqe2, another choice is to use fixed File; Also you can
add the open/close() OPs into above steps, which need these open/close/READ
to be linked in order, usually slower tnan non-linked OPs.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-04-19  1:51   ` Ming Lei
@ 2023-04-19  9:56     ` Bernd Schubert
  2023-04-19 11:19       ` Ming Lei
  0 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2023-04-19  9:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, [email protected], [email protected],
	[email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Amir Goldstein

On 4/19/23 03:51, Ming Lei wrote:
> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
>> On 3/30/23 13:36, Ming Lei wrote:
>> [...]
>>> V6:
>>> 	- re-design fused command, and make it more generic, moving sharing buffer
>>> 	as one plugin of fused command, so in future we can implement more plugins
>>> 	- document potential other use cases of fused command
>>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
>>> 	  requests has standalone SQE
>>> 	- make fused command as one feature
>>> 	- cleanup & improve naming
>>
>> Hi Ming, et al.,
>>
>> I started to wonder if fused SQE could be extended to combine multiple
>> syscalls, for example open/read/close.  Which would be another solution
>> for the readfile syscall Miklos had proposed some time ago.
>>
>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
>>
>> If fused SQEs could be extended, I think it would be quite helpful for
>> many other patterns. Another similar examples would open/write/close,
>> but ideal would be also to allow to have it more complex like
>> "open/write/sync_file_range/close" - open/write/close might be the
>> fastest and could possibly return before sync_file_range. Use case for
>> the latter would be a file server that wants to give notifications to
>> client when pages have been written out.
> 
> The above pattern needn't fused command, and it can be done by plain
> SQEs chain, follows the usage:
> 
> 1) suppose you get one command from /dev/fuse, then FUSE daemon
> needs to handle the command as open/write/sync/close
> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
> 5) get sqe4, prepare it for close syscall
> 6) io_uring_enter();	//for submit and get events

Oh, I was not aware that IOSQE_IO_LINK could pass the result of open 
down to the others. Hmm, the example I find for open is 
io_uring_prep_openat_direct in test_open_fixed(). It probably gets off 
topic here, but one needs to have ring prepared with 
io_uring_register_files_sparse, then manually manages available indexes 
and can then link commands? Interesting!

> 
> Then all the four OPs are done one by one by io_uring internal
> machinery, and you can choose to get successful CQE for each OP.
> 
> Is the above what you want to do?
> 
> The fused command proposal is actually for zero copy(but not limited to zc).

Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to 
support generic passing, as it kind of hands data (buffers) from one sqe 
to the other. I.e. instead of buffers it would have passed the fd, but 
if this is already possible - no need to make IORING_OP_FUSED_CMD more 
complex.man

> 
> If the above write OP need to write to file with in-kernel buffer
> of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
> before 1), and set sqe2->addr to offet of the buffer in 3).
> 
> However, fused command is usually used in the following way, such as FUSE daemon
> gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
> as io_uring fused command:
> 
> 1) get sqe0 and prepare it for primary command, in which you need to
> provide info for retrieving kernel buffer/pages of this READ request
> 
> 2) suppose this READ request needs to be handled by translating it to
> READs to two files/devices, considering it as one mirror:
> 
> - get sqe1, prepare it for read from file1, and set sqe->addr to offset
>    of the buffer in 1), set sqe->len as length for read; this READ OP
>    uses the kernel buffer in 1) directly
> 
> - get sqe2, prepare it for read from file2, and set sqe->addr to offset
>    of buffer in 1), set sqe->len as length for read;  this READ OP
>    uses the kernel buffer in 1) directly
> 
> 3) submit the three sqe by io_uring_enter()
> 
> sqe1 and sqe2 can be submitted concurrently or be issued one by one
> in order, fused command supports both, and depends on user requirement.
> But io_uring linked OPs is usually slower.
> 
> Also file1/file2 needs to be opened beforehand in this example, and FD is
> passed to sqe1/sqe2, another choice is to use fixed File; Also you can
> add the open/close() OPs into above steps, which need these open/close/READ
> to be linked in order, usually slower tnan non-linked OPs.


Yes thanks, I'm going to prepare this in an branch, otherwise current 
fuse-uring would have a ZC regression (although my target ddn projects 
cannot make use of it, as we need access to the buffer for checksums, etc).


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-04-19  9:56     ` Bernd Schubert
@ 2023-04-19 11:19       ` Ming Lei
  2023-04-19 15:42         ` Bernd Schubert
  0 siblings, 1 reply; 37+ messages in thread
From: Ming Lei @ 2023-04-19 11:19 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jens Axboe, [email protected], [email protected],
	[email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Amir Goldstein, ming.lei

On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote:
> On 4/19/23 03:51, Ming Lei wrote:
> > On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
> >> On 3/30/23 13:36, Ming Lei wrote:
> >> [...]
> >>> V6:
> >>> 	- re-design fused command, and make it more generic, moving sharing buffer
> >>> 	as one plugin of fused command, so in future we can implement more plugins
> >>> 	- document potential other use cases of fused command
> >>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> >>> 	  requests has standalone SQE
> >>> 	- make fused command as one feature
> >>> 	- cleanup & improve naming
> >>
> >> Hi Ming, et al.,
> >>
> >> I started to wonder if fused SQE could be extended to combine multiple
> >> syscalls, for example open/read/close.  Which would be another solution
> >> for the readfile syscall Miklos had proposed some time ago.
> >>
> >> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
> >>
> >> If fused SQEs could be extended, I think it would be quite helpful for
> >> many other patterns. Another similar examples would open/write/close,
> >> but ideal would be also to allow to have it more complex like
> >> "open/write/sync_file_range/close" - open/write/close might be the
> >> fastest and could possibly return before sync_file_range. Use case for
> >> the latter would be a file server that wants to give notifications to
> >> client when pages have been written out.
> > 
> > The above pattern needn't fused command, and it can be done by plain
> > SQEs chain, follows the usage:
> > 
> > 1) suppose you get one command from /dev/fuse, then FUSE daemon
> > needs to handle the command as open/write/sync/close
> > 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
> > 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
> > 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
> > 5) get sqe4, prepare it for close syscall
> > 6) io_uring_enter();	//for submit and get events
> 
> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open 
> down to the others. Hmm, the example I find for open is 
> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off 
> topic here, but one needs to have ring prepared with 
> io_uring_register_files_sparse, then manually manages available indexes 
> and can then link commands? Interesting!

Yeah,  see test/fixed-reuse.c of liburing

> 
> > 
> > Then all the four OPs are done one by one by io_uring internal
> > machinery, and you can choose to get successful CQE for each OP.
> > 
> > Is the above what you want to do?
> > 
> > The fused command proposal is actually for zero copy(but not limited to zc).
> 
> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to 
> support generic passing, as it kind of hands data (buffers) from one sqe 
> to the other. I.e. instead of buffers it would have passed the fd, but 
> if this is already possible - no need to make IORING_OP_FUSED_CMD more 
> complex.man

The way of passing FD introduces other cost, read op running into async,
and adding it into global table, which introduces runtime cost.

That is the reason why fused command is designed in the following way:

- link can be avoided, so OPs needn't to be run in async
- no need to add buffer into global table

Cause it is really in fast io path.

> 
> > 
> > If the above write OP need to write to file with in-kernel buffer
> > of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
> > before 1), and set sqe2->addr to offet of the buffer in 3).
> > 
> > However, fused command is usually used in the following way, such as FUSE daemon
> > gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
> > as io_uring fused command:
> > 
> > 1) get sqe0 and prepare it for primary command, in which you need to
> > provide info for retrieving kernel buffer/pages of this READ request
> > 
> > 2) suppose this READ request needs to be handled by translating it to
> > READs to two files/devices, considering it as one mirror:
> > 
> > - get sqe1, prepare it for read from file1, and set sqe->addr to offset
> >    of the buffer in 1), set sqe->len as length for read; this READ OP
> >    uses the kernel buffer in 1) directly
> > 
> > - get sqe2, prepare it for read from file2, and set sqe->addr to offset
> >    of buffer in 1), set sqe->len as length for read;  this READ OP
> >    uses the kernel buffer in 1) directly
> > 
> > 3) submit the three sqe by io_uring_enter()
> > 
> > sqe1 and sqe2 can be submitted concurrently or be issued one by one
> > in order, fused command supports both, and depends on user requirement.
> > But io_uring linked OPs is usually slower.
> > 
> > Also file1/file2 needs to be opened beforehand in this example, and FD is
> > passed to sqe1/sqe2, another choice is to use fixed File; Also you can
> > add the open/close() OPs into above steps, which need these open/close/READ
> > to be linked in order, usually slower tnan non-linked OPs.
> 
> 
> Yes thanks, I'm going to prepare this in an branch, otherwise current 
> fuse-uring would have a ZC regression (although my target ddn projects 
> cannot make use of it, as we need access to the buffer for checksums, etc).

storage has similar use case too, such as encrypt, nvme tcp data digest,
..., if the checksum/encrypt approach is standard, maybe one new OP or
syscall can be added for doing that on kernel buffer directly.


Thanks
Ming


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-04-19 11:19       ` Ming Lei
@ 2023-04-19 15:42         ` Bernd Schubert
  2023-04-20  1:18           ` Pavel Begunkov
  2023-04-20  1:38           ` Ming Lei
  0 siblings, 2 replies; 37+ messages in thread
From: Bernd Schubert @ 2023-04-19 15:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, [email protected], [email protected],
	[email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Amir Goldstein

On 4/19/23 13:19, Ming Lei wrote:
> On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote:
>> On 4/19/23 03:51, Ming Lei wrote:
>>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
>>>> On 3/30/23 13:36, Ming Lei wrote:
>>>> [...]
>>>>> V6:
>>>>> 	- re-design fused command, and make it more generic, moving sharing buffer
>>>>> 	as one plugin of fused command, so in future we can implement more plugins
>>>>> 	- document potential other use cases of fused command
>>>>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
>>>>> 	  requests has standalone SQE
>>>>> 	- make fused command as one feature
>>>>> 	- cleanup & improve naming
>>>>
>>>> Hi Ming, et al.,
>>>>
>>>> I started to wonder if fused SQE could be extended to combine multiple
>>>> syscalls, for example open/read/close.  Which would be another solution
>>>> for the readfile syscall Miklos had proposed some time ago.
>>>>
>>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
>>>>
>>>> If fused SQEs could be extended, I think it would be quite helpful for
>>>> many other patterns. Another similar examples would open/write/close,
>>>> but ideal would be also to allow to have it more complex like
>>>> "open/write/sync_file_range/close" - open/write/close might be the
>>>> fastest and could possibly return before sync_file_range. Use case for
>>>> the latter would be a file server that wants to give notifications to
>>>> client when pages have been written out.
>>>
>>> The above pattern needn't fused command, and it can be done by plain
>>> SQEs chain, follows the usage:
>>>
>>> 1) suppose you get one command from /dev/fuse, then FUSE daemon
>>> needs to handle the command as open/write/sync/close
>>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
>>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
>>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
>>> 5) get sqe4, prepare it for close syscall
>>> 6) io_uring_enter();	//for submit and get events
>>
>> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open
>> down to the others. Hmm, the example I find for open is
>> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off
>> topic here, but one needs to have ring prepared with
>> io_uring_register_files_sparse, then manually manages available indexes
>> and can then link commands? Interesting!
> 
> Yeah,  see test/fixed-reuse.c of liburing
> 
>>
>>>
>>> Then all the four OPs are done one by one by io_uring internal
>>> machinery, and you can choose to get successful CQE for each OP.
>>>
>>> Is the above what you want to do?
>>>
>>> The fused command proposal is actually for zero copy(but not limited to zc).
>>
>> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to
>> support generic passing, as it kind of hands data (buffers) from one sqe
>> to the other. I.e. instead of buffers it would have passed the fd, but
>> if this is already possible - no need to make IORING_OP_FUSED_CMD more
>> complex.man
> 
> The way of passing FD introduces other cost, read op running into async,
> and adding it into global table, which introduces runtime cost.

Hmm, question from my side is why it needs to be in the global table, 
when it could be just passed to the linked or fused sqe?

> 
> That is the reason why fused command is designed in the following way:
> 
> - link can be avoided, so OPs needn't to be run in async
> - no need to add buffer into global table
> 
> Cause it is really in fast io path.
> 
>>
>>>
>>> If the above write OP need to write to file with in-kernel buffer
>>> of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
>>> before 1), and set sqe2->addr to offet of the buffer in 3).
>>>
>>> However, fused command is usually used in the following way, such as FUSE daemon
>>> gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
>>> as io_uring fused command:
>>>
>>> 1) get sqe0 and prepare it for primary command, in which you need to
>>> provide info for retrieving kernel buffer/pages of this READ request
>>>
>>> 2) suppose this READ request needs to be handled by translating it to
>>> READs to two files/devices, considering it as one mirror:
>>>
>>> - get sqe1, prepare it for read from file1, and set sqe->addr to offset
>>>     of the buffer in 1), set sqe->len as length for read; this READ OP
>>>     uses the kernel buffer in 1) directly
>>>
>>> - get sqe2, prepare it for read from file2, and set sqe->addr to offset
>>>     of buffer in 1), set sqe->len as length for read;  this READ OP
>>>     uses the kernel buffer in 1) directly
>>>
>>> 3) submit the three sqe by io_uring_enter()
>>>
>>> sqe1 and sqe2 can be submitted concurrently or be issued one by one
>>> in order, fused command supports both, and depends on user requirement.
>>> But io_uring linked OPs is usually slower.
>>>
>>> Also file1/file2 needs to be opened beforehand in this example, and FD is
>>> passed to sqe1/sqe2, another choice is to use fixed File; Also you can
>>> add the open/close() OPs into above steps, which need these open/close/READ
>>> to be linked in order, usually slower tnan non-linked OPs.
>>
>>
>> Yes thanks, I'm going to prepare this in an branch, otherwise current
>> fuse-uring would have a ZC regression (although my target ddn projects
>> cannot make use of it, as we need access to the buffer for checksums, etc).
> 
> storage has similar use case too, such as encrypt, nvme tcp data digest,
> ..., if the checksum/encrypt approach is standard, maybe one new OP or
> syscall can be added for doing that on kernel buffer directly.

I very much see the use case for FUSED_CMD for overlay or simple network 
sockets. Now in the HPC world one typically uses IB  RDMA and if that 
fails for some reasons (like connection down), tcp or other interfaces 
as fallback. And there is sending the right part of the buffer to the 
right server and erasure coding involved - it gets complex and I don't 
think there is a way for us without a buffer copy.

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-04-19 15:42         ` Bernd Schubert
@ 2023-04-20  1:18           ` Pavel Begunkov
  2023-04-20  1:38           ` Ming Lei
  1 sibling, 0 replies; 37+ messages in thread
From: Pavel Begunkov @ 2023-04-20  1:18 UTC (permalink / raw)
  To: Bernd Schubert, Ming Lei
  Cc: Jens Axboe, [email protected], [email protected],
	[email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Stefan Hajnoczi, Dan Williams, Amir Goldstein

On 4/19/23 16:42, Bernd Schubert wrote:
> On 4/19/23 13:19, Ming Lei wrote:
>> On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote:
>>> On 4/19/23 03:51, Ming Lei wrote:
>>>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
>>>>> On 3/30/23 13:36, Ming Lei wrote:
>>>>> [...]
>>>>>> V6:
>>>>>> 	- re-design fused command, and make it more generic, moving sharing buffer
>>>>>> 	as one plugin of fused command, so in future we can implement more plugins
>>>>>> 	- document potential other use cases of fused command
>>>>>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
>>>>>> 	  requests has standalone SQE
>>>>>> 	- make fused command as one feature
>>>>>> 	- cleanup & improve naming
>>>>>
>>>>> Hi Ming, et al.,
>>>>>
>>>>> I started to wonder if fused SQE could be extended to combine multiple
>>>>> syscalls, for example open/read/close.  Which would be another solution
>>>>> for the readfile syscall Miklos had proposed some time ago.
>>>>>
>>>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
>>>>>
>>>>> If fused SQEs could be extended, I think it would be quite helpful for
>>>>> many other patterns. Another similar examples would open/write/close,
>>>>> but ideal would be also to allow to have it more complex like
>>>>> "open/write/sync_file_range/close" - open/write/close might be the
>>>>> fastest and could possibly return before sync_file_range. Use case for
>>>>> the latter would be a file server that wants to give notifications to
>>>>> client when pages have been written out.
>>>>
>>>> The above pattern needn't fused command, and it can be done by plain
>>>> SQEs chain, follows the usage:
>>>>
>>>> 1) suppose you get one command from /dev/fuse, then FUSE daemon
>>>> needs to handle the command as open/write/sync/close
>>>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
>>>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
>>>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
>>>> 5) get sqe4, prepare it for close syscall
>>>> 6) io_uring_enter();	//for submit and get events
>>>
>>> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open
>>> down to the others. Hmm, the example I find for open is
>>> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off
>>> topic here, but one needs to have ring prepared with
>>> io_uring_register_files_sparse, then manually manages available indexes
>>> and can then link commands? Interesting!
>>
>> Yeah,  see test/fixed-reuse.c of liburing
>>
>>>
>>>>
>>>> Then all the four OPs are done one by one by io_uring internal
>>>> machinery, and you can choose to get successful CQE for each OP.
>>>>
>>>> Is the above what you want to do?
>>>>
>>>> The fused command proposal is actually for zero copy(but not limited to zc).
>>>
>>> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to
>>> support generic passing, as it kind of hands data (buffers) from one sqe
>>> to the other. I.e. instead of buffers it would have passed the fd, but
>>> if this is already possible - no need to make IORING_OP_FUSED_CMD more
>>> complex.man
>>
>> The way of passing FD introduces other cost, read op running into async,
>> and adding it into global table, which introduces runtime cost.
> 
> Hmm, question from my side is why it needs to be in the global table,
> when it could be just passed to the linked or fused sqe?

Because for every such type of state you need to write custom code,
it's not scalable, not to say that it usually can't be kept to a
specific operation and leaks into generic paths / other requests.

Some may want to pass a file or a buffer, there might be a need
to pass a result in some specific way (e.g. nr = recv(); send(nr)),
and the list continues...

I tried adding BPF in the middle ~2y ago, but it was no
different in perf than returning to the userspace, and gets
worse with higher submission batching. Maybe I need to test
it again.

>> That is the reason why fused command is designed in the following way:
>>
>> - link can be avoided, so OPs needn't to be run in async
>> - no need to add buffer into global table
>>
>> Cause it is really in fast io path.
>>
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-04-19 15:42         ` Bernd Schubert
  2023-04-20  1:18           ` Pavel Begunkov
@ 2023-04-20  1:38           ` Ming Lei
  2023-04-21 22:38             ` Bernd Schubert
  1 sibling, 1 reply; 37+ messages in thread
From: Ming Lei @ 2023-04-20  1:38 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Jens Axboe, [email protected], [email protected],
	[email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Amir Goldstein, ming.lei

On Wed, Apr 19, 2023 at 03:42:40PM +0000, Bernd Schubert wrote:
> On 4/19/23 13:19, Ming Lei wrote:
> > On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote:
> >> On 4/19/23 03:51, Ming Lei wrote:
> >>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote:
> >>>> On 3/30/23 13:36, Ming Lei wrote:
> >>>> [...]
> >>>>> V6:
> >>>>> 	- re-design fused command, and make it more generic, moving sharing buffer
> >>>>> 	as one plugin of fused command, so in future we can implement more plugins
> >>>>> 	- document potential other use cases of fused command
> >>>>> 	- drop support for builtin secondary sqe in SQE128, so all secondary
> >>>>> 	  requests has standalone SQE
> >>>>> 	- make fused command as one feature
> >>>>> 	- cleanup & improve naming
> >>>>
> >>>> Hi Ming, et al.,
> >>>>
> >>>> I started to wonder if fused SQE could be extended to combine multiple
> >>>> syscalls, for example open/read/close.  Which would be another solution
> >>>> for the readfile syscall Miklos had proposed some time ago.
> >>>>
> >>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@mail.gmail.com/
> >>>>
> >>>> If fused SQEs could be extended, I think it would be quite helpful for
> >>>> many other patterns. Another similar examples would open/write/close,
> >>>> but ideal would be also to allow to have it more complex like
> >>>> "open/write/sync_file_range/close" - open/write/close might be the
> >>>> fastest and could possibly return before sync_file_range. Use case for
> >>>> the latter would be a file server that wants to give notifications to
> >>>> client when pages have been written out.
> >>>
> >>> The above pattern needn't fused command, and it can be done by plain
> >>> SQEs chain, follows the usage:
> >>>
> >>> 1) suppose you get one command from /dev/fuse, then FUSE daemon
> >>> needs to handle the command as open/write/sync/close
> >>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK;
> >>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK;
> >>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK;
> >>> 5) get sqe4, prepare it for close syscall
> >>> 6) io_uring_enter();	//for submit and get events
> >>
> >> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open
> >> down to the others. Hmm, the example I find for open is
> >> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off
> >> topic here, but one needs to have ring prepared with
> >> io_uring_register_files_sparse, then manually manages available indexes
> >> and can then link commands? Interesting!
> > 
> > Yeah,  see test/fixed-reuse.c of liburing
> > 
> >>
> >>>
> >>> Then all the four OPs are done one by one by io_uring internal
> >>> machinery, and you can choose to get successful CQE for each OP.
> >>>
> >>> Is the above what you want to do?
> >>>
> >>> The fused command proposal is actually for zero copy(but not limited to zc).
> >>
> >> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to
> >> support generic passing, as it kind of hands data (buffers) from one sqe
> >> to the other. I.e. instead of buffers it would have passed the fd, but
> >> if this is already possible - no need to make IORING_OP_FUSED_CMD more
> >> complex.man
> > 
> > The way of passing FD introduces other cost, read op running into async,
> > and adding it into global table, which introduces runtime cost.
> 
> Hmm, question from my side is why it needs to be in the global table, 
> when it could be just passed to the linked or fused sqe?

Any data which crosses OPs need be registered to somewhere, such as
fixed buffer, fixed FD, here global meant context wide, and it is actually from
OP/SQE viewpoint.

Fused command actually is one whole command logically, even though it
may includes multiple SQEs. Then registration as context wide isn't
needn't(since it is known buffer sharing isn't context wide, and just
among several IOs), meantime dependency is avoided, so link isn't needed.

This way helps performance a lot, such as, in test on ublk/loop over tmpfs,
iops drops to 1/2 with registration in 4k rand io, but fused command actually
improves iops a bit, baseline is current in-tree ublk driver/ublksrv.

> 
> > 
> > That is the reason why fused command is designed in the following way:
> > 
> > - link can be avoided, so OPs needn't to be run in async
> > - no need to add buffer into global table
> > 
> > Cause it is really in fast io path.
> > 
> >>
> >>>
> >>> If the above write OP need to write to file with in-kernel buffer
> >>> of /dev/fuse directly, you can get one sqe0 and prepare it for primary command
> >>> before 1), and set sqe2->addr to offet of the buffer in 3).
> >>>
> >>> However, fused command is usually used in the following way, such as FUSE daemon
> >>> gets one READ request from /dev/fuse, FUSE userspace can handle the READ request
> >>> as io_uring fused command:
> >>>
> >>> 1) get sqe0 and prepare it for primary command, in which you need to
> >>> provide info for retrieving kernel buffer/pages of this READ request
> >>>
> >>> 2) suppose this READ request needs to be handled by translating it to
> >>> READs to two files/devices, considering it as one mirror:
> >>>
> >>> - get sqe1, prepare it for read from file1, and set sqe->addr to offset
> >>>     of the buffer in 1), set sqe->len as length for read; this READ OP
> >>>     uses the kernel buffer in 1) directly
> >>>
> >>> - get sqe2, prepare it for read from file2, and set sqe->addr to offset
> >>>     of buffer in 1), set sqe->len as length for read;  this READ OP
> >>>     uses the kernel buffer in 1) directly
> >>>
> >>> 3) submit the three sqe by io_uring_enter()
> >>>
> >>> sqe1 and sqe2 can be submitted concurrently or be issued one by one
> >>> in order, fused command supports both, and depends on user requirement.
> >>> But io_uring linked OPs is usually slower.
> >>>
> >>> Also file1/file2 needs to be opened beforehand in this example, and FD is
> >>> passed to sqe1/sqe2, another choice is to use fixed File; Also you can
> >>> add the open/close() OPs into above steps, which need these open/close/READ
> >>> to be linked in order, usually slower tnan non-linked OPs.
> >>
> >>
> >> Yes thanks, I'm going to prepare this in an branch, otherwise current
> >> fuse-uring would have a ZC regression (although my target ddn projects
> >> cannot make use of it, as we need access to the buffer for checksums, etc).
> > 
> > storage has similar use case too, such as encrypt, nvme tcp data digest,
> > ..., if the checksum/encrypt approach is standard, maybe one new OP or
> > syscall can be added for doing that on kernel buffer directly.
> 
> I very much see the use case for FUSED_CMD for overlay or simple network 
> sockets. Now in the HPC world one typically uses IB  RDMA and if that 
> fails for some reasons (like connection down), tcp or other interfaces 
> as fallback. And there is sending the right part of the buffer to the 
> right server and erasure coding involved - it gets complex and I don't 
> think there is a way for us without a buffer copy.

As I mentioned, it(checksum, encrypt, ...) becomes one generic issue if
the zero copy approach is accepted, meantime the problem itself is well-defined,
so I don't worry no solution can be figured out.

Meantime big memory copy does consume both cpu and memory bandwidth a
lot, and 64k/512k ublk io has shown this big difference wrt. copy vs.
zero copy.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD
  2023-04-20  1:38           ` Ming Lei
@ 2023-04-21 22:38             ` Bernd Schubert
  0 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2023-04-21 22:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, [email protected], [email protected],
	[email protected], Miklos Szeredi, ZiyangZhang,
	Xiaoguang Wang, Pavel Begunkov, Stefan Hajnoczi, Dan Williams,
	Amir Goldstein

On 4/20/23 03:38, Ming Lei wrote:
> On Wed, Apr 19, 2023 at 03:42:40PM +0000, Bernd Schubert wrote:
>> I very much see the use case for FUSED_CMD for overlay or simple network
>> sockets. Now in the HPC world one typically uses IB  RDMA and if that
>> fails for some reasons (like connection down), tcp or other interfaces
>> as fallback. And there is sending the right part of the buffer to the
>> right server and erasure coding involved - it gets complex and I don't
>> think there is a way for us without a buffer copy.
> 
> As I mentioned, it(checksum, encrypt, ...) becomes one generic issue if
> the zero copy approach is accepted, meantime the problem itself is well-defined,
> so I don't worry no solution can be figured out.
> 
> Meantime big memory copy does consume both cpu and memory bandwidth a
> lot, and 64k/512k ublk io has shown this big difference wrt. copy vs.
> zero copy.

I don't have any doubt about that, but I believe there is no current way 
to support it in all use cases. As example, let's consider we would like 
to extend nbd with verbs/rdma instead of plain tcp  - verbs/rdma needs 
registered memory and does not take a simple socket fd to send buffers to.


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-04-21 22:39 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-30 11:36 [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
2023-03-30 11:36 ` [PATCH V6 01/17] io_uring: increase io_kiocb->flags into 64bit Ming Lei
2023-03-30 11:36 ` [PATCH V6 02/17] io_uring: use ctx->cached_sq_head to calculate left sqes Ming Lei
2023-03-30 11:36 ` [PATCH V6 03/17] io_uring: add generic IORING_OP_FUSED_CMD Ming Lei
2023-04-01 14:35   ` Ming Lei
2023-03-30 11:36 ` [PATCH V6 04/17] io_uring: support providing buffer by IORING_OP_FUSED_CMD Ming Lei
2023-03-30 11:36 ` [PATCH V6 05/17] io_uring: support OP_READ/OP_WRITE for fused secondary request Ming Lei
2023-03-30 11:36 ` [PATCH V6 06/17] io_uring: support OP_SEND_ZC/OP_RECV " Ming Lei
2023-03-30 11:36 ` [PATCH V6 07/17] block: ublk_drv: add common exit handling Ming Lei
2023-03-30 11:36 ` [PATCH V6 08/17] block: ublk_drv: don't consider flush request in map/unmap io Ming Lei
2023-03-30 11:36 ` [PATCH V6 09/17] block: ublk_drv: add two helpers to clean up map/unmap request Ming Lei
2023-03-30 11:36 ` [PATCH V6 10/17] block: ublk_drv: clean up several helpers Ming Lei
2023-03-30 11:36 ` [PATCH V6 11/17] block: ublk_drv: cleanup 'struct ublk_map_data' Ming Lei
2023-03-30 11:36 ` [PATCH V6 12/17] block: ublk_drv: cleanup ublk_copy_user_pages Ming Lei
2023-03-31 16:22   ` Bernd Schubert
2023-03-30 11:36 ` [PATCH V6 13/17] block: ublk_drv: grab request reference when the request is handled by userspace Ming Lei
2023-03-30 11:36 ` [PATCH V6 14/17] block: ublk_drv: support to copy any part of request pages Ming Lei
2023-03-30 11:36 ` [PATCH V6 15/17] block: ublk_drv: add read()/write() support for ublk char device Ming Lei
2023-03-30 11:36 ` [PATCH V6 16/17] block: ublk_drv: don't check buffer in case of zero copy Ming Lei
2023-03-30 11:36 ` [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting " Ming Lei
2023-03-31 19:13   ` Bernd Schubert
2023-04-01 13:19     ` Ming Lei
2023-03-31 19:55   ` Bernd Schubert
2023-04-01 13:22     ` Ming Lei
2023-04-03  9:25     ` Bernd Schubert
2023-04-03  1:11 ` [PATCH V6 00/17] io_uring/ublk: add generic IORING_OP_FUSED_CMD Ming Lei
2023-04-03  1:24   ` Jens Axboe
2023-04-04  7:48     ` Ming Lei
2023-04-03  1:23 ` (subset) " Jens Axboe
2023-04-18 19:38 ` Bernd Schubert
2023-04-19  1:51   ` Ming Lei
2023-04-19  9:56     ` Bernd Schubert
2023-04-19 11:19       ` Ming Lei
2023-04-19 15:42         ` Bernd Schubert
2023-04-20  1:18           ` Pavel Begunkov
2023-04-20  1:38           ` Ming Lei
2023-04-21 22:38             ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox