[PATCH V6 0/8] io_uring: support sqe group and provide group kbuf

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf
@ 2024-09-12 10:49 Ming Lei
  2024-09-12 10:49 ` [PATCH V6 1/8] io_uring: add io_link_req() helper Ming Lei
                   ` (8 more replies)
  0 siblings, 9 replies; 47+ messages in thread
From: Ming Lei @ 2024-09-12 10:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block, Ming Lei

Hello,

The 1st 3 patches are cleanup, and prepare for adding sqe group.

The 4th patch supports generic sqe group which is like link chain, but
allows each sqe in group to be issued in parallel and the group shares
same IO_LINK & IO_DRAIN boundary, so N:M dependency can be supported with
sqe group & io link together. sqe group changes nothing on
IOSQE_IO_LINK.

The 5th patch supports one variant of sqe group: allow members to depend
on group leader, so that kernel resource lifetime can be aligned with
group leader or group, then any kernel resource can be shared in this
sqe group, and can be used in generic device zero copy.

The 6th & 7th patches supports providing sqe group buffer via the sqe
group variant.

The 8th patch supports ublk zero copy based on io_uring providing sqe
group buffer.

Tests:

1) pass liburing test
- make runtests

2) write/pass sqe group test case and sqe provide buffer case:

https://github.com/axboe/liburing/compare/master...ming1:liburing:sqe_group_v3

https://github.com/ming1/liburing/tree/sqe_group_v3

- covers related sqe flags combination and linking groups, both nop and
one multi-destination file copy.

- cover failure handling test: fail leader IO or member IO in both single
  group and linked groups, which is done in each sqe flags combination
  test

- covers IORING_PROVIDE_GROUP_KBUF by adding ublk-loop-zc

3) ublksrv zero copy:

ublksrv userspace implements zero copy by sqe group & provide group
kbuf:

	git clone https://github.com/ublk-org/ublksrv.git -b group-provide-buf_v3
	make test T=loop/009:nbd/061	#ublk zc tests

When running 64KB/512KB block size test on ublk-loop('ublk add -t loop --buffered_io -f $backing'),
it is observed that perf is doubled.


V6:
	- follow Pavel's suggestion to disallow IOSQE_CQE_SKIP_SUCCESS &
	  LINK_TIMEOUT
	- kill __io_complete_group_member() (Pavel)
	- simplify link failure handling (Pavel)
	- move members' queuing out of completion lock (Pavel)
	- cleanup group io complete handler
	- add more comment
	- add ublk zc into liburing test for covering
	  IOSQE_SQE_GROUP & IORING_PROVIDE_GROUP_KBUF 

V5:
	- follow Pavel's suggestion to minimize change on io_uring fast code
	  path: sqe group code is called in by single 'if (unlikely())' from
	  both issue & completion code path

	- simplify & re-write group request completion
		avoid to touch io-wq code by completing group leader via tw
		directly, just like ->task_complete

		re-write group member & leader completion handling, one
		simplification is always to free leader via the last member

		simplify queueing group members, not support issuing leader
		and members in parallel

	- fail the whole group if IO_*LINK & IO_DRAIN is set on group
	  members, and test code to cover this change

	- misc cleanup

V4:
	- address most comments from Pavel
	- fix request double free
	- don't use io_req_commit_cqe() in io_req_complete_defer()
	- make members' REQ_F_INFLIGHT discoverable
	- use common assembling check in submission code path
	- drop patch 3 and don't move REQ_F_CQE_SKIP out of io_free_req()
	- don't set .accept_group_kbuf for net send zc, in which members
	  need to be queued after buffer notification is got, and can be
	  enabled in future
	- add .grp_leader field via union, and share storage with .grp_link
	- move .grp_refs into one hole of io_kiocb, so that one extra
	cacheline isn't needed for io_kiocb
	- cleanup & document improvement

V3:
	- add IORING_FEAT_SQE_GROUP
	- simplify group completion, and minimize change on io_req_complete_defer()
	- simplify & cleanup io_queue_group_members()
	- fix many failure handling issues
	- cover failure handling code in added liburing tests
	- remove RFC

V2:
	- add generic sqe group, suggested by Kevin Wolf
	- add REQ_F_SQE_GROUP_DEP which is based on IOSQE_SQE_GROUP, for sharing
	  kernel resource in group wide, suggested by Kevin Wolf
	- remove sqe ext flag, and use the last bit for IOSQE_SQE_GROUP(Pavel),
	in future we still can extend sqe flags with one uring context flag
	- initialize group requests via submit state pattern, suggested by Pavel
	- all kinds of cleanup & bug fixes

Ming Lei (8):
  io_uring: add io_link_req() helper
  io_uring: add io_submit_fail_link() helper
  io_uring: add helper of io_req_commit_cqe()
  io_uring: support SQE group
  io_uring: support sqe group with members depending on leader
  io_uring: support providing sqe group buffer
  io_uring/uring_cmd: support provide group kernel buffer
  ublk: support provide io buffer

 drivers/block/ublk_drv.c       | 160 +++++++++++++-
 include/linux/io_uring/cmd.h   |   7 +
 include/linux/io_uring_types.h |  54 +++++
 include/uapi/linux/io_uring.h  |  11 +-
 include/uapi/linux/ublk_cmd.h  |   7 +-
 io_uring/io_uring.c            | 370 ++++++++++++++++++++++++++++++---
 io_uring/io_uring.h            |  16 ++
 io_uring/kbuf.c                |  60 ++++++
 io_uring/kbuf.h                |  13 ++
 io_uring/net.c                 |  23 +-
 io_uring/opdef.c               |   4 +
 io_uring/opdef.h               |   2 +
 io_uring/rw.c                  |  20 +-
 io_uring/timeout.c             |   6 +
 io_uring/uring_cmd.c           |  28 +++
 15 files changed, 735 insertions(+), 46 deletions(-)

-- 
2.42.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH V6 1/8] io_uring: add io_link_req() helper
  2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
@ 2024-09-12 10:49 ` Ming Lei
  2024-09-12 10:49 ` [PATCH V6 2/8] io_uring: add io_submit_fail_link() helper Ming Lei
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-09-12 10:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block, Ming Lei

Add io_link_req() helper, so that io_submit_sqe() can become more
readable.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 io_uring/io_uring.c | 41 +++++++++++++++++++++++++++--------------
 1 file changed, 27 insertions(+), 14 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1aca501efaf6..8ed4f40470e3 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2132,19 +2132,11 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
 	return 0;
 }
 
-static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			 const struct io_uring_sqe *sqe)
-	__must_hold(&ctx->uring_lock)
+/*
+ * Return NULL if nothing to be queued, otherwise return request for queueing */
+static struct io_kiocb *io_link_sqe(struct io_submit_link *link,
+				    struct io_kiocb *req)
 {
-	struct io_submit_link *link = &ctx->submit_state.link;
-	int ret;
-
-	ret = io_init_req(ctx, req, sqe);
-	if (unlikely(ret))
-		return io_submit_fail_init(sqe, req, ret);
-
-	trace_io_uring_submit_req(req);
-
 	/*
 	 * If we already have a head request, queue this one for async
 	 * submittal once the head completes. If we don't have a head but
@@ -2158,7 +2150,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		link->last = req;
 
 		if (req->flags & IO_REQ_LINK_FLAGS)
-			return 0;
+			return NULL;
 		/* last request of the link, flush it */
 		req = link->head;
 		link->head = NULL;
@@ -2174,9 +2166,30 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 fallback:
 			io_queue_sqe_fallback(req);
 		}
-		return 0;
+		return NULL;
 	}
+	return req;
+}
+
+static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			 const struct io_uring_sqe *sqe)
+	__must_hold(&ctx->uring_lock)
+{
+	struct io_submit_link *link = &ctx->submit_state.link;
+	int ret;
 
+	ret = io_init_req(ctx, req, sqe);
+	if (unlikely(ret))
+		return io_submit_fail_init(sqe, req, ret);
+
+	trace_io_uring_submit_req(req);
+
+	if (unlikely(link->head || (req->flags & (IO_REQ_LINK_FLAGS |
+				    REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
+		req = io_link_sqe(link, req);
+		if (!req)
+			return 0;
+	}
 	io_queue_sqe(req);
 	return 0;
 }
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V6 2/8] io_uring: add io_submit_fail_link() helper
  2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
  2024-09-12 10:49 ` [PATCH V6 1/8] io_uring: add io_link_req() helper Ming Lei
@ 2024-09-12 10:49 ` Ming Lei
  2024-09-12 10:49 ` [PATCH V6 3/8] io_uring: add helper of io_req_commit_cqe() Ming Lei
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-09-12 10:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block, Ming Lei

Add io_submit_fail_link() helper and put linking fail logic into this
helper.

This way simplifies io_submit_fail_init(), and becomes easier to add
sqe group failing logic.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 io_uring/io_uring.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8ed4f40470e3..7454532d0e8e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2095,22 +2095,17 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return def->prep(req, sqe);
 }
 
-static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
+static __cold int io_submit_fail_link(struct io_submit_link *link,
 				      struct io_kiocb *req, int ret)
 {
-	struct io_ring_ctx *ctx = req->ctx;
-	struct io_submit_link *link = &ctx->submit_state.link;
 	struct io_kiocb *head = link->head;
 
-	trace_io_uring_req_failed(sqe, req, ret);
-
 	/*
 	 * Avoid breaking links in the middle as it renders links with SQPOLL
 	 * unusable. Instead of failing eagerly, continue assembling the link if
 	 * applicable and mark the head with REQ_F_FAIL. The link flushing code
 	 * should find the flag and handle the rest.
 	 */
-	req_fail_link_node(req, ret);
 	if (head && !(head->flags & REQ_F_FAIL))
 		req_fail_link_node(head, -ECANCELED);
 
@@ -2129,9 +2124,24 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
 	else
 		link->head = req;
 	link->last = req;
+
 	return 0;
 }
 
+static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
+				      struct io_kiocb *req, int ret)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_submit_link *link = &ctx->submit_state.link;
+
+	trace_io_uring_req_failed(sqe, req, ret);
+
+	req_fail_link_node(req, ret);
+
+	/* cover both linked and non-linked request */
+	return io_submit_fail_link(link, req, ret);
+}
+
 /*
  * Return NULL if nothing to be queued, otherwise return request for queueing */
 static struct io_kiocb *io_link_sqe(struct io_submit_link *link,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V6 3/8] io_uring: add helper of io_req_commit_cqe()
  2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
  2024-09-12 10:49 ` [PATCH V6 1/8] io_uring: add io_link_req() helper Ming Lei
  2024-09-12 10:49 ` [PATCH V6 2/8] io_uring: add io_submit_fail_link() helper Ming Lei
@ 2024-09-12 10:49 ` Ming Lei
  2024-09-12 10:49 ` [PATCH V6 4/8] io_uring: support SQE group Ming Lei
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-09-12 10:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block, Ming Lei

Add helper of io_req_commit_cqe() for simplifying
__io_submit_flush_completions() a bit.

No functional change, and the added helper will be reused in sqe group
code with same lock rule.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 io_uring/io_uring.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7454532d0e8e..d277f0a6e549 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -861,6 +861,20 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
 	return posted;
 }
 
+static __always_inline void io_req_commit_cqe(struct io_ring_ctx *ctx,
+		struct io_kiocb *req)
+{
+	if (unlikely(!io_fill_cqe_req(ctx, req))) {
+		if (ctx->lockless_cq) {
+			spin_lock(&ctx->completion_lock);
+			io_req_cqe_overflow(req);
+			spin_unlock(&ctx->completion_lock);
+		} else {
+			io_req_cqe_overflow(req);
+		}
+	}
+}
+
 static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
@@ -1413,16 +1427,8 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
 					    comp_list);
 
-		if (!(req->flags & REQ_F_CQE_SKIP) &&
-		    unlikely(!io_fill_cqe_req(ctx, req))) {
-			if (ctx->lockless_cq) {
-				spin_lock(&ctx->completion_lock);
-				io_req_cqe_overflow(req);
-				spin_unlock(&ctx->completion_lock);
-			} else {
-				io_req_cqe_overflow(req);
-			}
-		}
+		if (!(req->flags & REQ_F_CQE_SKIP))
+			io_req_commit_cqe(ctx, req);
 	}
 	__io_cq_unlock_post(ctx);
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V6 4/8] io_uring: support SQE group
  2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
                   ` (2 preceding siblings ...)
  2024-09-12 10:49 ` [PATCH V6 3/8] io_uring: add helper of io_req_commit_cqe() Ming Lei
@ 2024-09-12 10:49 ` Ming Lei
  2024-10-04 13:12   ` Pavel Begunkov
  2024-09-12 10:49 ` [PATCH V6 5/8] io_uring: support sqe group with members depending on leader Ming Lei
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-09-12 10:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block, Ming Lei, Kevin Wolf

SQE group is defined as one chain of SQEs starting with the first SQE that
has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
doesn't have it set, and it is similar with chain of linked SQEs.

Not like linked SQEs, each sqe is issued after the previous one is
completed. All SQEs in one group can be submitted in parallel. To simplify
the implementation from beginning, all members are queued after the leader
is completed, however, this way may be changed and leader and members may
be issued concurrently in future.

The 1st SQE is group leader, and the other SQEs are group member. The whole
group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
the two flags can't be set for group members. For the sake of
simplicity, IORING_OP_LINK_TIMEOUT is disallowed for SQE group now.

When the group is in one link chain, this group isn't submitted until the
previous SQE or group is completed. And the following SQE or group can't
be started if this group isn't completed. Failure from any group member will
fail the group leader, then the link chain can be terminated.

When IOSQE_IO_DRAIN is set for group leader, all requests in this group and
previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for
group leader only, we respect IO_DRAIN by always completing group leader as
the last one in the group. Meantime it is natural to post leader's CQE
as the last one from application viewpoint.

Working together with IOSQE_IO_LINK, SQE group provides flexible way to
support N:M dependency, such as:

- group A is chained with group B together
- group A has N SQEs
- group B has M SQEs

then M SQEs in group B depend on N SQEs in group A.

N:M dependency can support some interesting use cases in efficient way:

1) read from multiple files, then write the read data into single file

2) read from single file, and write the read data into multiple files

3) write same data into multiple files, and read data from multiple files and
compare if correct data is written

Also IOSQE_SQE_GROUP takes the last bit in sqe->flags, but we still can
extend sqe->flags with io_uring context flag, such as use __pad3 for
non-uring_cmd OPs and part of uring_cmd_flags for uring_cmd OP.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/io_uring_types.h |  18 ++
 include/uapi/linux/io_uring.h  |   4 +
 io_uring/io_uring.c            | 299 +++++++++++++++++++++++++++++++--
 io_uring/io_uring.h            |   6 +
 io_uring/timeout.c             |   6 +
 5 files changed, 318 insertions(+), 15 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 4b9ba523978d..11c6726abbb9 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -202,6 +202,8 @@ struct io_submit_state {
 	/* batch completion logic */
 	struct io_wq_work_list	compl_reqs;
 	struct io_submit_link	link;
+	/* points to current group */
+	struct io_submit_link	group;
 
 	bool			plug_started;
 	bool			need_plug;
@@ -438,6 +440,7 @@ enum {
 	REQ_F_FORCE_ASYNC_BIT	= IOSQE_ASYNC_BIT,
 	REQ_F_BUFFER_SELECT_BIT	= IOSQE_BUFFER_SELECT_BIT,
 	REQ_F_CQE_SKIP_BIT	= IOSQE_CQE_SKIP_SUCCESS_BIT,
+	REQ_F_SQE_GROUP_BIT	= IOSQE_SQE_GROUP_BIT,
 
 	/* first byte is taken by user flags, shift it to not overlap */
 	REQ_F_FAIL_BIT		= 8,
@@ -468,6 +471,7 @@ enum {
 	REQ_F_BL_EMPTY_BIT,
 	REQ_F_BL_NO_RECYCLE_BIT,
 	REQ_F_BUFFERS_COMMIT_BIT,
+	REQ_F_SQE_GROUP_LEADER_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -491,6 +495,8 @@ enum {
 	REQ_F_BUFFER_SELECT	= IO_REQ_FLAG(REQ_F_BUFFER_SELECT_BIT),
 	/* IOSQE_CQE_SKIP_SUCCESS */
 	REQ_F_CQE_SKIP		= IO_REQ_FLAG(REQ_F_CQE_SKIP_BIT),
+	/* IOSQE_SQE_GROUP */
+	REQ_F_SQE_GROUP		= IO_REQ_FLAG(REQ_F_SQE_GROUP_BIT),
 
 	/* fail rest of links */
 	REQ_F_FAIL		= IO_REQ_FLAG(REQ_F_FAIL_BIT),
@@ -546,6 +552,8 @@ enum {
 	REQ_F_BL_NO_RECYCLE	= IO_REQ_FLAG(REQ_F_BL_NO_RECYCLE_BIT),
 	/* buffer ring head needs incrementing on put */
 	REQ_F_BUFFERS_COMMIT	= IO_REQ_FLAG(REQ_F_BUFFERS_COMMIT_BIT),
+	/* sqe group lead */
+	REQ_F_SQE_GROUP_LEADER	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEADER_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
@@ -651,6 +659,8 @@ struct io_kiocb {
 	void				*async_data;
 	/* linked requests, IFF REQ_F_HARDLINK or REQ_F_LINK are set */
 	atomic_t			poll_refs;
+	/* reference for group leader request */
+	int				grp_refs;
 	struct io_kiocb			*link;
 	/* custom credentials, valid IFF REQ_F_CREDS is set */
 	const struct cred		*creds;
@@ -660,6 +670,14 @@ struct io_kiocb {
 		u64			extra1;
 		u64			extra2;
 	} big_cqe;
+
+	union {
+		/* links all group members for leader */
+		struct io_kiocb			*grp_link;
+
+		/* points to group leader for member */
+		struct io_kiocb			*grp_leader;
+	};
 };
 
 struct io_overflow_cqe {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7b15216a3d7f..2af32745ebd3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -124,6 +124,7 @@ enum io_uring_sqe_flags_bit {
 	IOSQE_ASYNC_BIT,
 	IOSQE_BUFFER_SELECT_BIT,
 	IOSQE_CQE_SKIP_SUCCESS_BIT,
+	IOSQE_SQE_GROUP_BIT,
 };
 
 /*
@@ -143,6 +144,8 @@ enum io_uring_sqe_flags_bit {
 #define IOSQE_BUFFER_SELECT	(1U << IOSQE_BUFFER_SELECT_BIT)
 /* don't post CQE if request succeeded */
 #define IOSQE_CQE_SKIP_SUCCESS	(1U << IOSQE_CQE_SKIP_SUCCESS_BIT)
+/* defines sqe group */
+#define IOSQE_SQE_GROUP		(1U << IOSQE_SQE_GROUP_BIT)
 
 /*
  * io_uring_setup() flags
@@ -554,6 +557,7 @@ struct io_uring_params {
 #define IORING_FEAT_REG_REG_RING	(1U << 13)
 #define IORING_FEAT_RECVSEND_BUNDLE	(1U << 14)
 #define IORING_FEAT_MIN_TIMEOUT		(1U << 15)
+#define IORING_FEAT_SQE_GROUP		(1U << 16)
 
 /*
  * io_uring_register(2) opcodes and arguments
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index d277f0a6e549..99b44b6babd6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -111,13 +111,15 @@
 			  IOSQE_IO_HARDLINK | IOSQE_ASYNC)
 
 #define SQE_VALID_FLAGS	(SQE_COMMON_FLAGS | IOSQE_BUFFER_SELECT | \
-			IOSQE_IO_DRAIN | IOSQE_CQE_SKIP_SUCCESS)
+			IOSQE_IO_DRAIN | IOSQE_CQE_SKIP_SUCCESS | \
+			IOSQE_SQE_GROUP)
 
 #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
 				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
 				REQ_F_ASYNC_DATA)
 
 #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | REQ_F_HARDLINK |\
+				 REQ_F_SQE_GROUP | REQ_F_SQE_GROUP_LEADER | \
 				 IO_REQ_CLEAN_FLAGS)
 
 #define IO_TCTX_REFS_CACHE_NR	(1U << 10)
@@ -875,6 +877,119 @@ static __always_inline void io_req_commit_cqe(struct io_ring_ctx *ctx,
 	}
 }
 
+/* Can only be called after this request is issued */
+static inline struct io_kiocb *get_group_leader(struct io_kiocb *req)
+{
+	if (req->flags & REQ_F_SQE_GROUP) {
+		if (req_is_group_leader(req))
+			return req;
+		return req->grp_leader;
+	}
+	return NULL;
+}
+
+void io_cancel_group_members(struct io_kiocb *req, bool ignore_cqes)
+{
+	struct io_kiocb *member = req->grp_link;
+
+	while (member) {
+		struct io_kiocb *next = member->grp_link;
+
+		if (ignore_cqes)
+			member->flags |= REQ_F_CQE_SKIP;
+		if (!(member->flags & REQ_F_FAIL)) {
+			req_set_fail(member);
+			io_req_set_res(member, -ECANCELED, 0);
+		}
+		member = next;
+	}
+}
+
+static void io_queue_group_members(struct io_kiocb *req)
+{
+	struct io_kiocb *member = req->grp_link;
+
+	if (!member)
+		return;
+
+	req->grp_link = NULL;
+	while (member) {
+		struct io_kiocb *next = member->grp_link;
+
+		member->grp_leader = req;
+		if (unlikely(member->flags & REQ_F_FAIL)) {
+			io_req_task_queue_fail(member, member->cqe.res);
+		} else if (unlikely(req->flags & REQ_F_FAIL)) {
+			io_req_task_queue_fail(member, -ECANCELED);
+		} else {
+			io_req_task_queue(member);
+		}
+		member = next;
+	}
+}
+
+/* called only after the request is completed */
+static void mark_last_group_member(struct io_kiocb *req)
+{
+	/* reuse REQ_F_SQE_GROUP as flag of last member */
+	WARN_ON_ONCE(req->flags & REQ_F_SQE_GROUP);
+
+	req->flags |= REQ_F_SQE_GROUP;
+}
+
+/* called only after the request is completed */
+static bool req_is_last_group_member(struct io_kiocb *req)
+{
+	return req->flags & REQ_F_SQE_GROUP;
+}
+
+static void io_complete_group_member(struct io_kiocb *req)
+{
+	struct io_kiocb *lead = get_group_leader(req);
+
+	if (WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP) ||
+			 lead->grp_refs <= 0))
+		return;
+
+	/* member CQE needs to be posted first */
+	if (!(req->flags & REQ_F_CQE_SKIP))
+		io_req_commit_cqe(req->ctx, req);
+
+	req->flags &= ~REQ_F_SQE_GROUP;
+
+	/* Set leader as failed in case of any member failed */
+	if (unlikely((req->flags & REQ_F_FAIL)))
+		req_set_fail(lead);
+
+	if (!--lead->grp_refs) {
+		mark_last_group_member(req);
+		if (!(lead->flags & REQ_F_CQE_SKIP))
+			io_req_commit_cqe(lead->ctx, lead);
+	} else if (lead->grp_refs == 1 && (lead->flags & REQ_F_SQE_GROUP)) {
+		/*
+		 * The single uncompleted leader will degenerate to plain
+		 * request, so group leader can be always freed via the
+		 * last completed member.
+		 */
+		lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
+	}
+}
+
+static void io_complete_group_leader(struct io_kiocb *req)
+{
+	WARN_ON_ONCE(req->grp_refs <= 1);
+	req->flags &= ~REQ_F_SQE_GROUP;
+	req->grp_refs -= 1;
+}
+
+static void io_complete_group_req(struct io_kiocb *req)
+{
+	if (req_is_group_leader(req))
+		io_complete_group_leader(req);
+	else
+		io_complete_group_member(req);
+}
+
 static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
@@ -890,7 +1005,8 @@ static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
 	 * Handle special CQ sync cases via task_work. DEFER_TASKRUN requires
 	 * the submitter task context, IOPOLL protects with uring_lock.
 	 */
-	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)) {
+	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL) ||
+	    req_is_group_leader(req)) {
 		req->io_task_work.func = io_req_task_complete;
 		io_req_task_work_add(req);
 		return;
@@ -1388,11 +1504,43 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 						    comp_list);
 
 		if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
+			if (req_is_last_group_member(req) ||
+					req_is_group_leader(req)) {
+				struct io_kiocb *leader;
+
+				/* Leader is freed via the last member */
+				if (req_is_group_leader(req)) {
+					if (req->grp_link)
+						io_queue_group_members(req);
+					node = req->comp_list.next;
+					continue;
+				}
+
+				/*
+				 * Prepare for freeing leader since we are the
+				 * last group member
+				 */
+				leader = get_group_leader(req);
+				leader->flags &= ~REQ_F_SQE_GROUP_LEADER;
+				req->flags &= ~REQ_F_SQE_GROUP;
+				/*
+				 * Link leader to current request's next,
+				 * this way works because the iterator
+				 * always check the next node only.
+				 *
+				 * Be careful when you change the iterator
+				 * in future
+				 */
+				wq_stack_add_head(&leader->comp_list,
+						  &req->comp_list);
+			}
+
 			if (req->flags & REQ_F_REFCOUNT) {
 				node = req->comp_list.next;
 				if (!req_ref_put_and_test(req))
 					continue;
 			}
+
 			if ((req->flags & REQ_F_POLLED) && req->apoll) {
 				struct async_poll *apoll = req->apoll;
 
@@ -1427,8 +1575,16 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
 					    comp_list);
 
-		if (!(req->flags & REQ_F_CQE_SKIP))
-			io_req_commit_cqe(ctx, req);
+		if (unlikely(req->flags & (REQ_F_CQE_SKIP | REQ_F_SQE_GROUP))) {
+			if (req->flags & REQ_F_SQE_GROUP) {
+				io_complete_group_req(req);
+				continue;
+			}
+
+			if (req->flags & REQ_F_CQE_SKIP)
+				continue;
+		}
+		io_req_commit_cqe(ctx, req);
 	}
 	__io_cq_unlock_post(ctx);
 
@@ -1638,8 +1794,12 @@ static u32 io_get_sequence(struct io_kiocb *req)
 	struct io_kiocb *cur;
 
 	/* need original cached_sq_head, but it was increased for each req */
-	io_for_each_link(cur, req)
-		seq--;
+	io_for_each_link(cur, req) {
+		if (req_is_group_leader(cur))
+			seq -= cur->grp_refs;
+		else
+			seq--;
+	}
 	return seq;
 }
 
@@ -2101,6 +2261,65 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return def->prep(req, sqe);
 }
 
+static struct io_kiocb *io_group_sqe(struct io_submit_link *group,
+				     struct io_kiocb *req)
+{
+	/*
+	 * Group chain is similar with link chain: starts with 1st sqe with
+	 * REQ_F_SQE_GROUP, and ends with the 1st sqe without REQ_F_SQE_GROUP
+	 */
+	if (group->head) {
+		struct io_kiocb *lead = group->head;
+
+		/*
+		 * Members can't be in link chain, can't be drained
+		 *
+		 * Also IOSQE_CQE_SKIP_SUCCESS can't be set for member
+		 * for the sake of simplicity
+		 */
+		if (req->flags & (IO_REQ_LINK_FLAGS | REQ_F_IO_DRAIN |
+				  REQ_F_CQE_SKIP))
+			req_fail_link_node(lead, -EINVAL);
+
+		lead->grp_refs += 1;
+		group->last->grp_link = req;
+		group->last = req;
+
+		if (req->flags & REQ_F_SQE_GROUP)
+			return NULL;
+
+		req->grp_link = NULL;
+		req->flags |= REQ_F_SQE_GROUP;
+		group->head = NULL;
+
+		return lead;
+	} else {
+		if (WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP)))
+			return req;
+		group->head = req;
+		group->last = req;
+		req->grp_refs = 1;
+		req->flags |= REQ_F_SQE_GROUP_LEADER;
+		return NULL;
+	}
+}
+
+static __cold struct io_kiocb *io_submit_fail_group(
+		struct io_submit_link *link, struct io_kiocb *req)
+{
+	struct io_kiocb *lead = link->head;
+
+	/*
+	 * Instead of failing eagerly, continue assembling the group link
+	 * if applicable and mark the leader with REQ_F_FAIL. The group
+	 * flushing code should find the flag and handle the rest
+	 */
+	if (lead && !(lead->flags & REQ_F_FAIL))
+		req_fail_link_node(lead, -ECANCELED);
+
+	return io_group_sqe(link, req);
+}
+
 static __cold int io_submit_fail_link(struct io_submit_link *link,
 				      struct io_kiocb *req, int ret)
 {
@@ -2139,11 +2358,18 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_submit_link *link = &ctx->submit_state.link;
+	struct io_submit_link *group = &ctx->submit_state.group;
 
 	trace_io_uring_req_failed(sqe, req, ret);
 
 	req_fail_link_node(req, ret);
 
+	if (group->head || (req->flags & REQ_F_SQE_GROUP)) {
+		req = io_submit_fail_group(group, req);
+		if (!req)
+			return 0;
+	}
+
 	/* cover both linked and non-linked request */
 	return io_submit_fail_link(link, req, ret);
 }
@@ -2187,11 +2413,29 @@ static struct io_kiocb *io_link_sqe(struct io_submit_link *link,
 	return req;
 }
 
+static inline bool io_group_assembling(const struct io_submit_state *state,
+				       const struct io_kiocb *req)
+{
+	if (state->group.head || req->flags & REQ_F_SQE_GROUP)
+		return true;
+	return false;
+}
+
+/* Failed request is covered too */
+static inline bool io_link_assembling(const struct io_submit_state *state,
+				      const struct io_kiocb *req)
+{
+	if (state->link.head || (req->flags & (IO_REQ_LINK_FLAGS |
+				 REQ_F_FORCE_ASYNC | REQ_F_FAIL)))
+		return true;
+	return false;
+}
+
 static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			 const struct io_uring_sqe *sqe)
 	__must_hold(&ctx->uring_lock)
 {
-	struct io_submit_link *link = &ctx->submit_state.link;
+	struct io_submit_state *state = &ctx->submit_state;
 	int ret;
 
 	ret = io_init_req(ctx, req, sqe);
@@ -2200,11 +2444,20 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 
 	trace_io_uring_submit_req(req);
 
-	if (unlikely(link->head || (req->flags & (IO_REQ_LINK_FLAGS |
-				    REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
-		req = io_link_sqe(link, req);
-		if (!req)
-			return 0;
+	if (unlikely(io_link_assembling(state, req) ||
+		     io_group_assembling(state, req))) {
+		if (io_group_assembling(state, req)) {
+			req = io_group_sqe(&state->group, req);
+			if (!req)
+				return 0;
+		}
+
+		/* covers non-linked failed request too */
+		if (io_link_assembling(state, req)) {
+			req = io_link_sqe(&state->link, req);
+			if (!req)
+				return 0;
+		}
 	}
 	io_queue_sqe(req);
 	return 0;
@@ -2217,8 +2470,22 @@ static void io_submit_state_end(struct io_ring_ctx *ctx)
 {
 	struct io_submit_state *state = &ctx->submit_state;
 
-	if (unlikely(state->link.head))
-		io_queue_sqe_fallback(state->link.head);
+	if (unlikely(state->group.head || state->link.head)) {
+		/* the last member must set REQ_F_SQE_GROUP */
+		if (state->group.head) {
+			struct io_kiocb *lead = state->group.head;
+
+			state->group.last->grp_link = NULL;
+			if (lead->flags & IO_REQ_LINK_FLAGS)
+				io_link_sqe(&state->link, lead);
+			else
+				io_queue_sqe_fallback(lead);
+		}
+
+		if (unlikely(state->link.head))
+			io_queue_sqe_fallback(state->link.head);
+	}
+
 	/* flush only after queuing links as they can generate completions */
 	io_submit_flush_completions(ctx);
 	if (state->plug_started)
@@ -2236,6 +2503,7 @@ static void io_submit_state_start(struct io_submit_state *state,
 	state->submit_nr = max_ios;
 	/* set only head, no need to init link_last in advance */
 	state->link.head = NULL;
+	state->group.head = NULL;
 }
 
 static void io_commit_sqring(struct io_ring_ctx *ctx)
@@ -3670,7 +3938,8 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 			IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS |
 			IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP |
 			IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING |
-			IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_MIN_TIMEOUT;
+			IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_MIN_TIMEOUT |
+			IORING_FEAT_SQE_GROUP;
 
 	if (copy_to_user(params, p, sizeof(*p))) {
 		ret = -EFAULT;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 65078e641390..df2be7353414 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -72,6 +72,7 @@ bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags
 void io_add_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
 bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags);
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx);
+void io_cancel_group_members(struct io_kiocb *req, bool ignore_cqes);
 
 struct file *io_file_get_normal(struct io_kiocb *req, int fd);
 struct file *io_file_get_fixed(struct io_kiocb *req, int fd,
@@ -343,6 +344,11 @@ static inline void io_tw_lock(struct io_ring_ctx *ctx, struct io_tw_state *ts)
 	lockdep_assert_held(&ctx->uring_lock);
 }
 
+static inline bool req_is_group_leader(struct io_kiocb *req)
+{
+	return req->flags & REQ_F_SQE_GROUP_LEADER;
+}
+
 /*
  * Don't complete immediately but use deferred completion infrastructure.
  * Protected by ->uring_lock and can only be used either with
diff --git a/io_uring/timeout.c b/io_uring/timeout.c
index 9973876d91b0..dad7e9283d7e 100644
--- a/io_uring/timeout.c
+++ b/io_uring/timeout.c
@@ -168,6 +168,8 @@ static void io_fail_links(struct io_kiocb *req)
 			link->flags |= REQ_F_CQE_SKIP;
 		else
 			link->flags &= ~REQ_F_CQE_SKIP;
+		if (req_is_group_leader(link))
+			io_cancel_group_members(link, ignore_cqes);
 		trace_io_uring_fail_link(req, link);
 		link = link->link;
 	}
@@ -543,6 +545,10 @@ static int __io_timeout_prep(struct io_kiocb *req,
 	if (is_timeout_link) {
 		struct io_submit_link *link = &req->ctx->submit_state.link;
 
+		/* so far disallow IO group link timeout */
+		if (req->ctx->submit_state.group.head)
+			return -EINVAL;
+
 		if (!link->head)
 			return -EINVAL;
 		if (link->last->opcode == IORING_OP_LINK_TIMEOUT)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V6 5/8] io_uring: support sqe group with members depending on leader
  2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
                   ` (3 preceding siblings ...)
  2024-09-12 10:49 ` [PATCH V6 4/8] io_uring: support SQE group Ming Lei
@ 2024-09-12 10:49 ` Ming Lei
  2024-10-04 13:18   ` Pavel Begunkov
  2024-09-12 10:49 ` [PATCH V6 6/8] io_uring: support providing sqe group buffer Ming Lei
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-09-12 10:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block, Ming Lei

IOSQE_SQE_GROUP just starts to queue members after the leader is completed,
which way is just for simplifying implementation, and this behavior is never
part of UAPI, and it may be relaxed and members can be queued concurrently
with leader in future.

However, some resource can't cross OPs, such as kernel buffer, otherwise
the buffer may be leaked easily in case that any OP failure or application
panic.

Add flag REQ_F_SQE_GROUP_DEP for allowing members to depend on group leader
explicitly, so that group members won't be queued until the leader request is
completed, the kernel resource lifetime can be aligned with group leader
or group, one typical use case is to support zero copy for device internal
buffer.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/io_uring_types.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 11c6726abbb9..793d5a26d9b8 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -472,6 +472,7 @@ enum {
 	REQ_F_BL_NO_RECYCLE_BIT,
 	REQ_F_BUFFERS_COMMIT_BIT,
 	REQ_F_SQE_GROUP_LEADER_BIT,
+	REQ_F_SQE_GROUP_DEP_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -554,6 +555,8 @@ enum {
 	REQ_F_BUFFERS_COMMIT	= IO_REQ_FLAG(REQ_F_BUFFERS_COMMIT_BIT),
 	/* sqe group lead */
 	REQ_F_SQE_GROUP_LEADER	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEADER_BIT),
+	/* sqe group with members depending on leader */
+	REQ_F_SQE_GROUP_DEP	= IO_REQ_FLAG(REQ_F_SQE_GROUP_DEP_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
                   ` (4 preceding siblings ...)
  2024-09-12 10:49 ` [PATCH V6 5/8] io_uring: support sqe group with members depending on leader Ming Lei
@ 2024-09-12 10:49 ` Ming Lei
  2024-10-04 15:32   ` Pavel Begunkov
  2024-09-12 10:49 ` [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-09-12 10:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block, Ming Lei

SQE group with REQ_F_SQE_GROUP_DEP introduces one new mechanism to share
resource among one group of requests, and all member requests can consume
the resource provided by group leader efficiently in parallel.

This patch uses the added sqe group feature REQ_F_SQE_GROUP_DEP to share
kernel buffer in sqe group:

- the group leader provides kernel buffer to member requests

- member requests use the provided buffer to do FS or network IO, or more
operations in future

- this kernel buffer is returned back after member requests consume it

This way looks a bit similar with kernel's pipe/splice, but there are some
important differences:

- splice is for transferring data between two FDs via pipe, and fd_out can
only read data from pipe, but data can't be written to; this feature can
borrow buffer from group leader to members, so member request can write data
to this buffer if the provided buffer is allowed to write to.

- splice implements data transfer by moving pages between subsystem and
pipe, that means page ownership is transferred, and this way is one of the
most complicated thing of splice; this patch supports scenarios in which
the buffer can't be transferred, and buffer is only borrowed to member
requests, and is returned back after member requests consume the provided
buffer, so buffer lifetime is aligned with group leader lifetime, and
buffer lifetime is simplified a lot. Especially the buffer is guaranteed
to be returned back.

- splice can't run in async way basically

It can help to implement generic zero copy between device and related
operations, such as ublk, fuse, vdpa, even network receive or whatever.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/io_uring_types.h | 33 +++++++++++++++++++
 io_uring/io_uring.c            | 10 +++++-
 io_uring/io_uring.h            | 10 ++++++
 io_uring/kbuf.c                | 60 ++++++++++++++++++++++++++++++++++
 io_uring/kbuf.h                | 13 ++++++++
 io_uring/net.c                 | 23 ++++++++++++-
 io_uring/opdef.c               |  4 +++
 io_uring/opdef.h               |  2 ++
 io_uring/rw.c                  | 20 +++++++++++-
 9 files changed, 172 insertions(+), 3 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 793d5a26d9b8..445e5507565a 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -6,6 +6,7 @@
 #include <linux/task_work.h>
 #include <linux/bitmap.h>
 #include <linux/llist.h>
+#include <linux/bvec.h>
 #include <uapi/linux/io_uring.h>
 
 enum {
@@ -39,6 +40,26 @@ enum io_uring_cmd_flags {
 	IO_URING_F_COMPAT		= (1 << 12),
 };
 
+struct io_uring_kernel_buf;
+typedef void (io_uring_buf_giveback_t) (const struct io_uring_kernel_buf *);
+
+/* buffer provided from kernel */
+struct io_uring_kernel_buf {
+	unsigned long		len;
+	unsigned short		nr_bvecs;
+	unsigned char		dir;	/* ITER_SOURCE or ITER_DEST */
+
+	/* offset in the 1st bvec */
+	unsigned int		offset;
+	const struct bio_vec	*bvec;
+
+	/* called when we are done with this buffer */
+	io_uring_buf_giveback_t	*grp_kbuf_ack;
+
+	/* private field, user don't touch it */
+	struct bio_vec		__bvec[];
+};
+
 struct io_wq_work_node {
 	struct io_wq_work_node *next;
 };
@@ -473,6 +494,7 @@ enum {
 	REQ_F_BUFFERS_COMMIT_BIT,
 	REQ_F_SQE_GROUP_LEADER_BIT,
 	REQ_F_SQE_GROUP_DEP_BIT,
+	REQ_F_GROUP_KBUF_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -557,6 +579,8 @@ enum {
 	REQ_F_SQE_GROUP_LEADER	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEADER_BIT),
 	/* sqe group with members depending on leader */
 	REQ_F_SQE_GROUP_DEP	= IO_REQ_FLAG(REQ_F_SQE_GROUP_DEP_BIT),
+	/* group lead provides kbuf for members, set for both lead and member */
+	REQ_F_GROUP_KBUF	= IO_REQ_FLAG(REQ_F_GROUP_KBUF_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
@@ -640,6 +664,15 @@ struct io_kiocb {
 		 * REQ_F_BUFFER_RING is set.
 		 */
 		struct io_buffer_list	*buf_list;
+
+		/*
+		 * store kernel buffer provided by sqe group lead, valid
+		 * IFF REQ_F_GROUP_KBUF
+		 *
+		 * The buffer meta is immutable since it is shared by
+		 * all member requests
+		 */
+		const struct io_uring_kernel_buf *grp_kbuf;
 	};
 
 	union {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 99b44b6babd6..80c4d9192657 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -116,7 +116,7 @@
 
 #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
 				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
-				REQ_F_ASYNC_DATA)
+				REQ_F_ASYNC_DATA | REQ_F_GROUP_KBUF)
 
 #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | REQ_F_HARDLINK |\
 				 REQ_F_SQE_GROUP | REQ_F_SQE_GROUP_LEADER | \
@@ -387,6 +387,11 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
 
 static void io_clean_op(struct io_kiocb *req)
 {
+	/* GROUP_KBUF is only available for REQ_F_SQE_GROUP_DEP */
+	if ((req->flags & (REQ_F_GROUP_KBUF | REQ_F_SQE_GROUP_DEP)) ==
+			(REQ_F_GROUP_KBUF | REQ_F_SQE_GROUP_DEP))
+		io_group_kbuf_drop(req);
+
 	if (req->flags & REQ_F_BUFFER_SELECTED) {
 		spin_lock(&req->ctx->completion_lock);
 		io_kbuf_drop(req);
@@ -914,9 +919,12 @@ static void io_queue_group_members(struct io_kiocb *req)
 
 	req->grp_link = NULL;
 	while (member) {
+		const struct io_issue_def *def = &io_issue_defs[member->opcode];
 		struct io_kiocb *next = member->grp_link;
 
 		member->grp_leader = req;
+		if ((req->flags & REQ_F_GROUP_KBUF) && def->accept_group_kbuf)
+			member->flags |= REQ_F_GROUP_KBUF;
 		if (unlikely(member->flags & REQ_F_FAIL)) {
 			io_req_task_queue_fail(member, member->cqe.res);
 		} else if (unlikely(req->flags & REQ_F_FAIL)) {
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index df2be7353414..8e111d24c02d 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -349,6 +349,16 @@ static inline bool req_is_group_leader(struct io_kiocb *req)
 	return req->flags & REQ_F_SQE_GROUP_LEADER;
 }
 
+static inline bool req_is_group_member(struct io_kiocb *req)
+{
+	return !req_is_group_leader(req) && (req->flags & REQ_F_SQE_GROUP);
+}
+
+static inline bool req_support_group_dep(struct io_kiocb *req)
+{
+	return req_is_group_leader(req) && (req->flags & REQ_F_SQE_GROUP_DEP);
+}
+
 /*
  * Don't complete immediately but use deferred completion infrastructure.
  * Protected by ->uring_lock and can only be used either with
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 1f503bcc9c9f..ead0f85c05ac 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -838,3 +838,63 @@ int io_pbuf_mmap(struct file *file, struct vm_area_struct *vma)
 	io_put_bl(ctx, bl);
 	return ret;
 }
+
+int io_provide_group_kbuf(struct io_kiocb *req,
+		const struct io_uring_kernel_buf *grp_kbuf)
+{
+	if (unlikely(!req_support_group_dep(req)))
+		return -EINVAL;
+
+	/*
+	 * Borrow this buffer from one kernel subsystem, and return them
+	 * by calling `grp_kbuf_ack` when the group lead is freed.
+	 *
+	 * Not like pipe/splice, this kernel buffer is always owned by the
+	 * provider, and has to be returned back.
+	 */
+	req->grp_kbuf = grp_kbuf;
+	req->flags |= REQ_F_GROUP_KBUF;
+
+	return 0;
+}
+
+int io_import_group_kbuf(struct io_kiocb *req, unsigned long buf_off,
+		unsigned int len, int dir, struct iov_iter *iter)
+{
+	struct io_kiocb *lead = req->grp_link;
+	const struct io_uring_kernel_buf *kbuf;
+	unsigned long offset;
+
+	WARN_ON_ONCE(!(req->flags & REQ_F_GROUP_KBUF));
+
+	if (!req_is_group_member(req))
+		return -EINVAL;
+
+	if (!lead || !req_support_group_dep(lead) || !lead->grp_kbuf)
+		return -EINVAL;
+
+	/* req->fused_cmd_kbuf is immutable */
+	kbuf = lead->grp_kbuf;
+	offset = kbuf->offset;
+
+	if (!kbuf->bvec)
+		return -EINVAL;
+
+	if (dir != kbuf->dir)
+		return -EINVAL;
+
+	if (unlikely(buf_off > kbuf->len))
+		return -EFAULT;
+
+	if (unlikely(len > kbuf->len - buf_off))
+		return -EFAULT;
+
+	/* don't use io_import_fixed which doesn't support multipage bvec */
+	offset += buf_off;
+	iov_iter_bvec(iter, dir, kbuf->bvec, kbuf->nr_bvecs, offset + len);
+
+	if (offset)
+		iov_iter_advance(iter, offset);
+
+	return 0;
+}
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index 36aadfe5ac00..37d18324e840 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -89,6 +89,11 @@ struct io_buffer_list *io_pbuf_get_bl(struct io_ring_ctx *ctx,
 				      unsigned long bgid);
 int io_pbuf_mmap(struct file *file, struct vm_area_struct *vma);
 
+int io_provide_group_kbuf(struct io_kiocb *req,
+		const struct io_uring_kernel_buf *grp_kbuf);
+int io_import_group_kbuf(struct io_kiocb *req, unsigned long buf_off,
+		unsigned int len, int dir, struct iov_iter *iter);
+
 static inline bool io_kbuf_recycle_ring(struct io_kiocb *req)
 {
 	/*
@@ -220,4 +225,12 @@ static inline unsigned int io_put_kbufs(struct io_kiocb *req, int len,
 {
 	return __io_put_kbufs(req, len, nbufs, issue_flags);
 }
+
+static inline void io_group_kbuf_drop(struct io_kiocb *req)
+{
+	const struct io_uring_kernel_buf *gbuf = req->grp_kbuf;
+
+	if (gbuf && gbuf->grp_kbuf_ack)
+		gbuf->grp_kbuf_ack(gbuf);
+}
 #endif
diff --git a/io_uring/net.c b/io_uring/net.c
index f10f5a22d66a..ad24dd5924d2 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -89,6 +89,13 @@ struct io_sr_msg {
  */
 #define MULTISHOT_MAX_RETRY	32
 
+#define user_ptr_to_u64(x) (		\
+{					\
+	typecheck(void __user *, (x));	\
+	(u64)(unsigned long)(x);	\
+}					\
+)
+
 int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
@@ -375,7 +382,7 @@ static int io_send_setup(struct io_kiocb *req)
 		kmsg->msg.msg_name = &kmsg->addr;
 		kmsg->msg.msg_namelen = sr->addr_len;
 	}
-	if (!io_do_buffer_select(req)) {
+	if (!io_do_buffer_select(req) && !(req->flags & REQ_F_GROUP_KBUF)) {
 		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len,
 				  &kmsg->msg.msg_iter);
 		if (unlikely(ret < 0))
@@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	if (issue_flags & IO_URING_F_NONBLOCK)
 		flags |= MSG_DONTWAIT;
 
+	if (req->flags & REQ_F_GROUP_KBUF) {
+		ret = io_import_group_kbuf(req,
+					user_ptr_to_u64(sr->buf),
+					sr->len, ITER_SOURCE,
+					&kmsg->msg.msg_iter);
+		if (unlikely(ret))
+			return ret;
+	}
+
 retry_bundle:
 	if (io_do_buffer_select(req)) {
 		struct buf_sel_arg arg = {
@@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 			goto out_free;
 		}
 		sr->buf = NULL;
+	} else if (req->flags & REQ_F_GROUP_KBUF) {
+		ret = io_import_group_kbuf(req, user_ptr_to_u64(sr->buf),
+				sr->len, ITER_DEST, &kmsg->msg.msg_iter);
+		if (unlikely(ret))
+			goto out_free;
 	}
 
 	kmsg->msg.msg_flags = 0;
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index a2be3bbca5ff..c12f57619a33 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -246,6 +246,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.ioprio			= 1,
 		.iopoll			= 1,
 		.iopoll_queue		= 1,
+		.accept_group_kbuf	= 1,
 		.async_size		= sizeof(struct io_async_rw),
 		.prep			= io_prep_read,
 		.issue			= io_read,
@@ -260,6 +261,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.ioprio			= 1,
 		.iopoll			= 1,
 		.iopoll_queue		= 1,
+		.accept_group_kbuf	= 1,
 		.async_size		= sizeof(struct io_async_rw),
 		.prep			= io_prep_write,
 		.issue			= io_write,
@@ -282,6 +284,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.audit_skip		= 1,
 		.ioprio			= 1,
 		.buffer_select		= 1,
+		.accept_group_kbuf	= 1,
 #if defined(CONFIG_NET)
 		.async_size		= sizeof(struct io_async_msghdr),
 		.prep			= io_sendmsg_prep,
@@ -297,6 +300,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.buffer_select		= 1,
 		.audit_skip		= 1,
 		.ioprio			= 1,
+		.accept_group_kbuf	= 1,
 #if defined(CONFIG_NET)
 		.async_size		= sizeof(struct io_async_msghdr),
 		.prep			= io_recvmsg_prep,
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index 14456436ff74..328c8a3c4fa7 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -27,6 +27,8 @@ struct io_issue_def {
 	unsigned		iopoll_queue : 1;
 	/* vectored opcode, set if 1) vectored, and 2) handler needs to know */
 	unsigned		vectored : 1;
+	/* opcodes which accept provided group kbuf */
+	unsigned		accept_group_kbuf : 1;
 
 	/* size of async data needed, if any */
 	unsigned short		async_size;
diff --git a/io_uring/rw.c b/io_uring/rw.c
index f023ff49c688..402d80436ffd 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -235,7 +235,8 @@ static int io_prep_rw_setup(struct io_kiocb *req, int ddir, bool do_import)
 	if (io_rw_alloc_async(req))
 		return -ENOMEM;
 
-	if (!do_import || io_do_buffer_select(req))
+	if (!do_import || io_do_buffer_select(req) ||
+	    (req->flags & REQ_F_GROUP_KBUF))
 		return 0;
 
 	rw = req->async_data;
@@ -619,11 +620,16 @@ static inline loff_t *io_kiocb_ppos(struct kiocb *kiocb)
  */
 static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter)
 {
+	struct io_kiocb *req = cmd_to_io_kiocb(rw);
 	struct kiocb *kiocb = &rw->kiocb;
 	struct file *file = kiocb->ki_filp;
 	ssize_t ret = 0;
 	loff_t *ppos;
 
+	/* group buffer is kernel buffer and doesn't have userspace addr */
+	if (req->flags & REQ_F_GROUP_KBUF)
+		return -EOPNOTSUPP;
+
 	/*
 	 * Don't support polled IO through this interface, and we can't
 	 * support non-blocking either. For the latter, this just causes
@@ -830,6 +836,11 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
 		ret = io_import_iovec(ITER_DEST, req, io, issue_flags);
 		if (unlikely(ret < 0))
 			return ret;
+	} else if (req->flags & REQ_F_GROUP_KBUF) {
+		ret = io_import_group_kbuf(req, rw->addr, rw->len, ITER_DEST,
+				&io->iter);
+		if (unlikely(ret))
+			return ret;
 	}
 	ret = io_rw_init_file(req, FMODE_READ, READ);
 	if (unlikely(ret))
@@ -1019,6 +1030,13 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	ssize_t ret, ret2;
 	loff_t *ppos;
 
+	if (req->flags & REQ_F_GROUP_KBUF) {
+		ret = io_import_group_kbuf(req, rw->addr, rw->len, ITER_SOURCE,
+				&io->iter);
+		if (unlikely(ret))
+			return ret;
+	}
+
 	ret = io_rw_init_file(req, FMODE_WRITE, WRITE);
 	if (unlikely(ret))
 		return ret;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
                   ` (5 preceding siblings ...)
  2024-09-12 10:49 ` [PATCH V6 6/8] io_uring: support providing sqe group buffer Ming Lei
@ 2024-09-12 10:49 ` Ming Lei
  2024-10-04 15:44   ` Pavel Begunkov
  2024-09-12 10:49 ` [PATCH V6 8/8] ublk: support provide io buffer Ming Lei
  2024-09-26 10:27 ` [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
  8 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-09-12 10:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block, Ming Lei

Allow uring command to be group leader for providing kernel buffer,
and this way can support generic device zero copy over device buffer.

The following patch will use the way to support zero copy for ublk.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/io_uring/cmd.h  |  7 +++++++
 include/uapi/linux/io_uring.h |  7 ++++++-
 io_uring/uring_cmd.c          | 28 ++++++++++++++++++++++++++++
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 447fbfd32215..fde3a2ec7d9a 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -48,6 +48,8 @@ void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
 void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
 		unsigned int issue_flags);
 
+int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
+		const struct io_uring_kernel_buf *grp_kbuf);
 #else
 static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
 			      struct iov_iter *iter, void *ioucmd)
@@ -67,6 +69,11 @@ static inline void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
 {
 }
+static inline int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
+		const struct io_uring_kernel_buf *grp_kbuf)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 /*
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 2af32745ebd3..11985eeac10e 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -271,9 +271,14 @@ enum io_uring_op {
  * sqe->uring_cmd_flags		top 8bits aren't available for userspace
  * IORING_URING_CMD_FIXED	use registered buffer; pass this flag
  *				along with setting sqe->buf_index.
+ * IORING_PROVIDE_GROUP_KBUF	this command provides group kernel buffer
+ *				for member requests which can retrieve
+ *				any sub-buffer with offset(sqe->addr) and
+ *				len(sqe->len)
  */
 #define IORING_URING_CMD_FIXED	(1U << 0)
-#define IORING_URING_CMD_MASK	IORING_URING_CMD_FIXED
+#define IORING_PROVIDE_GROUP_KBUF	(1U << 1)
+#define IORING_URING_CMD_MASK	(IORING_URING_CMD_FIXED | IORING_PROVIDE_GROUP_KBUF)
 
 
 /*
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8391c7c7c1ec..ac92ba70de9d 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -15,6 +15,7 @@
 #include "alloc_cache.h"
 #include "rsrc.h"
 #include "uring_cmd.h"
+#include "kbuf.h"
 
 static struct uring_cache *io_uring_async_get(struct io_kiocb *req)
 {
@@ -175,6 +176,26 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, ssize_t res2,
 }
 EXPORT_SYMBOL_GPL(io_uring_cmd_done);
 
+/*
+ * Provide kernel buffer for sqe group members to consume, and the caller
+ * has to guarantee that the provided buffer and the callback are valid
+ * until the callback is called.
+ */
+int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
+		const struct io_uring_kernel_buf *grp_kbuf)
+{
+	struct io_kiocb *req = cmd_to_io_kiocb(ioucmd);
+
+	if (unlikely(!(ioucmd->flags & IORING_PROVIDE_GROUP_KBUF)))
+		return -EINVAL;
+
+	if (unlikely(!req_support_group_dep(req)))
+		return -EINVAL;
+
+	return io_provide_group_kbuf(req, grp_kbuf);
+}
+EXPORT_SYMBOL_GPL(io_uring_cmd_provide_kbuf);
+
 static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 				   const struct io_uring_sqe *sqe)
 {
@@ -207,6 +228,13 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (ioucmd->flags & ~IORING_URING_CMD_MASK)
 		return -EINVAL;
 
+	if (ioucmd->flags & IORING_PROVIDE_GROUP_KBUF) {
+		/* LEADER flag isn't set yet, so check GROUP only */
+		if (!(req->flags & REQ_F_SQE_GROUP))
+			return -EINVAL;
+		req->flags |= REQ_F_SQE_GROUP_DEP;
+	}
+
 	if (ioucmd->flags & IORING_URING_CMD_FIXED) {
 		struct io_ring_ctx *ctx = req->ctx;
 		u16 index;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH V6 8/8] ublk: support provide io buffer
  2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
                   ` (6 preceding siblings ...)
  2024-09-12 10:49 ` [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
@ 2024-09-12 10:49 ` Ming Lei
  2024-10-17 22:31   ` Uday Shankar
  2024-09-26 10:27 ` [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
  8 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-09-12 10:49 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block, Ming Lei

Implement uring command's IORING_PROVIDE_GROUP_KBUF, and provide
io buffer for userpace to run io_uring operations(FS, network IO),
then ublk zero copy can be supported.

userspace code:

	https://github.com/ublk-org/ublksrv/tree/group-provide-buf.v3
	git clone https://github.com/ublk-org/ublksrv.git -b group-provide-buf.v3

And both loop and nbd zero copy(io_uring send and send zc) are covered.

Performance improvement is quite obvious in big block size test, such as
'loop --buffered_io' perf is doubled in 64KB block test("loop/007 vs
loop/009").

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c      | 160 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/ublk_cmd.h |   7 +-
 2 files changed, 156 insertions(+), 11 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 890c08792ba8..d5813e20c177 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -51,6 +51,8 @@
 /* private ioctl command mirror */
 #define UBLK_CMD_DEL_DEV_ASYNC	_IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
 
+#define UBLK_IO_PROVIDE_IO_BUF _IOC_NR(UBLK_U_IO_PROVIDE_IO_BUF)
+
 /* All UBLK_F_* have to be included into UBLK_F_ALL */
 #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
 		| UBLK_F_URING_CMD_COMP_IN_TASK \
@@ -74,6 +76,8 @@ struct ublk_rq_data {
 	__u64 sector;
 	__u32 operation;
 	__u32 nr_zones;
+	bool allocated_bvec;
+	struct io_uring_kernel_buf buf[0];
 };
 
 struct ublk_uring_cmd_pdu {
@@ -192,11 +196,15 @@ struct ublk_params_header {
 	__u32	types;
 };
 
+static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
+		struct ublk_queue *ubq, int tag, size_t offset);
 static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq);
 
 static inline unsigned int ublk_req_build_flags(struct request *req);
 static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq,
 						   int tag);
+static void ublk_io_buf_giveback_cb(const struct io_uring_kernel_buf *buf);
+
 static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub)
 {
 	return ub->dev_info.flags & UBLK_F_USER_COPY;
@@ -558,6 +566,11 @@ static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
 	return ublk_support_user_copy(ubq);
 }
 
+static inline bool ublk_support_zc(const struct ublk_queue *ubq)
+{
+	return ubq->flags & UBLK_F_SUPPORT_ZERO_COPY;
+}
+
 static inline void ublk_init_req_ref(const struct ublk_queue *ubq,
 		struct request *req)
 {
@@ -821,6 +834,71 @@ static size_t ublk_copy_user_pages(const struct request *req,
 	return done;
 }
 
+/*
+ * The built command buffer is immutable, so it is fine to feed it to
+ * concurrent io_uring provide buf commands
+ */
+static int ublk_init_zero_copy_buffer(struct request *req)
+{
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
+	struct io_uring_kernel_buf *imu = data->buf;
+	struct req_iterator rq_iter;
+	unsigned int nr_bvecs = 0;
+	struct bio_vec *bvec;
+	unsigned int offset;
+	struct bio_vec bv;
+
+	if (!ublk_rq_has_data(req))
+		goto exit;
+
+	rq_for_each_bvec(bv, req, rq_iter)
+		nr_bvecs++;
+
+	if (!nr_bvecs)
+		goto exit;
+
+	if (req->bio != req->biotail) {
+		int idx = 0;
+
+		bvec = kvmalloc_array(nr_bvecs, sizeof(struct bio_vec),
+				GFP_NOIO);
+		if (!bvec)
+			return -ENOMEM;
+
+		offset = 0;
+		rq_for_each_bvec(bv, req, rq_iter)
+			bvec[idx++] = bv;
+		data->allocated_bvec = true;
+	} else {
+		struct bio *bio = req->bio;
+
+		offset = bio->bi_iter.bi_bvec_done;
+		bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+	}
+	imu->bvec = bvec;
+	imu->nr_bvecs = nr_bvecs;
+	imu->offset = offset;
+	imu->len = blk_rq_bytes(req);
+	imu->dir = req_op(req) == REQ_OP_READ ? ITER_DEST : ITER_SOURCE;
+	imu->grp_kbuf_ack = ublk_io_buf_giveback_cb;
+
+	return 0;
+exit:
+	imu->bvec = NULL;
+	return 0;
+}
+
+static void ublk_deinit_zero_copy_buffer(struct request *req)
+{
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
+	struct io_uring_kernel_buf *imu = data->buf;
+
+	if (data->allocated_bvec) {
+		kvfree(imu->bvec);
+		data->allocated_bvec = false;
+	}
+}
+
 static inline bool ublk_need_map_req(const struct request *req)
 {
 	return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
@@ -832,13 +910,25 @@ static inline bool ublk_need_unmap_req(const struct request *req)
 	       (req_op(req) == REQ_OP_READ || req_op(req) == REQ_OP_DRV_IN);
 }
 
-static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
+static int ublk_map_io(const struct ublk_queue *ubq, struct request *req,
 		struct ublk_io *io)
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
-	if (ublk_support_user_copy(ubq))
+	if (ublk_support_user_copy(ubq)) {
+		if (ublk_support_zc(ubq)) {
+			int ret = ublk_init_zero_copy_buffer(req);
+
+			/*
+			 * The only failure is -ENOMEM for allocating providing
+			 * buffer command, return zero so that we can requeue
+			 * this req.
+			 */
+			if (unlikely(ret))
+				return 0;
+		}
 		return rq_bytes;
+	}
 
 	/*
 	 * no zero copy, we delay copy WRITE request data into ublksrv
@@ -856,13 +946,16 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 }
 
 static int ublk_unmap_io(const struct ublk_queue *ubq,
-		const struct request *req,
+		struct request *req,
 		struct ublk_io *io)
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
-	if (ublk_support_user_copy(ubq))
+	if (ublk_support_user_copy(ubq)) {
+		if (ublk_support_zc(ubq))
+			ublk_deinit_zero_copy_buffer(req);
 		return rq_bytes;
+	}
 
 	if (ublk_need_unmap_req(req)) {
 		struct iov_iter iter;
@@ -1008,6 +1101,7 @@ static inline void __ublk_complete_rq(struct request *req)
 
 	return;
 exit:
+	ublk_deinit_zero_copy_buffer(req);
 	blk_mq_end_request(req, res);
 }
 
@@ -1650,6 +1744,45 @@ static inline void ublk_prep_cancel(struct io_uring_cmd *cmd,
 	io_uring_cmd_mark_cancelable(cmd, issue_flags);
 }
 
+static void ublk_io_buf_giveback_cb(const struct io_uring_kernel_buf *buf)
+{
+	struct ublk_rq_data *data = container_of(buf, struct ublk_rq_data, buf[0]);
+	struct request *req = blk_mq_rq_from_pdu(data);
+	struct ublk_queue *ubq = req->mq_hctx->driver_data;
+
+	ublk_put_req_ref(ubq, req);
+}
+
+static int ublk_provide_io_buf(struct io_uring_cmd *cmd,
+		struct ublk_queue *ubq, int tag)
+{
+	struct ublk_device *ub = cmd->file->private_data;
+	struct ublk_rq_data *data;
+	struct request *req;
+
+	if (!ub)
+		return -EPERM;
+
+	req = __ublk_check_and_get_req(ub, ubq, tag, 0);
+	if (!req)
+		return -EINVAL;
+
+	pr_devel("%s: qid %d tag %u request bytes %u\n",
+			__func__, tag, ubq->q_id, blk_rq_bytes(req));
+
+	data = blk_mq_rq_to_pdu(req);
+
+	/*
+	 * io_uring guarantees that the callback will be called after
+	 * the provided buffer is consumed, and it is automatic removal
+	 * before this uring command is freed.
+	 *
+	 * This request won't be completed unless the callback is called,
+	 * so ublk module won't be unloaded too.
+	 */
+	return io_uring_cmd_provide_kbuf(cmd, data->buf);
+}
+
 static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 			       unsigned int issue_flags,
 			       const struct ublksrv_io_cmd *ub_cmd)
@@ -1666,6 +1799,10 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 			__func__, cmd->cmd_op, ub_cmd->q_id, tag,
 			ub_cmd->result);
 
+	if ((cmd->flags & IORING_PROVIDE_GROUP_KBUF) &&
+			cmd_op != UBLK_U_IO_PROVIDE_IO_BUF)
+		return -EOPNOTSUPP;
+
 	if (ub_cmd->q_id >= ub->dev_info.nr_hw_queues)
 		goto out;
 
@@ -1701,6 +1838,8 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 
 	ret = -EINVAL;
 	switch (_IOC_NR(cmd_op)) {
+	case UBLK_IO_PROVIDE_IO_BUF:
+		return ublk_provide_io_buf(cmd, ubq, tag);
 	case UBLK_IO_FETCH_REQ:
 		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
 		if (ublk_queue_ready(ubq)) {
@@ -2120,11 +2259,14 @@ static void ublk_align_max_io_size(struct ublk_device *ub)
 
 static int ublk_add_tag_set(struct ublk_device *ub)
 {
+	int zc = !!(ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY);
+	struct ublk_rq_data *data;
+
 	ub->tag_set.ops = &ublk_mq_ops;
 	ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues;
 	ub->tag_set.queue_depth = ub->dev_info.queue_depth;
 	ub->tag_set.numa_node = NUMA_NO_NODE;
-	ub->tag_set.cmd_size = sizeof(struct ublk_rq_data);
+	ub->tag_set.cmd_size = struct_size(data, buf, zc);
 	ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	ub->tag_set.driver_data = ub;
 	return blk_mq_alloc_tag_set(&ub->tag_set);
@@ -2420,8 +2562,12 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 		goto out_free_dev_number;
 	}
 
-	/* We are not ready to support zero copy */
-	ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
+	/* zero copy depends on user copy */
+	if ((ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) &&
+			!ublk_dev_is_user_copy(ub)) {
+		ret = -EINVAL;
+		goto out_free_dev_number;
+	}
 
 	ub->dev_info.nr_hw_queues = min_t(unsigned int,
 			ub->dev_info.nr_hw_queues, nr_cpu_ids);
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index c8dc5f8ea699..897ace0794c2 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -94,6 +94,8 @@
 	_IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd)
 #define	UBLK_U_IO_NEED_GET_DATA		\
 	_IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd)
+#define	UBLK_U_IO_PROVIDE_IO_BUF	\
+	_IOWR('u', 0x23, struct ublksrv_io_cmd)
 
 /* only ABORT means that no re-fetch */
 #define UBLK_IO_RES_OK			0
@@ -126,10 +128,7 @@
 #define UBLKSRV_IO_BUF_TOTAL_BITS	(UBLK_QID_OFF + UBLK_QID_BITS)
 #define UBLKSRV_IO_BUF_TOTAL_SIZE	(1ULL << UBLKSRV_IO_BUF_TOTAL_BITS)
 
-/*
- * zero copy requires 4k block size, and can remap ublk driver's io
- * request into ublksrv's vm space
- */
+/* io_uring provide kbuf command based zero copy */
 #define UBLK_F_SUPPORT_ZERO_COPY	(1ULL << 0)
 
 /*
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf
  2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
                   ` (7 preceding siblings ...)
  2024-09-12 10:49 ` [PATCH V6 8/8] ublk: support provide io buffer Ming Lei
@ 2024-09-26 10:27 ` Ming Lei
  2024-09-26 12:18   ` Jens Axboe
  8 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-09-26 10:27 UTC (permalink / raw)
  To: Jens Axboe, io-uring, Pavel Begunkov; +Cc: linux-block

Hello Pavel, Jens and Guys,

On Thu, Sep 12, 2024 at 06:49:20PM +0800, Ming Lei wrote:
> Hello,
> 
> The 1st 3 patches are cleanup, and prepare for adding sqe group.
> 
> The 4th patch supports generic sqe group which is like link chain, but
> allows each sqe in group to be issued in parallel and the group shares
> same IO_LINK & IO_DRAIN boundary, so N:M dependency can be supported with
> sqe group & io link together. sqe group changes nothing on
> IOSQE_IO_LINK.
> 
> The 5th patch supports one variant of sqe group: allow members to depend
> on group leader, so that kernel resource lifetime can be aligned with
> group leader or group, then any kernel resource can be shared in this
> sqe group, and can be used in generic device zero copy.
> 
> The 6th & 7th patches supports providing sqe group buffer via the sqe
> group variant.
> 
> The 8th patch supports ublk zero copy based on io_uring providing sqe
> group buffer.
> 
> Tests:
> 
> 1) pass liburing test
> - make runtests
> 
> 2) write/pass sqe group test case and sqe provide buffer case:
> 
> https://github.com/axboe/liburing/compare/master...ming1:liburing:sqe_group_v3
> 
> https://github.com/ming1/liburing/tree/sqe_group_v3
> 
> - covers related sqe flags combination and linking groups, both nop and
> one multi-destination file copy.
> 
> - cover failure handling test: fail leader IO or member IO in both single
>   group and linked groups, which is done in each sqe flags combination
>   test
> 
> - covers IORING_PROVIDE_GROUP_KBUF by adding ublk-loop-zc
> 
> 3) ublksrv zero copy:
> 
> ublksrv userspace implements zero copy by sqe group & provide group
> kbuf:
> 
> 	git clone https://github.com/ublk-org/ublksrv.git -b group-provide-buf_v3
> 	make test T=loop/009:nbd/061	#ublk zc tests
> 
> When running 64KB/512KB block size test on ublk-loop('ublk add -t loop --buffered_io -f $backing'),
> it is observed that perf is doubled.
> 
> 
> V6:
> 	- follow Pavel's suggestion to disallow IOSQE_CQE_SKIP_SUCCESS &
> 	  LINK_TIMEOUT
> 	- kill __io_complete_group_member() (Pavel)
> 	- simplify link failure handling (Pavel)
> 	- move members' queuing out of completion lock (Pavel)
> 	- cleanup group io complete handler
> 	- add more comment
> 	- add ublk zc into liburing test for covering
> 	  IOSQE_SQE_GROUP & IORING_PROVIDE_GROUP_KBUF 

Any comments on V6? So that I may address them in next version since
v6 has small conflict with mainline.


thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf
  2024-09-26 10:27 ` [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
@ 2024-09-26 12:18   ` Jens Axboe
  2024-09-26 19:46     ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2024-09-26 12:18 UTC (permalink / raw)
  To: Ming Lei, io-uring, Pavel Begunkov; +Cc: linux-block

On 9/26/24 4:27 AM, Ming Lei wrote:
> Hello Pavel, Jens and Guys,
> 
> On Thu, Sep 12, 2024 at 06:49:20PM +0800, Ming Lei wrote:
>> Hello,
>>
>> The 1st 3 patches are cleanup, and prepare for adding sqe group.
>>
>> The 4th patch supports generic sqe group which is like link chain, but
>> allows each sqe in group to be issued in parallel and the group shares
>> same IO_LINK & IO_DRAIN boundary, so N:M dependency can be supported with
>> sqe group & io link together. sqe group changes nothing on
>> IOSQE_IO_LINK.
>>
>> The 5th patch supports one variant of sqe group: allow members to depend
>> on group leader, so that kernel resource lifetime can be aligned with
>> group leader or group, then any kernel resource can be shared in this
>> sqe group, and can be used in generic device zero copy.
>>
>> The 6th & 7th patches supports providing sqe group buffer via the sqe
>> group variant.
>>
>> The 8th patch supports ublk zero copy based on io_uring providing sqe
>> group buffer.
>>
>> Tests:
>>
>> 1) pass liburing test
>> - make runtests
>>
>> 2) write/pass sqe group test case and sqe provide buffer case:
>>
>> https://github.com/axboe/liburing/compare/master...ming1:liburing:sqe_group_v3
>>
>> https://github.com/ming1/liburing/tree/sqe_group_v3
>>
>> - covers related sqe flags combination and linking groups, both nop and
>> one multi-destination file copy.
>>
>> - cover failure handling test: fail leader IO or member IO in both single
>>   group and linked groups, which is done in each sqe flags combination
>>   test
>>
>> - covers IORING_PROVIDE_GROUP_KBUF by adding ublk-loop-zc
>>
>> 3) ublksrv zero copy:
>>
>> ublksrv userspace implements zero copy by sqe group & provide group
>> kbuf:
>>
>> 	git clone https://github.com/ublk-org/ublksrv.git -b group-provide-buf_v3
>> 	make test T=loop/009:nbd/061	#ublk zc tests
>>
>> When running 64KB/512KB block size test on ublk-loop('ublk add -t loop --buffered_io -f $backing'),
>> it is observed that perf is doubled.
>>
>>
>> V6:
>> 	- follow Pavel's suggestion to disallow IOSQE_CQE_SKIP_SUCCESS &
>> 	  LINK_TIMEOUT
>> 	- kill __io_complete_group_member() (Pavel)
>> 	- simplify link failure handling (Pavel)
>> 	- move members' queuing out of completion lock (Pavel)
>> 	- cleanup group io complete handler
>> 	- add more comment
>> 	- add ublk zc into liburing test for covering
>> 	  IOSQE_SQE_GROUP & IORING_PROVIDE_GROUP_KBUF 
> 
> Any comments on V6? So that I may address them in next version since
> v6 has small conflict with mainline.

It looks fine to me, don't know if Pavel has any comments. Maybe just
toss out a v7 so it applies cleanly? I'll kick off the 6.13 branch
pretty soon.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf
  2024-09-26 12:18   ` Jens Axboe
@ 2024-09-26 19:46     ` Pavel Begunkov
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-09-26 19:46 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei, io-uring; +Cc: linux-block

On 9/26/24 13:18, Jens Axboe wrote:
> On 9/26/24 4:27 AM, Ming Lei wrote:
>> Hello Pavel, Jens and Guys,
>>
>> On Thu, Sep 12, 2024 at 06:49:20PM +0800, Ming Lei wrote:
>>> Hello,
>>>
>>> The 1st 3 patches are cleanup, and prepare for adding sqe group.
>>>
>>> The 4th patch supports generic sqe group which is like link chain, but
>>> allows each sqe in group to be issued in parallel and the group shares
>>> same IO_LINK & IO_DRAIN boundary, so N:M dependency can be supported with
>>> sqe group & io link together. sqe group changes nothing on
>>> IOSQE_IO_LINK.
>>>
>>> The 5th patch supports one variant of sqe group: allow members to depend
>>> on group leader, so that kernel resource lifetime can be aligned with
>>> group leader or group, then any kernel resource can be shared in this
>>> sqe group, and can be used in generic device zero copy.
>>>
>>> The 6th & 7th patches supports providing sqe group buffer via the sqe
>>> group variant.
>>>
>>> The 8th patch supports ublk zero copy based on io_uring providing sqe
>>> group buffer.
>>>
>>> Tests:
>>>
>>> 1) pass liburing test
>>> - make runtests
>>>
>>> 2) write/pass sqe group test case and sqe provide buffer case:
>>>
>>> https://github.com/axboe/liburing/compare/master...ming1:liburing:sqe_group_v3
>>>
>>> https://github.com/ming1/liburing/tree/sqe_group_v3
>>>
>>> - covers related sqe flags combination and linking groups, both nop and
>>> one multi-destination file copy.
>>>
>>> - cover failure handling test: fail leader IO or member IO in both single
>>>    group and linked groups, which is done in each sqe flags combination
>>>    test
>>>
>>> - covers IORING_PROVIDE_GROUP_KBUF by adding ublk-loop-zc
>>>
>>> 3) ublksrv zero copy:
>>>
>>> ublksrv userspace implements zero copy by sqe group & provide group
>>> kbuf:
>>>
>>> 	git clone https://github.com/ublk-org/ublksrv.git -b group-provide-buf_v3
>>> 	make test T=loop/009:nbd/061	#ublk zc tests
>>>
>>> When running 64KB/512KB block size test on ublk-loop('ublk add -t loop --buffered_io -f $backing'),
>>> it is observed that perf is doubled.
>>>
>>>
>>> V6:
>>> 	- follow Pavel's suggestion to disallow IOSQE_CQE_SKIP_SUCCESS &
>>> 	  LINK_TIMEOUT
>>> 	- kill __io_complete_group_member() (Pavel)
>>> 	- simplify link failure handling (Pavel)
>>> 	- move members' queuing out of completion lock (Pavel)
>>> 	- cleanup group io complete handler
>>> 	- add more comment
>>> 	- add ublk zc into liburing test for covering
>>> 	  IOSQE_SQE_GROUP & IORING_PROVIDE_GROUP_KBUF
>>
>> Any comments on V6? So that I may address them in next version since
>> v6 has small conflict with mainline.
> 
> It looks fine to me, don't know if Pavel has any comments. Maybe just
> toss out a v7 so it applies cleanly? I'll kick off the 6.13 branch
> pretty soon.

Impl is not that straightforwardand warrants some prudence in
reviewing. I was visiting conferences, but going to take a look
next week or earlier.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 4/8] io_uring: support SQE group
  2024-09-12 10:49 ` [PATCH V6 4/8] io_uring: support SQE group Ming Lei
@ 2024-10-04 13:12   ` Pavel Begunkov
  2024-10-06  3:54     ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-04 13:12 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, io-uring; +Cc: linux-block, Kevin Wolf

On 9/12/24 11:49, Ming Lei wrote:
...
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -111,13 +111,15 @@
...
> +static void io_complete_group_member(struct io_kiocb *req)
> +{
> +	struct io_kiocb *lead = get_group_leader(req);
> +
> +	if (WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP) ||
> +			 lead->grp_refs <= 0))
> +		return;
> +
> +	/* member CQE needs to be posted first */
> +	if (!(req->flags & REQ_F_CQE_SKIP))
> +		io_req_commit_cqe(req->ctx, req);
> +
> +	req->flags &= ~REQ_F_SQE_GROUP;

I can't say I like this implicit state machine too much,
but let's add a comment why we need to clear it. i.e.
it seems it wouldn't be needed if not for the
mark_last_group_member() below that puts it back to tunnel
the leader to io_free_batch_list().

> +
> +	/* Set leader as failed in case of any member failed */
> +	if (unlikely((req->flags & REQ_F_FAIL)))
> +		req_set_fail(lead);
> +
> +	if (!--lead->grp_refs) {
> +		mark_last_group_member(req);
> +		if (!(lead->flags & REQ_F_CQE_SKIP))
> +			io_req_commit_cqe(lead->ctx, lead);
> +	} else if (lead->grp_refs == 1 && (lead->flags & REQ_F_SQE_GROUP)) {
> +		/*
> +		 * The single uncompleted leader will degenerate to plain
> +		 * request, so group leader can be always freed via the
> +		 * last completed member.
> +		 */
> +		lead->flags &= ~REQ_F_SQE_GROUP_LEADER;

What does this try to handle? A group with a leader but no
members? If that's the case, io_group_sqe() and io_submit_state_end()
just need to fail such groups (and clear REQ_F_SQE_GROUP before
that).

> +	}
> +}
> +
> +static void io_complete_group_leader(struct io_kiocb *req)
> +{
> +	WARN_ON_ONCE(req->grp_refs <= 1);
> +	req->flags &= ~REQ_F_SQE_GROUP;
> +	req->grp_refs -= 1;
> +}
> +
> +static void io_complete_group_req(struct io_kiocb *req)
> +{
> +	if (req_is_group_leader(req))
> +		io_complete_group_leader(req);
> +	else
> +		io_complete_group_member(req);
> +}
> +
>   static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
>   {
>   	struct io_ring_ctx *ctx = req->ctx;
> @@ -890,7 +1005,8 @@ static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
>   	 * Handle special CQ sync cases via task_work. DEFER_TASKRUN requires
>   	 * the submitter task context, IOPOLL protects with uring_lock.
>   	 */
> -	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)) {
> +	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL) ||
> +	    req_is_group_leader(req)) {

We're better to push all group requests to io_req_task_complete(),
not just a group leader. While seems to be correct, that just
overcomplicates the request's flow, it can post a CQE here, but then
still expect to do group stuff in the CQE posting loop
(flush_completions -> io_complete_group_req), which might post another
cqe for the leader, and then do yet another post processing loop in
io_free_batch_list().


>   		req->io_task_work.func = io_req_task_complete;
>   		io_req_task_work_add(req);
>   		return;
> @@ -1388,11 +1504,43 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
>   						    comp_list);
>   
>   		if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
> +			if (req_is_last_group_member(req) ||
> +					req_is_group_leader(req)) {
> +				struct io_kiocb *leader;
> +
> +				/* Leader is freed via the last member */
> +				if (req_is_group_leader(req)) {
> +					if (req->grp_link)
> +						io_queue_group_members(req);
> +					node = req->comp_list.next;
> +					continue;
> +				}
> +
> +				/*
> +				 * Prepare for freeing leader since we are the
> +				 * last group member
> +				 */
> +				leader = get_group_leader(req);
> +				leader->flags &= ~REQ_F_SQE_GROUP_LEADER;
> +				req->flags &= ~REQ_F_SQE_GROUP;
> +				/*
> +				 * Link leader to current request's next,
> +				 * this way works because the iterator
> +				 * always check the next node only.
> +				 *
> +				 * Be careful when you change the iterator
> +				 * in future
> +				 */
> +				wq_stack_add_head(&leader->comp_list,
> +						  &req->comp_list);
> +			}
> +
>   			if (req->flags & REQ_F_REFCOUNT) {
>   				node = req->comp_list.next;
>   				if (!req_ref_put_and_test(req))
>   					continue;
>   			}
> +
>   			if ((req->flags & REQ_F_POLLED) && req->apoll) {
>   				struct async_poll *apoll = req->apoll;
>   
> @@ -1427,8 +1575,16 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
>   		struct io_kiocb *req = container_of(node, struct io_kiocb,
>   					    comp_list);
>   
> -		if (!(req->flags & REQ_F_CQE_SKIP))
> -			io_req_commit_cqe(ctx, req);
> +		if (unlikely(req->flags & (REQ_F_CQE_SKIP | REQ_F_SQE_GROUP))) {
> +			if (req->flags & REQ_F_SQE_GROUP) {
> +				io_complete_group_req(req);
> +				continue;
> +			}
> +
> +			if (req->flags & REQ_F_CQE_SKIP)
> +				continue;
> +		}
> +		io_req_commit_cqe(ctx, req);
>   	}
>   	__io_cq_unlock_post(ctx);
>   
> @@ -1638,8 +1794,12 @@ static u32 io_get_sequence(struct io_kiocb *req)
>   	struct io_kiocb *cur;
>   
>   	/* need original cached_sq_head, but it was increased for each req */
> -	io_for_each_link(cur, req)
> -		seq--;
> +	io_for_each_link(cur, req) {
> +		if (req_is_group_leader(cur))
> +			seq -= cur->grp_refs;
> +		else
> +			seq--;
> +	}
>   	return seq;
>   }
...
> @@ -2217,8 +2470,22 @@ static void io_submit_state_end(struct io_ring_ctx *ctx)
>   {
>   	struct io_submit_state *state = &ctx->submit_state;
>   
> -	if (unlikely(state->link.head))
> -		io_queue_sqe_fallback(state->link.head);
> +	if (unlikely(state->group.head || state->link.head)) {
> +		/* the last member must set REQ_F_SQE_GROUP */
> +		if (state->group.head) {
> +			struct io_kiocb *lead = state->group.head;
> +
> +			state->group.last->grp_link = NULL;
> +			if (lead->flags & IO_REQ_LINK_FLAGS)
> +				io_link_sqe(&state->link, lead);
> +			else
> +				io_queue_sqe_fallback(lead);

req1(F_LINK), req2(F_GROUP), req3

is supposed to be turned into

req1 -> {group: req2 (lead), req3 }

but note that req2 here doesn't have F_LINK set.
I think it should be like this instead:

if (state->link.head)
	io_link_sqe();
else
	io_queue_sqe_fallback(lead);

> +		}
> +
> +		if (unlikely(state->link.head))
> +			io_queue_sqe_fallback(state->link.head);
> +	}
> +
>   	/* flush only after queuing links as they can generate completions */
>   	io_submit_flush_completions(ctx);
>   	if (state->plug_started)


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 5/8] io_uring: support sqe group with members depending on leader
  2024-09-12 10:49 ` [PATCH V6 5/8] io_uring: support sqe group with members depending on leader Ming Lei
@ 2024-10-04 13:18   ` Pavel Begunkov
  2024-10-06  3:54     ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-04 13:18 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, io-uring; +Cc: linux-block

On 9/12/24 11:49, Ming Lei wrote:
> IOSQE_SQE_GROUP just starts to queue members after the leader is completed,
> which way is just for simplifying implementation, and this behavior is never
> part of UAPI, and it may be relaxed and members can be queued concurrently
> with leader in future.
> 
> However, some resource can't cross OPs, such as kernel buffer, otherwise
> the buffer may be leaked easily in case that any OP failure or application
> panic.
> 
> Add flag REQ_F_SQE_GROUP_DEP for allowing members to depend on group leader
> explicitly, so that group members won't be queued until the leader request is
> completed, the kernel resource lifetime can be aligned with group leader

That's the current and only behaviour, we don't need an extra flag
for that. We can add it back later when anything changes.

> or group, one typical use case is to support zero copy for device internal
> buffer.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   include/linux/io_uring_types.h | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 11c6726abbb9..793d5a26d9b8 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -472,6 +472,7 @@ enum {
>   	REQ_F_BL_NO_RECYCLE_BIT,
>   	REQ_F_BUFFERS_COMMIT_BIT,
>   	REQ_F_SQE_GROUP_LEADER_BIT,
> +	REQ_F_SQE_GROUP_DEP_BIT,
>   
>   	/* not a real bit, just to check we're not overflowing the space */
>   	__REQ_F_LAST_BIT,
> @@ -554,6 +555,8 @@ enum {
>   	REQ_F_BUFFERS_COMMIT	= IO_REQ_FLAG(REQ_F_BUFFERS_COMMIT_BIT),
>   	/* sqe group lead */
>   	REQ_F_SQE_GROUP_LEADER	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEADER_BIT),
> +	/* sqe group with members depending on leader */
> +	REQ_F_SQE_GROUP_DEP	= IO_REQ_FLAG(REQ_F_SQE_GROUP_DEP_BIT),
>   };
>   
>   typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-09-12 10:49 ` [PATCH V6 6/8] io_uring: support providing sqe group buffer Ming Lei
@ 2024-10-04 15:32   ` Pavel Begunkov
  2024-10-06  8:20     ` Ming Lei
  2024-10-06  9:47     ` Ming Lei
  0 siblings, 2 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-04 15:32 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, io-uring; +Cc: linux-block

On 9/12/24 11:49, Ming Lei wrote:
...
> It can help to implement generic zero copy between device and related
> operations, such as ublk, fuse, vdpa,
> even network receive or whatever.

That's exactly the thing it can't sanely work with because
of this design.

> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   include/linux/io_uring_types.h | 33 +++++++++++++++++++
>   io_uring/io_uring.c            | 10 +++++-
>   io_uring/io_uring.h            | 10 ++++++
>   io_uring/kbuf.c                | 60 ++++++++++++++++++++++++++++++++++
>   io_uring/kbuf.h                | 13 ++++++++
>   io_uring/net.c                 | 23 ++++++++++++-
>   io_uring/opdef.c               |  4 +++
>   io_uring/opdef.h               |  2 ++
>   io_uring/rw.c                  | 20 +++++++++++-
>   9 files changed, 172 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 793d5a26d9b8..445e5507565a 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -6,6 +6,7 @@
>   #include <linux/task_work.h>
>   #include <linux/bitmap.h>
>   #include <linux/llist.h>
> +#include <linux/bvec.h>
>   #include <uapi/linux/io_uring.h>
>   
>   enum {
> @@ -39,6 +40,26 @@ enum io_uring_cmd_flags {
>   	IO_URING_F_COMPAT		= (1 << 12),
>   };
>   
> +struct io_uring_kernel_buf;
> +typedef void (io_uring_buf_giveback_t) (const struct io_uring_kernel_buf *);
> +
> +/* buffer provided from kernel */
> +struct io_uring_kernel_buf {
> +	unsigned long		len;
> +	unsigned short		nr_bvecs;
> +	unsigned char		dir;	/* ITER_SOURCE or ITER_DEST */
> +
> +	/* offset in the 1st bvec */
> +	unsigned int		offset;
> +	const struct bio_vec	*bvec;
> +
> +	/* called when we are done with this buffer */
> +	io_uring_buf_giveback_t	*grp_kbuf_ack;
> +
> +	/* private field, user don't touch it */
> +	struct bio_vec		__bvec[];
> +};
> +
>   struct io_wq_work_node {
>   	struct io_wq_work_node *next;
>   };
> @@ -473,6 +494,7 @@ enum {
>   	REQ_F_BUFFERS_COMMIT_BIT,
>   	REQ_F_SQE_GROUP_LEADER_BIT,
>   	REQ_F_SQE_GROUP_DEP_BIT,
> +	REQ_F_GROUP_KBUF_BIT,
>   
>   	/* not a real bit, just to check we're not overflowing the space */
>   	__REQ_F_LAST_BIT,
> @@ -557,6 +579,8 @@ enum {
>   	REQ_F_SQE_GROUP_LEADER	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEADER_BIT),
>   	/* sqe group with members depending on leader */
>   	REQ_F_SQE_GROUP_DEP	= IO_REQ_FLAG(REQ_F_SQE_GROUP_DEP_BIT),
> +	/* group lead provides kbuf for members, set for both lead and member */
> +	REQ_F_GROUP_KBUF	= IO_REQ_FLAG(REQ_F_GROUP_KBUF_BIT),

We have a huge flag problem here. It's a 4th group flag, that gives
me an idea that it's overabused. We're adding state machines based on
them "set group, clear group, but if last set it again. And clear
group lead if refs are of particular value". And it's not really
clear what these two flags are here for or what they do.

 From what I see you need here just one flag to mark requests
that provide a buffer, ala REQ_F_PROVIDING_KBUF. On the import
side:

if ((req->flags & GROUP) && (req->lead->flags & REQ_F_PROVIDING_KBUF))
	...

And when you kill the request:

if (req->flags & REQ_F_PROVIDING_KBUF)
	io_group_kbuf_drop();

And I don't think you need opdef::accept_group_kbuf since the
request handler should already know that and, importantly, you don't
imbue any semantics based on it.

FWIW, would be nice if during init figure we can verify that the leader
provides a buffer IFF there is someone consuming it, but I don't think
the semantics is flexible enough to do it sanely. i.e. there are many
members in a group, some might want to use the buffer and some might not.


>   typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
> @@ -640,6 +664,15 @@ struct io_kiocb {
>   		 * REQ_F_BUFFER_RING is set.
>   		 */
>   		struct io_buffer_list	*buf_list;
> +
> +		/*
> +		 * store kernel buffer provided by sqe group lead, valid
> +		 * IFF REQ_F_GROUP_KBUF
> +		 *
> +		 * The buffer meta is immutable since it is shared by
> +		 * all member requests
> +		 */
> +		const struct io_uring_kernel_buf *grp_kbuf;
>   	};
>   
>   	union {
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 99b44b6babd6..80c4d9192657 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -116,7 +116,7 @@
>   
>   #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
>   				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
> -				REQ_F_ASYNC_DATA)
> +				REQ_F_ASYNC_DATA | REQ_F_GROUP_KBUF)
>   
>   #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | REQ_F_HARDLINK |\
>   				 REQ_F_SQE_GROUP | REQ_F_SQE_GROUP_LEADER | \
> @@ -387,6 +387,11 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
>   
>   static void io_clean_op(struct io_kiocb *req)
>   {
> +	/* GROUP_KBUF is only available for REQ_F_SQE_GROUP_DEP */
> +	if ((req->flags & (REQ_F_GROUP_KBUF | REQ_F_SQE_GROUP_DEP)) ==
> +			(REQ_F_GROUP_KBUF | REQ_F_SQE_GROUP_DEP))
> +		io_group_kbuf_drop(req);
> +
>   	if (req->flags & REQ_F_BUFFER_SELECTED) {
>   		spin_lock(&req->ctx->completion_lock);
>   		io_kbuf_drop(req);
> @@ -914,9 +919,12 @@ static void io_queue_group_members(struct io_kiocb *req)
>   
>   	req->grp_link = NULL;
>   	while (member) {
> +		const struct io_issue_def *def = &io_issue_defs[member->opcode];
>   		struct io_kiocb *next = member->grp_link;
>   
>   		member->grp_leader = req;
> +		if ((req->flags & REQ_F_GROUP_KBUF) && def->accept_group_kbuf)
> +			member->flags |= REQ_F_GROUP_KBUF;

As per above I suspect that is not needed.

>   		if (unlikely(member->flags & REQ_F_FAIL)) {
>   			io_req_task_queue_fail(member, member->cqe.res);
>   		} else if (unlikely(req->flags & REQ_F_FAIL)) {
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index df2be7353414..8e111d24c02d 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -349,6 +349,16 @@ static inline bool req_is_group_leader(struct io_kiocb *req)
>   	return req->flags & REQ_F_SQE_GROUP_LEADER;
>   }
>   
...
> +int io_import_group_kbuf(struct io_kiocb *req, unsigned long buf_off,
> +		unsigned int len, int dir, struct iov_iter *iter)
> +{
> +	struct io_kiocb *lead = req->grp_link;
> +	const struct io_uring_kernel_buf *kbuf;
> +	unsigned long offset;
> +
> +	WARN_ON_ONCE(!(req->flags & REQ_F_GROUP_KBUF));
> +
> +	if (!req_is_group_member(req))
> +		return -EINVAL;
> +
> +	if (!lead || !req_support_group_dep(lead) || !lead->grp_kbuf)
> +		return -EINVAL;
> +
> +	/* req->fused_cmd_kbuf is immutable */
> +	kbuf = lead->grp_kbuf;
> +	offset = kbuf->offset;
> +
> +	if (!kbuf->bvec)
> +		return -EINVAL;

How can this happen?

> +	if (dir != kbuf->dir)
> +		return -EINVAL;
> +
> +	if (unlikely(buf_off > kbuf->len))
> +		return -EFAULT;
> +
> +	if (unlikely(len > kbuf->len - buf_off))
> +		return -EFAULT;

check_add_overflow() would be more readable

> +
> +	/* don't use io_import_fixed which doesn't support multipage bvec */
> +	offset += buf_off;
> +	iov_iter_bvec(iter, dir, kbuf->bvec, kbuf->nr_bvecs, offset + len);
> +
> +	if (offset)
> +		iov_iter_advance(iter, offset);
> +
> +	return 0;
> +}
> diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
> index 36aadfe5ac00..37d18324e840 100644
> --- a/io_uring/kbuf.h
> +++ b/io_uring/kbuf.h
> @@ -89,6 +89,11 @@ struct io_buffer_list *io_pbuf_get_bl(struct io_ring_ctx *ctx,
>   				      unsigned long bgid);
>   int io_pbuf_mmap(struct file *file, struct vm_area_struct *vma);
>   
> +int io_provide_group_kbuf(struct io_kiocb *req,
> +		const struct io_uring_kernel_buf *grp_kbuf);
> +int io_import_group_kbuf(struct io_kiocb *req, unsigned long buf_off,
> +		unsigned int len, int dir, struct iov_iter *iter);
> +
>   static inline bool io_kbuf_recycle_ring(struct io_kiocb *req)
>   {
>   	/*
> @@ -220,4 +225,12 @@ static inline unsigned int io_put_kbufs(struct io_kiocb *req, int len,
>   {
>   	return __io_put_kbufs(req, len, nbufs, issue_flags);
>   }
> +
> +static inline void io_group_kbuf_drop(struct io_kiocb *req)
> +{
> +	const struct io_uring_kernel_buf *gbuf = req->grp_kbuf;
> +
> +	if (gbuf && gbuf->grp_kbuf_ack)

How can ->grp_kbuf_ack be missing?

> +		gbuf->grp_kbuf_ack(gbuf);
> +}
>   #endif
> diff --git a/io_uring/net.c b/io_uring/net.c
> index f10f5a22d66a..ad24dd5924d2 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -89,6 +89,13 @@ struct io_sr_msg {
>    */
>   #define MULTISHOT_MAX_RETRY	32
>   
> +#define user_ptr_to_u64(x) (		\
> +{					\
> +	typecheck(void __user *, (x));	\
> +	(u64)(unsigned long)(x);	\
> +}					\
> +)
> +
>   int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>   {
>   	struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
> @@ -375,7 +382,7 @@ static int io_send_setup(struct io_kiocb *req)
>   		kmsg->msg.msg_name = &kmsg->addr;
>   		kmsg->msg.msg_namelen = sr->addr_len;
>   	}
> -	if (!io_do_buffer_select(req)) {
> +	if (!io_do_buffer_select(req) && !(req->flags & REQ_F_GROUP_KBUF)) {
>   		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len,
>   				  &kmsg->msg.msg_iter);
>   		if (unlikely(ret < 0))
> @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
>   	if (issue_flags & IO_URING_F_NONBLOCK)
>   		flags |= MSG_DONTWAIT;
>   
> +	if (req->flags & REQ_F_GROUP_KBUF) {

Does anything prevent the request to be marked by both
GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
a group kbuf and then go to the io_do_buffer_select()
overriding all of that

> +		ret = io_import_group_kbuf(req,
> +					user_ptr_to_u64(sr->buf),
> +					sr->len, ITER_SOURCE,
> +					&kmsg->msg.msg_iter);
> +		if (unlikely(ret))
> +			return ret;
> +	}
> +
>   retry_bundle:
>   	if (io_do_buffer_select(req)) {
>   		struct buf_sel_arg arg = {
> @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>   			goto out_free;
>   		}
>   		sr->buf = NULL;
> +	} else if (req->flags & REQ_F_GROUP_KBUF) {

What happens if we get a short read/recv?

> +		ret = io_import_group_kbuf(req, user_ptr_to_u64(sr->buf),
> +				sr->len, ITER_DEST, &kmsg->msg.msg_iter);
> +		if (unlikely(ret))
> +			goto out_free;
>   	}
>   
>   	kmsg->msg.msg_flags = 0;
...

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-09-12 10:49 ` [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
@ 2024-10-04 15:44   ` Pavel Begunkov
  2024-10-06  8:46     ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-04 15:44 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, io-uring; +Cc: linux-block

On 9/12/24 11:49, Ming Lei wrote:
> Allow uring command to be group leader for providing kernel buffer,
> and this way can support generic device zero copy over device buffer.
> 
> The following patch will use the way to support zero copy for ublk.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   include/linux/io_uring/cmd.h  |  7 +++++++
>   include/uapi/linux/io_uring.h |  7 ++++++-
>   io_uring/uring_cmd.c          | 28 ++++++++++++++++++++++++++++
>   3 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> index 447fbfd32215..fde3a2ec7d9a 100644
> --- a/include/linux/io_uring/cmd.h
> +++ b/include/linux/io_uring/cmd.h
> @@ -48,6 +48,8 @@ void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
>   void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
>   		unsigned int issue_flags);
>   
> +int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
> +		const struct io_uring_kernel_buf *grp_kbuf);
>   #else
>   static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
>   			      struct iov_iter *iter, void *ioucmd)
> @@ -67,6 +69,11 @@ static inline void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
>   		unsigned int issue_flags)
>   {
>   }
> +static inline int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
> +		const struct io_uring_kernel_buf *grp_kbuf)
> +{
> +	return -EOPNOTSUPP;
> +}
>   #endif
>   
>   /*
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 2af32745ebd3..11985eeac10e 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -271,9 +271,14 @@ enum io_uring_op {
>    * sqe->uring_cmd_flags		top 8bits aren't available for userspace
>    * IORING_URING_CMD_FIXED	use registered buffer; pass this flag
>    *				along with setting sqe->buf_index.
> + * IORING_PROVIDE_GROUP_KBUF	this command provides group kernel buffer
> + *				for member requests which can retrieve
> + *				any sub-buffer with offset(sqe->addr) and
> + *				len(sqe->len)

Is there a good reason it needs to be a cmd generic flag instead of
ublk specific?

1. Extra overhead for files / cmds that don't even care about the
feature.

2. As it stands with this patch, the flag is ignored by all other
cmd implementations, which might be quite confusing as an api,
especially so since if we don't set that REQ_F_GROUP_KBUF memeber
requests will silently try to import a buffer the "normal way",
i.e. interpret sqe->addr or such as the target buffer.

3. We can't even put some nice semantics on top since it's
still cmd specific and not generic to all other io_uring
requests.

I'd even think that it'd make sense to implement it as a
new cmd opcode, but that's the business of the file implementing
it, i.e. ublk.

>    */
>   #define IORING_URING_CMD_FIXED	(1U << 0)
> -#define IORING_URING_CMD_MASK	IORING_URING_CMD_FIXED
> +#define IORING_PROVIDE_GROUP_KBUF	(1U << 1)
> +#define IORING_URING_CMD_MASK	(IORING_URING_CMD_FIXED | IORING_PROVIDE_GROUP_KBUF)
>   
>   
>   /*
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 4/8] io_uring: support SQE group
  2024-10-04 13:12   ` Pavel Begunkov
@ 2024-10-06  3:54     ` Ming Lei
  2024-10-09 11:53       ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-06  3:54 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, Kevin Wolf, ming.lei

On Fri, Oct 04, 2024 at 02:12:28PM +0100, Pavel Begunkov wrote:
> On 9/12/24 11:49, Ming Lei wrote:
> ...
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> > @@ -111,13 +111,15 @@
> ...
> > +static void io_complete_group_member(struct io_kiocb *req)
> > +{
> > +	struct io_kiocb *lead = get_group_leader(req);
> > +
> > +	if (WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP) ||
> > +			 lead->grp_refs <= 0))
> > +		return;
> > +
> > +	/* member CQE needs to be posted first */
> > +	if (!(req->flags & REQ_F_CQE_SKIP))
> > +		io_req_commit_cqe(req->ctx, req);
> > +
> > +	req->flags &= ~REQ_F_SQE_GROUP;
> 
> I can't say I like this implicit state machine too much,
> but let's add a comment why we need to clear it. i.e.
> it seems it wouldn't be needed if not for the
> mark_last_group_member() below that puts it back to tunnel
> the leader to io_free_batch_list().

Yeah, the main purpose is for reusing the flag for marking last
member, will add comment for this usage.

> 
> > +
> > +	/* Set leader as failed in case of any member failed */
> > +	if (unlikely((req->flags & REQ_F_FAIL)))
> > +		req_set_fail(lead);
> > +
> > +	if (!--lead->grp_refs) {
> > +		mark_last_group_member(req);
> > +		if (!(lead->flags & REQ_F_CQE_SKIP))
> > +			io_req_commit_cqe(lead->ctx, lead);
> > +	} else if (lead->grp_refs == 1 && (lead->flags & REQ_F_SQE_GROUP)) {
> > +		/*
> > +		 * The single uncompleted leader will degenerate to plain
> > +		 * request, so group leader can be always freed via the
> > +		 * last completed member.
> > +		 */
> > +		lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
> 
> What does this try to handle? A group with a leader but no
> members? If that's the case, io_group_sqe() and io_submit_state_end()
> just need to fail such groups (and clear REQ_F_SQE_GROUP before
> that).

The code block allows to issue leader and members concurrently, but
we have changed to always issue members after leader is completed, so
the above code can be removed now.

> 
> > +	}
> > +}
> > +
> > +static void io_complete_group_leader(struct io_kiocb *req)
> > +{
> > +	WARN_ON_ONCE(req->grp_refs <= 1);
> > +	req->flags &= ~REQ_F_SQE_GROUP;
> > +	req->grp_refs -= 1;
> > +}
> > +
> > +static void io_complete_group_req(struct io_kiocb *req)
> > +{
> > +	if (req_is_group_leader(req))
> > +		io_complete_group_leader(req);
> > +	else
> > +		io_complete_group_member(req);
> > +}
> > +
> >   static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
> >   {
> >   	struct io_ring_ctx *ctx = req->ctx;
> > @@ -890,7 +1005,8 @@ static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
> >   	 * Handle special CQ sync cases via task_work. DEFER_TASKRUN requires
> >   	 * the submitter task context, IOPOLL protects with uring_lock.
> >   	 */
> > -	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)) {
> > +	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL) ||
> > +	    req_is_group_leader(req)) {
> 
> We're better to push all group requests to io_req_task_complete(),
> not just a group leader. While seems to be correct, that just
> overcomplicates the request's flow, it can post a CQE here, but then
> still expect to do group stuff in the CQE posting loop
> (flush_completions -> io_complete_group_req), which might post another
> cqe for the leader, and then do yet another post processing loop in
> io_free_batch_list().

OK, it is simpler to complete all group reqs via tw.

> 
> 
> >   		req->io_task_work.func = io_req_task_complete;
> >   		io_req_task_work_add(req);
> >   		return;
> > @@ -1388,11 +1504,43 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
> >   						    comp_list);
> >   		if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
> > +			if (req_is_last_group_member(req) ||
> > +					req_is_group_leader(req)) {
> > +				struct io_kiocb *leader;
> > +
> > +				/* Leader is freed via the last member */
> > +				if (req_is_group_leader(req)) {
> > +					if (req->grp_link)
> > +						io_queue_group_members(req);
> > +					node = req->comp_list.next;
> > +					continue;
> > +				}
> > +
> > +				/*
> > +				 * Prepare for freeing leader since we are the
> > +				 * last group member
> > +				 */
> > +				leader = get_group_leader(req);
> > +				leader->flags &= ~REQ_F_SQE_GROUP_LEADER;
> > +				req->flags &= ~REQ_F_SQE_GROUP;
> > +				/*
> > +				 * Link leader to current request's next,
> > +				 * this way works because the iterator
> > +				 * always check the next node only.
> > +				 *
> > +				 * Be careful when you change the iterator
> > +				 * in future
> > +				 */
> > +				wq_stack_add_head(&leader->comp_list,
> > +						  &req->comp_list);
> > +			}
> > +
> >   			if (req->flags & REQ_F_REFCOUNT) {
> >   				node = req->comp_list.next;
> >   				if (!req_ref_put_and_test(req))
> >   					continue;
> >   			}
> > +
> >   			if ((req->flags & REQ_F_POLLED) && req->apoll) {
> >   				struct async_poll *apoll = req->apoll;
> > @@ -1427,8 +1575,16 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
> >   		struct io_kiocb *req = container_of(node, struct io_kiocb,
> >   					    comp_list);
> > -		if (!(req->flags & REQ_F_CQE_SKIP))
> > -			io_req_commit_cqe(ctx, req);
> > +		if (unlikely(req->flags & (REQ_F_CQE_SKIP | REQ_F_SQE_GROUP))) {
> > +			if (req->flags & REQ_F_SQE_GROUP) {
> > +				io_complete_group_req(req);
> > +				continue;
> > +			}
> > +
> > +			if (req->flags & REQ_F_CQE_SKIP)
> > +				continue;
> > +		}
> > +		io_req_commit_cqe(ctx, req);
> >   	}
> >   	__io_cq_unlock_post(ctx);
> > @@ -1638,8 +1794,12 @@ static u32 io_get_sequence(struct io_kiocb *req)
> >   	struct io_kiocb *cur;
> >   	/* need original cached_sq_head, but it was increased for each req */
> > -	io_for_each_link(cur, req)
> > -		seq--;
> > +	io_for_each_link(cur, req) {
> > +		if (req_is_group_leader(cur))
> > +			seq -= cur->grp_refs;
> > +		else
> > +			seq--;
> > +	}
> >   	return seq;
> >   }
> ...
> > @@ -2217,8 +2470,22 @@ static void io_submit_state_end(struct io_ring_ctx *ctx)
> >   {
> >   	struct io_submit_state *state = &ctx->submit_state;
> > -	if (unlikely(state->link.head))
> > -		io_queue_sqe_fallback(state->link.head);
> > +	if (unlikely(state->group.head || state->link.head)) {
> > +		/* the last member must set REQ_F_SQE_GROUP */
> > +		if (state->group.head) {
> > +			struct io_kiocb *lead = state->group.head;
> > +
> > +			state->group.last->grp_link = NULL;
> > +			if (lead->flags & IO_REQ_LINK_FLAGS)
> > +				io_link_sqe(&state->link, lead);
> > +			else
> > +				io_queue_sqe_fallback(lead);
> 
> req1(F_LINK), req2(F_GROUP), req3
> 
> is supposed to be turned into
> 
> req1 -> {group: req2 (lead), req3 }
> 
> but note that req2 here doesn't have F_LINK set.
> I think it should be like this instead:
> 
> if (state->link.head)
> 	io_link_sqe();
> else
> 	io_queue_sqe_fallback(lead);

Indeed, the above change is correct.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 5/8] io_uring: support sqe group with members depending on leader
  2024-10-04 13:18   ` Pavel Begunkov
@ 2024-10-06  3:54     ` Ming Lei
  0 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-10-06  3:54 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block

On Fri, Oct 04, 2024 at 02:18:13PM +0100, Pavel Begunkov wrote:
> On 9/12/24 11:49, Ming Lei wrote:
> > IOSQE_SQE_GROUP just starts to queue members after the leader is completed,
> > which way is just for simplifying implementation, and this behavior is never
> > part of UAPI, and it may be relaxed and members can be queued concurrently
> > with leader in future.
> > 
> > However, some resource can't cross OPs, such as kernel buffer, otherwise
> > the buffer may be leaked easily in case that any OP failure or application
> > panic.
> > 
> > Add flag REQ_F_SQE_GROUP_DEP for allowing members to depend on group leader
> > explicitly, so that group members won't be queued until the leader request is
> > completed, the kernel resource lifetime can be aligned with group leader
> 
> That's the current and only behaviour, we don't need an extra flag
> for that. We can add it back later when anything changes.

OK.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-10-04 15:32   ` Pavel Begunkov
@ 2024-10-06  8:20     ` Ming Lei
  2024-10-09 14:25       ` Pavel Begunkov
  2024-10-06  9:47     ` Ming Lei
  1 sibling, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-06  8:20 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, ming.lei

On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
> On 9/12/24 11:49, Ming Lei wrote:
> ...
> > It can help to implement generic zero copy between device and related
> > operations, such as ublk, fuse, vdpa,
> > even network receive or whatever.
> 
> That's exactly the thing it can't sanely work with because
> of this design.

The provide buffer design is absolutely generic, and basically

- group leader provides buffer for member OPs, and member OPs can borrow
the buffer if leader allows by calling io_provide_group_kbuf()

- after member OPs consumes the buffer, the buffer is returned back by
the callback implemented in group leader subsystem, so group leader can
release related sources;

- and it is guaranteed that the buffer can be released always

The ublk implementation is pretty simple, it can be reused in device driver
to share buffer with other kernel subsystems.

I don't see anything insane with the design.

> 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   include/linux/io_uring_types.h | 33 +++++++++++++++++++
> >   io_uring/io_uring.c            | 10 +++++-
> >   io_uring/io_uring.h            | 10 ++++++
> >   io_uring/kbuf.c                | 60 ++++++++++++++++++++++++++++++++++
> >   io_uring/kbuf.h                | 13 ++++++++
> >   io_uring/net.c                 | 23 ++++++++++++-
> >   io_uring/opdef.c               |  4 +++
> >   io_uring/opdef.h               |  2 ++
> >   io_uring/rw.c                  | 20 +++++++++++-
> >   9 files changed, 172 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > index 793d5a26d9b8..445e5507565a 100644
> > --- a/include/linux/io_uring_types.h
> > +++ b/include/linux/io_uring_types.h
> > @@ -6,6 +6,7 @@
> >   #include <linux/task_work.h>
> >   #include <linux/bitmap.h>
> >   #include <linux/llist.h>
> > +#include <linux/bvec.h>
> >   #include <uapi/linux/io_uring.h>
> >   enum {
> > @@ -39,6 +40,26 @@ enum io_uring_cmd_flags {
> >   	IO_URING_F_COMPAT		= (1 << 12),
> >   };
> > +struct io_uring_kernel_buf;
> > +typedef void (io_uring_buf_giveback_t) (const struct io_uring_kernel_buf *);
> > +
> > +/* buffer provided from kernel */
> > +struct io_uring_kernel_buf {
> > +	unsigned long		len;
> > +	unsigned short		nr_bvecs;
> > +	unsigned char		dir;	/* ITER_SOURCE or ITER_DEST */
> > +
> > +	/* offset in the 1st bvec */
> > +	unsigned int		offset;
> > +	const struct bio_vec	*bvec;
> > +
> > +	/* called when we are done with this buffer */
> > +	io_uring_buf_giveback_t	*grp_kbuf_ack;
> > +
> > +	/* private field, user don't touch it */
> > +	struct bio_vec		__bvec[];
> > +};
> > +
> >   struct io_wq_work_node {
> >   	struct io_wq_work_node *next;
> >   };
> > @@ -473,6 +494,7 @@ enum {
> >   	REQ_F_BUFFERS_COMMIT_BIT,
> >   	REQ_F_SQE_GROUP_LEADER_BIT,
> >   	REQ_F_SQE_GROUP_DEP_BIT,
> > +	REQ_F_GROUP_KBUF_BIT,
> >   	/* not a real bit, just to check we're not overflowing the space */
> >   	__REQ_F_LAST_BIT,
> > @@ -557,6 +579,8 @@ enum {
> >   	REQ_F_SQE_GROUP_LEADER	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEADER_BIT),
> >   	/* sqe group with members depending on leader */
> >   	REQ_F_SQE_GROUP_DEP	= IO_REQ_FLAG(REQ_F_SQE_GROUP_DEP_BIT),
> > +	/* group lead provides kbuf for members, set for both lead and member */
> > +	REQ_F_GROUP_KBUF	= IO_REQ_FLAG(REQ_F_GROUP_KBUF_BIT),
> 
> We have a huge flag problem here. It's a 4th group flag, that gives
> me an idea that it's overabused. We're adding state machines based on
> them "set group, clear group, but if last set it again.

I have explained it is just for reusing SQE_GROUP flag to mark the last
member.

And now REQ_F_SQE_GROUP_DEP_BIT can be killed.

> And clear
> group lead if refs are of particular value".

It is actually one dead code for supporting concurrent leader & members,
so it will be removed too.

> And it's not really
> clear what these two flags are here for or what they do.
> 
> From what I see you need here just one flag to mark requests
> that provide a buffer, ala REQ_F_PROVIDING_KBUF. On the import
> side:
> 
> if ((req->flags & GROUP) && (req->lead->flags & REQ_F_PROVIDING_KBUF))
> 	...
> 
> And when you kill the request:
> 
> if (req->flags & REQ_F_PROVIDING_KBUF)
> 	io_group_kbuf_drop();

Yeah, this way works, here using REQ_F_GROUP_KBUF can remove the extra
indirect ->lead->flags check. I am fine to switch to this way by adding
one helper of io_use_group_provided_buf() to cover the check.

> 
> And I don't think you need opdef::accept_group_kbuf since the
> request handler should already know that and, importantly, you don't
> imbue any semantics based on it.

Yeah, and it just follows logic of buffer_select. I guess def->buffer_select
may be removed too?

> 
> FWIW, would be nice if during init figure we can verify that the leader
> provides a buffer IFF there is someone consuming it, but I don't think

It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can
only be provided in ->issue().

> the semantics is flexible enough to do it sanely. i.e. there are many
> members in a group, some might want to use the buffer and some might not.
> 
> 
> >   typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
> > @@ -640,6 +664,15 @@ struct io_kiocb {
> >   		 * REQ_F_BUFFER_RING is set.
> >   		 */
> >   		struct io_buffer_list	*buf_list;
> > +
> > +		/*
> > +		 * store kernel buffer provided by sqe group lead, valid
> > +		 * IFF REQ_F_GROUP_KBUF
> > +		 *
> > +		 * The buffer meta is immutable since it is shared by
> > +		 * all member requests
> > +		 */
> > +		const struct io_uring_kernel_buf *grp_kbuf;
> >   	};
> >   	union {
> > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > index 99b44b6babd6..80c4d9192657 100644
> > --- a/io_uring/io_uring.c
> > +++ b/io_uring/io_uring.c
> > @@ -116,7 +116,7 @@
> >   #define IO_REQ_CLEAN_FLAGS (REQ_F_BUFFER_SELECTED | REQ_F_NEED_CLEANUP | \
> >   				REQ_F_POLLED | REQ_F_INFLIGHT | REQ_F_CREDS | \
> > -				REQ_F_ASYNC_DATA)
> > +				REQ_F_ASYNC_DATA | REQ_F_GROUP_KBUF)
> >   #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | REQ_F_HARDLINK |\
> >   				 REQ_F_SQE_GROUP | REQ_F_SQE_GROUP_LEADER | \
> > @@ -387,6 +387,11 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
> >   static void io_clean_op(struct io_kiocb *req)
> >   {
> > +	/* GROUP_KBUF is only available for REQ_F_SQE_GROUP_DEP */
> > +	if ((req->flags & (REQ_F_GROUP_KBUF | REQ_F_SQE_GROUP_DEP)) ==
> > +			(REQ_F_GROUP_KBUF | REQ_F_SQE_GROUP_DEP))
> > +		io_group_kbuf_drop(req);
> > +
> >   	if (req->flags & REQ_F_BUFFER_SELECTED) {
> >   		spin_lock(&req->ctx->completion_lock);
> >   		io_kbuf_drop(req);
> > @@ -914,9 +919,12 @@ static void io_queue_group_members(struct io_kiocb *req)
> >   	req->grp_link = NULL;
> >   	while (member) {
> > +		const struct io_issue_def *def = &io_issue_defs[member->opcode];
> >   		struct io_kiocb *next = member->grp_link;
> >   		member->grp_leader = req;
> > +		if ((req->flags & REQ_F_GROUP_KBUF) && def->accept_group_kbuf)
> > +			member->flags |= REQ_F_GROUP_KBUF;
> 
> As per above I suspect that is not needed.
> 
> >   		if (unlikely(member->flags & REQ_F_FAIL)) {
> >   			io_req_task_queue_fail(member, member->cqe.res);
> >   		} else if (unlikely(req->flags & REQ_F_FAIL)) {
> > diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> > index df2be7353414..8e111d24c02d 100644
> > --- a/io_uring/io_uring.h
> > +++ b/io_uring/io_uring.h
> > @@ -349,6 +349,16 @@ static inline bool req_is_group_leader(struct io_kiocb *req)
> >   	return req->flags & REQ_F_SQE_GROUP_LEADER;
> >   }
> ...
> > +int io_import_group_kbuf(struct io_kiocb *req, unsigned long buf_off,
> > +		unsigned int len, int dir, struct iov_iter *iter)
> > +{
> > +	struct io_kiocb *lead = req->grp_link;
> > +	const struct io_uring_kernel_buf *kbuf;
> > +	unsigned long offset;
> > +
> > +	WARN_ON_ONCE(!(req->flags & REQ_F_GROUP_KBUF));
> > +
> > +	if (!req_is_group_member(req))
> > +		return -EINVAL;
> > +
> > +	if (!lead || !req_support_group_dep(lead) || !lead->grp_kbuf)
> > +		return -EINVAL;
> > +
> > +	/* req->fused_cmd_kbuf is immutable */
> > +	kbuf = lead->grp_kbuf;
> > +	offset = kbuf->offset;
> > +
> > +	if (!kbuf->bvec)
> > +		return -EINVAL;
> 
> How can this happen?

OK, we can run the check in uring_cmd API.

> 
> > +	if (dir != kbuf->dir)
> > +		return -EINVAL;
> > +
> > +	if (unlikely(buf_off > kbuf->len))
> > +		return -EFAULT;
> > +
> > +	if (unlikely(len > kbuf->len - buf_off))
> > +		return -EFAULT;
> 
> check_add_overflow() would be more readable

OK.

> 
> > +
> > +	/* don't use io_import_fixed which doesn't support multipage bvec */
> > +	offset += buf_off;
> > +	iov_iter_bvec(iter, dir, kbuf->bvec, kbuf->nr_bvecs, offset + len);
> > +
> > +	if (offset)
> > +		iov_iter_advance(iter, offset);
> > +
> > +	return 0;
> > +}
> > diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
> > index 36aadfe5ac00..37d18324e840 100644
> > --- a/io_uring/kbuf.h
> > +++ b/io_uring/kbuf.h
> > @@ -89,6 +89,11 @@ struct io_buffer_list *io_pbuf_get_bl(struct io_ring_ctx *ctx,
> >   				      unsigned long bgid);
> >   int io_pbuf_mmap(struct file *file, struct vm_area_struct *vma);
> > +int io_provide_group_kbuf(struct io_kiocb *req,
> > +		const struct io_uring_kernel_buf *grp_kbuf);
> > +int io_import_group_kbuf(struct io_kiocb *req, unsigned long buf_off,
> > +		unsigned int len, int dir, struct iov_iter *iter);
> > +
> >   static inline bool io_kbuf_recycle_ring(struct io_kiocb *req)
> >   {
> >   	/*
> > @@ -220,4 +225,12 @@ static inline unsigned int io_put_kbufs(struct io_kiocb *req, int len,
> >   {
> >   	return __io_put_kbufs(req, len, nbufs, issue_flags);
> >   }
> > +
> > +static inline void io_group_kbuf_drop(struct io_kiocb *req)
> > +{
> > +	const struct io_uring_kernel_buf *gbuf = req->grp_kbuf;
> > +
> > +	if (gbuf && gbuf->grp_kbuf_ack)
> 
> How can ->grp_kbuf_ack be missing?

OK, the check can be removed here.

> 
> > +		gbuf->grp_kbuf_ack(gbuf);
> > +}
> >   #endif
> > diff --git a/io_uring/net.c b/io_uring/net.c
> > index f10f5a22d66a..ad24dd5924d2 100644
> > --- a/io_uring/net.c
> > +++ b/io_uring/net.c
> > @@ -89,6 +89,13 @@ struct io_sr_msg {
> >    */
> >   #define MULTISHOT_MAX_RETRY	32
> > +#define user_ptr_to_u64(x) (		\
> > +{					\
> > +	typecheck(void __user *, (x));	\
> > +	(u64)(unsigned long)(x);	\
> > +}					\
> > +)
> > +
> >   int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> >   {
> >   	struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
> > @@ -375,7 +382,7 @@ static int io_send_setup(struct io_kiocb *req)
> >   		kmsg->msg.msg_name = &kmsg->addr;
> >   		kmsg->msg.msg_namelen = sr->addr_len;
> >   	}
> > -	if (!io_do_buffer_select(req)) {
> > +	if (!io_do_buffer_select(req) && !(req->flags & REQ_F_GROUP_KBUF)) {
> >   		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len,
> >   				  &kmsg->msg.msg_iter);
> >   		if (unlikely(ret < 0))
> > @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
> >   	if (issue_flags & IO_URING_F_NONBLOCK)
> >   		flags |= MSG_DONTWAIT;
> > +	if (req->flags & REQ_F_GROUP_KBUF) {
> 
> Does anything prevent the request to be marked by both
> GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
> a group kbuf and then go to the io_do_buffer_select()
> overriding all of that

It could be used in this way, and we can fail the member in
io_queue_group_members().

> 
> > +		ret = io_import_group_kbuf(req,
> > +					user_ptr_to_u64(sr->buf),
> > +					sr->len, ITER_SOURCE,
> > +					&kmsg->msg.msg_iter);
> > +		if (unlikely(ret))
> > +			return ret;
> > +	}
> > +
> >   retry_bundle:
> >   	if (io_do_buffer_select(req)) {
> >   		struct buf_sel_arg arg = {
> > @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
> >   			goto out_free;
> >   		}
> >   		sr->buf = NULL;
> > +	} else if (req->flags & REQ_F_GROUP_KBUF) {
> 
> What happens if we get a short read/recv?

For short read/recv, any progress is stored in iterator, nothing to do
with the provide buffer, which is immutable.

One problem for read is reissue, but it can be handled by saving iter
state after the group buffer is imported, I will fix it in next version.
For net recv, offset/len of buffer is updated in case of short recv, so
it works as expected.

Or any other issue with short read/recv? Can you explain in detail?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-04 15:44   ` Pavel Begunkov
@ 2024-10-06  8:46     ` Ming Lei
  2024-10-09 15:14       ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-06  8:46 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, ming.lei

On Fri, Oct 04, 2024 at 04:44:54PM +0100, Pavel Begunkov wrote:
> On 9/12/24 11:49, Ming Lei wrote:
> > Allow uring command to be group leader for providing kernel buffer,
> > and this way can support generic device zero copy over device buffer.
> > 
> > The following patch will use the way to support zero copy for ublk.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   include/linux/io_uring/cmd.h  |  7 +++++++
> >   include/uapi/linux/io_uring.h |  7 ++++++-
> >   io_uring/uring_cmd.c          | 28 ++++++++++++++++++++++++++++
> >   3 files changed, 41 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > index 447fbfd32215..fde3a2ec7d9a 100644
> > --- a/include/linux/io_uring/cmd.h
> > +++ b/include/linux/io_uring/cmd.h
> > @@ -48,6 +48,8 @@ void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
> >   void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
> >   		unsigned int issue_flags);
> > +int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
> > +		const struct io_uring_kernel_buf *grp_kbuf);
> >   #else
> >   static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
> >   			      struct iov_iter *iter, void *ioucmd)
> > @@ -67,6 +69,11 @@ static inline void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
> >   		unsigned int issue_flags)
> >   {
> >   }
> > +static inline int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
> > +		const struct io_uring_kernel_buf *grp_kbuf)
> > +{
> > +	return -EOPNOTSUPP;
> > +}
> >   #endif
> >   /*
> > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > index 2af32745ebd3..11985eeac10e 100644
> > --- a/include/uapi/linux/io_uring.h
> > +++ b/include/uapi/linux/io_uring.h
> > @@ -271,9 +271,14 @@ enum io_uring_op {
> >    * sqe->uring_cmd_flags		top 8bits aren't available for userspace
> >    * IORING_URING_CMD_FIXED	use registered buffer; pass this flag
> >    *				along with setting sqe->buf_index.
> > + * IORING_PROVIDE_GROUP_KBUF	this command provides group kernel buffer
> > + *				for member requests which can retrieve
> > + *				any sub-buffer with offset(sqe->addr) and
> > + *				len(sqe->len)
> 
> Is there a good reason it needs to be a cmd generic flag instead of
> ublk specific?

io_uring request isn't visible for drivers, so driver can't know if the
uring command is one group leader.

Another way is to add new API of io_uring_cmd_may_provide_buffer(ioucmd)
so driver can check if device buffer can be provided with this uring_cmd,
but I prefer to the new uring_cmd flag:

- IORING_PROVIDE_GROUP_KBUF can provide device buffer in generic way.
- ->prep() can fail fast in case that it isn't one group request

> 
> 1. Extra overhead for files / cmds that don't even care about the
> feature.

It is just checking ioucmd->flags in ->prep(), and basically zero cost.

> 
> 2. As it stands with this patch, the flag is ignored by all other
> cmd implementations, which might be quite confusing as an api,
> especially so since if we don't set that REQ_F_GROUP_KBUF memeber
> requests will silently try to import a buffer the "normal way",

The usage is same with buffer select or fixed buffer, and consumer
has to check the flag.

And same with IORING_URING_CMD_FIXED which is ignored by other
implementations except for nvme, :-)

I can understand the concern, but it exits since uring cmd is born.

> i.e. interpret sqe->addr or such as the target buffer.

> 3. We can't even put some nice semantics on top since it's
> still cmd specific and not generic to all other io_uring
> requests.
> 
> I'd even think that it'd make sense to implement it as a
> new cmd opcode, but that's the business of the file implementing
> it, i.e. ublk.
> 
> >    */
> >   #define IORING_URING_CMD_FIXED	(1U << 0)
> > -#define IORING_URING_CMD_MASK	IORING_URING_CMD_FIXED
> > +#define IORING_PROVIDE_GROUP_KBUF	(1U << 1)
> > +#define IORING_URING_CMD_MASK	(IORING_URING_CMD_FIXED | IORING_PROVIDE_GROUP_KBUF)

It needs one new file operation, and we shouldn't work toward
this way.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-10-04 15:32   ` Pavel Begunkov
  2024-10-06  8:20     ` Ming Lei
@ 2024-10-06  9:47     ` Ming Lei
  2024-10-09 11:57       ` Pavel Begunkov
  1 sibling, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-06  9:47 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block

On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
> On 9/12/24 11:49, Ming Lei wrote:
> ...
...
> > @@ -473,6 +494,7 @@ enum {
> >   	REQ_F_BUFFERS_COMMIT_BIT,
> >   	REQ_F_SQE_GROUP_LEADER_BIT,
> >   	REQ_F_SQE_GROUP_DEP_BIT,
> > +	REQ_F_GROUP_KBUF_BIT,
> >   	/* not a real bit, just to check we're not overflowing the space */
> >   	__REQ_F_LAST_BIT,
> > @@ -557,6 +579,8 @@ enum {
> >   	REQ_F_SQE_GROUP_LEADER	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEADER_BIT),
> >   	/* sqe group with members depending on leader */
> >   	REQ_F_SQE_GROUP_DEP	= IO_REQ_FLAG(REQ_F_SQE_GROUP_DEP_BIT),
> > +	/* group lead provides kbuf for members, set for both lead and member */
> > +	REQ_F_GROUP_KBUF	= IO_REQ_FLAG(REQ_F_GROUP_KBUF_BIT),
> 
> We have a huge flag problem here. It's a 4th group flag, that gives
> me an idea that it's overabused. We're adding state machines based on
> them "set group, clear group, but if last set it again. And clear
> group lead if refs are of particular value". And it's not really
> clear what these two flags are here for or what they do.
> 
> From what I see you need here just one flag to mark requests
> that provide a buffer, ala REQ_F_PROVIDING_KBUF. On the import
> side:
> 
> if ((req->flags & GROUP) && (req->lead->flags & REQ_F_PROVIDING_KBUF))
> 	...
> 
> And when you kill the request:
> 
> if (req->flags & REQ_F_PROVIDING_KBUF)
> 	io_group_kbuf_drop();

REQ_F_PROVIDING_KBUF may be killed too, and the check helper can become:

bool io_use_group_provided_buf(req)
{
	return (req->flags & GROUP) && req->lead->grp_buf;
}


Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 4/8] io_uring: support SQE group
  2024-10-06  3:54     ` Ming Lei
@ 2024-10-09 11:53       ` Pavel Begunkov
  2024-10-09 12:14         ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-09 11:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block, Kevin Wolf

On 10/6/24 04:54, Ming Lei wrote:
> On Fri, Oct 04, 2024 at 02:12:28PM +0100, Pavel Begunkov wrote:
>> On 9/12/24 11:49, Ming Lei wrote:
>> ...
>>> --- a/io_uring/io_uring.c
>>> +++ b/io_uring/io_uring.c
>>> @@ -111,13 +111,15 @@
>> ...
>>> +static void io_complete_group_member(struct io_kiocb *req)
>>> +{
>>> +	struct io_kiocb *lead = get_group_leader(req);
>>> +
>>> +	if (WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP) ||
>>> +			 lead->grp_refs <= 0))
>>> +		return;
>>> +
>>> +	/* member CQE needs to be posted first */
>>> +	if (!(req->flags & REQ_F_CQE_SKIP))
>>> +		io_req_commit_cqe(req->ctx, req);
>>> +
>>> +	req->flags &= ~REQ_F_SQE_GROUP;
>>
>> I can't say I like this implicit state machine too much,
>> but let's add a comment why we need to clear it. i.e.
>> it seems it wouldn't be needed if not for the
>> mark_last_group_member() below that puts it back to tunnel
>> the leader to io_free_batch_list().
> 
> Yeah, the main purpose is for reusing the flag for marking last
> member, will add comment for this usage.
> 
>>
>>> +
>>> +	/* Set leader as failed in case of any member failed */
>>> +	if (unlikely((req->flags & REQ_F_FAIL)))
>>> +		req_set_fail(lead);
>>> +
>>> +	if (!--lead->grp_refs) {
>>> +		mark_last_group_member(req);
>>> +		if (!(lead->flags & REQ_F_CQE_SKIP))
>>> +			io_req_commit_cqe(lead->ctx, lead);
>>> +	} else if (lead->grp_refs == 1 && (lead->flags & REQ_F_SQE_GROUP)) {
>>> +		/*
>>> +		 * The single uncompleted leader will degenerate to plain
>>> +		 * request, so group leader can be always freed via the
>>> +		 * last completed member.
>>> +		 */
>>> +		lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
>>
>> What does this try to handle? A group with a leader but no
>> members? If that's the case, io_group_sqe() and io_submit_state_end()
>> just need to fail such groups (and clear REQ_F_SQE_GROUP before
>> that).
> 
> The code block allows to issue leader and members concurrently, but
> we have changed to always issue members after leader is completed, so
> the above code can be removed now.

One case to check, what if the user submits just a single request marked
as a group? The concern is that we create a group with a leader but
without members otherwise, and when the leader goes through
io_submit_flush_completions for the first time it drops it refs and
starts waiting for members that don't exist to "wake" it. I mentioned
above we should probably just fail it, but would be nice to have a
test for it if not already.

Forgot to mention, with the mentioned changes I believe the patch
should be good enough.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-10-06  9:47     ` Ming Lei
@ 2024-10-09 11:57       ` Pavel Begunkov
  2024-10-09 12:21         ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-09 11:57 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block

On 10/6/24 10:47, Ming Lei wrote:
> On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
>> On 9/12/24 11:49, Ming Lei wrote:
>> ...
> ...
>>> @@ -473,6 +494,7 @@ enum {
>>>    	REQ_F_BUFFERS_COMMIT_BIT,
>>>    	REQ_F_SQE_GROUP_LEADER_BIT,
>>>    	REQ_F_SQE_GROUP_DEP_BIT,
>>> +	REQ_F_GROUP_KBUF_BIT,
>>>    	/* not a real bit, just to check we're not overflowing the space */
>>>    	__REQ_F_LAST_BIT,
>>> @@ -557,6 +579,8 @@ enum {
>>>    	REQ_F_SQE_GROUP_LEADER	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEADER_BIT),
>>>    	/* sqe group with members depending on leader */
>>>    	REQ_F_SQE_GROUP_DEP	= IO_REQ_FLAG(REQ_F_SQE_GROUP_DEP_BIT),
>>> +	/* group lead provides kbuf for members, set for both lead and member */
>>> +	REQ_F_GROUP_KBUF	= IO_REQ_FLAG(REQ_F_GROUP_KBUF_BIT),
>>
>> We have a huge flag problem here. It's a 4th group flag, that gives
>> me an idea that it's overabused. We're adding state machines based on
>> them "set group, clear group, but if last set it again. And clear
>> group lead if refs are of particular value". And it's not really
>> clear what these two flags are here for or what they do.
>>
>>  From what I see you need here just one flag to mark requests
>> that provide a buffer, ala REQ_F_PROVIDING_KBUF. On the import
>> side:
>>
>> if ((req->flags & GROUP) && (req->lead->flags & REQ_F_PROVIDING_KBUF))
>> 	...
>>
>> And when you kill the request:
>>
>> if (req->flags & REQ_F_PROVIDING_KBUF)
>> 	io_group_kbuf_drop();
> 
> REQ_F_PROVIDING_KBUF may be killed too, and the check helper can become:
> 
> bool io_use_group_provided_buf(req)
> {
> 	return (req->flags & GROUP) && req->lead->grp_buf;
> }

->grp_kbuf is unionised, so for that to work you need to ensure that
only a buffer providing cmd / request could be a leader of a group,
which doesn't sound right.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 4/8] io_uring: support SQE group
  2024-10-09 11:53       ` Pavel Begunkov
@ 2024-10-09 12:14         ` Ming Lei
  0 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-10-09 12:14 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, Kevin Wolf

On Wed, Oct 09, 2024 at 12:53:34PM +0100, Pavel Begunkov wrote:
> On 10/6/24 04:54, Ming Lei wrote:
> > On Fri, Oct 04, 2024 at 02:12:28PM +0100, Pavel Begunkov wrote:
> > > On 9/12/24 11:49, Ming Lei wrote:
> > > ...
> > > > --- a/io_uring/io_uring.c
> > > > +++ b/io_uring/io_uring.c
> > > > @@ -111,13 +111,15 @@
> > > ...
> > > > +static void io_complete_group_member(struct io_kiocb *req)
> > > > +{
> > > > +	struct io_kiocb *lead = get_group_leader(req);
> > > > +
> > > > +	if (WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP) ||
> > > > +			 lead->grp_refs <= 0))
> > > > +		return;
> > > > +
> > > > +	/* member CQE needs to be posted first */
> > > > +	if (!(req->flags & REQ_F_CQE_SKIP))
> > > > +		io_req_commit_cqe(req->ctx, req);
> > > > +
> > > > +	req->flags &= ~REQ_F_SQE_GROUP;
> > > 
> > > I can't say I like this implicit state machine too much,
> > > but let's add a comment why we need to clear it. i.e.
> > > it seems it wouldn't be needed if not for the
> > > mark_last_group_member() below that puts it back to tunnel
> > > the leader to io_free_batch_list().
> > 
> > Yeah, the main purpose is for reusing the flag for marking last
> > member, will add comment for this usage.
> > 
> > > 
> > > > +
> > > > +	/* Set leader as failed in case of any member failed */
> > > > +	if (unlikely((req->flags & REQ_F_FAIL)))
> > > > +		req_set_fail(lead);
> > > > +
> > > > +	if (!--lead->grp_refs) {
> > > > +		mark_last_group_member(req);
> > > > +		if (!(lead->flags & REQ_F_CQE_SKIP))
> > > > +			io_req_commit_cqe(lead->ctx, lead);
> > > > +	} else if (lead->grp_refs == 1 && (lead->flags & REQ_F_SQE_GROUP)) {
> > > > +		/*
> > > > +		 * The single uncompleted leader will degenerate to plain
> > > > +		 * request, so group leader can be always freed via the
> > > > +		 * last completed member.
> > > > +		 */
> > > > +		lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
> > > 
> > > What does this try to handle? A group with a leader but no
> > > members? If that's the case, io_group_sqe() and io_submit_state_end()
> > > just need to fail such groups (and clear REQ_F_SQE_GROUP before
> > > that).
> > 
> > The code block allows to issue leader and members concurrently, but
> > we have changed to always issue members after leader is completed, so
> > the above code can be removed now.
> 
> One case to check, what if the user submits just a single request marked
> as a group? The concern is that we create a group with a leader but
> without members otherwise, and when the leader goes through
> io_submit_flush_completions for the first time it drops it refs and
> starts waiting for members that don't exist to "wake" it. I mentioned
> above we should probably just fail it, but would be nice to have a
> test for it if not already.

The corner case isn't handled yet, and we can fail it by calling
req_fail_link_node(head, -EINVAL) in io_submit_state_end().


thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-10-09 11:57       ` Pavel Begunkov
@ 2024-10-09 12:21         ` Ming Lei
  0 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-10-09 12:21 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block

On Wed, Oct 09, 2024 at 12:57:48PM +0100, Pavel Begunkov wrote:
> On 10/6/24 10:47, Ming Lei wrote:
> > On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
> > > On 9/12/24 11:49, Ming Lei wrote:
> > > ...
> > ...
> > > > @@ -473,6 +494,7 @@ enum {
> > > >    	REQ_F_BUFFERS_COMMIT_BIT,
> > > >    	REQ_F_SQE_GROUP_LEADER_BIT,
> > > >    	REQ_F_SQE_GROUP_DEP_BIT,
> > > > +	REQ_F_GROUP_KBUF_BIT,
> > > >    	/* not a real bit, just to check we're not overflowing the space */
> > > >    	__REQ_F_LAST_BIT,
> > > > @@ -557,6 +579,8 @@ enum {
> > > >    	REQ_F_SQE_GROUP_LEADER	= IO_REQ_FLAG(REQ_F_SQE_GROUP_LEADER_BIT),
> > > >    	/* sqe group with members depending on leader */
> > > >    	REQ_F_SQE_GROUP_DEP	= IO_REQ_FLAG(REQ_F_SQE_GROUP_DEP_BIT),
> > > > +	/* group lead provides kbuf for members, set for both lead and member */
> > > > +	REQ_F_GROUP_KBUF	= IO_REQ_FLAG(REQ_F_GROUP_KBUF_BIT),
> > > 
> > > We have a huge flag problem here. It's a 4th group flag, that gives
> > > me an idea that it's overabused. We're adding state machines based on
> > > them "set group, clear group, but if last set it again. And clear
> > > group lead if refs are of particular value". And it's not really
> > > clear what these two flags are here for or what they do.
> > > 
> > >  From what I see you need here just one flag to mark requests
> > > that provide a buffer, ala REQ_F_PROVIDING_KBUF. On the import
> > > side:
> > > 
> > > if ((req->flags & GROUP) && (req->lead->flags & REQ_F_PROVIDING_KBUF))
> > > 	...
> > > 
> > > And when you kill the request:
> > > 
> > > if (req->flags & REQ_F_PROVIDING_KBUF)
> > > 	io_group_kbuf_drop();
> > 
> > REQ_F_PROVIDING_KBUF may be killed too, and the check helper can become:
> > 
> > bool io_use_group_provided_buf(req)
> > {
> > 	return (req->flags & GROUP) && req->lead->grp_buf;
> > }
> 
> ->grp_kbuf is unionised, so for that to work you need to ensure that
> only a buffer providing cmd / request could be a leader of a group,
> which doesn't sound right.

Yes, both 'req->lead->flags & REQ_F_PROVIDING_KBUF' and 'req->lead->grp_buf'
may not work because the helper may be called in ->prep(), when req->lead
isn't setup yet.

Another idea is to reuse one of the three unused flags(LINK, HARDLINK and DRAIN) 
of members for marking GROUP_KBUF, then it is aligned with BUFFER_SELECT and
implementation can be cleaner, what do you think of this approach?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-10-06  8:20     ` Ming Lei
@ 2024-10-09 14:25       ` Pavel Begunkov
  2024-10-10  3:00         ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-09 14:25 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block

On 10/6/24 09:20, Ming Lei wrote:
> On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
>> On 9/12/24 11:49, Ming Lei wrote:
>> ...
>>> It can help to implement generic zero copy between device and related
>>> operations, such as ublk, fuse, vdpa,
>>> even network receive or whatever.
>>
>> That's exactly the thing it can't sanely work with because
>> of this design.
> 
> The provide buffer design is absolutely generic, and basically
> 
> - group leader provides buffer for member OPs, and member OPs can borrow
> the buffer if leader allows by calling io_provide_group_kbuf()
> 
> - after member OPs consumes the buffer, the buffer is returned back by
> the callback implemented in group leader subsystem, so group leader can
> release related sources;
> 
> - and it is guaranteed that the buffer can be released always
> 
> The ublk implementation is pretty simple, it can be reused in device driver
> to share buffer with other kernel subsystems.
> 
> I don't see anything insane with the design.

There is nothing insane with the design, but the problem is cross
request error handling, same thing that makes links a pain to use.
It's good that with storage reads are reasonably idempotent and you
can be retried if needed. With sockets and streams, however, you
can't sanely borrow a buffer without consuming it, so if a member
request processing the buffer fails for any reason, the user data
will be dropped on the floor. I mentioned quite a while before,
if for example you stash the buffer somewhere you can access
across syscalls like the io_uring's registered buffer table, the
user at least would be able to find an error and then memcpy the
unprocessed data as a fallback.

>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>    include/linux/io_uring_types.h | 33 +++++++++++++++++++
>>>    io_uring/io_uring.c            | 10 +++++-
>>>    io_uring/io_uring.h            | 10 ++++++
>>>    io_uring/kbuf.c                | 60 ++++++++++++++++++++++++++++++++++
>>>    io_uring/kbuf.h                | 13 ++++++++
>>>    io_uring/net.c                 | 23 ++++++++++++-
>>>    io_uring/opdef.c               |  4 +++
>>>    io_uring/opdef.h               |  2 ++
>>>    io_uring/rw.c                  | 20 +++++++++++-
>>>    9 files changed, 172 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>> index 793d5a26d9b8..445e5507565a 100644
>>> --- a/include/linux/io_uring_types.h
>>> +++ b/include/linux/io_uring_types.h
...
>>
>> And I don't think you need opdef::accept_group_kbuf since the
>> request handler should already know that and, importantly, you don't
>> imbue any semantics based on it.
> 
> Yeah, and it just follows logic of buffer_select. I guess def->buffer_select
> may be removed too?

->buffer_select helps to fail IOSQE_BUFFER_SELECT for requests
that don't support it, so we don't need to add the check every
time we add a new request opcode.

In your case requests just ignore ->accept_group_kbuf /
REQ_F_GROUP_KBUF if they don't expect to use the buffer, so
it's different in several aspects.

fwiw, I don't mind ->accept_group_kbuf, I just don't see
what purpose it serves. Would be nice to have a sturdier uAPI,
where the user sets a flag to each member that want to use
these provided buffers and then the kernel checks leader vs
that flag and fails misconfigurations, but likely we don't
have flags / sqe space for it.


>> FWIW, would be nice if during init figure we can verify that the leader
>> provides a buffer IFF there is someone consuming it, but I don't think
> 
> It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can
> only be provided in ->issue().

In theory we could, in practise it'd be too much of a pain, I agree.

IORING_OP_PROVIDE_BUFFERS is different as you just stash the buffer
in the io_uring instance, and it's used at an unspecified time later
by some request. In this sense the API is explicit, requests that don't
support it but marked with IOSQE_BUFFER_SELECT will be failed by the
kernel.

>> the semantics is flexible enough to do it sanely. i.e. there are many
>> members in a group, some might want to use the buffer and some might not.
>>
...
>>> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
>>> index df2be7353414..8e111d24c02d 100644
>>> --- a/io_uring/io_uring.h
>>> +++ b/io_uring/io_uring.h
>>> @@ -349,6 +349,16 @@ static inline bool req_is_group_leader(struct io_kiocb *req)
>>>    	return req->flags & REQ_F_SQE_GROUP_LEADER;
>>>    }
>> ...
>>> +int io_import_group_kbuf(struct io_kiocb *req, unsigned long buf_off,
>>> +		unsigned int len, int dir, struct iov_iter *iter)
>>> +{
>>> +	struct io_kiocb *lead = req->grp_link;
>>> +	const struct io_uring_kernel_buf *kbuf;
>>> +	unsigned long offset;
>>> +
>>> +	WARN_ON_ONCE(!(req->flags & REQ_F_GROUP_KBUF));
>>> +
>>> +	if (!req_is_group_member(req))
>>> +		return -EINVAL;
>>> +
>>> +	if (!lead || !req_support_group_dep(lead) || !lead->grp_kbuf)
>>> +		return -EINVAL;
>>> +
>>> +	/* req->fused_cmd_kbuf is immutable */
>>> +	kbuf = lead->grp_kbuf;
>>> +	offset = kbuf->offset;
>>> +
>>> +	if (!kbuf->bvec)
>>> +		return -EINVAL;
>>
>> How can this happen?
> 
> OK, we can run the check in uring_cmd API.

Not sure I follow, if a request providing a buffer can't set
a bvec it should just fail, without exposing half made
io_uring_kernel_buf to other requests.

Is it rather a WARN_ON_ONCE check?


>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>> index f10f5a22d66a..ad24dd5924d2 100644
>>> --- a/io_uring/net.c
>>> +++ b/io_uring/net.c
>>> @@ -89,6 +89,13 @@ struct io_sr_msg {
>>>     */
>>>    #define MULTISHOT_MAX_RETRY	32
>>> +#define user_ptr_to_u64(x) (		\
>>> +{					\
>>> +	typecheck(void __user *, (x));	\
>>> +	(u64)(unsigned long)(x);	\
>>> +}					\
>>> +)
>>> +
>>>    int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>    {
>>>    	struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
>>> @@ -375,7 +382,7 @@ static int io_send_setup(struct io_kiocb *req)
>>>    		kmsg->msg.msg_name = &kmsg->addr;
>>>    		kmsg->msg.msg_namelen = sr->addr_len;
>>>    	}
>>> -	if (!io_do_buffer_select(req)) {
>>> +	if (!io_do_buffer_select(req) && !(req->flags & REQ_F_GROUP_KBUF)) {
>>>    		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len,
>>>    				  &kmsg->msg.msg_iter);
>>>    		if (unlikely(ret < 0))
>>> @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
>>>    	if (issue_flags & IO_URING_F_NONBLOCK)
>>>    		flags |= MSG_DONTWAIT;
>>> +	if (req->flags & REQ_F_GROUP_KBUF) {
>>
>> Does anything prevent the request to be marked by both
>> GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
>> a group kbuf and then go to the io_do_buffer_select()
>> overriding all of that
> 
> It could be used in this way, and we can fail the member in
> io_queue_group_members().

That's where the opdef flag could actually be useful,

if (opdef[member]->accept_group_kbuf &&
     member->flags & SELECT_BUF)
	fail;


>>> +		ret = io_import_group_kbuf(req,
>>> +					user_ptr_to_u64(sr->buf),
>>> +					sr->len, ITER_SOURCE,
>>> +					&kmsg->msg.msg_iter);
>>> +		if (unlikely(ret))
>>> +			return ret;
>>> +	}
>>> +
>>>    retry_bundle:
>>>    	if (io_do_buffer_select(req)) {
>>>    		struct buf_sel_arg arg = {
>>> @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>>    			goto out_free;
>>>    		}
>>>    		sr->buf = NULL;
>>> +	} else if (req->flags & REQ_F_GROUP_KBUF) {
>>
>> What happens if we get a short read/recv?
> 
> For short read/recv, any progress is stored in iterator, nothing to do
> with the provide buffer, which is immutable.
> 
> One problem for read is reissue, but it can be handled by saving iter
> state after the group buffer is imported, I will fix it in next version.
> For net recv, offset/len of buffer is updated in case of short recv, so
> it works as expected.

That was one of my worries.

> Or any other issue with short read/recv? Can you explain in detail?

To sum up design wise, when members that are using the buffer as a
source, e.g. write/send, fail, the user is expected to usually reissue
both the write and the ublk cmd.

Let's say you have a ublk leader command providing a 4K buffer, and
you group it with a 4K send using the buffer. Let's assume the send
is short and does't only 2K of data. Then the user would normally
reissue:

ublk(4K, GROUP), send(off=2K)

That's fine assuming short IO is rare.

I worry more about the backward flow, ublk provides an "empty" buffer
to receive/read into. ublk wants to do something with the buffer in
the callback. What happens when read/recv is short (and cannot be
retried by io_uring)?

1. ublk(provide empty 4K buffer)
2. recv, ret=2K
3. ->grp_kbuf_ack: ublk should commit back only 2K
    of data and not assume it's 4K

Another option is to fail ->grp_kbuf_ack if any member fails, but
the API might be a bit too awkward and inconvenient .

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-06  8:46     ` Ming Lei
@ 2024-10-09 15:14       ` Pavel Begunkov
  2024-10-10  3:28         ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-09 15:14 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block

On 10/6/24 09:46, Ming Lei wrote:
> On Fri, Oct 04, 2024 at 04:44:54PM +0100, Pavel Begunkov wrote:
>> On 9/12/24 11:49, Ming Lei wrote:
>>> Allow uring command to be group leader for providing kernel buffer,
>>> and this way can support generic device zero copy over device buffer.
>>>
>>> The following patch will use the way to support zero copy for ublk.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>    include/linux/io_uring/cmd.h  |  7 +++++++
>>>    include/uapi/linux/io_uring.h |  7 ++++++-
>>>    io_uring/uring_cmd.c          | 28 ++++++++++++++++++++++++++++
>>>    3 files changed, 41 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
>>> index 447fbfd32215..fde3a2ec7d9a 100644
>>> --- a/include/linux/io_uring/cmd.h
>>> +++ b/include/linux/io_uring/cmd.h
>>> @@ -48,6 +48,8 @@ void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
>>>    void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
>>>    		unsigned int issue_flags);
>>> +int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
>>> +		const struct io_uring_kernel_buf *grp_kbuf);
>>>    #else
>>>    static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
>>>    			      struct iov_iter *iter, void *ioucmd)
>>> @@ -67,6 +69,11 @@ static inline void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
>>>    		unsigned int issue_flags)
>>>    {
>>>    }
>>> +static inline int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
>>> +		const struct io_uring_kernel_buf *grp_kbuf)
>>> +{
>>> +	return -EOPNOTSUPP;
>>> +}
>>>    #endif
>>>    /*
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index 2af32745ebd3..11985eeac10e 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -271,9 +271,14 @@ enum io_uring_op {
>>>     * sqe->uring_cmd_flags		top 8bits aren't available for userspace
>>>     * IORING_URING_CMD_FIXED	use registered buffer; pass this flag
>>>     *				along with setting sqe->buf_index.
>>> + * IORING_PROVIDE_GROUP_KBUF	this command provides group kernel buffer
>>> + *				for member requests which can retrieve
>>> + *				any sub-buffer with offset(sqe->addr) and
>>> + *				len(sqe->len)
>>
>> Is there a good reason it needs to be a cmd generic flag instead of
>> ublk specific?
> 
> io_uring request isn't visible for drivers, so driver can't know if the
> uring command is one group leader.

btw, does it have to be in a group at all? Sure, nobody would be
able to consume the buffer, but otherwise should be fine.

> Another way is to add new API of io_uring_cmd_may_provide_buffer(ioucmd)

The checks can be done inside of io_uring_cmd_provide_kbuf()

> so driver can check if device buffer can be provided with this uring_cmd,
> but I prefer to the new uring_cmd flag:
> 
> - IORING_PROVIDE_GROUP_KBUF can provide device buffer in generic way.

Ok, could be.

> - ->prep() can fail fast in case that it isn't one group request

I don't believe that matters, a behaving user should never
see that kind of failure.


>> 1. Extra overhead for files / cmds that don't even care about the
>> feature.
> 
> It is just checking ioucmd->flags in ->prep(), and basically zero cost.

It's not if we add extra code for each every feature, at
which point it becomes a maze of such "ifs".

>> 2. As it stands with this patch, the flag is ignored by all other
>> cmd implementations, which might be quite confusing as an api,
>> especially so since if we don't set that REQ_F_GROUP_KBUF memeber
>> requests will silently try to import a buffer the "normal way",
> 
> The usage is same with buffer select or fixed buffer, and consumer
> has to check the flag.

We fails requests when it's asked to use the feature but
those are not supported, at least non-cmd requests.

> And same with IORING_URING_CMD_FIXED which is ignored by other
> implementations except for nvme, :-)

Oh, that's bad. If you'd try to implement the flag in the
future it might break the uapi. It might be worth to patch it
up on the ublk side, i.e. reject the flag, + backport, and hope
nobody tried to use them together, hmm?

> I can understand the concern, but it exits since uring cmd is born.
> 
>> i.e. interpret sqe->addr or such as the target buffer.
> 
>> 3. We can't even put some nice semantics on top since it's
>> still cmd specific and not generic to all other io_uring
>> requests.
>>
>> I'd even think that it'd make sense to implement it as a
>> new cmd opcode, but that's the business of the file implementing
>> it, i.e. ublk.
>>
>>>     */
>>>    #define IORING_URING_CMD_FIXED	(1U << 0)
>>> -#define IORING_URING_CMD_MASK	IORING_URING_CMD_FIXED
>>> +#define IORING_PROVIDE_GROUP_KBUF	(1U << 1)
>>> +#define IORING_URING_CMD_MASK	(IORING_URING_CMD_FIXED | IORING_PROVIDE_GROUP_KBUF)
> 
> It needs one new file operation, and we shouldn't work toward
> this way.

Not a new io_uring request, I rather meant sqe->cmd_op,
like UBLK_U_IO_FETCH_REQ_PROVIDER_BUFFER.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-10-09 14:25       ` Pavel Begunkov
@ 2024-10-10  3:00         ` Ming Lei
  2024-10-10 18:51           ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-10  3:00 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block

On Wed, Oct 09, 2024 at 03:25:25PM +0100, Pavel Begunkov wrote:
> On 10/6/24 09:20, Ming Lei wrote:
> > On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
> > > On 9/12/24 11:49, Ming Lei wrote:
> > > ...
> > > > It can help to implement generic zero copy between device and related
> > > > operations, such as ublk, fuse, vdpa,
> > > > even network receive or whatever.
> > > 
> > > That's exactly the thing it can't sanely work with because
> > > of this design.
> > 
> > The provide buffer design is absolutely generic, and basically
> > 
> > - group leader provides buffer for member OPs, and member OPs can borrow
> > the buffer if leader allows by calling io_provide_group_kbuf()
> > 
> > - after member OPs consumes the buffer, the buffer is returned back by
> > the callback implemented in group leader subsystem, so group leader can
> > release related sources;
> > 
> > - and it is guaranteed that the buffer can be released always
> > 
> > The ublk implementation is pretty simple, it can be reused in device driver
> > to share buffer with other kernel subsystems.
> > 
> > I don't see anything insane with the design.
> 
> There is nothing insane with the design, but the problem is cross
> request error handling, same thing that makes links a pain to use.

Wrt. link, the whole group is linked in the chain, and it respects
all existed link rule, care to share the trouble in link use case?

The only thing I thought of is that group internal link isn't supported
yet, but it may be added in future if use case is coming.

> It's good that with storage reads are reasonably idempotent and you
> can be retried if needed. With sockets and streams, however, you
> can't sanely borrow a buffer without consuming it, so if a member
> request processing the buffer fails for any reason, the user data
> will be dropped on the floor. I mentioned quite a while before,
> if for example you stash the buffer somewhere you can access
> across syscalls like the io_uring's registered buffer table, the
> user at least would be able to find an error and then memcpy the
> unprocessed data as a fallback.

I guess it is net rx case, which requires buffer to cross syscalls,
then the buffer has to be owned by userspace, otherwise the buffer
can be leaked easily.

That may not match with sqe group which is supposed to borrow kernel
buffer consumed by users.

> 
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >    include/linux/io_uring_types.h | 33 +++++++++++++++++++
> > > >    io_uring/io_uring.c            | 10 +++++-
> > > >    io_uring/io_uring.h            | 10 ++++++
> > > >    io_uring/kbuf.c                | 60 ++++++++++++++++++++++++++++++++++
> > > >    io_uring/kbuf.h                | 13 ++++++++
> > > >    io_uring/net.c                 | 23 ++++++++++++-
> > > >    io_uring/opdef.c               |  4 +++
> > > >    io_uring/opdef.h               |  2 ++
> > > >    io_uring/rw.c                  | 20 +++++++++++-
> > > >    9 files changed, 172 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > > > index 793d5a26d9b8..445e5507565a 100644
> > > > --- a/include/linux/io_uring_types.h
> > > > +++ b/include/linux/io_uring_types.h
> ...
> > > 
> > > And I don't think you need opdef::accept_group_kbuf since the
> > > request handler should already know that and, importantly, you don't
> > > imbue any semantics based on it.
> > 
> > Yeah, and it just follows logic of buffer_select. I guess def->buffer_select
> > may be removed too?
> 
> ->buffer_select helps to fail IOSQE_BUFFER_SELECT for requests
> that don't support it, so we don't need to add the check every
> time we add a new request opcode.
> 
> In your case requests just ignore ->accept_group_kbuf /
> REQ_F_GROUP_KBUF if they don't expect to use the buffer, so
> it's different in several aspects.
> 
> fwiw, I don't mind ->accept_group_kbuf, I just don't see
> what purpose it serves. Would be nice to have a sturdier uAPI,
> where the user sets a flag to each member that want to use
> these provided buffers and then the kernel checks leader vs
> that flag and fails misconfigurations, but likely we don't
> have flags / sqe space for it.

As I replied in previous email, members have three flags available,
we can map one of them to REQ_F_GROUP_KBUF.

> 
> 
> > > FWIW, would be nice if during init figure we can verify that the leader
> > > provides a buffer IFF there is someone consuming it, but I don't think
> > 
> > It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can
> > only be provided in ->issue().
> 
> In theory we could, in practise it'd be too much of a pain, I agree.
> 
> IORING_OP_PROVIDE_BUFFERS is different as you just stash the buffer
> in the io_uring instance, and it's used at an unspecified time later
> by some request. In this sense the API is explicit, requests that don't
> support it but marked with IOSQE_BUFFER_SELECT will be failed by the
> kernel.

That is also one reason why I add ->accept_group_kbuf.

> 
> > > the semantics is flexible enough to do it sanely. i.e. there are many
> > > members in a group, some might want to use the buffer and some might not.
> > > 
> ...
> > > > diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> > > > index df2be7353414..8e111d24c02d 100644
> > > > --- a/io_uring/io_uring.h
> > > > +++ b/io_uring/io_uring.h
> > > > @@ -349,6 +349,16 @@ static inline bool req_is_group_leader(struct io_kiocb *req)
> > > >    	return req->flags & REQ_F_SQE_GROUP_LEADER;
> > > >    }
> > > ...
> > > > +int io_import_group_kbuf(struct io_kiocb *req, unsigned long buf_off,
> > > > +		unsigned int len, int dir, struct iov_iter *iter)
> > > > +{
> > > > +	struct io_kiocb *lead = req->grp_link;
> > > > +	const struct io_uring_kernel_buf *kbuf;
> > > > +	unsigned long offset;
> > > > +
> > > > +	WARN_ON_ONCE(!(req->flags & REQ_F_GROUP_KBUF));
> > > > +
> > > > +	if (!req_is_group_member(req))
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!lead || !req_support_group_dep(lead) || !lead->grp_kbuf)
> > > > +		return -EINVAL;
> > > > +
> > > > +	/* req->fused_cmd_kbuf is immutable */
> > > > +	kbuf = lead->grp_kbuf;
> > > > +	offset = kbuf->offset;
> > > > +
> > > > +	if (!kbuf->bvec)
> > > > +		return -EINVAL;
> > > 
> > > How can this happen?
> > 
> > OK, we can run the check in uring_cmd API.
> 
> Not sure I follow, if a request providing a buffer can't set
> a bvec it should just fail, without exposing half made
> io_uring_kernel_buf to other requests.
> 
> Is it rather a WARN_ON_ONCE check?

I meant we can check it in API of io_provide_group_kbuf() since the group
buffer is filled by driver, since then the buffer is immutable, and we
needn't any other check.

> 
> 
> > > > diff --git a/io_uring/net.c b/io_uring/net.c
> > > > index f10f5a22d66a..ad24dd5924d2 100644
> > > > --- a/io_uring/net.c
> > > > +++ b/io_uring/net.c
> > > > @@ -89,6 +89,13 @@ struct io_sr_msg {
> > > >     */
> > > >    #define MULTISHOT_MAX_RETRY	32
> > > > +#define user_ptr_to_u64(x) (		\
> > > > +{					\
> > > > +	typecheck(void __user *, (x));	\
> > > > +	(u64)(unsigned long)(x);	\
> > > > +}					\
> > > > +)
> > > > +
> > > >    int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > > >    {
> > > >    	struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
> > > > @@ -375,7 +382,7 @@ static int io_send_setup(struct io_kiocb *req)
> > > >    		kmsg->msg.msg_name = &kmsg->addr;
> > > >    		kmsg->msg.msg_namelen = sr->addr_len;
> > > >    	}
> > > > -	if (!io_do_buffer_select(req)) {
> > > > +	if (!io_do_buffer_select(req) && !(req->flags & REQ_F_GROUP_KBUF)) {
> > > >    		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len,
> > > >    				  &kmsg->msg.msg_iter);
> > > >    		if (unlikely(ret < 0))
> > > > @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
> > > >    	if (issue_flags & IO_URING_F_NONBLOCK)
> > > >    		flags |= MSG_DONTWAIT;
> > > > +	if (req->flags & REQ_F_GROUP_KBUF) {
> > > 
> > > Does anything prevent the request to be marked by both
> > > GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
> > > a group kbuf and then go to the io_do_buffer_select()
> > > overriding all of that
> > 
> > It could be used in this way, and we can fail the member in
> > io_queue_group_members().
> 
> That's where the opdef flag could actually be useful,
> 
> if (opdef[member]->accept_group_kbuf &&
>     member->flags & SELECT_BUF)
> 	fail;
> 
> 
> > > > +		ret = io_import_group_kbuf(req,
> > > > +					user_ptr_to_u64(sr->buf),
> > > > +					sr->len, ITER_SOURCE,
> > > > +					&kmsg->msg.msg_iter);
> > > > +		if (unlikely(ret))
> > > > +			return ret;
> > > > +	}
> > > > +
> > > >    retry_bundle:
> > > >    	if (io_do_buffer_select(req)) {
> > > >    		struct buf_sel_arg arg = {
> > > > @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
> > > >    			goto out_free;
> > > >    		}
> > > >    		sr->buf = NULL;
> > > > +	} else if (req->flags & REQ_F_GROUP_KBUF) {
> > > 
> > > What happens if we get a short read/recv?
> > 
> > For short read/recv, any progress is stored in iterator, nothing to do
> > with the provide buffer, which is immutable.
> > 
> > One problem for read is reissue, but it can be handled by saving iter
> > state after the group buffer is imported, I will fix it in next version.
> > For net recv, offset/len of buffer is updated in case of short recv, so
> > it works as expected.
> 
> That was one of my worries.
> 
> > Or any other issue with short read/recv? Can you explain in detail?
> 
> To sum up design wise, when members that are using the buffer as a
> source, e.g. write/send, fail, the user is expected to usually reissue
> both the write and the ublk cmd.
> 
> Let's say you have a ublk leader command providing a 4K buffer, and
> you group it with a 4K send using the buffer. Let's assume the send
> is short and does't only 2K of data. Then the user would normally
> reissue:
> 
> ublk(4K, GROUP), send(off=2K)
> 
> That's fine assuming short IO is rare.
> 
> I worry more about the backward flow, ublk provides an "empty" buffer
> to receive/read into. ublk wants to do something with the buffer in
> the callback. What happens when read/recv is short (and cannot be
> retried by io_uring)?
> 
> 1. ublk(provide empty 4K buffer)
> 2. recv, ret=2K
> 3. ->grp_kbuf_ack: ublk should commit back only 2K
>    of data and not assume it's 4K

->grp_kbuf_ack is supposed to only return back the buffer to the
owner, and it doesn't care result of buffer consumption.

When ->grp_kbuf_ack() is done, it means this time buffer borrow is
over.

When userspace figures out it is one short read, it will send one
ublk uring_cmd to notify that this io command is completed with
result(2k). ublk driver may decide to requeue this io command for
retrying the remained bytes, when only remained part of the buffer
is allowed to borrow in following provide uring command originated
from userspace.

For ublk use case, so far so good.

> 
> Another option is to fail ->grp_kbuf_ack if any member fails, but
> the API might be a bit too awkward and inconvenient .

We needn't ->grp_kbuf_ack() to cover buffer consumption.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-09 15:14       ` Pavel Begunkov
@ 2024-10-10  3:28         ` Ming Lei
  2024-10-10 15:48           ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-10  3:28 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block

On Wed, Oct 09, 2024 at 04:14:33PM +0100, Pavel Begunkov wrote:
> On 10/6/24 09:46, Ming Lei wrote:
> > On Fri, Oct 04, 2024 at 04:44:54PM +0100, Pavel Begunkov wrote:
> > > On 9/12/24 11:49, Ming Lei wrote:
> > > > Allow uring command to be group leader for providing kernel buffer,
> > > > and this way can support generic device zero copy over device buffer.
> > > > 
> > > > The following patch will use the way to support zero copy for ublk.
> > > > 
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >    include/linux/io_uring/cmd.h  |  7 +++++++
> > > >    include/uapi/linux/io_uring.h |  7 ++++++-
> > > >    io_uring/uring_cmd.c          | 28 ++++++++++++++++++++++++++++
> > > >    3 files changed, 41 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
> > > > index 447fbfd32215..fde3a2ec7d9a 100644
> > > > --- a/include/linux/io_uring/cmd.h
> > > > +++ b/include/linux/io_uring/cmd.h
> > > > @@ -48,6 +48,8 @@ void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
> > > >    void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
> > > >    		unsigned int issue_flags);
> > > > +int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
> > > > +		const struct io_uring_kernel_buf *grp_kbuf);
> > > >    #else
> > > >    static inline int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
> > > >    			      struct iov_iter *iter, void *ioucmd)
> > > > @@ -67,6 +69,11 @@ static inline void io_uring_cmd_mark_cancelable(struct io_uring_cmd *cmd,
> > > >    		unsigned int issue_flags)
> > > >    {
> > > >    }
> > > > +static inline int io_uring_cmd_provide_kbuf(struct io_uring_cmd *ioucmd,
> > > > +		const struct io_uring_kernel_buf *grp_kbuf)
> > > > +{
> > > > +	return -EOPNOTSUPP;
> > > > +}
> > > >    #endif
> > > >    /*
> > > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > > > index 2af32745ebd3..11985eeac10e 100644
> > > > --- a/include/uapi/linux/io_uring.h
> > > > +++ b/include/uapi/linux/io_uring.h
> > > > @@ -271,9 +271,14 @@ enum io_uring_op {
> > > >     * sqe->uring_cmd_flags		top 8bits aren't available for userspace
> > > >     * IORING_URING_CMD_FIXED	use registered buffer; pass this flag
> > > >     *				along with setting sqe->buf_index.
> > > > + * IORING_PROVIDE_GROUP_KBUF	this command provides group kernel buffer
> > > > + *				for member requests which can retrieve
> > > > + *				any sub-buffer with offset(sqe->addr) and
> > > > + *				len(sqe->len)
> > > 
> > > Is there a good reason it needs to be a cmd generic flag instead of
> > > ublk specific?
> > 
> > io_uring request isn't visible for drivers, so driver can't know if the
> > uring command is one group leader.
> 
> btw, does it have to be in a group at all? Sure, nobody would be
> able to consume the buffer, but otherwise should be fine.
> 
> > Another way is to add new API of io_uring_cmd_may_provide_buffer(ioucmd)
> 
> The checks can be done inside of io_uring_cmd_provide_kbuf()

Yeah.

Now the difference is just that:

- user may know it explicitly(UAPI flag) or implicitly(driver's ->cmd_op),
- if driver knows this uring_cmd is one group leader

I am fine with either way.

> 
> > so driver can check if device buffer can be provided with this uring_cmd,
> > but I prefer to the new uring_cmd flag:
> > 
> > - IORING_PROVIDE_GROUP_KBUF can provide device buffer in generic way.
> 
> Ok, could be.
> 
> > - ->prep() can fail fast in case that it isn't one group request
> 
> I don't believe that matters, a behaving user should never
> see that kind of failure.
> 
> 
> > > 1. Extra overhead for files / cmds that don't even care about the
> > > feature.
> > 
> > It is just checking ioucmd->flags in ->prep(), and basically zero cost.
> 
> It's not if we add extra code for each every feature, at
> which point it becomes a maze of such "ifs".

Yeah, I guess it can't be avoided in current uring_cmd design, which
serves for different subsystems now, and more in future.

And the situation is similar with ioctl.

> 
> > > 2. As it stands with this patch, the flag is ignored by all other
> > > cmd implementations, which might be quite confusing as an api,
> > > especially so since if we don't set that REQ_F_GROUP_KBUF memeber
> > > requests will silently try to import a buffer the "normal way",
> > 
> > The usage is same with buffer select or fixed buffer, and consumer
> > has to check the flag.
> 
> We fails requests when it's asked to use the feature but
> those are not supported, at least non-cmd requests.
> 
> > And same with IORING_URING_CMD_FIXED which is ignored by other
> > implementations except for nvme, :-)
> 
> Oh, that's bad. If you'd try to implement the flag in the
> future it might break the uapi. It might be worth to patch it
> up on the ublk side, i.e. reject the flag, + backport, and hope
> nobody tried to use them together, hmm?
> 
> > I can understand the concern, but it exits since uring cmd is born.
> > 
> > > i.e. interpret sqe->addr or such as the target buffer.
> > 
> > > 3. We can't even put some nice semantics on top since it's
> > > still cmd specific and not generic to all other io_uring
> > > requests.
> > > 
> > > I'd even think that it'd make sense to implement it as a
> > > new cmd opcode, but that's the business of the file implementing
> > > it, i.e. ublk.
> > > 
> > > >     */
> > > >    #define IORING_URING_CMD_FIXED	(1U << 0)
> > > > -#define IORING_URING_CMD_MASK	IORING_URING_CMD_FIXED
> > > > +#define IORING_PROVIDE_GROUP_KBUF	(1U << 1)
> > > > +#define IORING_URING_CMD_MASK	(IORING_URING_CMD_FIXED | IORING_PROVIDE_GROUP_KBUF)
> > 
> > It needs one new file operation, and we shouldn't work toward
> > this way.
> 
> Not a new io_uring request, I rather meant sqe->cmd_op,
> like UBLK_U_IO_FETCH_REQ_PROVIDER_BUFFER.

`cmd_op` is supposed to be defined by subsystems, but maybe we can
reserve some for generic uring_cmd. Anyway this shouldn't be one big
deal, we can do that in future if there are more such uses.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-10  3:28         ` Ming Lei
@ 2024-10-10 15:48           ` Pavel Begunkov
  2024-10-10 19:31             ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-10 15:48 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block

On 10/10/24 04:28, Ming Lei wrote:
> On Wed, Oct 09, 2024 at 04:14:33PM +0100, Pavel Begunkov wrote:
>> On 10/6/24 09:46, Ming Lei wrote:
>>> On Fri, Oct 04, 2024 at 04:44:54PM +0100, Pavel Begunkov wrote:
>>>> On 9/12/24 11:49, Ming Lei wrote:
...
>>> so driver can check if device buffer can be provided with this uring_cmd,
>>> but I prefer to the new uring_cmd flag:
>>>
>>> - IORING_PROVIDE_GROUP_KBUF can provide device buffer in generic way.
>>
>> Ok, could be.
>>
>>> - ->prep() can fail fast in case that it isn't one group request
>>
>> I don't believe that matters, a behaving user should never
>> see that kind of failure.
>>
>>
>>>> 1. Extra overhead for files / cmds that don't even care about the
>>>> feature.
>>>
>>> It is just checking ioucmd->flags in ->prep(), and basically zero cost.
>>
>> It's not if we add extra code for each every feature, at
>> which point it becomes a maze of such "ifs".
> 
> Yeah, I guess it can't be avoided in current uring_cmd design, which

If can't only if we keep putting all custom / some specific
command features into the common path. And, for example, I
just named how this one could be avoided.

The real question is whether we deem that buffer providing
feature applicable widely enough so that it could be useful
to many potential command implementations and therefore is
worth of partially handling it generically in the common path.

> serves for different subsystems now, and more in future.
> 
> And the situation is similar with ioctl.

Well, commands look too much as ioctl for my taste, but even
then I naively hope it can avoid regressing to it.

>>>> 2. As it stands with this patch, the flag is ignored by all other
>>>> cmd implementations, which might be quite confusing as an api,
>>>> especially so since if we don't set that REQ_F_GROUP_KBUF memeber
>>>> requests will silently try to import a buffer the "normal way",
>>>
>>> The usage is same with buffer select or fixed buffer, and consumer
>>> has to check the flag.
>>
>> We fails requests when it's asked to use the feature but
>> those are not supported, at least non-cmd requests.
>>
>>> And same with IORING_URING_CMD_FIXED which is ignored by other
>>> implementations except for nvme, :-)
>>
>> Oh, that's bad. If you'd try to implement the flag in the
>> future it might break the uapi. It might be worth to patch it
>> up on the ublk side, i.e. reject the flag, + backport, and hope
>> nobody tried to use them together, hmm?
>>
>>> I can understand the concern, but it exits since uring cmd is born.
>>>
>>>> i.e. interpret sqe->addr or such as the target buffer.
>>>
>>>> 3. We can't even put some nice semantics on top since it's
>>>> still cmd specific and not generic to all other io_uring
>>>> requests.
>>>>
>>>> I'd even think that it'd make sense to implement it as a
>>>> new cmd opcode, but that's the business of the file implementing
>>>> it, i.e. ublk.
>>>>
>>>>>      */
>>>>>     #define IORING_URING_CMD_FIXED	(1U << 0)
>>>>> -#define IORING_URING_CMD_MASK	IORING_URING_CMD_FIXED
>>>>> +#define IORING_PROVIDE_GROUP_KBUF	(1U << 1)
>>>>> +#define IORING_URING_CMD_MASK	(IORING_URING_CMD_FIXED | IORING_PROVIDE_GROUP_KBUF)
>>>
>>> It needs one new file operation, and we shouldn't work toward
>>> this way.
>>
>> Not a new io_uring request, I rather meant sqe->cmd_op,
>> like UBLK_U_IO_FETCH_REQ_PROVIDER_BUFFER.
> 
> `cmd_op` is supposed to be defined by subsystems, but maybe we can
> reserve some for generic uring_cmd. Anyway this shouldn't be one big
> deal, we can do that in future if there are more such uses.

That's if the generic handling is desired, which isn't much
different from a flag, otherwise it can be just a new random
file specific cmd opcode as any other.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-10-10  3:00         ` Ming Lei
@ 2024-10-10 18:51           ` Pavel Begunkov
  2024-10-11  2:00             ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-10 18:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, linux-block

On 10/10/24 04:00, Ming Lei wrote:
> On Wed, Oct 09, 2024 at 03:25:25PM +0100, Pavel Begunkov wrote:
>> On 10/6/24 09:20, Ming Lei wrote:
>>> On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
>>>> On 9/12/24 11:49, Ming Lei wrote:
>>>> ...
>>>>> It can help to implement generic zero copy between device and related
>>>>> operations, such as ublk, fuse, vdpa,
>>>>> even network receive or whatever.
>>>>
>>>> That's exactly the thing it can't sanely work with because
>>>> of this design.
>>>
>>> The provide buffer design is absolutely generic, and basically
>>>
>>> - group leader provides buffer for member OPs, and member OPs can borrow
>>> the buffer if leader allows by calling io_provide_group_kbuf()
>>>
>>> - after member OPs consumes the buffer, the buffer is returned back by
>>> the callback implemented in group leader subsystem, so group leader can
>>> release related sources;
>>>
>>> - and it is guaranteed that the buffer can be released always
>>>
>>> The ublk implementation is pretty simple, it can be reused in device driver
>>> to share buffer with other kernel subsystems.
>>>
>>> I don't see anything insane with the design.
>>
>> There is nothing insane with the design, but the problem is cross
>> request error handling, same thing that makes links a pain to use.
> 
> Wrt. link, the whole group is linked in the chain, and it respects
> all existed link rule, care to share the trouble in link use case?

Error handling is a pain, it has been, even for pure link without
any groups. Even with a simple req1 -> req2, you need to track if
the first request fails you need to expect another failed CQE for
the second request, probably refcount (let's say non-atomically)
some structure and clean it up when you get both CQEs. It's not
prettier when the 2nd fails, especially if you consider short IO
and that you can't fully retry that partial IO, e.g. you consumed
data from the socket. And so on.

> The only thing I thought of is that group internal link isn't supported
> yet, but it may be added in future if use case is coming.
> 
>> It's good that with storage reads are reasonably idempotent and you
>> can be retried if needed. With sockets and streams, however, you
>> can't sanely borrow a buffer without consuming it, so if a member
>> request processing the buffer fails for any reason, the user data
>> will be dropped on the floor. I mentioned quite a while before,
>> if for example you stash the buffer somewhere you can access
>> across syscalls like the io_uring's registered buffer table, the
>> user at least would be able to find an error and then memcpy the
>> unprocessed data as a fallback.
> 
> I guess it is net rx case, which requires buffer to cross syscalls,
> then the buffer has to be owned by userspace, otherwise the buffer
> can be leaked easily.
> 
> That may not match with sqe group which is supposed to borrow kernel
> buffer consumed by users.

It doesn't necessarily require to keep buffers across syscalls
per se, it just can't drop the data it got on the floor. It's
just storage can read data again.

...
>>>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>>>> index 793d5a26d9b8..445e5507565a 100644
>>>>> --- a/include/linux/io_uring_types.h
>>>>> +++ b/include/linux/io_uring_types.h
...
>>>> FWIW, would be nice if during init figure we can verify that the leader
>>>> provides a buffer IFF there is someone consuming it, but I don't think
>>>
>>> It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can
>>> only be provided in ->issue().
>>
>> In theory we could, in practise it'd be too much of a pain, I agree.
>>
>> IORING_OP_PROVIDE_BUFFERS is different as you just stash the buffer
>> in the io_uring instance, and it's used at an unspecified time later
>> by some request. In this sense the API is explicit, requests that don't
>> support it but marked with IOSQE_BUFFER_SELECT will be failed by the
>> kernel.
> 
> That is also one reason why I add ->accept_group_kbuf.

I probably missed that, but I haven't seen that

>>>> the semantics is flexible enough to do it sanely. i.e. there are many
>>>> members in a group, some might want to use the buffer and some might not.
>>>>
...
>>>>> +	if (!kbuf->bvec)
>>>>> +		return -EINVAL;
>>>>
>>>> How can this happen?
>>>
>>> OK, we can run the check in uring_cmd API.
>>
>> Not sure I follow, if a request providing a buffer can't set
>> a bvec it should just fail, without exposing half made
>> io_uring_kernel_buf to other requests.
>>
>> Is it rather a WARN_ON_ONCE check?
> 
> I meant we can check it in API of io_provide_group_kbuf() since the group
> buffer is filled by driver, since then the buffer is immutable, and we
> needn't any other check.

That's be a buggy provider, so sounds like WARN_ON_ONCE

...
>>>>>     		if (unlikely(ret < 0))
>>>>> @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
>>>>>     	if (issue_flags & IO_URING_F_NONBLOCK)
>>>>>     		flags |= MSG_DONTWAIT;
>>>>> +	if (req->flags & REQ_F_GROUP_KBUF) {
>>>>
>>>> Does anything prevent the request to be marked by both
>>>> GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
>>>> a group kbuf and then go to the io_do_buffer_select()
>>>> overriding all of that
>>>
>>> It could be used in this way, and we can fail the member in
>>> io_queue_group_members().
>>
>> That's where the opdef flag could actually be useful,
>>
>> if (opdef[member]->accept_group_kbuf &&
>>      member->flags & SELECT_BUF)
>> 	fail;
>>
>>
>>>>> +		ret = io_import_group_kbuf(req,
>>>>> +					user_ptr_to_u64(sr->buf),
>>>>> +					sr->len, ITER_SOURCE,
>>>>> +					&kmsg->msg.msg_iter);
>>>>> +		if (unlikely(ret))
>>>>> +			return ret;
>>>>> +	}
>>>>> +
>>>>>     retry_bundle:
>>>>>     	if (io_do_buffer_select(req)) {
>>>>>     		struct buf_sel_arg arg = {
>>>>> @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>>>>     			goto out_free;
>>>>>     		}
>>>>>     		sr->buf = NULL;
>>>>> +	} else if (req->flags & REQ_F_GROUP_KBUF) {
>>>>
>>>> What happens if we get a short read/recv?
>>>
>>> For short read/recv, any progress is stored in iterator, nothing to do
>>> with the provide buffer, which is immutable.
>>>
>>> One problem for read is reissue, but it can be handled by saving iter
>>> state after the group buffer is imported, I will fix it in next version.
>>> For net recv, offset/len of buffer is updated in case of short recv, so
>>> it works as expected.
>>
>> That was one of my worries.
>>
>>> Or any other issue with short read/recv? Can you explain in detail?
>>
>> To sum up design wise, when members that are using the buffer as a
>> source, e.g. write/send, fail, the user is expected to usually reissue
>> both the write and the ublk cmd.
>>
>> Let's say you have a ublk leader command providing a 4K buffer, and
>> you group it with a 4K send using the buffer. Let's assume the send
>> is short and does't only 2K of data. Then the user would normally
>> reissue:
>>
>> ublk(4K, GROUP), send(off=2K)
>>
>> That's fine assuming short IO is rare.
>>
>> I worry more about the backward flow, ublk provides an "empty" buffer
>> to receive/read into. ublk wants to do something with the buffer in
>> the callback. What happens when read/recv is short (and cannot be
>> retried by io_uring)?
>>
>> 1. ublk(provide empty 4K buffer)
>> 2. recv, ret=2K
>> 3. ->grp_kbuf_ack: ublk should commit back only 2K
>>     of data and not assume it's 4K
> 
> ->grp_kbuf_ack is supposed to only return back the buffer to the
> owner, and it doesn't care result of buffer consumption.
> 
> When ->grp_kbuf_ack() is done, it means this time buffer borrow is
> over.
> 
> When userspace figures out it is one short read, it will send one
> ublk uring_cmd to notify that this io command is completed with
> result(2k). ublk driver may decide to requeue this io command for
> retrying the remained bytes, when only remained part of the buffer
> is allowed to borrow in following provide uring command originated
> from userspace.

My apologies, I failed to notice that moment, even though should've
given it some thinking at the very beginning. I think that part would
be a terrible interface. Might be good enough for ublk, but we can't
be creating a ublk specific features that change the entire io_uring.
Without knowing how much data it actually got, in generic case you
1) need to require the buffer to be fully initialised / zeroed
before handing it. 2) Can't ever commit the data from the callback,
but it would need to wait until the userspace reacts. Yes, it
works in the specific context of ublk, but I don't think it works
as a generic interface.

We need to fall back again and think if we can reuse the registered
buffer table or something else, and make it much cleaner and more
accommodating to other users. Jens, can you give a quick thought
about the API? You've done a lot of interfaces before and hopefully
have some ideas here.


> For ublk use case, so far so good.
> 
>>
>> Another option is to fail ->grp_kbuf_ack if any member fails, but
>> the API might be a bit too awkward and inconvenient .
> 
> We needn't ->grp_kbuf_ack() to cover buffer consumption.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-10 15:48           ` Pavel Begunkov
@ 2024-10-10 19:31             ` Jens Axboe
  2024-10-11  2:30               ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2024-10-10 19:31 UTC (permalink / raw)
  To: Pavel Begunkov, Ming Lei; +Cc: io-uring, linux-block

Hi,

Discussed this with Pavel, and on his suggestion, I tried prototyping a
"buffer update" opcode. Basically it works like
IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
registration. But it works as an sqe rather than being a sync opcode.

The idea here is that you could do that upfront, or as part of a chain,
and have it be generically available, just like any other buffer that
was registered upfront. You do need an empty table registered first,
which can just be sparse. And since you can pick the slot it goes into,
you can rely on that slot afterwards (either as a link, or just the
following sqe).

Quick'n dirty obviously, but I did write a quick test case too to verify
that:

1) It actually works (it seems to)
2) It's not too slow (it seems not to be, I can get ~2.5M updates per
   second in a vm on my laptop, which isn't too bad).

Not saying this is perfect, but perhaps it's worth entertaining an idea
like that? It has the added benefit of being persistent across system
calls as well, unless you do another IORING_OP_BUF_UPDATE at the end of
your chain to re-set it.

Comments? Could it be useful for this?

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 86cb385fe0b5..02d4b66267ef 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -259,6 +259,7 @@ enum io_uring_op {
 	IORING_OP_FTRUNCATE,
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
+	IORING_OP_BUF_UPDATE,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index a2be3bbca5ff..cda35d22397d 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -515,6 +515,10 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_BUF_UPDATE] = {
+		.prep			= io_buf_update_prep,
+		.issue			= io_buf_update,
+	},
 };
 
 const struct io_cold_def io_cold_defs[] = {
@@ -742,6 +746,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_LISTEN] = {
 		.name			= "LISTEN",
 	},
+	[IORING_OP_BUF_UPDATE] = {
+		.name			= "BUF_UPDATE",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 33a3d156a85b..6f0071733018 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1236,3 +1236,44 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg)
 		fput(file);
 	return ret;
 }
+
+struct io_buf_update {
+	struct file *file;
+	struct io_uring_rsrc_update2 up;
+};
+
+int io_buf_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_buf_update *ibu = io_kiocb_to_cmd(req, struct io_buf_update);
+	struct io_uring_rsrc_update2 __user *uaddr;
+
+	if (!req->ctx->buf_data)
+		return -ENXIO;
+	if (sqe->ioprio || sqe->fd || sqe->addr2 || sqe->rw_flags ||
+	    sqe->splice_fd_in)
+		return -EINVAL;
+	if (sqe->len != 1)
+		return -EINVAL;
+
+	uaddr = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	if (copy_from_user(&ibu->up, uaddr, sizeof(*uaddr)))
+		return -EFAULT;
+
+	return 0;
+}
+
+int io_buf_update(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_buf_update *ibu = io_kiocb_to_cmd(req, struct io_buf_update);
+	struct io_ring_ctx *ctx = req->ctx;
+	int ret;
+
+	io_ring_submit_lock(ctx, issue_flags);
+	ret = __io_register_rsrc_update(ctx, IORING_RSRC_BUFFER, &ibu->up, ibu->up.nr);
+	io_ring_submit_unlock(ctx, issue_flags);
+
+	if (ret < 0)
+		req_set_fail(req);
+	io_req_set_res(req, ret, 0);
+	return 0;
+}
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 8ed588036210..d41e75c956ef 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -142,4 +142,7 @@ static inline void __io_unaccount_mem(struct user_struct *user,
 	atomic_long_sub(nr_pages, &user->locked_vm);
 }
 
+int io_buf_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_buf_update(struct io_kiocb *req, unsigned int issue_flags);
+
 #endif

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-10-10 18:51           ` Pavel Begunkov
@ 2024-10-11  2:00             ` Ming Lei
  2024-10-11  4:06               ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-11  2:00 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, ming.lei

On Thu, Oct 10, 2024 at 07:51:19PM +0100, Pavel Begunkov wrote:
> On 10/10/24 04:00, Ming Lei wrote:
> > On Wed, Oct 09, 2024 at 03:25:25PM +0100, Pavel Begunkov wrote:
> > > On 10/6/24 09:20, Ming Lei wrote:
> > > > On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
> > > > > On 9/12/24 11:49, Ming Lei wrote:
> > > > > ...
> > > > > > It can help to implement generic zero copy between device and related
> > > > > > operations, such as ublk, fuse, vdpa,
> > > > > > even network receive or whatever.
> > > > > 
> > > > > That's exactly the thing it can't sanely work with because
> > > > > of this design.
> > > > 
> > > > The provide buffer design is absolutely generic, and basically
> > > > 
> > > > - group leader provides buffer for member OPs, and member OPs can borrow
> > > > the buffer if leader allows by calling io_provide_group_kbuf()
> > > > 
> > > > - after member OPs consumes the buffer, the buffer is returned back by
> > > > the callback implemented in group leader subsystem, so group leader can
> > > > release related sources;
> > > > 
> > > > - and it is guaranteed that the buffer can be released always
> > > > 
> > > > The ublk implementation is pretty simple, it can be reused in device driver
> > > > to share buffer with other kernel subsystems.
> > > > 
> > > > I don't see anything insane with the design.
> > > 
> > > There is nothing insane with the design, but the problem is cross
> > > request error handling, same thing that makes links a pain to use.
> > 
> > Wrt. link, the whole group is linked in the chain, and it respects
> > all existed link rule, care to share the trouble in link use case?
> 
> Error handling is a pain, it has been, even for pure link without
> any groups. Even with a simple req1 -> req2, you need to track if
> the first request fails you need to expect another failed CQE for
> the second request, probably refcount (let's say non-atomically)
> some structure and clean it up when you get both CQEs. It's not
> prettier when the 2nd fails, especially if you consider short IO
> and that you can't fully retry that partial IO, e.g. you consumed
> data from the socket. And so on.
> 
> > The only thing I thought of is that group internal link isn't supported
> > yet, but it may be added in future if use case is coming.
> > 
> > > It's good that with storage reads are reasonably idempotent and you
> > > can be retried if needed. With sockets and streams, however, you
> > > can't sanely borrow a buffer without consuming it, so if a member
> > > request processing the buffer fails for any reason, the user data
> > > will be dropped on the floor. I mentioned quite a while before,
> > > if for example you stash the buffer somewhere you can access
> > > across syscalls like the io_uring's registered buffer table, the
> > > user at least would be able to find an error and then memcpy the
> > > unprocessed data as a fallback.
> > 
> > I guess it is net rx case, which requires buffer to cross syscalls,
> > then the buffer has to be owned by userspace, otherwise the buffer
> > can be leaked easily.
> > 
> > That may not match with sqe group which is supposed to borrow kernel
> > buffer consumed by users.
> 
> It doesn't necessarily require to keep buffers across syscalls
> per se, it just can't drop the data it got on the floor. It's
> just storage can read data again.

In case of short read, data is really stored(not dropped) in the provided
buffer, and you can consume the short read data, or continue to read more to
the same buffer.

What is the your real issue here?

> 
> ...
> > > > > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > > > > > index 793d5a26d9b8..445e5507565a 100644
> > > > > > --- a/include/linux/io_uring_types.h
> > > > > > +++ b/include/linux/io_uring_types.h
> ...
> > > > > FWIW, would be nice if during init figure we can verify that the leader
> > > > > provides a buffer IFF there is someone consuming it, but I don't think
> > > > 
> > > > It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can
> > > > only be provided in ->issue().
> > > 
> > > In theory we could, in practise it'd be too much of a pain, I agree.
> > > 
> > > IORING_OP_PROVIDE_BUFFERS is different as you just stash the buffer
> > > in the io_uring instance, and it's used at an unspecified time later
> > > by some request. In this sense the API is explicit, requests that don't
> > > support it but marked with IOSQE_BUFFER_SELECT will be failed by the
> > > kernel.
> > 
> > That is also one reason why I add ->accept_group_kbuf.
> 
> I probably missed that, but I haven't seen that

Such as, any OPs with fixed buffer can't set ->accept_group_kbuf.

> 
> > > > > the semantics is flexible enough to do it sanely. i.e. there are many
> > > > > members in a group, some might want to use the buffer and some might not.
> > > > > 
> ...
> > > > > > +	if (!kbuf->bvec)
> > > > > > +		return -EINVAL;
> > > > > 
> > > > > How can this happen?
> > > > 
> > > > OK, we can run the check in uring_cmd API.
> > > 
> > > Not sure I follow, if a request providing a buffer can't set
> > > a bvec it should just fail, without exposing half made
> > > io_uring_kernel_buf to other requests.
> > > 
> > > Is it rather a WARN_ON_ONCE check?
> > 
> > I meant we can check it in API of io_provide_group_kbuf() since the group
> > buffer is filled by driver, since then the buffer is immutable, and we
> > needn't any other check.
> 
> That's be a buggy provider, so sounds like WARN_ON_ONCE

Not at all.

If the driver provides bad buffer, all group leader and members OP will be
failed, and userspace can get notified.

> 
> ...
> > > > > >     		if (unlikely(ret < 0))
> > > > > > @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
> > > > > >     	if (issue_flags & IO_URING_F_NONBLOCK)
> > > > > >     		flags |= MSG_DONTWAIT;
> > > > > > +	if (req->flags & REQ_F_GROUP_KBUF) {
> > > > > 
> > > > > Does anything prevent the request to be marked by both
> > > > > GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
> > > > > a group kbuf and then go to the io_do_buffer_select()
> > > > > overriding all of that
> > > > 
> > > > It could be used in this way, and we can fail the member in
> > > > io_queue_group_members().
> > > 
> > > That's where the opdef flag could actually be useful,
> > > 
> > > if (opdef[member]->accept_group_kbuf &&
> > >      member->flags & SELECT_BUF)
> > > 	fail;
> > > 
> > > 
> > > > > > +		ret = io_import_group_kbuf(req,
> > > > > > +					user_ptr_to_u64(sr->buf),
> > > > > > +					sr->len, ITER_SOURCE,
> > > > > > +					&kmsg->msg.msg_iter);
> > > > > > +		if (unlikely(ret))
> > > > > > +			return ret;
> > > > > > +	}
> > > > > > +
> > > > > >     retry_bundle:
> > > > > >     	if (io_do_buffer_select(req)) {
> > > > > >     		struct buf_sel_arg arg = {
> > > > > > @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
> > > > > >     			goto out_free;
> > > > > >     		}
> > > > > >     		sr->buf = NULL;
> > > > > > +	} else if (req->flags & REQ_F_GROUP_KBUF) {
> > > > > 
> > > > > What happens if we get a short read/recv?
> > > > 
> > > > For short read/recv, any progress is stored in iterator, nothing to do
> > > > with the provide buffer, which is immutable.
> > > > 
> > > > One problem for read is reissue, but it can be handled by saving iter
> > > > state after the group buffer is imported, I will fix it in next version.
> > > > For net recv, offset/len of buffer is updated in case of short recv, so
> > > > it works as expected.
> > > 
> > > That was one of my worries.
> > > 
> > > > Or any other issue with short read/recv? Can you explain in detail?
> > > 
> > > To sum up design wise, when members that are using the buffer as a
> > > source, e.g. write/send, fail, the user is expected to usually reissue
> > > both the write and the ublk cmd.
> > > 
> > > Let's say you have a ublk leader command providing a 4K buffer, and
> > > you group it with a 4K send using the buffer. Let's assume the send
> > > is short and does't only 2K of data. Then the user would normally
> > > reissue:
> > > 
> > > ublk(4K, GROUP), send(off=2K)
> > > 
> > > That's fine assuming short IO is rare.
> > > 
> > > I worry more about the backward flow, ublk provides an "empty" buffer
> > > to receive/read into. ublk wants to do something with the buffer in
> > > the callback. What happens when read/recv is short (and cannot be
> > > retried by io_uring)?
> > > 
> > > 1. ublk(provide empty 4K buffer)
> > > 2. recv, ret=2K
> > > 3. ->grp_kbuf_ack: ublk should commit back only 2K
> > >     of data and not assume it's 4K
> > 
> > ->grp_kbuf_ack is supposed to only return back the buffer to the
> > owner, and it doesn't care result of buffer consumption.
> > 
> > When ->grp_kbuf_ack() is done, it means this time buffer borrow is
> > over.
> > 
> > When userspace figures out it is one short read, it will send one
> > ublk uring_cmd to notify that this io command is completed with
> > result(2k). ublk driver may decide to requeue this io command for
> > retrying the remained bytes, when only remained part of the buffer
> > is allowed to borrow in following provide uring command originated
> > from userspace.
> 
> My apologies, I failed to notice that moment, even though should've
> given it some thinking at the very beginning. I think that part would
> be a terrible interface. Might be good enough for ublk, but we can't
> be creating a ublk specific features that change the entire io_uring.
> Without knowing how much data it actually got, in generic case you

You do know how much data actually got in the member OP, don't you?

> 1) need to require the buffer to be fully initialised / zeroed
> before handing it.

The buffer is really initialized before being provided via
io_provide_group_kbuf(). And it is one bvec buffer, anytime the part
is consumed, the iterator is advanced, so always initialized buffer
is provided to consumer OP.

> 2) Can't ever commit the data from the callback,

What do you mean `commit`?

The callback is documented clearly from beginning that it is for
returning back the buffer to the owner.

Only member OPs consume buffer, and group leader provides valid buffer
for member OP, and the buffer lifetime is aligned with group leader
request.

> but it would need to wait until the userspace reacts. Yes, it
> works in the specific context of ublk, but I don't think it works
> as a generic interface.

It is just how ublk uses group buffer, but not necessary to be exactly
this way.

Anytime the buffer is provided via io_provide_group_kbuf() successfully,
the member OPs can consume it safely, and finally the buffer is returned
back if all member OPs are completed. That is all.

Please explain why it isn't generic interface.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-10 19:31             ` Jens Axboe
@ 2024-10-11  2:30               ` Ming Lei
  2024-10-11  2:39                 ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-11  2:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-block

Hi Jens,

On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote:
> Hi,
> 
> Discussed this with Pavel, and on his suggestion, I tried prototyping a
> "buffer update" opcode. Basically it works like
> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
> registration. But it works as an sqe rather than being a sync opcode.
> 
> The idea here is that you could do that upfront, or as part of a chain,
> and have it be generically available, just like any other buffer that
> was registered upfront. You do need an empty table registered first,
> which can just be sparse. And since you can pick the slot it goes into,
> you can rely on that slot afterwards (either as a link, or just the
> following sqe).
> 
> Quick'n dirty obviously, but I did write a quick test case too to verify
> that:
> 
> 1) It actually works (it seems to)

It doesn't work for ublk zc since ublk needs to provide one kernel buffer
for fs rw & net send/recv to consume, and the kernel buffer is invisible
to userspace. But  __io_register_rsrc_update() only can register userspace
buffer.

Also multiple OPs may consume the buffer concurrently, which can't be
supported by buffer select.


thanks, 
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-11  2:30               ` Ming Lei
@ 2024-10-11  2:39                 ` Jens Axboe
  2024-10-11  3:07                   ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2024-10-11  2:39 UTC (permalink / raw)
  To: Ming Lei; +Cc: Pavel Begunkov, io-uring, linux-block

On 10/10/24 8:30 PM, Ming Lei wrote:
> Hi Jens,
> 
> On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote:
>> Hi,
>>
>> Discussed this with Pavel, and on his suggestion, I tried prototyping a
>> "buffer update" opcode. Basically it works like
>> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
>> registration. But it works as an sqe rather than being a sync opcode.
>>
>> The idea here is that you could do that upfront, or as part of a chain,
>> and have it be generically available, just like any other buffer that
>> was registered upfront. You do need an empty table registered first,
>> which can just be sparse. And since you can pick the slot it goes into,
>> you can rely on that slot afterwards (either as a link, or just the
>> following sqe).
>>
>> Quick'n dirty obviously, but I did write a quick test case too to verify
>> that:
>>
>> 1) It actually works (it seems to)
> 
> It doesn't work for ublk zc since ublk needs to provide one kernel buffer
> for fs rw & net send/recv to consume, and the kernel buffer is invisible
> to userspace. But  __io_register_rsrc_update() only can register userspace
> buffer.

I'd be surprised if this simple one was enough! In terms of user vs
kernel buffer, you could certainly use the same mechanism, and just
ensure that buffers are tagged appropriately. I need to think about that
a little bit.

There are certainly many different ways that can get propagated which
would not entail a complicated mechanism. I really like the aspect of
having the identifier being the same thing that we already use, and
hence not needing to be something new on the side.

> Also multiple OPs may consume the buffer concurrently, which can't be
> supported by buffer select.

Why not? You can certainly have multiple ops using the same registered
buffer concurrently right now.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-11  2:39                 ` Jens Axboe
@ 2024-10-11  3:07                   ` Ming Lei
  2024-10-11 13:24                     ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-11  3:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-block, ming.lei

On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
> On 10/10/24 8:30 PM, Ming Lei wrote:
> > Hi Jens,
> > 
> > On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote:
> >> Hi,
> >>
> >> Discussed this with Pavel, and on his suggestion, I tried prototyping a
> >> "buffer update" opcode. Basically it works like
> >> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
> >> registration. But it works as an sqe rather than being a sync opcode.
> >>
> >> The idea here is that you could do that upfront, or as part of a chain,
> >> and have it be generically available, just like any other buffer that
> >> was registered upfront. You do need an empty table registered first,
> >> which can just be sparse. And since you can pick the slot it goes into,
> >> you can rely on that slot afterwards (either as a link, or just the
> >> following sqe).
> >>
> >> Quick'n dirty obviously, but I did write a quick test case too to verify
> >> that:
> >>
> >> 1) It actually works (it seems to)
> > 
> > It doesn't work for ublk zc since ublk needs to provide one kernel buffer
> > for fs rw & net send/recv to consume, and the kernel buffer is invisible
> > to userspace. But  __io_register_rsrc_update() only can register userspace
> > buffer.
> 
> I'd be surprised if this simple one was enough! In terms of user vs
> kernel buffer, you could certainly use the same mechanism, and just
> ensure that buffers are tagged appropriately. I need to think about that
> a little bit.

It is actually same with IORING_OP_PROVIDE_BUFFERS, so the following
consumer OPs have to wait until this OP_BUF_UPDATE is completed.

Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.

1) all N OPs are linked with OP_BUF_UPDATE

Or

2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
OPs concurrently.

But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
and 1 extra syscall is introduced in 2).

The same thing exists in the next OP_BUF_UPDATE which has to wait until
all the previous buffer consumers are done. So the same slow thing are
doubled. Not mention the application will become more complicated.

Here the provided buffer is only visible among the N OPs wide, and making
it global isn't necessary, and slow things down. And has kbuf lifetime
issue.

Also it makes error handling more complicated, io_uring has to remove
the kernel buffer when the current task is exit, dependency or order with
buffer provider is introduced.

There could be more problems, will try to remember all related stuff
thought before.

> 
> There are certainly many different ways that can get propagated which
> would not entail a complicated mechanism. I really like the aspect of
> having the identifier being the same thing that we already use, and
> hence not needing to be something new on the side.
> 
> > Also multiple OPs may consume the buffer concurrently, which can't be
> > supported by buffer select.
> 
> Why not? You can certainly have multiple ops using the same registered
> buffer concurrently right now.

Please see the above problem.

Also I remember that the selected buffer is removed from buffer list,
see io_provided_buffer_select(), but maybe I am wrong.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer
  2024-10-11  2:00             ` Ming Lei
@ 2024-10-11  4:06               ` Ming Lei
  0 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-10-11  4:06 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, ming.lei

On Fri, Oct 11, 2024 at 10:00:06AM +0800, Ming Lei wrote:
> On Thu, Oct 10, 2024 at 07:51:19PM +0100, Pavel Begunkov wrote:
> > On 10/10/24 04:00, Ming Lei wrote:
> > > On Wed, Oct 09, 2024 at 03:25:25PM +0100, Pavel Begunkov wrote:
> > > > On 10/6/24 09:20, Ming Lei wrote:
> > > > > On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
> > > > > > On 9/12/24 11:49, Ming Lei wrote:
> > > > > > ...
> > > > > > > It can help to implement generic zero copy between device and related
> > > > > > > operations, such as ublk, fuse, vdpa,
> > > > > > > even network receive or whatever.
> > > > > > 
> > > > > > That's exactly the thing it can't sanely work with because
> > > > > > of this design.
> > > > > 
> > > > > The provide buffer design is absolutely generic, and basically
> > > > > 
> > > > > - group leader provides buffer for member OPs, and member OPs can borrow
> > > > > the buffer if leader allows by calling io_provide_group_kbuf()
> > > > > 
> > > > > - after member OPs consumes the buffer, the buffer is returned back by
> > > > > the callback implemented in group leader subsystem, so group leader can
> > > > > release related sources;
> > > > > 
> > > > > - and it is guaranteed that the buffer can be released always
> > > > > 
> > > > > The ublk implementation is pretty simple, it can be reused in device driver
> > > > > to share buffer with other kernel subsystems.
> > > > > 
> > > > > I don't see anything insane with the design.
> > > > 
> > > > There is nothing insane with the design, but the problem is cross
> > > > request error handling, same thing that makes links a pain to use.
> > > 
> > > Wrt. link, the whole group is linked in the chain, and it respects
> > > all existed link rule, care to share the trouble in link use case?
> > 
> > Error handling is a pain, it has been, even for pure link without
> > any groups. Even with a simple req1 -> req2, you need to track if
> > the first request fails you need to expect another failed CQE for
> > the second request, probably refcount (let's say non-atomically)
> > some structure and clean it up when you get both CQEs. It's not
> > prettier when the 2nd fails, especially if you consider short IO
> > and that you can't fully retry that partial IO, e.g. you consumed
> > data from the socket. And so on.
> > 
> > > The only thing I thought of is that group internal link isn't supported
> > > yet, but it may be added in future if use case is coming.
> > > 
> > > > It's good that with storage reads are reasonably idempotent and you
> > > > can be retried if needed. With sockets and streams, however, you
> > > > can't sanely borrow a buffer without consuming it, so if a member
> > > > request processing the buffer fails for any reason, the user data
> > > > will be dropped on the floor. I mentioned quite a while before,
> > > > if for example you stash the buffer somewhere you can access
> > > > across syscalls like the io_uring's registered buffer table, the
> > > > user at least would be able to find an error and then memcpy the
> > > > unprocessed data as a fallback.
> > > 
> > > I guess it is net rx case, which requires buffer to cross syscalls,
> > > then the buffer has to be owned by userspace, otherwise the buffer
> > > can be leaked easily.
> > > 
> > > That may not match with sqe group which is supposed to borrow kernel
> > > buffer consumed by users.
> > 
> > It doesn't necessarily require to keep buffers across syscalls
> > per se, it just can't drop the data it got on the floor. It's
> > just storage can read data again.
> 
> In case of short read, data is really stored(not dropped) in the provided
> buffer, and you can consume the short read data, or continue to read more to
> the same buffer.
> 
> What is the your real issue here?
> 
> > 
> > ...
> > > > > > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > > > > > > index 793d5a26d9b8..445e5507565a 100644
> > > > > > > --- a/include/linux/io_uring_types.h
> > > > > > > +++ b/include/linux/io_uring_types.h
> > ...
> > > > > > FWIW, would be nice if during init figure we can verify that the leader
> > > > > > provides a buffer IFF there is someone consuming it, but I don't think
> > > > > 
> > > > > It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can
> > > > > only be provided in ->issue().
> > > > 
> > > > In theory we could, in practise it'd be too much of a pain, I agree.
> > > > 
> > > > IORING_OP_PROVIDE_BUFFERS is different as you just stash the buffer
> > > > in the io_uring instance, and it's used at an unspecified time later
> > > > by some request. In this sense the API is explicit, requests that don't
> > > > support it but marked with IOSQE_BUFFER_SELECT will be failed by the
> > > > kernel.
> > > 
> > > That is also one reason why I add ->accept_group_kbuf.
> > 
> > I probably missed that, but I haven't seen that
> 
> Such as, any OPs with fixed buffer can't set ->accept_group_kbuf.
> 
> > 
> > > > > > the semantics is flexible enough to do it sanely. i.e. there are many
> > > > > > members in a group, some might want to use the buffer and some might not.
> > > > > > 
> > ...
> > > > > > > +	if (!kbuf->bvec)
> > > > > > > +		return -EINVAL;
> > > > > > 
> > > > > > How can this happen?
> > > > > 
> > > > > OK, we can run the check in uring_cmd API.
> > > > 
> > > > Not sure I follow, if a request providing a buffer can't set
> > > > a bvec it should just fail, without exposing half made
> > > > io_uring_kernel_buf to other requests.
> > > > 
> > > > Is it rather a WARN_ON_ONCE check?
> > > 
> > > I meant we can check it in API of io_provide_group_kbuf() since the group
> > > buffer is filled by driver, since then the buffer is immutable, and we
> > > needn't any other check.
> > 
> > That's be a buggy provider, so sounds like WARN_ON_ONCE
> 
> Not at all.
> 
> If the driver provides bad buffer, all group leader and members OP will be
> failed, and userspace can get notified.
> 
> > 
> > ...
> > > > > > >     		if (unlikely(ret < 0))
> > > > > > > @@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
> > > > > > >     	if (issue_flags & IO_URING_F_NONBLOCK)
> > > > > > >     		flags |= MSG_DONTWAIT;
> > > > > > > +	if (req->flags & REQ_F_GROUP_KBUF) {
> > > > > > 
> > > > > > Does anything prevent the request to be marked by both
> > > > > > GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
> > > > > > a group kbuf and then go to the io_do_buffer_select()
> > > > > > overriding all of that
> > > > > 
> > > > > It could be used in this way, and we can fail the member in
> > > > > io_queue_group_members().
> > > > 
> > > > That's where the opdef flag could actually be useful,
> > > > 
> > > > if (opdef[member]->accept_group_kbuf &&
> > > >      member->flags & SELECT_BUF)
> > > > 	fail;
> > > > 
> > > > 
> > > > > > > +		ret = io_import_group_kbuf(req,
> > > > > > > +					user_ptr_to_u64(sr->buf),
> > > > > > > +					sr->len, ITER_SOURCE,
> > > > > > > +					&kmsg->msg.msg_iter);
> > > > > > > +		if (unlikely(ret))
> > > > > > > +			return ret;
> > > > > > > +	}
> > > > > > > +
> > > > > > >     retry_bundle:
> > > > > > >     	if (io_do_buffer_select(req)) {
> > > > > > >     		struct buf_sel_arg arg = {
> > > > > > > @@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
> > > > > > >     			goto out_free;
> > > > > > >     		}
> > > > > > >     		sr->buf = NULL;
> > > > > > > +	} else if (req->flags & REQ_F_GROUP_KBUF) {
> > > > > > 
> > > > > > What happens if we get a short read/recv?
> > > > > 
> > > > > For short read/recv, any progress is stored in iterator, nothing to do
> > > > > with the provide buffer, which is immutable.
> > > > > 
> > > > > One problem for read is reissue, but it can be handled by saving iter
> > > > > state after the group buffer is imported, I will fix it in next version.
> > > > > For net recv, offset/len of buffer is updated in case of short recv, so
> > > > > it works as expected.
> > > > 
> > > > That was one of my worries.
> > > > 
> > > > > Or any other issue with short read/recv? Can you explain in detail?
> > > > 
> > > > To sum up design wise, when members that are using the buffer as a
> > > > source, e.g. write/send, fail, the user is expected to usually reissue
> > > > both the write and the ublk cmd.
> > > > 
> > > > Let's say you have a ublk leader command providing a 4K buffer, and
> > > > you group it with a 4K send using the buffer. Let's assume the send
> > > > is short and does't only 2K of data. Then the user would normally
> > > > reissue:
> > > > 
> > > > ublk(4K, GROUP), send(off=2K)
> > > > 
> > > > That's fine assuming short IO is rare.
> > > > 
> > > > I worry more about the backward flow, ublk provides an "empty" buffer
> > > > to receive/read into. ublk wants to do something with the buffer in
> > > > the callback. What happens when read/recv is short (and cannot be
> > > > retried by io_uring)?
> > > > 
> > > > 1. ublk(provide empty 4K buffer)
> > > > 2. recv, ret=2K
> > > > 3. ->grp_kbuf_ack: ublk should commit back only 2K
> > > >     of data and not assume it's 4K
> > > 
> > > ->grp_kbuf_ack is supposed to only return back the buffer to the
> > > owner, and it doesn't care result of buffer consumption.
> > > 
> > > When ->grp_kbuf_ack() is done, it means this time buffer borrow is
> > > over.
> > > 
> > > When userspace figures out it is one short read, it will send one
> > > ublk uring_cmd to notify that this io command is completed with
> > > result(2k). ublk driver may decide to requeue this io command for
> > > retrying the remained bytes, when only remained part of the buffer
> > > is allowed to borrow in following provide uring command originated
> > > from userspace.
> > 
> > My apologies, I failed to notice that moment, even though should've
> > given it some thinking at the very beginning. I think that part would
> > be a terrible interface. Might be good enough for ublk, but we can't
> > be creating a ublk specific features that change the entire io_uring.
> > Without knowing how much data it actually got, in generic case you
> 
> You do know how much data actually got in the member OP, don't you?
> 
> > 1) need to require the buffer to be fully initialised / zeroed
> > before handing it.
> 
> The buffer is really initialized before being provided via
> io_provide_group_kbuf(). And it is one bvec buffer, anytime the part
> is consumed, the iterator is advanced, so always initialized buffer
> is provided to consumer OP.
> 
> > 2) Can't ever commit the data from the callback,
> 
> What do you mean `commit`?
> 
> The callback is documented clearly from beginning that it is for
> returning back the buffer to the owner.
> 
> Only member OPs consume buffer, and group leader provides valid buffer
> for member OP, and the buffer lifetime is aligned with group leader
> request.
> 
> > but it would need to wait until the userspace reacts. Yes, it
> > works in the specific context of ublk, but I don't think it works
> > as a generic interface.
> 
> It is just how ublk uses group buffer, but not necessary to be exactly
> this way.
> 
> Anytime the buffer is provided via io_provide_group_kbuf() successfully,
> the member OPs can consume it safely, and finally the buffer is returned
> back if all member OPs are completed. That is all.

Forget to mention:

The same buffer can be provided multiple times if it is valid, and one
offset can be added(not done yet in this patchset) easily on the provide
buffer uring_command, so the buffer can be advanced in case of short recv
in the provider side.

> Please explain why it isn't generic interface.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-11  3:07                   ` Ming Lei
@ 2024-10-11 13:24                     ` Jens Axboe
  2024-10-11 14:20                       ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2024-10-11 13:24 UTC (permalink / raw)
  To: Ming Lei; +Cc: Pavel Begunkov, io-uring, linux-block

On 10/10/24 9:07 PM, Ming Lei wrote:
> On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
>> On 10/10/24 8:30 PM, Ming Lei wrote:
>>> Hi Jens,
>>>
>>> On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> Discussed this with Pavel, and on his suggestion, I tried prototyping a
>>>> "buffer update" opcode. Basically it works like
>>>> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
>>>> registration. But it works as an sqe rather than being a sync opcode.
>>>>
>>>> The idea here is that you could do that upfront, or as part of a chain,
>>>> and have it be generically available, just like any other buffer that
>>>> was registered upfront. You do need an empty table registered first,
>>>> which can just be sparse. And since you can pick the slot it goes into,
>>>> you can rely on that slot afterwards (either as a link, or just the
>>>> following sqe).
>>>>
>>>> Quick'n dirty obviously, but I did write a quick test case too to verify
>>>> that:
>>>>
>>>> 1) It actually works (it seems to)
>>>
>>> It doesn't work for ublk zc since ublk needs to provide one kernel buffer
>>> for fs rw & net send/recv to consume, and the kernel buffer is invisible
>>> to userspace. But  __io_register_rsrc_update() only can register userspace
>>> buffer.
>>
>> I'd be surprised if this simple one was enough! In terms of user vs
>> kernel buffer, you could certainly use the same mechanism, and just
>> ensure that buffers are tagged appropriately. I need to think about that
>> a little bit.
> 
> It is actually same with IORING_OP_PROVIDE_BUFFERS, so the following
> consumer OPs have to wait until this OP_BUF_UPDATE is completed.

See below for the registered vs provided buffer confusion that seems to
be a confusion issue here.

> Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
> 
> 1) all N OPs are linked with OP_BUF_UPDATE
> 
> Or
> 
> 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
> OPs concurrently.

Correct

> But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
> and 1 extra syscall is introduced in 2).

Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
you can just do it upfront. It's not ideal in terms of usage, and I get
where the grouping comes from. But is it possible to do the grouping in
a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any
of the other ops in terms of buffer consumption, they'd just need fixed
buffer support and you'd flag the buffer index in sqe->buf_index. And
the nice thing about that is that while fixed/registered buffers aren't
really used on the networking side yet (as they don't bring any benefit
yet), adding support for them could potentially be useful down the line
anyway.

> The same thing exists in the next OP_BUF_UPDATE which has to wait until
> all the previous buffer consumers are done. So the same slow thing are
> doubled. Not mention the application will become more complicated.

It does not, you can do an update on a buffer that's already inflight.

> Here the provided buffer is only visible among the N OPs wide, and making
> it global isn't necessary, and slow things down. And has kbuf lifetime
> issue.

I was worried about it being too slow too, but the basic testing seems
like it's fine. Yes with updates inflight it'll make it a tad bit
slower, but really should not be a concern. I'd argue that even doing
the very basic of things, which would be:

1) Submit OP_BUF_UPDATE, get completion
2) Do the rest of the ops

would be totally fine in terms of performance. OP_BUF_UPDATE will
_always_ completely immediately and inline, which means that it'll
_always_ be immediately available post submission. The only think you'd
ever have to worry about in terms of failure is a badly formed request,
which is a programming issue, or running out of memory on the host.

> Also it makes error handling more complicated, io_uring has to remove
> the kernel buffer when the current task is exit, dependency or order with
> buffer provider is introduced.

Why would that be? They belong to the ring, so should be torn down as
part of the ring anyway? Why would they be task-private, but not
ring-private?

>> There are certainly many different ways that can get propagated which
>> would not entail a complicated mechanism. I really like the aspect of
>> having the identifier being the same thing that we already use, and
>> hence not needing to be something new on the side.
>>
>>> Also multiple OPs may consume the buffer concurrently, which can't be
>>> supported by buffer select.
>>
>> Why not? You can certainly have multiple ops using the same registered
>> buffer concurrently right now.
> 
> Please see the above problem.
> 
> Also I remember that the selected buffer is removed from buffer list,
> see io_provided_buffer_select(), but maybe I am wrong.

You're mixing up provided and registered buffers. Provided buffers are
ones that the applications gives to the kernel, and the kernel grabs and
consumes them. Then the application replenishes, repeat.

Registered buffers are entirely different, those are registered with the
kernel and we can do things like pre-gup the pages so we don't have to
do them for every IO. They are entirely persistent, any multiple ops can
keep using them, concurrently. They don't get consumed by an IO like
provided buffers, they remain in place until they get unregistered (or
updated, like my patch) at some point.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-11 13:24                     ` Jens Axboe
@ 2024-10-11 14:20                       ` Ming Lei
  2024-10-11 14:41                         ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Ming Lei @ 2024-10-11 14:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-block

On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
> On 10/10/24 9:07 PM, Ming Lei wrote:
> > On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
> >> On 10/10/24 8:30 PM, Ming Lei wrote:
> >>> Hi Jens,
> >>>
> >>> On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote:
> >>>> Hi,
> >>>>
> >>>> Discussed this with Pavel, and on his suggestion, I tried prototyping a
> >>>> "buffer update" opcode. Basically it works like
> >>>> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
> >>>> registration. But it works as an sqe rather than being a sync opcode.
> >>>>
> >>>> The idea here is that you could do that upfront, or as part of a chain,
> >>>> and have it be generically available, just like any other buffer that
> >>>> was registered upfront. You do need an empty table registered first,
> >>>> which can just be sparse. And since you can pick the slot it goes into,
> >>>> you can rely on that slot afterwards (either as a link, or just the
> >>>> following sqe).
> >>>>
> >>>> Quick'n dirty obviously, but I did write a quick test case too to verify
> >>>> that:
> >>>>
> >>>> 1) It actually works (it seems to)
> >>>
> >>> It doesn't work for ublk zc since ublk needs to provide one kernel buffer
> >>> for fs rw & net send/recv to consume, and the kernel buffer is invisible
> >>> to userspace. But  __io_register_rsrc_update() only can register userspace
> >>> buffer.
> >>
> >> I'd be surprised if this simple one was enough! In terms of user vs
> >> kernel buffer, you could certainly use the same mechanism, and just
> >> ensure that buffers are tagged appropriately. I need to think about that
> >> a little bit.
> > 
> > It is actually same with IORING_OP_PROVIDE_BUFFERS, so the following
> > consumer OPs have to wait until this OP_BUF_UPDATE is completed.
> 
> See below for the registered vs provided buffer confusion that seems to
> be a confusion issue here.
> 
> > Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
> > 
> > 1) all N OPs are linked with OP_BUF_UPDATE
> > 
> > Or
> > 
> > 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
> > OPs concurrently.
> 
> Correct
> 
> > But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
> > and 1 extra syscall is introduced in 2).
> 
> Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
> you can just do it upfront. It's not ideal in terms of usage, and I get
> where the grouping comes from. But is it possible to do the grouping in
> a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any

The most of 'intrusive' change is just on patch 4, and Pavel has commented
that it is good enough:

https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d

> of the other ops in terms of buffer consumption, they'd just need fixed
> buffer support and you'd flag the buffer index in sqe->buf_index. And
> the nice thing about that is that while fixed/registered buffers aren't
> really used on the networking side yet (as they don't bring any benefit
> yet), adding support for them could potentially be useful down the line
> anyway.

With 2), two extra syscalls are added for each ublk IO, one is provide
buffer, another is remove buffer. The two syscalls have to be sync with
consumer OPs.

I can understand the concern, but if the change can't improve perf or
even slow things done, it loses its value.

> 
> > The same thing exists in the next OP_BUF_UPDATE which has to wait until
> > all the previous buffer consumers are done. So the same slow thing are
> > doubled. Not mention the application will become more complicated.
> 
> It does not, you can do an update on a buffer that's already inflight.

UPDATE may not match the case, actually two OPs are needed, one is
provide buffer OP and the other is remove buffer OP, both have to deal
with the other subsystem(ublk). Remove buffer needs to be done after all
consumer OPs are done immediately.

I guess you mean the buffer is reference-counted, but what if the remove
buffer OP is run before any consumer OP? The order has to be enhanced.

That is why I mention two syscalls are added.

> 
> > Here the provided buffer is only visible among the N OPs wide, and making
> > it global isn't necessary, and slow things down. And has kbuf lifetime
> > issue.
> 
> I was worried about it being too slow too, but the basic testing seems
> like it's fine. Yes with updates inflight it'll make it a tad bit
> slower, but really should not be a concern. I'd argue that even doing
> the very basic of things, which would be:
> 
> 1) Submit OP_BUF_UPDATE, get completion
> 2) Do the rest of the ops

The above adds one syscall for each ublk IO, and the following Remove
buffer adds another syscall.

Not only it slows thing down, but also makes application more
complicated, cause two wait points are added.

> 
> would be totally fine in terms of performance. OP_BUF_UPDATE will
> _always_ completely immediately and inline, which means that it'll
> _always_ be immediately available post submission. The only think you'd
> ever have to worry about in terms of failure is a badly formed request,
> which is a programming issue, or running out of memory on the host.
> 
> > Also it makes error handling more complicated, io_uring has to remove
> > the kernel buffer when the current task is exit, dependency or order with
> > buffer provider is introduced.
> 
> Why would that be? They belong to the ring, so should be torn down as
> part of the ring anyway? Why would they be task-private, but not
> ring-private?

It is kernel buffer, which belongs to provider(such as ublk) instead of uring,
application may panic any time, then io_uring has to remove the buffer for
notifying the buffer owner.

In concept grouping is simpler because:

- buffer lifetime is aligned with group leader lifetime, so we needn't
worry buffer leak because of application accidental exit

- the buffer is borrowed to consumer OPs, and returned back after all
consumers are done, this way avoids any dependency

Meantime OP_BUF_UPDATE(provide buffer OP, remove buffer OP) becomes more
complicated:

- buffer leak because of app panic
- buffer dependency issue: consumer OPs depend on provide buffer OP,
	remove buffer OP depends on consumer OPs; two syscalls has to be
	added for handling single ublk IO.

> 
> >> There are certainly many different ways that can get propagated which
> >> would not entail a complicated mechanism. I really like the aspect of
> >> having the identifier being the same thing that we already use, and
> >> hence not needing to be something new on the side.
> >>
> >>> Also multiple OPs may consume the buffer concurrently, which can't be
> >>> supported by buffer select.
> >>
> >> Why not? You can certainly have multiple ops using the same registered
> >> buffer concurrently right now.
> > 
> > Please see the above problem.
> > 
> > Also I remember that the selected buffer is removed from buffer list,
> > see io_provided_buffer_select(), but maybe I am wrong.
> 
> You're mixing up provided and registered buffers. Provided buffers are
> ones that the applications gives to the kernel, and the kernel grabs and
> consumes them. Then the application replenishes, repeat.
> 
> Registered buffers are entirely different, those are registered with the
> kernel and we can do things like pre-gup the pages so we don't have to
> do them for every IO. They are entirely persistent, any multiple ops can
> keep using them, concurrently. They don't get consumed by an IO like
> provided buffers, they remain in place until they get unregistered (or
> updated, like my patch) at some point.

I know the difference.

The thing is that here we can't register the kernel buffer in ->prep(),
and it has to be provided in ->issue() of uring command. That is similar
with provided buffer.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-11 14:20                       ` Ming Lei
@ 2024-10-11 14:41                         ` Jens Axboe
  2024-10-11 15:45                           ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2024-10-11 14:41 UTC (permalink / raw)
  To: Ming Lei; +Cc: Pavel Begunkov, io-uring, linux-block

On 10/11/24 8:20 AM, Ming Lei wrote:
> On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
>> On 10/10/24 9:07 PM, Ming Lei wrote:
>>> On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
>>>> On 10/10/24 8:30 PM, Ming Lei wrote:
>>>>> Hi Jens,
>>>>>
>>>>> On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Discussed this with Pavel, and on his suggestion, I tried prototyping a
>>>>>> "buffer update" opcode. Basically it works like
>>>>>> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
>>>>>> registration. But it works as an sqe rather than being a sync opcode.
>>>>>>
>>>>>> The idea here is that you could do that upfront, or as part of a chain,
>>>>>> and have it be generically available, just like any other buffer that
>>>>>> was registered upfront. You do need an empty table registered first,
>>>>>> which can just be sparse. And since you can pick the slot it goes into,
>>>>>> you can rely on that slot afterwards (either as a link, or just the
>>>>>> following sqe).
>>>>>>
>>>>>> Quick'n dirty obviously, but I did write a quick test case too to verify
>>>>>> that:
>>>>>>
>>>>>> 1) It actually works (it seems to)
>>>>>
>>>>> It doesn't work for ublk zc since ublk needs to provide one kernel buffer
>>>>> for fs rw & net send/recv to consume, and the kernel buffer is invisible
>>>>> to userspace. But  __io_register_rsrc_update() only can register userspace
>>>>> buffer.
>>>>
>>>> I'd be surprised if this simple one was enough! In terms of user vs
>>>> kernel buffer, you could certainly use the same mechanism, and just
>>>> ensure that buffers are tagged appropriately. I need to think about that
>>>> a little bit.
>>>
>>> It is actually same with IORING_OP_PROVIDE_BUFFERS, so the following
>>> consumer OPs have to wait until this OP_BUF_UPDATE is completed.
>>
>> See below for the registered vs provided buffer confusion that seems to
>> be a confusion issue here.
>>
>>> Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
>>>
>>> 1) all N OPs are linked with OP_BUF_UPDATE
>>>
>>> Or
>>>
>>> 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
>>> OPs concurrently.
>>
>> Correct
>>
>>> But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
>>> and 1 extra syscall is introduced in 2).
>>
>> Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
>> you can just do it upfront. It's not ideal in terms of usage, and I get
>> where the grouping comes from. But is it possible to do the grouping in
>> a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any
> 
> The most of 'intrusive' change is just on patch 4, and Pavel has commented
> that it is good enough:
> 
> https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d

At least for me, patch 4 looks fine. The problem occurs when you start
needing to support this different buffer type, which is in patch 6. I'm
not saying we can necessarily solve this with OP_BUF_UPDATE, I just want
to explore that path because if we can, then patch 6 turns into "oh
let's just added registered/fixed buffer support to these ops that don't
currently support it". And that would be much nicer indeed.


>> of the other ops in terms of buffer consumption, they'd just need fixed
>> buffer support and you'd flag the buffer index in sqe->buf_index. And
>> the nice thing about that is that while fixed/registered buffers aren't
>> really used on the networking side yet (as they don't bring any benefit
>> yet), adding support for them could potentially be useful down the line
>> anyway.
> 
> With 2), two extra syscalls are added for each ublk IO, one is provide
> buffer, another is remove buffer. The two syscalls have to be sync with
> consumer OPs.
> 
> I can understand the concern, but if the change can't improve perf or
> even slow things done, it loses its value.

It'd be one extra syscall, as the remove can get bundled with the next
add. But your point still stands, yes it will add extra overhead,
although be it pretty darn minimal. I'm actually more concerned with the
complexity for handling it. While the OP_BUF_UPDATE will always
complete immediately, there's no guarantee it's the next cqe you pull
out when peeking post submission.

>>> The same thing exists in the next OP_BUF_UPDATE which has to wait until
>>> all the previous buffer consumers are done. So the same slow thing are
>>> doubled. Not mention the application will become more complicated.
>>
>> It does not, you can do an update on a buffer that's already inflight.
> 
> UPDATE may not match the case, actually two OPs are needed, one is
> provide buffer OP and the other is remove buffer OP, both have to deal
> with the other subsystem(ublk). Remove buffer needs to be done after all
> consumer OPs are done immediately.

You don't necessarily need the remove. If you always just use the same
slot for these, then the OP_BUF_UPDATE will just update the current
location.

> I guess you mean the buffer is reference-counted, but what if the remove
> buffer OP is run before any consumer OP? The order has to be enhanced.
> 
> That is why I mention two syscalls are added.

See above, you can just update in place, and if you do want remove, it
can get bundled with the next one. But it would be pointless to remove
only then to update right after, a single update would suffice.

>>> Here the provided buffer is only visible among the N OPs wide, and making
>>> it global isn't necessary, and slow things down. And has kbuf lifetime
>>> issue.
>>
>> I was worried about it being too slow too, but the basic testing seems
>> like it's fine. Yes with updates inflight it'll make it a tad bit
>> slower, but really should not be a concern. I'd argue that even doing
>> the very basic of things, which would be:
>>
>> 1) Submit OP_BUF_UPDATE, get completion
>> 2) Do the rest of the ops
> 
> The above adds one syscall for each ublk IO, and the following Remove
> buffer adds another syscall.
> 
> Not only it slows thing down, but also makes application more
> complicated, cause two wait points are added.

I don't think the extra overhead would be noticeable though, but the
extra complication is the main issue here.

>> would be totally fine in terms of performance. OP_BUF_UPDATE will
>> _always_ completely immediately and inline, which means that it'll
>> _always_ be immediately available post submission. The only think you'd
>> ever have to worry about in terms of failure is a badly formed request,
>> which is a programming issue, or running out of memory on the host.
>>
>>> Also it makes error handling more complicated, io_uring has to remove
>>> the kernel buffer when the current task is exit, dependency or order with
>>> buffer provider is introduced.
>>
>> Why would that be? They belong to the ring, so should be torn down as
>> part of the ring anyway? Why would they be task-private, but not
>> ring-private?
> 
> It is kernel buffer, which belongs to provider(such as ublk) instead
> of uring, application may panic any time, then io_uring has to remove
> the buffer for notifying the buffer owner.

But it could be an application buffer, no? You'd just need the
application to provide it to ublk and have it mapped, rather than have
ublk allocate it in-kernel and then use that.

> In concept grouping is simpler because:
> 
> - buffer lifetime is aligned with group leader lifetime, so we needn't
> worry buffer leak because of application accidental exit

But if it was an application buffer, that would not be a concern.

> - the buffer is borrowed to consumer OPs, and returned back after all
> consumers are done, this way avoids any dependency
> 
> Meantime OP_BUF_UPDATE(provide buffer OP, remove buffer OP) becomes more
> complicated:
> 
> - buffer leak because of app panic
> - buffer dependency issue: consumer OPs depend on provide buffer OP,
> 	remove buffer OP depends on consumer OPs; two syscalls has to be
> 	added for handling single ublk IO.

Seems like most of this is because of the kernel buffer too, no?

I do like the concept of the ephemeral buffer, the downside is that we
need per-op support for it too. And while I'm not totally against doing
that, it would be lovely if we could utilize and existing mechanism for
that rather than add another one.

>>>> There are certainly many different ways that can get propagated which
>>>> would not entail a complicated mechanism. I really like the aspect of
>>>> having the identifier being the same thing that we already use, and
>>>> hence not needing to be something new on the side.
>>>>
>>>>> Also multiple OPs may consume the buffer concurrently, which can't be
>>>>> supported by buffer select.
>>>>
>>>> Why not? You can certainly have multiple ops using the same registered
>>>> buffer concurrently right now.
>>>
>>> Please see the above problem.
>>>
>>> Also I remember that the selected buffer is removed from buffer list,
>>> see io_provided_buffer_select(), but maybe I am wrong.
>>
>> You're mixing up provided and registered buffers. Provided buffers are
>> ones that the applications gives to the kernel, and the kernel grabs and
>> consumes them. Then the application replenishes, repeat.
>>
>> Registered buffers are entirely different, those are registered with the
>> kernel and we can do things like pre-gup the pages so we don't have to
>> do them for every IO. They are entirely persistent, any multiple ops can
>> keep using them, concurrently. They don't get consumed by an IO like
>> provided buffers, they remain in place until they get unregistered (or
>> updated, like my patch) at some point.
> 
> I know the difference.

But io_provided_buffer_select() has nothing to do with registered/fixed
buffers or this use case, the above "remove from buffer list" is an
entirely different buffer concept. So there's some confusion here, just
wanted to make that clear.

> The thing is that here we can't register the kernel buffer in ->prep(),
> and it has to be provided in ->issue() of uring command. That is similar
> with provided buffer.

What's preventing it from registering it in ->prep()? It would be a bit
odd, but there would be nothing preventing it codewise, outside of the
oddity of ->prep() not being idempotent at that point. Don't follow why
that would be necessary, though, can you expand?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-11 14:41                         ` Jens Axboe
@ 2024-10-11 15:45                           ` Ming Lei
  2024-10-11 16:49                             ` Jens Axboe
  2024-10-14 18:40                             ` Pavel Begunkov
  0 siblings, 2 replies; 47+ messages in thread
From: Ming Lei @ 2024-10-11 15:45 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-block, ming.lei

On Fri, Oct 11, 2024 at 08:41:03AM -0600, Jens Axboe wrote:
> On 10/11/24 8:20 AM, Ming Lei wrote:
> > On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
> >> On 10/10/24 9:07 PM, Ming Lei wrote:
> >>> On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
> >>>> On 10/10/24 8:30 PM, Ming Lei wrote:
> >>>>> Hi Jens,
> >>>>>
> >>>>> On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Discussed this with Pavel, and on his suggestion, I tried prototyping a
> >>>>>> "buffer update" opcode. Basically it works like
> >>>>>> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
> >>>>>> registration. But it works as an sqe rather than being a sync opcode.
> >>>>>>
> >>>>>> The idea here is that you could do that upfront, or as part of a chain,
> >>>>>> and have it be generically available, just like any other buffer that
> >>>>>> was registered upfront. You do need an empty table registered first,
> >>>>>> which can just be sparse. And since you can pick the slot it goes into,
> >>>>>> you can rely on that slot afterwards (either as a link, or just the
> >>>>>> following sqe).
> >>>>>>
> >>>>>> Quick'n dirty obviously, but I did write a quick test case too to verify
> >>>>>> that:
> >>>>>>
> >>>>>> 1) It actually works (it seems to)
> >>>>>
> >>>>> It doesn't work for ublk zc since ublk needs to provide one kernel buffer
> >>>>> for fs rw & net send/recv to consume, and the kernel buffer is invisible
> >>>>> to userspace. But  __io_register_rsrc_update() only can register userspace
> >>>>> buffer.
> >>>>
> >>>> I'd be surprised if this simple one was enough! In terms of user vs
> >>>> kernel buffer, you could certainly use the same mechanism, and just
> >>>> ensure that buffers are tagged appropriately. I need to think about that
> >>>> a little bit.
> >>>
> >>> It is actually same with IORING_OP_PROVIDE_BUFFERS, so the following
> >>> consumer OPs have to wait until this OP_BUF_UPDATE is completed.
> >>
> >> See below for the registered vs provided buffer confusion that seems to
> >> be a confusion issue here.
> >>
> >>> Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
> >>>
> >>> 1) all N OPs are linked with OP_BUF_UPDATE
> >>>
> >>> Or
> >>>
> >>> 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
> >>> OPs concurrently.
> >>
> >> Correct
> >>
> >>> But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
> >>> and 1 extra syscall is introduced in 2).
> >>
> >> Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
> >> you can just do it upfront. It's not ideal in terms of usage, and I get
> >> where the grouping comes from. But is it possible to do the grouping in
> >> a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any
> > 
> > The most of 'intrusive' change is just on patch 4, and Pavel has commented
> > that it is good enough:
> > 
> > https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d
> 
> At least for me, patch 4 looks fine. The problem occurs when you start
> needing to support this different buffer type, which is in patch 6. I'm
> not saying we can necessarily solve this with OP_BUF_UPDATE, I just want
> to explore that path because if we can, then patch 6 turns into "oh
> let's just added registered/fixed buffer support to these ops that don't
> currently support it". And that would be much nicer indeed.

OK, in my local V7, the buffer type is actually aligned with
BUFFER_SELECT from both interface & use viewpoint, since member SQE have three
empty flags available.

I will post V7 for review.

> 
> 
> >> of the other ops in terms of buffer consumption, they'd just need fixed
> >> buffer support and you'd flag the buffer index in sqe->buf_index. And
> >> the nice thing about that is that while fixed/registered buffers aren't
> >> really used on the networking side yet (as they don't bring any benefit
> >> yet), adding support for them could potentially be useful down the line
> >> anyway.
> > 
> > With 2), two extra syscalls are added for each ublk IO, one is provide
> > buffer, another is remove buffer. The two syscalls have to be sync with
> > consumer OPs.
> > 
> > I can understand the concern, but if the change can't improve perf or
> > even slow things done, it loses its value.
> 
> It'd be one extra syscall, as the remove can get bundled with the next
> add. But your point still stands, yes it will add extra overhead,

It can't be bundled.

And the kernel buffer is blk-mq's request pages, which is per tag.

Such as, for ublk-target, IO comes to tag 0, after this IO(tag 0) is
handled, how can we know if there is new IO comes to tag 0 immediately? :-)

> although be it pretty darn minimal. I'm actually more concerned with the
> complexity for handling it. While the OP_BUF_UPDATE will always
> complete immediately, there's no guarantee it's the next cqe you pull
> out when peeking post submission.
> 
> >>> The same thing exists in the next OP_BUF_UPDATE which has to wait until
> >>> all the previous buffer consumers are done. So the same slow thing are
> >>> doubled. Not mention the application will become more complicated.
> >>
> >> It does not, you can do an update on a buffer that's already inflight.
> > 
> > UPDATE may not match the case, actually two OPs are needed, one is
> > provide buffer OP and the other is remove buffer OP, both have to deal
> > with the other subsystem(ublk). Remove buffer needs to be done after all
> > consumer OPs are done immediately.
> 
> You don't necessarily need the remove. If you always just use the same
> slot for these, then the OP_BUF_UPDATE will just update the current
> location.

The buffer is per tag, and can't guarantee to be reused immediately,
otherwise it isn't zero copy any more.

> 
> > I guess you mean the buffer is reference-counted, but what if the remove
> > buffer OP is run before any consumer OP? The order has to be enhanced.
> > 
> > That is why I mention two syscalls are added.
> 
> See above, you can just update in place, and if you do want remove, it
> can get bundled with the next one. But it would be pointless to remove
> only then to update right after, a single update would suffice.
> 
> >>> Here the provided buffer is only visible among the N OPs wide, and making
> >>> it global isn't necessary, and slow things down. And has kbuf lifetime
> >>> issue.
> >>
> >> I was worried about it being too slow too, but the basic testing seems
> >> like it's fine. Yes with updates inflight it'll make it a tad bit
> >> slower, but really should not be a concern. I'd argue that even doing
> >> the very basic of things, which would be:
> >>
> >> 1) Submit OP_BUF_UPDATE, get completion
> >> 2) Do the rest of the ops
> > 
> > The above adds one syscall for each ublk IO, and the following Remove
> > buffer adds another syscall.
> > 
> > Not only it slows thing down, but also makes application more
> > complicated, cause two wait points are added.
> 
> I don't think the extra overhead would be noticeable though, but the
> extra complication is the main issue here.

Can't agree more.

> 
> >> would be totally fine in terms of performance. OP_BUF_UPDATE will
> >> _always_ completely immediately and inline, which means that it'll
> >> _always_ be immediately available post submission. The only think you'd
> >> ever have to worry about in terms of failure is a badly formed request,
> >> which is a programming issue, or running out of memory on the host.
> >>
> >>> Also it makes error handling more complicated, io_uring has to remove
> >>> the kernel buffer when the current task is exit, dependency or order with
> >>> buffer provider is introduced.
> >>
> >> Why would that be? They belong to the ring, so should be torn down as
> >> part of the ring anyway? Why would they be task-private, but not
> >> ring-private?
> > 
> > It is kernel buffer, which belongs to provider(such as ublk) instead
> > of uring, application may panic any time, then io_uring has to remove
> > the buffer for notifying the buffer owner.
> 
> But it could be an application buffer, no? You'd just need the
> application to provide it to ublk and have it mapped, rather than have
> ublk allocate it in-kernel and then use that.

The buffer is actually kernel 'request/bio' pages of /dev/ublkbN, and now we
forward and borrow it to io_uring OPs(fs rw, net send/recv), so it can't be
application buffer, not same with net rx.

> 
> > In concept grouping is simpler because:
> > 
> > - buffer lifetime is aligned with group leader lifetime, so we needn't
> > worry buffer leak because of application accidental exit
> 
> But if it was an application buffer, that would not be a concern.

Yeah, but storage isn't same with network, here application buffer can't
support zc.

> 
> > - the buffer is borrowed to consumer OPs, and returned back after all
> > consumers are done, this way avoids any dependency
> > 
> > Meantime OP_BUF_UPDATE(provide buffer OP, remove buffer OP) becomes more
> > complicated:
> > 
> > - buffer leak because of app panic
> > - buffer dependency issue: consumer OPs depend on provide buffer OP,
> > 	remove buffer OP depends on consumer OPs; two syscalls has to be
> > 	added for handling single ublk IO.
> 
> Seems like most of this is because of the kernel buffer too, no?

Yeah.

> 
> I do like the concept of the ephemeral buffer, the downside is that we
> need per-op support for it too. And while I'm not totally against doing

Can you explain per-op support a bit?

Now the buffer has been provided by one single uring command.

> that, it would be lovely if we could utilize and existing mechanism for
> that rather than add another one.

If existing mechanism can cover everything, our linux may not progress any
more.

> 
> >>>> There are certainly many different ways that can get propagated which
> >>>> would not entail a complicated mechanism. I really like the aspect of
> >>>> having the identifier being the same thing that we already use, and
> >>>> hence not needing to be something new on the side.
> >>>>
> >>>>> Also multiple OPs may consume the buffer concurrently, which can't be
> >>>>> supported by buffer select.
> >>>>
> >>>> Why not? You can certainly have multiple ops using the same registered
> >>>> buffer concurrently right now.
> >>>
> >>> Please see the above problem.
> >>>
> >>> Also I remember that the selected buffer is removed from buffer list,
> >>> see io_provided_buffer_select(), but maybe I am wrong.
> >>
> >> You're mixing up provided and registered buffers. Provided buffers are
> >> ones that the applications gives to the kernel, and the kernel grabs and
> >> consumes them. Then the application replenishes, repeat.
> >>
> >> Registered buffers are entirely different, those are registered with the
> >> kernel and we can do things like pre-gup the pages so we don't have to
> >> do them for every IO. They are entirely persistent, any multiple ops can
> >> keep using them, concurrently. They don't get consumed by an IO like
> >> provided buffers, they remain in place until they get unregistered (or
> >> updated, like my patch) at some point.
> > 
> > I know the difference.
> 
> But io_provided_buffer_select() has nothing to do with registered/fixed
> buffers or this use case, the above "remove from buffer list" is an
> entirely different buffer concept. So there's some confusion here, just
> wanted to make that clear.
> 
> > The thing is that here we can't register the kernel buffer in ->prep(),
> > and it has to be provided in ->issue() of uring command. That is similar
> > with provided buffer.
> 
> What's preventing it from registering it in ->prep()? It would be a bit
> odd, but there would be nothing preventing it codewise, outside of the
> oddity of ->prep() not being idempotent at that point. Don't follow why
> that would be necessary, though, can you expand?

->prep() doesn't export to uring cmd, and we may not want to bother
drivers.

Also remove buffer still can't be done in ->prep().

Not dig into further, one big thing could be that dependency isn't
respected in ->prep().


thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-11 15:45                           ` Ming Lei
@ 2024-10-11 16:49                             ` Jens Axboe
  2024-10-12  3:35                               ` Ming Lei
  2024-10-14 18:40                             ` Pavel Begunkov
  1 sibling, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2024-10-11 16:49 UTC (permalink / raw)
  To: Ming Lei; +Cc: Pavel Begunkov, io-uring, linux-block

On 10/11/24 9:45 AM, Ming Lei wrote:
> On Fri, Oct 11, 2024 at 08:41:03AM -0600, Jens Axboe wrote:
>> On 10/11/24 8:20 AM, Ming Lei wrote:
>>> On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
>>>> On 10/10/24 9:07 PM, Ming Lei wrote:
>>>>> On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
>>>>>> On 10/10/24 8:30 PM, Ming Lei wrote:
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Discussed this with Pavel, and on his suggestion, I tried prototyping a
>>>>>>>> "buffer update" opcode. Basically it works like
>>>>>>>> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
>>>>>>>> registration. But it works as an sqe rather than being a sync opcode.
>>>>>>>>
>>>>>>>> The idea here is that you could do that upfront, or as part of a chain,
>>>>>>>> and have it be generically available, just like any other buffer that
>>>>>>>> was registered upfront. You do need an empty table registered first,
>>>>>>>> which can just be sparse. And since you can pick the slot it goes into,
>>>>>>>> you can rely on that slot afterwards (either as a link, or just the
>>>>>>>> following sqe).
>>>>>>>>
>>>>>>>> Quick'n dirty obviously, but I did write a quick test case too to verify
>>>>>>>> that:
>>>>>>>>
>>>>>>>> 1) It actually works (it seems to)
>>>>>>>
>>>>>>> It doesn't work for ublk zc since ublk needs to provide one kernel buffer
>>>>>>> for fs rw & net send/recv to consume, and the kernel buffer is invisible
>>>>>>> to userspace. But  __io_register_rsrc_update() only can register userspace
>>>>>>> buffer.
>>>>>>
>>>>>> I'd be surprised if this simple one was enough! In terms of user vs
>>>>>> kernel buffer, you could certainly use the same mechanism, and just
>>>>>> ensure that buffers are tagged appropriately. I need to think about that
>>>>>> a little bit.
>>>>>
>>>>> It is actually same with IORING_OP_PROVIDE_BUFFERS, so the following
>>>>> consumer OPs have to wait until this OP_BUF_UPDATE is completed.
>>>>
>>>> See below for the registered vs provided buffer confusion that seems to
>>>> be a confusion issue here.
>>>>
>>>>> Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
>>>>>
>>>>> 1) all N OPs are linked with OP_BUF_UPDATE
>>>>>
>>>>> Or
>>>>>
>>>>> 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
>>>>> OPs concurrently.
>>>>
>>>> Correct
>>>>
>>>>> But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
>>>>> and 1 extra syscall is introduced in 2).
>>>>
>>>> Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
>>>> you can just do it upfront. It's not ideal in terms of usage, and I get
>>>> where the grouping comes from. But is it possible to do the grouping in
>>>> a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any
>>>
>>> The most of 'intrusive' change is just on patch 4, and Pavel has commented
>>> that it is good enough:
>>>
>>> https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d
>>
>> At least for me, patch 4 looks fine. The problem occurs when you start
>> needing to support this different buffer type, which is in patch 6. I'm
>> not saying we can necessarily solve this with OP_BUF_UPDATE, I just want
>> to explore that path because if we can, then patch 6 turns into "oh
>> let's just added registered/fixed buffer support to these ops that don't
>> currently support it". And that would be much nicer indeed.
> 
> OK, in my local V7, the buffer type is actually aligned with
> BUFFER_SELECT from both interface & use viewpoint, since member SQE
> have three empty flags available.
> 
> I will post V7 for review.

OK, I'll take a look once posted.

>>>> of the other ops in terms of buffer consumption, they'd just need fixed
>>>> buffer support and you'd flag the buffer index in sqe->buf_index. And
>>>> the nice thing about that is that while fixed/registered buffers aren't
>>>> really used on the networking side yet (as they don't bring any benefit
>>>> yet), adding support for them could potentially be useful down the line
>>>> anyway.
>>>
>>> With 2), two extra syscalls are added for each ublk IO, one is provide
>>> buffer, another is remove buffer. The two syscalls have to be sync with
>>> consumer OPs.
>>>
>>> I can understand the concern, but if the change can't improve perf or
>>> even slow things done, it loses its value.
>>
>> It'd be one extra syscall, as the remove can get bundled with the next
>> add. But your point still stands, yes it will add extra overhead,
> 
> It can't be bundled.

Don't see why not, but let's review v7 and see what comes up.

> And the kernel buffer is blk-mq's request pages, which is per tag.
> 
> Such as, for ublk-target, IO comes to tag 0, after this IO(tag 0) is
> handled, how can we know if there is new IO comes to tag 0 immediately? :-)

Gotcha, yeah sounds like that needs to remain a kernel buffer.

>> although be it pretty darn minimal. I'm actually more concerned with the
>> complexity for handling it. While the OP_BUF_UPDATE will always
>> complete immediately, there's no guarantee it's the next cqe you pull
>> out when peeking post submission.
>>
>>>>> The same thing exists in the next OP_BUF_UPDATE which has to wait until
>>>>> all the previous buffer consumers are done. So the same slow thing are
>>>>> doubled. Not mention the application will become more complicated.
>>>>
>>>> It does not, you can do an update on a buffer that's already inflight.
>>>
>>> UPDATE may not match the case, actually two OPs are needed, one is
>>> provide buffer OP and the other is remove buffer OP, both have to deal
>>> with the other subsystem(ublk). Remove buffer needs to be done after all
>>> consumer OPs are done immediately.
>>
>> You don't necessarily need the remove. If you always just use the same
>> slot for these, then the OP_BUF_UPDATE will just update the current
>> location.
> 
> The buffer is per tag, and can't guarantee to be reused immediately,
> otherwise it isn't zero copy any more.

Don't follow this one either. As long as reuse keeps existing IO fine,
then it should be fine? I'm not talking about reusing the buffer, just
the slot it belongs to.

>>>> would be totally fine in terms of performance. OP_BUF_UPDATE will
>>>> _always_ completely immediately and inline, which means that it'll
>>>> _always_ be immediately available post submission. The only think you'd
>>>> ever have to worry about in terms of failure is a badly formed request,
>>>> which is a programming issue, or running out of memory on the host.
>>>>
>>>>> Also it makes error handling more complicated, io_uring has to remove
>>>>> the kernel buffer when the current task is exit, dependency or order with
>>>>> buffer provider is introduced.
>>>>
>>>> Why would that be? They belong to the ring, so should be torn down as
>>>> part of the ring anyway? Why would they be task-private, but not
>>>> ring-private?
>>>
>>> It is kernel buffer, which belongs to provider(such as ublk) instead
>>> of uring, application may panic any time, then io_uring has to remove
>>> the buffer for notifying the buffer owner.
>>
>> But it could be an application buffer, no? You'd just need the
>> application to provide it to ublk and have it mapped, rather than have
>> ublk allocate it in-kernel and then use that.
> 
> The buffer is actually kernel 'request/bio' pages of /dev/ublkbN, and now we
> forward and borrow it to io_uring OPs(fs rw, net send/recv), so it can't be
> application buffer, not same with net rx.

So you borrow the kernel pages, but presumably these are all from
O_DIRECT and have a user mapping?

>>> In concept grouping is simpler because:
>>>
>>> - buffer lifetime is aligned with group leader lifetime, so we needn't
>>> worry buffer leak because of application accidental exit
>>
>> But if it was an application buffer, that would not be a concern.
> 
> Yeah, but storage isn't same with network, here application buffer can't
> support zc.

Maybe I'm dense, but can you expand on why that's the case?

>> I do like the concept of the ephemeral buffer, the downside is that we
>> need per-op support for it too. And while I'm not totally against doing
> 
> Can you explain per-op support a bit?
> 
> Now the buffer has been provided by one single uring command.

I mean the need to do:

+	if (req->flags & REQ_F_GROUP_KBUF) {
+		ret = io_import_group_kbuf(req, rw->addr, rw->len, ITER_SOURCE,
+				&io->iter);
+		if (unlikely(ret))
+			return ret;
+	}

for picking such a buffer.

>> that, it would be lovely if we could utilize and existing mechanism for
>> that rather than add another one.
> 
> If existing mechanism can cover everything, our linux may not progress any
> more.

That's not what I mean at all. We already have essentially three ways to
get a buffer destination for IO:

1) Just pass in an uaddr+len or an iovec
2) Set ->buf_index, the op needs to support this separately to grab a
   registered buffer for IO.
3) For pollable stuff, provided buffers, either via the ring or the
   legacy/classic approach.

This adds a 4th method, which shares the characteristics of 2+3 that the
op needs to support it. This is the whole motivation to poke at having a
way to use the normal registered buffer table for this, because then
this falls into method 2 above.

I'm not at all saying "oh we can't add this new feature", the only thing
I'm addressing is HOW we do that. I don't think anybody disagrees that
we need zero copy for ublk, and honestly I would love to see that sooner
rather than later!

>>>>>> There are certainly many different ways that can get propagated which
>>>>>> would not entail a complicated mechanism. I really like the aspect of
>>>>>> having the identifier being the same thing that we already use, and
>>>>>> hence not needing to be something new on the side.
>>>>>>
>>>>>>> Also multiple OPs may consume the buffer concurrently, which can't be
>>>>>>> supported by buffer select.
>>>>>>
>>>>>> Why not? You can certainly have multiple ops using the same registered
>>>>>> buffer concurrently right now.
>>>>>
>>>>> Please see the above problem.
>>>>>
>>>>> Also I remember that the selected buffer is removed from buffer list,
>>>>> see io_provided_buffer_select(), but maybe I am wrong.
>>>>
>>>> You're mixing up provided and registered buffers. Provided buffers are
>>>> ones that the applications gives to the kernel, and the kernel grabs and
>>>> consumes them. Then the application replenishes, repeat.
>>>>
>>>> Registered buffers are entirely different, those are registered with the
>>>> kernel and we can do things like pre-gup the pages so we don't have to
>>>> do them for every IO. They are entirely persistent, any multiple ops can
>>>> keep using them, concurrently. They don't get consumed by an IO like
>>>> provided buffers, they remain in place until they get unregistered (or
>>>> updated, like my patch) at some point.
>>>
>>> I know the difference.
>>
>> But io_provided_buffer_select() has nothing to do with registered/fixed
>> buffers or this use case, the above "remove from buffer list" is an
>> entirely different buffer concept. So there's some confusion here, just
>> wanted to make that clear.
>>
>>> The thing is that here we can't register the kernel buffer in ->prep(),
>>> and it has to be provided in ->issue() of uring command. That is similar
>>> with provided buffer.
>>
>> What's preventing it from registering it in ->prep()? It would be a bit
>> odd, but there would be nothing preventing it codewise, outside of the
>> oddity of ->prep() not being idempotent at that point. Don't follow why
>> that would be necessary, though, can you expand?
> 
> ->prep() doesn't export to uring cmd, and we may not want to bother
> drivers.

Sure, we don't want it off ->uring_cmd() or anything like that.

> Also remove buffer still can't be done in ->prep().

I mean, technically it could... Same restrictions as add, however.

> Not dig into further, one big thing could be that dependency isn't
> respected in ->prep().

This is the main thing I was considering, because there's nothing
preventing it from happening outside of the fact that it makes ->prep()
not idempotent. Which is a big enough reason already, but...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-11 16:49                             ` Jens Axboe
@ 2024-10-12  3:35                               ` Ming Lei
  0 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-10-12  3:35 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-block, ming.lei

On Fri, Oct 11, 2024 at 10:49:10AM -0600, Jens Axboe wrote:
> On 10/11/24 9:45 AM, Ming Lei wrote:
> > On Fri, Oct 11, 2024 at 08:41:03AM -0600, Jens Axboe wrote:
> >> On 10/11/24 8:20 AM, Ming Lei wrote:
> >>> On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
> >>>> On 10/10/24 9:07 PM, Ming Lei wrote:
> >>>>> On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
> >>>>>> On 10/10/24 8:30 PM, Ming Lei wrote:
> >>>>>>> Hi Jens,
> >>>>>>>
> >>>>>>> On Thu, Oct 10, 2024 at 01:31:21PM -0600, Jens Axboe wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Discussed this with Pavel, and on his suggestion, I tried prototyping a
> >>>>>>>> "buffer update" opcode. Basically it works like
> >>>>>>>> IORING_REGISTER_BUFFERS_UPDATE in that it can update an existing buffer
> >>>>>>>> registration. But it works as an sqe rather than being a sync opcode.
> >>>>>>>>
> >>>>>>>> The idea here is that you could do that upfront, or as part of a chain,
> >>>>>>>> and have it be generically available, just like any other buffer that
> >>>>>>>> was registered upfront. You do need an empty table registered first,
> >>>>>>>> which can just be sparse. And since you can pick the slot it goes into,
> >>>>>>>> you can rely on that slot afterwards (either as a link, or just the
> >>>>>>>> following sqe).
> >>>>>>>>
> >>>>>>>> Quick'n dirty obviously, but I did write a quick test case too to verify
> >>>>>>>> that:
> >>>>>>>>
> >>>>>>>> 1) It actually works (it seems to)
> >>>>>>>
> >>>>>>> It doesn't work for ublk zc since ublk needs to provide one kernel buffer
> >>>>>>> for fs rw & net send/recv to consume, and the kernel buffer is invisible
> >>>>>>> to userspace. But  __io_register_rsrc_update() only can register userspace
> >>>>>>> buffer.
> >>>>>>
> >>>>>> I'd be surprised if this simple one was enough! In terms of user vs
> >>>>>> kernel buffer, you could certainly use the same mechanism, and just
> >>>>>> ensure that buffers are tagged appropriately. I need to think about that
> >>>>>> a little bit.
> >>>>>
> >>>>> It is actually same with IORING_OP_PROVIDE_BUFFERS, so the following
> >>>>> consumer OPs have to wait until this OP_BUF_UPDATE is completed.
> >>>>
> >>>> See below for the registered vs provided buffer confusion that seems to
> >>>> be a confusion issue here.
> >>>>
> >>>>> Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
> >>>>>
> >>>>> 1) all N OPs are linked with OP_BUF_UPDATE
> >>>>>
> >>>>> Or
> >>>>>
> >>>>> 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
> >>>>> OPs concurrently.
> >>>>
> >>>> Correct
> >>>>
> >>>>> But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
> >>>>> and 1 extra syscall is introduced in 2).
> >>>>
> >>>> Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
> >>>> you can just do it upfront. It's not ideal in terms of usage, and I get
> >>>> where the grouping comes from. But is it possible to do the grouping in
> >>>> a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any
> >>>
> >>> The most of 'intrusive' change is just on patch 4, and Pavel has commented
> >>> that it is good enough:
> >>>
> >>> https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d
> >>
> >> At least for me, patch 4 looks fine. The problem occurs when you start
> >> needing to support this different buffer type, which is in patch 6. I'm
> >> not saying we can necessarily solve this with OP_BUF_UPDATE, I just want
> >> to explore that path because if we can, then patch 6 turns into "oh
> >> let's just added registered/fixed buffer support to these ops that don't
> >> currently support it". And that would be much nicer indeed.
> > 
> > OK, in my local V7, the buffer type is actually aligned with
> > BUFFER_SELECT from both interface & use viewpoint, since member SQE
> > have three empty flags available.
> > 
> > I will post V7 for review.
> 
> OK, I'll take a look once posted.
> 
> >>>> of the other ops in terms of buffer consumption, they'd just need fixed
> >>>> buffer support and you'd flag the buffer index in sqe->buf_index. And
> >>>> the nice thing about that is that while fixed/registered buffers aren't
> >>>> really used on the networking side yet (as they don't bring any benefit
> >>>> yet), adding support for them could potentially be useful down the line
> >>>> anyway.
> >>>
> >>> With 2), two extra syscalls are added for each ublk IO, one is provide
> >>> buffer, another is remove buffer. The two syscalls have to be sync with
> >>> consumer OPs.
> >>>
> >>> I can understand the concern, but if the change can't improve perf or
> >>> even slow things done, it loses its value.
> >>
> >> It'd be one extra syscall, as the remove can get bundled with the next
> >> add. But your point still stands, yes it will add extra overhead,
> > 
> > It can't be bundled.
> 
> Don't see why not, but let's review v7 and see what comes up.
> 
> > And the kernel buffer is blk-mq's request pages, which is per tag.
> > 
> > Such as, for ublk-target, IO comes to tag 0, after this IO(tag 0) is
> > handled, how can we know if there is new IO comes to tag 0 immediately? :-)
> 
> Gotcha, yeah sounds like that needs to remain a kernel buffer.
> 
> >> although be it pretty darn minimal. I'm actually more concerned with the
> >> complexity for handling it. While the OP_BUF_UPDATE will always
> >> complete immediately, there's no guarantee it's the next cqe you pull
> >> out when peeking post submission.
> >>
> >>>>> The same thing exists in the next OP_BUF_UPDATE which has to wait until
> >>>>> all the previous buffer consumers are done. So the same slow thing are
> >>>>> doubled. Not mention the application will become more complicated.
> >>>>
> >>>> It does not, you can do an update on a buffer that's already inflight.
> >>>
> >>> UPDATE may not match the case, actually two OPs are needed, one is
> >>> provide buffer OP and the other is remove buffer OP, both have to deal
> >>> with the other subsystem(ublk). Remove buffer needs to be done after all
> >>> consumer OPs are done immediately.
> >>
> >> You don't necessarily need the remove. If you always just use the same
> >> slot for these, then the OP_BUF_UPDATE will just update the current
> >> location.
> > 
> > The buffer is per tag, and can't guarantee to be reused immediately,
> > otherwise it isn't zero copy any more.
> 
> Don't follow this one either. As long as reuse keeps existing IO fine,
> then it should be fine? I'm not talking about reusing the buffer, just
> the slot it belongs to.

And both provide/remove buffer OP is dealing with IO buffer with same
unique tag, and the buffer is indexed by one key provided by user, which
is similar with ->buf_index. It is definitely not possible to remove one
old buffer and add new buffer in single command with same key. Another
reason is that we don't know if any new IO(with buffer) comes at that time.

Also anytime there is only 1 inflight IO with same tag in storage world,
there can't be new IO coming on current slot, since the old IO can't
completed until the kernel buffer is removed.

> 
> >>>> would be totally fine in terms of performance. OP_BUF_UPDATE will
> >>>> _always_ completely immediately and inline, which means that it'll
> >>>> _always_ be immediately available post submission. The only think you'd
> >>>> ever have to worry about in terms of failure is a badly formed request,
> >>>> which is a programming issue, or running out of memory on the host.
> >>>>
> >>>>> Also it makes error handling more complicated, io_uring has to remove
> >>>>> the kernel buffer when the current task is exit, dependency or order with
> >>>>> buffer provider is introduced.
> >>>>
> >>>> Why would that be? They belong to the ring, so should be torn down as
> >>>> part of the ring anyway? Why would they be task-private, but not
> >>>> ring-private?
> >>>
> >>> It is kernel buffer, which belongs to provider(such as ublk) instead
> >>> of uring, application may panic any time, then io_uring has to remove
> >>> the buffer for notifying the buffer owner.
> >>
> >> But it could be an application buffer, no? You'd just need the
> >> application to provide it to ublk and have it mapped, rather than have
> >> ublk allocate it in-kernel and then use that.
> > 
> > The buffer is actually kernel 'request/bio' pages of /dev/ublkbN, and now we
> > forward and borrow it to io_uring OPs(fs rw, net send/recv), so it can't be
> > application buffer, not same with net rx.
> 
> So you borrow the kernel pages, but presumably these are all from
> O_DIRECT and have a user mapping?

Yes.

> 
> >>> In concept grouping is simpler because:
> >>>
> >>> - buffer lifetime is aligned with group leader lifetime, so we needn't
> >>> worry buffer leak because of application accidental exit
> >>
> >> But if it was an application buffer, that would not be a concern.
> > 
> > Yeah, but storage isn't same with network, here application buffer can't
> > support zc.
> 
> Maybe I'm dense, but can you expand on why that's the case?

network data can come anytime, so I guess rx buffer has to be provided beforehand,
so it is just one buffer, which can be built from application or kernel.

Storage is client/server model, and data can only come after request
is sent to device, so buffer is prepared with request together before
sending the request, which is built in kernel in current linux, so it
has to be one kernel buffer(bio->bi_bvec).

> 
> >> I do like the concept of the ephemeral buffer, the downside is that we
> >> need per-op support for it too. And while I'm not totally against doing
> > 
> > Can you explain per-op support a bit?
> > 
> > Now the buffer has been provided by one single uring command.
> 
> I mean the need to do:
> 
> +	if (req->flags & REQ_F_GROUP_KBUF) {
> +		ret = io_import_group_kbuf(req, rw->addr, rw->len, ITER_SOURCE,
> +				&io->iter);
> +		if (unlikely(ret))
> +			return ret;
> +	}
> 
> for picking such a buffer.

The above is for starting to consume the buffer, which usage is same
with buffer_select case, in which the buffer still need to be imported.

And this patchset provides this buffer(REQ_F_GROUP_KBUF) in single
uring_cmd.

The use model is basically that we use driver specific commands for
providing & removing the kernel buffer, and the buffer is consumed
by generic io_uring OP with generic interface in group style.

Looks you and Pavel hope that generic buffer provide/remove kernel
buffer OPs can be added from beginning.

> 
> >> that, it would be lovely if we could utilize and existing mechanism for
> >> that rather than add another one.
> > 
> > If existing mechanism can cover everything, our linux may not progress any
> > more.
> 
> That's not what I mean at all. We already have essentially three ways to
> get a buffer destination for IO:
> 
> 1) Just pass in an uaddr+len or an iovec
> 2) Set ->buf_index, the op needs to support this separately to grab a
>    registered buffer for IO.
> 3) For pollable stuff, provided buffers, either via the ring or the
>    legacy/classic approach.
> 
> This adds a 4th method, which shares the characteristics of 2+3 that the
> op needs to support it. This is the whole motivation to poke at having a
> way to use the normal registered buffer table for this, because then
> this falls into method 2 above.

Here the kernel buffer has very short lifetime, and the lifetime is aligned
with block IO request in ublk use case, which is another big difference
with 2#, and could have shorter time than provided buffer too.

As we discussed, there are several disadvantages with existed mechanism
for this use case:

1) it is actually the above 3# provided buffer instead of registered buffer,
because:
- registered buffer is long-live, without dependency problem
- registered buffer can be imported in ->prep()

here the kernel buffer has short lifetime, can't be imported in ->prep()
because of buffer dependency.

2) dependency between consumer OPs and provide buffer & remove buffer,
which adds two extra syscalls for handling each ublk IO; and makes
application to become more complicated.

3) application may exit with exception or panic, uring has to
remove this kernel buffer from table when that happens, but removing
the kernel buffer need to return this buffer to the buffer provider.

4) The existed provide & register buffer needs big change for support to
provide/remove kernel buffer, and the model is more complicated than
group buffer method. The existed buffer use code(import buffer) has to
change too because it isn't real registered buffer. But if we treat it
as provided buffer, that is basically what this patch is doing.

Looks you and Pavel are fine with patch 4, which adds sqe or IO group concept,
just wondering if you may take one step further and consider the group buffer
concept, which is valid only in group wide(local buffer), won't need register
and needn't to be global, and implementation & use is simple.

The cons could be that one new buffer type is added, even BUFFER_SELECT
can be reused, but not flexible.

> 
> I'm not at all saying "oh we can't add this new feature", the only thing
> I'm addressing is HOW we do that. I don't think anybody disagrees that
> we need zero copy for ublk, and honestly I would love to see that sooner
> rather than later!

If fuse will switch to uring_cmd, it may benefit from it too.

The current fuse can only support WRITE zero copy, and READ zc never
gets supported, because of splice/pipe's limit. I discussed with
Miklos, turns out it is one impossible task to support fuse READ zc with
splice.

> 
> >>>>>> There are certainly many different ways that can get propagated which
> >>>>>> would not entail a complicated mechanism. I really like the aspect of
> >>>>>> having the identifier being the same thing that we already use, and
> >>>>>> hence not needing to be something new on the side.
> >>>>>>
> >>>>>>> Also multiple OPs may consume the buffer concurrently, which can't be
> >>>>>>> supported by buffer select.
> >>>>>>
> >>>>>> Why not? You can certainly have multiple ops using the same registered
> >>>>>> buffer concurrently right now.
> >>>>>
> >>>>> Please see the above problem.
> >>>>>
> >>>>> Also I remember that the selected buffer is removed from buffer list,
> >>>>> see io_provided_buffer_select(), but maybe I am wrong.
> >>>>
> >>>> You're mixing up provided and registered buffers. Provided buffers are
> >>>> ones that the applications gives to the kernel, and the kernel grabs and
> >>>> consumes them. Then the application replenishes, repeat.
> >>>>
> >>>> Registered buffers are entirely different, those are registered with the
> >>>> kernel and we can do things like pre-gup the pages so we don't have to
> >>>> do them for every IO. They are entirely persistent, any multiple ops can
> >>>> keep using them, concurrently. They don't get consumed by an IO like
> >>>> provided buffers, they remain in place until they get unregistered (or
> >>>> updated, like my patch) at some point.
> >>>
> >>> I know the difference.
> >>
> >> But io_provided_buffer_select() has nothing to do with registered/fixed
> >> buffers or this use case, the above "remove from buffer list" is an
> >> entirely different buffer concept. So there's some confusion here, just
> >> wanted to make that clear.
> >>
> >>> The thing is that here we can't register the kernel buffer in ->prep(),
> >>> and it has to be provided in ->issue() of uring command. That is similar
> >>> with provided buffer.
> >>
> >> What's preventing it from registering it in ->prep()? It would be a bit
> >> odd, but there would be nothing preventing it codewise, outside of the
> >> oddity of ->prep() not being idempotent at that point. Don't follow why
> >> that would be necessary, though, can you expand?
> > 
> > ->prep() doesn't export to uring cmd, and we may not want to bother
> > drivers.
> 
> Sure, we don't want it off ->uring_cmd() or anything like that.
> 
> > Also remove buffer still can't be done in ->prep().
> 
> I mean, technically it could... Same restrictions as add, however.
> 
> > Not dig into further, one big thing could be that dependency isn't
> > respected in ->prep().
> 
> This is the main thing I was considering, because there's nothing
> preventing it from happening outside of the fact that it makes ->prep()
> not idempotent. Which is a big enough reason already, but...

It depends on if the OP need to support IO_LINK, if yes, it can't be
done in ->prep(), otherwise the link rule is broken.

But here IO_LINK is important, because buffer dependency really exists,
IMO, we shouldn't put the limit on this OP from user viewpoint.


Thanks
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-11 15:45                           ` Ming Lei
  2024-10-11 16:49                             ` Jens Axboe
@ 2024-10-14 18:40                             ` Pavel Begunkov
  2024-10-15 11:05                               ` Ming Lei
  1 sibling, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-10-14 18:40 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: io-uring, linux-block

On 10/11/24 16:45, Ming Lei wrote:
> On Fri, Oct 11, 2024 at 08:41:03AM -0600, Jens Axboe wrote:
>> On 10/11/24 8:20 AM, Ming Lei wrote:
>>> On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
>>>> On 10/10/24 9:07 PM, Ming Lei wrote:
>>>>> On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
>>>>>> On 10/10/24 8:30 PM, Ming Lei wrote:
>>>>>>> Hi Jens,
...
>>>>> Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
>>>>>
>>>>> 1) all N OPs are linked with OP_BUF_UPDATE
>>>>>
>>>>> Or
>>>>>
>>>>> 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
>>>>> OPs concurrently.
>>>>
>>>> Correct
>>>>
>>>>> But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
>>>>> and 1 extra syscall is introduced in 2).
>>>>
>>>> Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
>>>> you can just do it upfront. It's not ideal in terms of usage, and I get
>>>> where the grouping comes from. But is it possible to do the grouping in
>>>> a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any
>>>
>>> The most of 'intrusive' change is just on patch 4, and Pavel has commented
>>> that it is good enough:
>>>
>>> https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d

Trying to catch up on the thread. I do think the patch is tolerable and
mergeable, but I do it adds quite a bit of complication to the path if
you try to have a map in what state a request can be and what
dependencies are there, and then patches after has to go to every each
io_uring opcode and add support for leased buffers. And I'm afraid
that we'll also need to feedback from completion of those to let
the buffer know what ranges now has data / initialised. One typical
problem for page flipping rx, for example, is that you need to have
a full page of data to map it, otherwise it should be prezeroed,
which is too expensive, same problem you can have without mmap'ing
and directly exposing pages to the user.

>> At least for me, patch 4 looks fine. The problem occurs when you start
>> needing to support this different buffer type, which is in patch 6. I'm
>> not saying we can necessarily solve this with OP_BUF_UPDATE, I just want
>> to explore that path because if we can, then patch 6 turns into "oh
>> let's just added registered/fixed buffer support to these ops that don't
>> currently support it". And that would be much nicer indeed.
...
>>>> would be totally fine in terms of performance. OP_BUF_UPDATE will
>>>> _always_ completely immediately and inline, which means that it'll
>>>> _always_ be immediately available post submission. The only think you'd
>>>> ever have to worry about in terms of failure is a badly formed request,
>>>> which is a programming issue, or running out of memory on the host.
>>>>
>>>>> Also it makes error handling more complicated, io_uring has to remove
>>>>> the kernel buffer when the current task is exit, dependency or order with
>>>>> buffer provider is introduced.
>>>>
>>>> Why would that be? They belong to the ring, so should be torn down as
>>>> part of the ring anyway? Why would they be task-private, but not
>>>> ring-private?
>>>
>>> It is kernel buffer, which belongs to provider(such as ublk) instead
>>> of uring, application may panic any time, then io_uring has to remove
>>> the buffer for notifying the buffer owner.
>>
>> But it could be an application buffer, no? You'd just need the
>> application to provide it to ublk and have it mapped, rather than have
>> ublk allocate it in-kernel and then use that.
> 
> The buffer is actually kernel 'request/bio' pages of /dev/ublkbN, and now we
> forward and borrow it to io_uring OPs(fs rw, net send/recv), so it can't be
> application buffer, not same with net rx.

I don't see any problem in dropping buffers from the table
on exit, we have a lot of stuff a thread does for io_uring
when it exits.


>>> In concept grouping is simpler because:
>>>
>>> - buffer lifetime is aligned with group leader lifetime, so we needn't
>>> worry buffer leak because of application accidental exit
>>
>> But if it was an application buffer, that would not be a concern.
> 
> Yeah, but storage isn't same with network, here application buffer can't
> support zc.

Maybe I missed how it came to app buffers, but the thing I
initially mentioned is about storing the kernel buffer in
the table, without any user pointers and user buffers.

>>> - the buffer is borrowed to consumer OPs, and returned back after all
>>> consumers are done, this way avoids any dependency
>>>
>>> Meantime OP_BUF_UPDATE(provide buffer OP, remove buffer OP) becomes more
>>> complicated:
>>>
>>> - buffer leak because of app panic

Then io_uring dies and releases buffers. Or we can even add
some code removing it, as mentioned, any task that has ever
submitted a request already runs some io_uring code on exit.

>>> - buffer dependency issue: consumer OPs depend on provide buffer OP,
>>> 	remove buffer OP depends on consumer OPs; two syscalls has to be
>>> 	added for handling single ublk IO.
>>
>> Seems like most of this is because of the kernel buffer too, no?
> 
> Yeah.
> 
>>
>> I do like the concept of the ephemeral buffer, the downside is that we
>> need per-op support for it too. And while I'm not totally against doing
> 
> Can you explain per-op support a bit?
> 
> Now the buffer has been provided by one single uring command.
> 
>> that, it would be lovely if we could utilize and existing mechanism for
>> that rather than add another one.

That would also be more flexible as not everything can be
handled by linked request logic, and wouldn't require hacking
into every each request type to support "consuming" leased
buffers.

Overhead wise, let's say we fix buffer binding order and delay it
as elaborated on below, then you can provide a buffer and link a
consumer (e.g. send request or anything else) just as you do
it now. You can also link a request returning the buffer to the
same chain if you don't need extra flexibility.

As for groups, they're complicated because of the order inversion,
the notion of a leader and so. If we get rid of the need to impose
more semantics onto it by mediating buffer transition through the
table, I think we can do groups if needed but make it simpler.

>> What's preventing it from registering it in ->prep()? It would be a bit
>> odd, but there would be nothing preventing it codewise, outside of the
>> oddity of ->prep() not being idempotent at that point. Don't follow why
>> that would be necessary, though, can you expand?
> 
> ->prep() doesn't export to uring cmd, and we may not want to bother
> drivers.
> 
> Also remove buffer still can't be done in ->prep().
> 
> Not dig into further, one big thing could be that dependency isn't
> respected in ->prep().

And we can just fix that and move the choosing of a buffer
to ->issue(), in which case a buffer provided by one request
will be observable to its linked requests.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer
  2024-10-14 18:40                             ` Pavel Begunkov
@ 2024-10-15 11:05                               ` Ming Lei
  0 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-10-15 11:05 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, io-uring, linux-block, ming.lei

On Mon, Oct 14, 2024 at 07:40:40PM +0100, Pavel Begunkov wrote:
> On 10/11/24 16:45, Ming Lei wrote:
> > On Fri, Oct 11, 2024 at 08:41:03AM -0600, Jens Axboe wrote:
> > > On 10/11/24 8:20 AM, Ming Lei wrote:
> > > > On Fri, Oct 11, 2024 at 07:24:27AM -0600, Jens Axboe wrote:
> > > > > On 10/10/24 9:07 PM, Ming Lei wrote:
> > > > > > On Thu, Oct 10, 2024 at 08:39:12PM -0600, Jens Axboe wrote:
> > > > > > > On 10/10/24 8:30 PM, Ming Lei wrote:
> > > > > > > > Hi Jens,
> ...
> > > > > > Suppose we have N consumers OPs which depends on OP_BUF_UPDATE.
> > > > > > 
> > > > > > 1) all N OPs are linked with OP_BUF_UPDATE
> > > > > > 
> > > > > > Or
> > > > > > 
> > > > > > 2) submit OP_BUF_UPDATE first, and wait its completion, then submit N
> > > > > > OPs concurrently.
> > > > > 
> > > > > Correct
> > > > > 
> > > > > > But 1) and 2) may slow the IO handing.  In 1) all N OPs are serialized,
> > > > > > and 1 extra syscall is introduced in 2).
> > > > > 
> > > > > Yes you don't want do do #1. But the OP_BUF_UPDATE is cheap enough that
> > > > > you can just do it upfront. It's not ideal in terms of usage, and I get
> > > > > where the grouping comes from. But is it possible to do the grouping in
> > > > > a less intrusive fashion with OP_BUF_UPDATE? Because it won't change any
> > > > 
> > > > The most of 'intrusive' change is just on patch 4, and Pavel has commented
> > > > that it is good enough:
> > > > 
> > > > https://lore.kernel.org/linux-block/ZwZzsPcXyazyeZnu@fedora/T/#m551e94f080b80ccbd2561e01da5ea8e17f7ee15d
> 
> Trying to catch up on the thread. I do think the patch is tolerable and
> mergeable, but I do it adds quite a bit of complication to the path if
> you try to have a map in what state a request can be and what

I admit that sqe group adds a little complexity to the submission &
completion code, especially dealing with completion code.

But with your help, patch 4 has become easy to follow and sqe group
is well-defined now, and it does add new feature of N:M dependency,
otherwise one extra syscall is required for supporting N:M dependency,
this way not only saves one syscall, but also simplify application.

> dependencies are there, and then patches after has to go to every each
> io_uring opcode and add support for leased buffers. And I'm afraid

Only fast IO(net, fs) needs it, not see other OPs for such support.

> that we'll also need to feedback from completion of those to let
> the buffer know what ranges now has data / initialised. One typical
> problem for page flipping rx, for example, is that you need to have
> a full page of data to map it, otherwise it should be prezeroed,
> which is too expensive, same problem you can have without mmap'ing
> and directly exposing pages to the user.

From current design, the callback is only for returning the leased
buffer to owner, and we just need io_uring to do the favor for driver
by running aio with the leased buffer.

It can becomes quite complicated if we add feedback from completion.

Your catch on short read/recv is good, which may leak kernel
data, the problem exists on any other approach(provide kbuf) too, the
point is that it is kernel buffer, what do you think of the
following approach?

diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index d72a6bbbbd12..c1bc4179b390 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -242,4 +242,14 @@ static inline void io_drop_leased_grp_kbuf(struct io_kiocb *req)
 	if (gbuf)
 		gbuf->grp_kbuf_ack(gbuf);
 }
+
+/* zero remained bytes of kernel buffer for avoiding to leak data */
+static inline void io_req_zero_remained(struct io_kiocb *req, struct iov_iter *iter)
+{
+	size_t left = iov_iter_count(iter);
+
+	printk("iter type %d, left %lu\n", iov_iter_rw(iter), left);
+	if (iov_iter_rw(iter) == READ && left > 0)
+		iov_iter_zero(left, iter);
+}
 #endif
diff --git a/io_uring/net.c b/io_uring/net.c
index 6c32be92646f..022d81b6fc65 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -899,6 +899,8 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
 		*ret = IOU_STOP_MULTISHOT;
 	else
 		*ret = IOU_OK;
+	if (io_use_leased_grp_kbuf(req))
+		io_req_zero_remained(req, &kmsg->msg.msg_iter);
 	io_req_msg_cleanup(req, issue_flags);
 	return true;
 }
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 76a443fa593c..565b0e742ee5 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -479,6 +479,11 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res)
 		}
 		req_set_fail(req);
 		req->cqe.res = res;
+		if (io_use_leased_grp_kbuf(req)) {
+			struct io_async_rw *io = req->async_data;
+
+			io_req_zero_remained(req, &io->iter);
+		}
 	}
 	return false;
 }

> 
> > > At least for me, patch 4 looks fine. The problem occurs when you start
> > > needing to support this different buffer type, which is in patch 6. I'm
> > > not saying we can necessarily solve this with OP_BUF_UPDATE, I just want
> > > to explore that path because if we can, then patch 6 turns into "oh
> > > let's just added registered/fixed buffer support to these ops that don't
> > > currently support it". And that would be much nicer indeed.
> ...
> > > > > would be totally fine in terms of performance. OP_BUF_UPDATE will
> > > > > _always_ completely immediately and inline, which means that it'll
> > > > > _always_ be immediately available post submission. The only think you'd
> > > > > ever have to worry about in terms of failure is a badly formed request,
> > > > > which is a programming issue, or running out of memory on the host.
> > > > > 
> > > > > > Also it makes error handling more complicated, io_uring has to remove
> > > > > > the kernel buffer when the current task is exit, dependency or order with
> > > > > > buffer provider is introduced.
> > > > > 
> > > > > Why would that be? They belong to the ring, so should be torn down as
> > > > > part of the ring anyway? Why would they be task-private, but not
> > > > > ring-private?
> > > > 
> > > > It is kernel buffer, which belongs to provider(such as ublk) instead
> > > > of uring, application may panic any time, then io_uring has to remove
> > > > the buffer for notifying the buffer owner.
> > > 
> > > But it could be an application buffer, no? You'd just need the
> > > application to provide it to ublk and have it mapped, rather than have
> > > ublk allocate it in-kernel and then use that.
> > 
> > The buffer is actually kernel 'request/bio' pages of /dev/ublkbN, and now we
> > forward and borrow it to io_uring OPs(fs rw, net send/recv), so it can't be
> > application buffer, not same with net rx.
> 
> I don't see any problem in dropping buffers from the table
> on exit, we have a lot of stuff a thread does for io_uring
> when it exits.

io_uring cancel handling has been complicated enough, now uring
command have two cancel code paths if provide kernel buffer is
added:

1) io_uring_try_cancel_uring_cmd()

2) the kernel buffer cancel code path

There might be dependency for the two.

> 
> 
> > > > In concept grouping is simpler because:
> > > > 
> > > > - buffer lifetime is aligned with group leader lifetime, so we needn't
> > > > worry buffer leak because of application accidental exit
> > > 
> > > But if it was an application buffer, that would not be a concern.
> > 
> > Yeah, but storage isn't same with network, here application buffer can't
> > support zc.
> 
> Maybe I missed how it came to app buffers, but the thing I
> initially mentioned is about storing the kernel buffer in
> the table, without any user pointers and user buffers.

Yeah, just some random words, please ignore it.

> 
> > > > - the buffer is borrowed to consumer OPs, and returned back after all
> > > > consumers are done, this way avoids any dependency
> > > > 
> > > > Meantime OP_BUF_UPDATE(provide buffer OP, remove buffer OP) becomes more
> > > > complicated:
> > > > 
> > > > - buffer leak because of app panic
> 
> Then io_uring dies and releases buffers. Or we can even add
> some code removing it, as mentioned, any task that has ever
> submitted a request already runs some io_uring code on exit.
> 
> > > > - buffer dependency issue: consumer OPs depend on provide buffer OP,
> > > > 	remove buffer OP depends on consumer OPs; two syscalls has to be
> > > > 	added for handling single ublk IO.
> > > 
> > > Seems like most of this is because of the kernel buffer too, no?
> > 
> > Yeah.
> > 
> > > 
> > > I do like the concept of the ephemeral buffer, the downside is that we
> > > need per-op support for it too. And while I'm not totally against doing
> > 
> > Can you explain per-op support a bit?
> > 
> > Now the buffer has been provided by one single uring command.
> > 
> > > that, it would be lovely if we could utilize and existing mechanism for
> > > that rather than add another one.
> 
> That would also be more flexible as not everything can be
> handled by linked request logic, and wouldn't require hacking
> into every each request type to support "consuming" leased
> buffers.

I guess you mean 'consuming' the code added in net.c and rw.c, which
can't be avoided, because it is kernel buffer, and we are supporting
it first time:

- there isn't userspace address, not like buffer select & fixed buffer
- the kernel buffer has to be returned to the provider
- the buffer has to be imported in ->issue(), can't be done in ->prep()
- short read/recv has to be dealt with

> 
> Overhead wise, let's say we fix buffer binding order and delay it
> as elaborated on below, then you can provide a buffer and link a
> consumer (e.g. send request or anything else) just as you do
> it now. You can also link a request returning the buffer to the
> same chain if you don't need extra flexibility.
> 
> As for groups, they're complicated because of the order inversion,

IMO, group complication only exists in the completion side, fortunately
it is well defined now.

buffer and table causes more complicated application, with bad
performance:

- two syscalls(uring_enter trips) are added for each ublk IO
- one extra request is added(group needs 2 requests, and add buffer
needs 3 requests for the simples case), then bigger SQ & CQ size
- extra cancel handling

group simplifies buffer lifetime a lot, since io_uring needn't to
care it at all.

> the notion of a leader and so. If we get rid of the need to impose
> more semantics onto it by mediating buffer transition through the
> table, I think we can do groups if needed but make it simpler.

The situation is just that driver leases the buffer to io_uring, not
have to transfer it to io_uring. Once it is added to table, it has to
be removed from table.

It is just like local variable vs global variable, the latter is more
complicated to use.

> 
> > > What's preventing it from registering it in ->prep()? It would be a bit
> > > odd, but there would be nothing preventing it codewise, outside of the
> > > oddity of ->prep() not being idempotent at that point. Don't follow why
> > > that would be necessary, though, can you expand?
> > 
> > ->prep() doesn't export to uring cmd, and we may not want to bother
> > drivers.
> > 
> > Also remove buffer still can't be done in ->prep().
> > 
> > Not dig into further, one big thing could be that dependency isn't
> > respected in ->prep().
> 
> And we can just fix that and move the choosing of a buffer
> to ->issue(), in which case a buffer provided by one request
> will be observable to its linked requests.

This patch does import buffer in ->issue(), as I explained to Jens:

- either all OPs are linked together with add_kbuf  & remove_kbuf, then
all OPs can't be issued concurrently

- or two syscalls are added for handling single ublk IO

The two are not great from performance viewpoint, but also complicates
application.

I don't think the above two can be avoided, or can you explain how to
do it?


thanks,
Ming


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 8/8] ublk: support provide io buffer
  2024-09-12 10:49 ` [PATCH V6 8/8] ublk: support provide io buffer Ming Lei
@ 2024-10-17 22:31   ` Uday Shankar
  2024-10-18  0:45     ` Ming Lei
  0 siblings, 1 reply; 47+ messages in thread
From: Uday Shankar @ 2024-10-17 22:31 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block

On Thu, Sep 12, 2024 at 06:49:28PM +0800, Ming Lei wrote:
> +static int ublk_provide_io_buf(struct io_uring_cmd *cmd,
> +		struct ublk_queue *ubq, int tag)
> +{
> +	struct ublk_device *ub = cmd->file->private_data;
> +	struct ublk_rq_data *data;
> +	struct request *req;
> +
> +	if (!ub)
> +		return -EPERM;
> +
> +	req = __ublk_check_and_get_req(ub, ubq, tag, 0);
> +	if (!req)
> +		return -EINVAL;
> +
> +	pr_devel("%s: qid %d tag %u request bytes %u\n",
> +			__func__, tag, ubq->q_id, blk_rq_bytes(req));
> +
> +	data = blk_mq_rq_to_pdu(req);
> +
> +	/*
> +	 * io_uring guarantees that the callback will be called after
> +	 * the provided buffer is consumed, and it is automatic removal
> +	 * before this uring command is freed.
> +	 *
> +	 * This request won't be completed unless the callback is called,
> +	 * so ublk module won't be unloaded too.
> +	 */
> +	return io_uring_cmd_provide_kbuf(cmd, data->buf);
> +}

We did some testing with this patchset and saw some panics due to
grp_kbuf_ack being a garbage value. Turns out that's because we forgot
to set the UBLK_F_SUPPORT_ZERO_COPY flag on the device. But it looks
like the UBLK_IO_PROVIDE_IO_BUF command is still allowed for such
devices. Should this function test that the device has zero copy
configured and fail if it doesn't?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH V6 8/8] ublk: support provide io buffer
  2024-10-17 22:31   ` Uday Shankar
@ 2024-10-18  0:45     ` Ming Lei
  0 siblings, 0 replies; 47+ messages in thread
From: Ming Lei @ 2024-10-18  0:45 UTC (permalink / raw)
  To: Uday Shankar; +Cc: Jens Axboe, io-uring, Pavel Begunkov, linux-block

On Thu, Oct 17, 2024 at 04:31:26PM -0600, Uday Shankar wrote:
> On Thu, Sep 12, 2024 at 06:49:28PM +0800, Ming Lei wrote:
> > +static int ublk_provide_io_buf(struct io_uring_cmd *cmd,
> > +		struct ublk_queue *ubq, int tag)
> > +{
> > +	struct ublk_device *ub = cmd->file->private_data;
> > +	struct ublk_rq_data *data;
> > +	struct request *req;
> > +
> > +	if (!ub)
> > +		return -EPERM;
> > +
> > +	req = __ublk_check_and_get_req(ub, ubq, tag, 0);
> > +	if (!req)
> > +		return -EINVAL;
> > +
> > +	pr_devel("%s: qid %d tag %u request bytes %u\n",
> > +			__func__, tag, ubq->q_id, blk_rq_bytes(req));
> > +
> > +	data = blk_mq_rq_to_pdu(req);
> > +
> > +	/*
> > +	 * io_uring guarantees that the callback will be called after
> > +	 * the provided buffer is consumed, and it is automatic removal
> > +	 * before this uring command is freed.
> > +	 *
> > +	 * This request won't be completed unless the callback is called,
> > +	 * so ublk module won't be unloaded too.
> > +	 */
> > +	return io_uring_cmd_provide_kbuf(cmd, data->buf);
> > +}
> 
> We did some testing with this patchset and saw some panics due to
> grp_kbuf_ack being a garbage value. Turns out that's because we forgot
> to set the UBLK_F_SUPPORT_ZERO_COPY flag on the device. But it looks
> like the UBLK_IO_PROVIDE_IO_BUF command is still allowed for such
> devices. Should this function test that the device has zero copy
> configured and fail if it doesn't?

Yeah, it should, thanks for the test & report.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2024-10-18  0:45 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-12 10:49 [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-09-12 10:49 ` [PATCH V6 1/8] io_uring: add io_link_req() helper Ming Lei
2024-09-12 10:49 ` [PATCH V6 2/8] io_uring: add io_submit_fail_link() helper Ming Lei
2024-09-12 10:49 ` [PATCH V6 3/8] io_uring: add helper of io_req_commit_cqe() Ming Lei
2024-09-12 10:49 ` [PATCH V6 4/8] io_uring: support SQE group Ming Lei
2024-10-04 13:12   ` Pavel Begunkov
2024-10-06  3:54     ` Ming Lei
2024-10-09 11:53       ` Pavel Begunkov
2024-10-09 12:14         ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 5/8] io_uring: support sqe group with members depending on leader Ming Lei
2024-10-04 13:18   ` Pavel Begunkov
2024-10-06  3:54     ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 6/8] io_uring: support providing sqe group buffer Ming Lei
2024-10-04 15:32   ` Pavel Begunkov
2024-10-06  8:20     ` Ming Lei
2024-10-09 14:25       ` Pavel Begunkov
2024-10-10  3:00         ` Ming Lei
2024-10-10 18:51           ` Pavel Begunkov
2024-10-11  2:00             ` Ming Lei
2024-10-11  4:06               ` Ming Lei
2024-10-06  9:47     ` Ming Lei
2024-10-09 11:57       ` Pavel Begunkov
2024-10-09 12:21         ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 7/8] io_uring/uring_cmd: support provide group kernel buffer Ming Lei
2024-10-04 15:44   ` Pavel Begunkov
2024-10-06  8:46     ` Ming Lei
2024-10-09 15:14       ` Pavel Begunkov
2024-10-10  3:28         ` Ming Lei
2024-10-10 15:48           ` Pavel Begunkov
2024-10-10 19:31             ` Jens Axboe
2024-10-11  2:30               ` Ming Lei
2024-10-11  2:39                 ` Jens Axboe
2024-10-11  3:07                   ` Ming Lei
2024-10-11 13:24                     ` Jens Axboe
2024-10-11 14:20                       ` Ming Lei
2024-10-11 14:41                         ` Jens Axboe
2024-10-11 15:45                           ` Ming Lei
2024-10-11 16:49                             ` Jens Axboe
2024-10-12  3:35                               ` Ming Lei
2024-10-14 18:40                             ` Pavel Begunkov
2024-10-15 11:05                               ` Ming Lei
2024-09-12 10:49 ` [PATCH V6 8/8] ublk: support provide io buffer Ming Lei
2024-10-17 22:31   ` Uday Shankar
2024-10-18  0:45     ` Ming Lei
2024-09-26 10:27 ` [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Ming Lei
2024-09-26 12:18   ` Jens Axboe
2024-09-26 19:46     ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox