[RFC 0/7] Rethinking splice

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC 0/7] Rethinking splice
@ 2023-04-30  9:35 Pavel Begunkov
  2023-04-30  9:35 ` [RFC 1/7] io_uring: add io_mapped_ubuf caches Pavel Begunkov
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Pavel Begunkov @ 2023-04-30  9:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, ming.lei

IORING_OP_SPLICE has problems, many of them are fundamental and rooted
in the uapi design, see the patch 8 description. This patchset introduces
a different approach, which came from discussions about splices
and fused commands and absorbed ideas from both of them. We remove
reliance onto pipes and registering "spliced" buffers with data as an
io_uring's registered buffer. Then the user can use it as a usual
registered buffer, e.g. pass it to IORING_OP_WRITE_FIXED.

Once a buffer is released, it'll be returned back to the file it
originated from via a callback. It's carried on on the level of the
enitre buffer rather than on per-page basis as with splice, which,
as noted by Ming, will allow more optimisations.

The communication with the target file is done by a new fops callback,
however the end mean of getting a buffer might change. It also peels
layers of code compared to splice requests, which helps it to be more
flexible and support more cases. For instance, Ming has a case where
it's beneficial for the target file to provide a buffer to be filled
with read/recv/etc. requests and then returned back to the file.

Testing:

I was benchmarking using liburing/examples/splice-bench.t [1], which
also needs additional test kernel patches [2]. It implements get-buf
for /dev/null, and the test grabs one page from it and then feeds it
back without any actual IO, then repeats.

fairness:
IORING_OP_SPLICE performs very poorly not even reaching 450K qps, so one
of the patches enables inline execution of it to make it more
interesting but is only fine for testing.

Buffer removal is done by OP_GET_BUF without issuing a separate op
for that. "GET_BUF + nop" emulates the overhead by additional additional
nop requests.

Another aspect is that OP_GET_BUF issues OP_WRITE_FIXED, which, as
profiles show, are quite expensive, which is not exactly a problem of
GET_BUF but skews results. E.g. io_get_buf() - 10.7, io_write() - 24.3%

The last bit is that the buffer removal, if done by a separate request,
might and likely will be batched with other requests, so "GET_BUF + nop"
is rather the worst case.

The numbers below are "requests / s".

QD  | splice2() | OP_SPLICE | OP_GET_BUF | GET_BUF, link | GET_BUF + nop
1   | 5009035   | 3697020   | 3886356    | 4616123       | 2886171
2   | 4859523   | 5205564   | 5309510    | 5591521       | 4139125
4   | 4908353   | 6265771   | 6415036    | 6331249       | 5198505
8   | 4955003   | 7141326   | 7243434    | 6850088       | 5984588
16  | 4959496   | 7640409   | 7794564    | 7208221       | 6587212
32  | 4937463   | 7868501   | 8103406    | 7385890       | 6844390

The test is obviously not exhausting and it should further be tried
and with more complicated cases. E.g. need quantify performance with
sockets, where apoll feature will be involved, and it'll need to get
internal partial IO retry support.

[1] https://github.com/isilence/liburing.git io_uring/get-buf-op
[2] https://github.com/isilence/linux.git io_uring/get-buf-op

Links for convenience:
https://github.com/isilence/liburing/tree/io_uring/get-buf-op
https://github.com/isilence/linux/tree/io_uring/get-buf-op

Pavel Begunkov (7):
  io_uring: add io_mapped_ubuf caches
  io_uring: add reg-buffer data directions
  io_uring: fail loop_rw_iter with pure bvec bufs
  io_uring/rsrc: introduce struct iou_buf_desc
  io_uring/rsrc: add buffer release callbacks
  io_uring/rsrc: introduce helper installing one buffer
  io_uring,fs: introduce IORING_OP_GET_BUF

 include/linux/fs.h             |  2 +
 include/linux/io_uring.h       | 19 +++++++
 include/linux/io_uring_types.h |  2 +
 include/uapi/linux/io_uring.h  |  1 +
 io_uring/io_uring.c            |  9 ++++
 io_uring/opdef.c               | 11 +++++
 io_uring/rsrc.c                | 80 ++++++++++++++++++++++++++----
 io_uring/rsrc.h                | 24 +++++++--
 io_uring/rw.c                  |  7 +++
 io_uring/splice.c              | 90 ++++++++++++++++++++++++++++++++++
 io_uring/splice.h              |  4 ++
 11 files changed, 235 insertions(+), 14 deletions(-)

-- 
2.40.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 1/7] io_uring: add io_mapped_ubuf caches
  2023-04-30  9:35 [RFC 0/7] Rethinking splice Pavel Begunkov
@ 2023-04-30  9:35 ` Pavel Begunkov
  2023-04-30  9:35 ` [RFC 2/7] io_uring: add reg-buffer data directions Pavel Begunkov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2023-04-30  9:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, ming.lei

We'll be allocating lots of io_mapped_ubuf shortly, add caches

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  2 ++
 io_uring/io_uring.c            |  9 ++++++++
 io_uring/rsrc.c                | 39 ++++++++++++++++++++++++++++++++--
 io_uring/rsrc.h                |  6 +++++-
 4 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 1b2a20a42413..3d103a00264c 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -336,6 +336,8 @@ struct io_ring_ctx {
 	struct wait_queue_head		rsrc_quiesce_wq;
 	unsigned			rsrc_quiesce;
 
+	struct io_alloc_cache		reg_buf_cache;
+
 	struct list_head		io_buffers_pages;
 
 	#if defined(CONFIG_UNIX)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3d43df8f1e4e..fdd62dbfd0ba 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -312,6 +312,8 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->io_buffers_cache);
 	io_alloc_cache_init(&ctx->rsrc_node_cache, IO_NODE_ALLOC_CACHE_MAX,
 			    sizeof(struct io_rsrc_node));
+	io_alloc_cache_init(&ctx->reg_buf_cache, IO_NODE_ALLOC_CACHE_MAX,
+			    sizeof(struct io_async_msghdr));
 	io_alloc_cache_init(&ctx->apoll_cache, IO_ALLOC_CACHE_MAX,
 			    sizeof(struct async_poll));
 	io_alloc_cache_init(&ctx->netmsg_cache, IO_ALLOC_CACHE_MAX,
@@ -2827,6 +2829,11 @@ static void io_rsrc_node_cache_free(struct io_cache_entry *entry)
 	kfree(container_of(entry, struct io_rsrc_node, cache));
 }
 
+static void io_reg_buf_cache_free(struct io_cache_entry *entry)
+{
+	kvfree(container_of(entry, struct io_mapped_ubuf, cache));
+}
+
 static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	io_sq_thread_finish(ctx);
@@ -2865,6 +2872,8 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
 
 	io_alloc_cache_free(&ctx->rsrc_node_cache, io_rsrc_node_cache_free);
+	io_alloc_cache_free(&ctx->reg_buf_cache, io_reg_buf_cache_free);
+
 	if (ctx->mm_account) {
 		mmdrop(ctx->mm_account);
 		ctx->mm_account = NULL;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index ddee7adb4006..fef94f8d788d 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -33,6 +33,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 #define IORING_MAX_FIXED_FILES	(1U << 20)
 #define IORING_MAX_REG_BUFFERS	(1U << 14)
 
+#define IO_BUF_CACHE_MAX_BVECS	64
+
 int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
 {
 	unsigned long page_limit, cur_pages, new_pages;
@@ -78,6 +80,39 @@ static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
 	return 0;
 }
 
+static void io_put_reg_buf(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
+{
+	lockdep_assert_held(&ctx->uring_lock);
+
+	if ((imu->max_bvecs != IO_BUF_CACHE_MAX_BVECS) ||
+	    !io_alloc_cache_put(&ctx->reg_buf_cache, &imu->cache))
+		kvfree(imu);
+}
+
+static struct io_mapped_ubuf *io_alloc_reg_buf(struct io_ring_ctx *ctx,
+					       int nr_bvecs)
+{
+	struct io_cache_entry *entry;
+	struct io_mapped_ubuf *imu;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	if (nr_bvecs > IO_BUF_CACHE_MAX_BVECS) {
+do_alloc:
+		imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+		if (!imu)
+			return NULL;
+	} else {
+		nr_bvecs = IO_BUF_CACHE_MAX_BVECS;
+		entry = io_alloc_cache_get(&ctx->reg_buf_cache);
+		if (!entry)
+			goto do_alloc;
+		imu = container_of(entry, struct io_mapped_ubuf, cache);
+	}
+	imu->max_bvecs = nr_bvecs;
+	return imu;
+}
+
 static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
 		       void __user *arg, unsigned index)
 {
@@ -137,7 +172,7 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf **slo
 			unpin_user_page(imu->bvec[i].bv_page);
 		if (imu->acct_pages)
 			io_unaccount_mem(ctx, imu->acct_pages);
-		kvfree(imu);
+		io_put_reg_buf(ctx, imu);
 	}
 	*slot = NULL;
 }
@@ -1134,7 +1169,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 		}
 	}
 
-	imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
+	imu = io_alloc_reg_buf(ctx, nr_pages);
 	if (!imu)
 		goto done;
 
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 0a8a95e9b99e..f34de451a79a 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -50,9 +50,13 @@ struct io_rsrc_node {
 };
 
 struct io_mapped_ubuf {
-	u64		ubuf;
+	union {
+		struct io_cache_entry		cache;
+		u64				ubuf;
+	};
 	u64		ubuf_end;
 	unsigned int	nr_bvecs;
+	unsigned int	max_bvecs;
 	unsigned long	acct_pages;
 	struct bio_vec	bvec[];
 };
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 2/7] io_uring: add reg-buffer data directions
  2023-04-30  9:35 [RFC 0/7] Rethinking splice Pavel Begunkov
  2023-04-30  9:35 ` [RFC 1/7] io_uring: add io_mapped_ubuf caches Pavel Begunkov
@ 2023-04-30  9:35 ` Pavel Begunkov
  2023-04-30  9:35 ` [RFC 3/7] io_uring: fail loop_rw_iter with pure bvec bufs Pavel Begunkov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2023-04-30  9:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, ming.lei

There will be buffers that only allow reading from or writing to it, so
add data directions, and check it when importing a buffer.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/rsrc.c | 5 +++++
 io_uring/rsrc.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index fef94f8d788d..b6305ae3538c 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1185,6 +1185,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	imu->ubuf = (unsigned long) iov->iov_base;
 	imu->ubuf_end = imu->ubuf + iov->iov_len;
 	imu->nr_bvecs = nr_pages;
+	imu->dir_mask = (1U << ITER_SOURCE) | (1U << ITER_DEST);
 	*pimu = imu;
 	ret = 0;
 
@@ -1274,6 +1275,8 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 	u64 buf_end;
 	size_t offset;
 
+	BUILD_BUG_ON((1U << ITER_SOURCE) & (1U << ITER_DEST));
+
 	if (WARN_ON_ONCE(!imu))
 		return -EFAULT;
 	if (unlikely(check_add_overflow(buf_addr, (u64)len, &buf_end)))
@@ -1281,6 +1284,8 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 	/* not inside the mapped region */
 	if (unlikely(buf_addr < imu->ubuf || buf_end > imu->ubuf_end))
 		return -EFAULT;
+	if (unlikely(!((1U << ddir) & imu->dir_mask)))
+		return -EFAULT;
 
 	/*
 	 * Might not be a start of buffer, set size appropriately
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index f34de451a79a..10daa25d9194 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -57,6 +57,7 @@ struct io_mapped_ubuf {
 	u64		ubuf_end;
 	unsigned int	nr_bvecs;
 	unsigned int	max_bvecs;
+	unsigned int	dir_mask;
 	unsigned long	acct_pages;
 	struct bio_vec	bvec[];
 };
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 3/7] io_uring: fail loop_rw_iter with pure bvec bufs
  2023-04-30  9:35 [RFC 0/7] Rethinking splice Pavel Begunkov
  2023-04-30  9:35 ` [RFC 1/7] io_uring: add io_mapped_ubuf caches Pavel Begunkov
  2023-04-30  9:35 ` [RFC 2/7] io_uring: add reg-buffer data directions Pavel Begunkov
@ 2023-04-30  9:35 ` Pavel Begunkov
  2023-04-30  9:35 ` [RFC 4/7] io_uring/rsrc: introduce struct iou_buf_desc Pavel Begunkov
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2023-04-30  9:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, ming.lei

There will be registered buffers that have never had a userspace mapping
and to use them the file have to work with iterators. Fail
loop_rw_iter() if it meets such a buffer.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/rw.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 6c7d2654770e..b2ad99e0e304 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -428,11 +428,18 @@ static inline loff_t *io_kiocb_ppos(struct kiocb *kiocb)
  */
 static ssize_t loop_rw_iter(int ddir, struct io_rw *rw, struct iov_iter *iter)
 {
+	struct io_kiocb *req = cmd_to_io_kiocb(rw);
 	struct kiocb *kiocb = &rw->kiocb;
 	struct file *file = kiocb->ki_filp;
 	ssize_t ret = 0;
 	loff_t *ppos;
 
+	if (req->opcode == IORING_OP_READ_FIXED ||
+	    req->opcode == IORING_OP_WRITE_FIXED) {
+		if (!req->imu->ubuf)
+			return -EFAULT;
+	}
+
 	/*
 	 * Don't support polled IO through this interface, and we can't
 	 * support non-blocking either. For the latter, this just causes
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 4/7] io_uring/rsrc: introduce struct iou_buf_desc
  2023-04-30  9:35 [RFC 0/7] Rethinking splice Pavel Begunkov
                   ` (2 preceding siblings ...)
  2023-04-30  9:35 ` [RFC 3/7] io_uring: fail loop_rw_iter with pure bvec bufs Pavel Begunkov
@ 2023-04-30  9:35 ` Pavel Begunkov
  2023-04-30  9:35 ` [RFC 5/7] io_uring/rsrc: add buffer release callbacks Pavel Begunkov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2023-04-30  9:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, ming.lei

Add struct iou_buf_desc, which will be used for new get_buf operations.
It'll be handed over to a file with via new operation to be filled.
After the content should eventually end up in struct io_mapped_ubuf,
and so to not make extra copies just place the descriptor inside struct
io_mapped_ubuf.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring.h |  6 ++++++
 io_uring/rsrc.c          | 13 +++++++------
 io_uring/rsrc.h          | 11 +++++------
 3 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 35b9328ca335..fddb5d52b776 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -22,6 +22,12 @@ enum io_uring_cmd_flags {
 	IO_URING_F_IOPOLL		= (1 << 10),
 };
 
+struct iou_buf_desc {
+	unsigned		nr_bvecs;
+	unsigned		max_bvecs;
+	struct bio_vec		*bvec;
+};
+
 struct io_uring_cmd {
 	struct file	*file;
 	const void	*cmd;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index b6305ae3538c..0edcebb6b5cb 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -84,7 +84,7 @@ static void io_put_reg_buf(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 {
 	lockdep_assert_held(&ctx->uring_lock);
 
-	if ((imu->max_bvecs != IO_BUF_CACHE_MAX_BVECS) ||
+	if ((imu->desc.max_bvecs != IO_BUF_CACHE_MAX_BVECS) ||
 	    !io_alloc_cache_put(&ctx->reg_buf_cache, &imu->cache))
 		kvfree(imu);
 }
@@ -109,7 +109,8 @@ static struct io_mapped_ubuf *io_alloc_reg_buf(struct io_ring_ctx *ctx,
 			goto do_alloc;
 		imu = container_of(entry, struct io_mapped_ubuf, cache);
 	}
-	imu->max_bvecs = nr_bvecs;
+	imu->desc.bvec = imu->bvec;
+	imu->desc.max_bvecs = nr_bvecs;
 	return imu;
 }
 
@@ -168,7 +169,7 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf **slo
 	unsigned int i;
 
 	if (imu != ctx->dummy_ubuf) {
-		for (i = 0; i < imu->nr_bvecs; i++)
+		for (i = 0; i < imu->desc.nr_bvecs; i++)
 			unpin_user_page(imu->bvec[i].bv_page);
 		if (imu->acct_pages)
 			io_unaccount_mem(ctx, imu->acct_pages);
@@ -1020,7 +1021,7 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
 	for (i = 0; i < ctx->nr_user_bufs; i++) {
 		struct io_mapped_ubuf *imu = ctx->user_bufs[i];
 
-		for (j = 0; j < imu->nr_bvecs; j++) {
+		for (j = 0; j < imu->desc.nr_bvecs; j++) {
 			if (!PageCompound(imu->bvec[j].bv_page))
 				continue;
 			if (compound_head(imu->bvec[j].bv_page) == hpage)
@@ -1184,7 +1185,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
 	/* store original address for later verification */
 	imu->ubuf = (unsigned long) iov->iov_base;
 	imu->ubuf_end = imu->ubuf + iov->iov_len;
-	imu->nr_bvecs = nr_pages;
+	imu->desc.nr_bvecs = nr_pages;
 	imu->dir_mask = (1U << ITER_SOURCE) | (1U << ITER_DEST);
 	*pimu = imu;
 	ret = 0;
@@ -1292,7 +1293,7 @@ int io_import_fixed(int ddir, struct iov_iter *iter,
 	 * and advance us to the beginning.
 	 */
 	offset = buf_addr - imu->ubuf;
-	iov_iter_bvec(iter, ddir, imu->bvec, imu->nr_bvecs, offset + len);
+	iov_iter_bvec(iter, ddir, imu->bvec, imu->desc.nr_bvecs, offset + len);
 
 	if (offset) {
 		/*
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 10daa25d9194..9ac10b3d25ac 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -54,12 +54,11 @@ struct io_mapped_ubuf {
 		struct io_cache_entry		cache;
 		u64				ubuf;
 	};
-	u64		ubuf_end;
-	unsigned int	nr_bvecs;
-	unsigned int	max_bvecs;
-	unsigned int	dir_mask;
-	unsigned long	acct_pages;
-	struct bio_vec	bvec[];
+	u64			ubuf_end;
+	struct iou_buf_desc	desc;
+	unsigned int		dir_mask;
+	unsigned long		acct_pages;
+	struct bio_vec		bvec[];
 };
 
 void io_rsrc_put_tw(struct callback_head *cb);
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 5/7] io_uring/rsrc: add buffer release callbacks
  2023-04-30  9:35 [RFC 0/7] Rethinking splice Pavel Begunkov
                   ` (3 preceding siblings ...)
  2023-04-30  9:35 ` [RFC 4/7] io_uring/rsrc: introduce struct iou_buf_desc Pavel Begunkov
@ 2023-04-30  9:35 ` Pavel Begunkov
  2023-04-30  9:35 ` [RFC 6/7] io_uring/rsrc: introduce helper installing one buffer Pavel Begunkov
  2023-04-30  9:35 ` [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF Pavel Begunkov
  6 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2023-04-30  9:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, ming.lei

With other buffer types we may want to have a different way of releasing
them. For example, we may want to put pages in some kind of cache or
bulk put them. Add a release callback.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring.h |  2 ++
 io_uring/rsrc.c          | 14 ++++++++++----
 io_uring/rsrc.h          |  5 +++++
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index fddb5d52b776..e0e7df5beefc 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -26,6 +26,8 @@ struct iou_buf_desc {
 	unsigned		nr_bvecs;
 	unsigned		max_bvecs;
 	struct bio_vec		*bvec;
+	void			(*release)(struct iou_buf_desc *);
+	void			*private;
 };
 
 struct io_uring_cmd {
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 0edcebb6b5cb..3799470fd45e 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -111,6 +111,8 @@ static struct io_mapped_ubuf *io_alloc_reg_buf(struct io_ring_ctx *ctx,
 	}
 	imu->desc.bvec = imu->bvec;
 	imu->desc.max_bvecs = nr_bvecs;
+	imu->desc.private = NULL;
+	imu->desc.release = NULL;
 	return imu;
 }
 
@@ -169,10 +171,14 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf **slo
 	unsigned int i;
 
 	if (imu != ctx->dummy_ubuf) {
-		for (i = 0; i < imu->desc.nr_bvecs; i++)
-			unpin_user_page(imu->bvec[i].bv_page);
-		if (imu->acct_pages)
-			io_unaccount_mem(ctx, imu->acct_pages);
+		if (imu->desc.release) {
+			io_reg_buf_release(imu);
+		} else {
+			for (i = 0; i < imu->desc.nr_bvecs; i++)
+				unpin_user_page(imu->bvec[i].bv_page);
+			if (imu->acct_pages)
+				io_unaccount_mem(ctx, imu->acct_pages);
+		}
 		io_put_reg_buf(ctx, imu);
 	}
 	*slot = NULL;
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 9ac10b3d25ac..29ce9a8a2277 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -169,4 +169,9 @@ static inline void __io_unaccount_mem(struct user_struct *user,
 	atomic_long_sub(nr_pages, &user->locked_vm);
 }
 
+static inline void io_reg_buf_release(struct io_mapped_ubuf *imu)
+{
+	imu->desc.release(&imu->desc);
+}
+
 #endif
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 6/7] io_uring/rsrc: introduce helper installing one buffer
  2023-04-30  9:35 [RFC 0/7] Rethinking splice Pavel Begunkov
                   ` (4 preceding siblings ...)
  2023-04-30  9:35 ` [RFC 5/7] io_uring/rsrc: add buffer release callbacks Pavel Begunkov
@ 2023-04-30  9:35 ` Pavel Begunkov
  2023-04-30  9:35 ` [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF Pavel Begunkov
  6 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2023-04-30  9:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, ming.lei

Add a new helper called io_install_buffer(), which will be used
later for operations willing to install buffers into the registered
buffer table.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/rsrc.c | 15 +++++++++++++++
 io_uring/rsrc.h |  3 +++
 2 files changed, 18 insertions(+)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 3799470fd45e..db4286b42dce 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -517,6 +517,21 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
 	return done ? done : err;
 }
 
+int io_install_buffer(struct io_ring_ctx *ctx,
+		      struct io_mapped_ubuf *imu,
+		      unsigned i)
+{
+	if (unlikely(i >= ctx->nr_user_bufs))
+		return -EFAULT;
+
+	i = array_index_nospec(i, ctx->nr_user_bufs);
+	if (unlikely(ctx->user_bufs[i] != ctx->dummy_ubuf))
+		return -EINVAL;
+
+	ctx->user_bufs[i] = imu;
+	return 0;
+}
+
 static int __io_register_rsrc_update(struct io_ring_ctx *ctx, unsigned type,
 				     struct io_uring_rsrc_update2 *up,
 				     unsigned nr_args)
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 29ce9a8a2277..aba95bdd060e 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -75,6 +75,9 @@ void __io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
 int io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
 int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 			    unsigned int nr_args, u64 __user *tags);
+int io_install_buffer(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu,
+		      unsigned i);
+
 void __io_sqe_files_unregister(struct io_ring_ctx *ctx);
 int io_sqe_files_unregister(struct io_ring_ctx *ctx);
 int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF
  2023-04-30  9:35 [RFC 0/7] Rethinking splice Pavel Begunkov
                   ` (5 preceding siblings ...)
  2023-04-30  9:35 ` [RFC 6/7] io_uring/rsrc: introduce helper installing one buffer Pavel Begunkov
@ 2023-04-30  9:35 ` Pavel Begunkov
  2023-05-02 14:57   ` Ming Lei
  6 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2023-04-30  9:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, ming.lei

There are several problems with splice requests, aka IORING_OP_SPLICE:
1) They are always executed by a worker thread, which is a slow path,
   as we don't have any reliable way to execute it NOWAIT.
2) It can't easily poll for data, as there are 2 files it operates on.
   It would either need to track what file to poll or poll both of them,
   in both cases it'll be a mess and add lot of overhead.
3) It has to have pipes in the middle, which adds overhead and is not
   great from the uapi design perspective when it goes for io_uring
   requests.
4) We want to operate with spliced data as with a normal buffer, i.e.
   write / send / etc. data as normally while it's zerocopy.

It can partially be solved, but the root cause is a suboptimal for
io_uring design of IORING_OP_SPLICE. Introduce a new request type
called IORING_OP_GET_BUF, inspired by splice(2) as well as other
proposals like fused requests. The main idea is to use io_uring's
registered buffers as the middle man instead of pipes. Once a buffer
is fetched / spliced from a file using a new fops callback
->iou_get_buf, it's installed as a registered buffers and can be used
by all operations supporting the feature.

Once the userspace releases the buffer, io_uring will wait for all
requests using the buffer to complete and then use a file provided
callback ->release() to return the buffer back. It operates on the
level of the entire buffer instead of individual pages like it's with
splice(2). As it was noted by the fused cmd work from where it came,
this approach should be more flexible and efficient, and also leaves
the space for more optimisations like custom caching or avoiding page
refcounting altogether.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/fs.h            |  2 +
 include/linux/io_uring.h      | 11 +++++
 include/uapi/linux/io_uring.h |  1 +
 io_uring/opdef.c              | 11 +++++
 io_uring/rsrc.c               |  2 +-
 io_uring/rsrc.h               |  2 +
 io_uring/splice.c             | 90 +++++++++++++++++++++++++++++++++++
 io_uring/splice.h             |  4 ++
 8 files changed, 122 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 475d88640d3d..a2528a39571f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1753,6 +1753,7 @@ struct dir_context {
 
 struct iov_iter;
 struct io_uring_cmd;
+struct iou_get_buf_info;
 
 struct file_operations {
 	struct module *owner;
@@ -1798,6 +1799,7 @@ struct file_operations {
 	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
 	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
 				unsigned int poll_flags);
+	int (*iou_get_buf)(struct file *file, struct iou_get_buf_info *);
 } __randomize_layout;
 
 struct inode_operations {
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index e0e7df5beefc..9564db555bab 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -30,6 +30,17 @@ struct iou_buf_desc {
 	void			*private;
 };
 
+enum {
+	IOU_GET_BUF_F_NOWAIT	= 1,
+};
+
+struct iou_get_buf_info {
+	loff_t			off;
+	size_t			len;
+	unsigned		flags;
+	struct iou_buf_desc	*desc;
+};
+
 struct io_uring_cmd {
 	struct file	*file;
 	const void	*cmd;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0716cb17e436..b244215d03ad 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -223,6 +223,7 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_GET_BUF,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index cca7c5b55208..d3b7144c685a 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,13 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_GET_BUF] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
+		.prep			= io_get_buf_prep,
+		.issue			= io_get_buf,
+	},
 };
 
 
@@ -648,6 +655,10 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_GET_BUF] = {
+		.name			= "IORING_OP_GET_BUF",
+		.cleanup		= io_get_buf_cleanup,
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index db4286b42dce..bdcd417bca87 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -89,7 +89,7 @@ static void io_put_reg_buf(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
 		kvfree(imu);
 }
 
-static struct io_mapped_ubuf *io_alloc_reg_buf(struct io_ring_ctx *ctx,
+struct io_mapped_ubuf *io_alloc_reg_buf(struct io_ring_ctx *ctx,
 					       int nr_bvecs)
 {
 	struct io_cache_entry *entry;
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index aba95bdd060e..6aaf7acb60c5 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -177,4 +177,6 @@ static inline void io_reg_buf_release(struct io_mapped_ubuf *imu)
 	imu->desc.release(&imu->desc);
 }
 
+struct io_mapped_ubuf *io_alloc_reg_buf(struct io_ring_ctx *ctx, int nr_bvecs);
+
 #endif
diff --git a/io_uring/splice.c b/io_uring/splice.c
index 2a4bbb719531..3d50334caec5 100644
--- a/io_uring/splice.c
+++ b/io_uring/splice.c
@@ -13,6 +13,7 @@
 
 #include "io_uring.h"
 #include "splice.h"
+#include "rsrc.h"
 
 struct io_splice {
 	struct file			*file_out;
@@ -119,3 +120,92 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags)
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 }
+
+struct io_get_buf {
+	struct file			*file;
+	struct io_mapped_ubuf		*imu;
+	int				max_pages;
+	loff_t				off;
+	u64				len;
+};
+
+void io_get_buf_cleanup(struct io_kiocb *req)
+{
+	struct io_get_buf *gb = io_kiocb_to_cmd(req, struct io_get_buf);
+	struct io_mapped_ubuf *imu = gb->imu;
+
+	if (!imu)
+		return;
+	if (imu->desc.nr_bvecs && !WARN_ON_ONCE(!imu->desc.release))
+		io_reg_buf_release(imu);
+
+	kvfree(imu);
+	gb->imu = NULL;
+}
+
+int io_get_buf_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_get_buf *gb = io_kiocb_to_cmd(req, struct io_get_buf);
+	struct io_mapped_ubuf *imu;
+	int nr_pages;
+
+	if (unlikely(sqe->splice_flags || sqe->splice_fd_in || sqe->ioprio ||
+		     sqe->addr || sqe->addr3))
+		return -EINVAL;
+
+	req->buf_index = READ_ONCE(sqe->buf_index);
+	gb->len = READ_ONCE(sqe->len);
+	gb->off = READ_ONCE(sqe->off);
+	nr_pages = (gb->len >> PAGE_SHIFT) + 2;
+	gb->max_pages = nr_pages;
+
+	gb->imu = imu = io_alloc_reg_buf(req->ctx, nr_pages);
+	if (!imu)
+		return -ENOMEM;
+	imu->desc.nr_bvecs = 0;
+	req->flags |= REQ_F_NEED_CLEANUP;
+	return 0;
+}
+
+int io_get_buf(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_get_buf *gb = io_kiocb_to_cmd(req, struct io_get_buf);
+	struct io_mapped_ubuf *imu = gb->imu;
+	struct iou_get_buf_info bi;
+	int ret, err;
+
+	bi.off = gb->off;
+	bi.len = gb->len;
+	bi.flags = (issue_flags & IO_URING_F_NONBLOCK) ? IOU_GET_BUF_F_NOWAIT : 0;
+	bi.desc = &imu->desc;
+
+	if (!gb->file->f_op->iou_get_buf)
+		return -ENOTSUPP;
+	ret = gb->file->f_op->iou_get_buf(gb->file, &bi);
+	if (ret < 0) {
+		if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
+			return -EAGAIN;
+		goto done;
+	}
+
+	imu->ubuf = 0;
+	imu->ubuf_end = ret;
+	imu->dir_mask = 1U << ITER_SOURCE;
+	imu->acct_pages = 0;
+
+	io_ring_submit_lock(req->ctx, issue_flags);
+	err = io_install_buffer(req->ctx, imu, req->buf_index);
+	io_ring_submit_unlock(req->ctx, issue_flags);
+	if (unlikely(err)) {
+		ret = err;
+		goto done;
+	}
+
+	gb->imu = NULL;
+	req->flags &= ~REQ_F_NEED_CLEANUP;
+done:
+	if (ret != gb->len)
+		req_set_fail(req);
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
diff --git a/io_uring/splice.h b/io_uring/splice.h
index 542f94168ad3..2b923fc2bbf1 100644
--- a/io_uring/splice.h
+++ b/io_uring/splice.h
@@ -5,3 +5,7 @@ int io_tee(struct io_kiocb *req, unsigned int issue_flags);
 
 int io_splice_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_splice(struct io_kiocb *req, unsigned int issue_flags);
+
+int io_get_buf_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_get_buf(struct io_kiocb *req, unsigned int issue_flags);
+void io_get_buf_cleanup(struct io_kiocb *req);
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF
  2023-04-30  9:35 ` [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF Pavel Begunkov
@ 2023-05-02 14:57   ` Ming Lei
  2023-05-02 15:20     ` Ming Lei
  2023-05-03 14:54     ` Pavel Begunkov
  0 siblings, 2 replies; 14+ messages in thread
From: Ming Lei @ 2023-05-02 14:57 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, ming.lei

On Sun, Apr 30, 2023 at 10:35:29AM +0100, Pavel Begunkov wrote:
> There are several problems with splice requests, aka IORING_OP_SPLICE:
> 1) They are always executed by a worker thread, which is a slow path,
>    as we don't have any reliable way to execute it NOWAIT.
> 2) It can't easily poll for data, as there are 2 files it operates on.
>    It would either need to track what file to poll or poll both of them,
>    in both cases it'll be a mess and add lot of overhead.
> 3) It has to have pipes in the middle, which adds overhead and is not
>    great from the uapi design perspective when it goes for io_uring
>    requests.
> 4) We want to operate with spliced data as with a normal buffer, i.e.
>    write / send / etc. data as normally while it's zerocopy.
> 
> It can partially be solved, but the root cause is a suboptimal for
> io_uring design of IORING_OP_SPLICE. Introduce a new request type
> called IORING_OP_GET_BUF, inspired by splice(2) as well as other
> proposals like fused requests. The main idea is to use io_uring's
> registered buffers as the middle man instead of pipes. Once a buffer
> is fetched / spliced from a file using a new fops callback
> ->iou_get_buf, it's installed as a registered buffers and can be used
> by all operations supporting the feature.
> 
> Once the userspace releases the buffer, io_uring will wait for all
> requests using the buffer to complete and then use a file provided
> callback ->release() to return the buffer back. It operates on the

In the commit of "io_uring: add an example for buf-get op", I don't see
any code to release the buffer, can you explain it in details about how
to release the buffer in userspace? And add it in your example?

Here I guess the ->release() is called in the following code path:

io_buffer_unmap
    io_rsrc_buf_put
        io_rsrc_put_work
            io_rsrc_node_ref_zero
                io_put_rsrc_node

If it is true, what is counter-pair code for io_put_rsrc_node()?
So far, only see io_req_set_rsrc_node() is called from
io_file_get_fixed(), is it needed for consumer OP of the buffer?

Also io_buffer_unmap() is called after io_rsrc_node's reference drops
to zero, which means ->release() isn't called after all its consumer(s)
are done given io_rsrc_node is shared by in-flight requests. If it is
true, this way will increase buffer lifetime a lot.

ublk zero copy needs to call ->release() immediately after all
consumers are done, because the ublk disk request won't be completed
until the buffer is released(the buffer actually belongs to ublk block request).

Also the usage in liburing example needs two extra syscall(io_uring_enter) for
handling one IO, not take into account the "release OP". IMO, this way makes
application more complicated, also might perform worse:

1) for ublk zero copy, the original IO just needs one OP, but now it
takes three OPs, so application has to take coroutine for applying
3 stages batch submission(GET_BUF, IO, release buffer) since IO_LINK can't
or not suggested to be used. In case of low QD, batch size is reduced much,
and performance may hurt because IOs/syscall is 1/3 of fused command.

2) GET_BUF OP is separated from the consumer OP, this way may cause
more cache miss, and I know this way is for avoiding IO_LINK.

I'd understand the approach first before using it to implement ublk zero copy
and comparing its performance with fused command.

Thanks, 
Ming

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF
  2023-05-02 14:57   ` Ming Lei
@ 2023-05-02 15:20     ` Ming Lei
  2023-05-03 14:54     ` Pavel Begunkov
  1 sibling, 0 replies; 14+ messages in thread
From: Ming Lei @ 2023-05-02 15:20 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, ming.lei

On Tue, May 02, 2023 at 10:57:30PM +0800, Ming Lei wrote:
> On Sun, Apr 30, 2023 at 10:35:29AM +0100, Pavel Begunkov wrote:
> > There are several problems with splice requests, aka IORING_OP_SPLICE:
> > 1) They are always executed by a worker thread, which is a slow path,
> >    as we don't have any reliable way to execute it NOWAIT.
> > 2) It can't easily poll for data, as there are 2 files it operates on.
> >    It would either need to track what file to poll or poll both of them,
> >    in both cases it'll be a mess and add lot of overhead.
> > 3) It has to have pipes in the middle, which adds overhead and is not
> >    great from the uapi design perspective when it goes for io_uring
> >    requests.
> > 4) We want to operate with spliced data as with a normal buffer, i.e.
> >    write / send / etc. data as normally while it's zerocopy.
> > 
> > It can partially be solved, but the root cause is a suboptimal for
> > io_uring design of IORING_OP_SPLICE. Introduce a new request type
> > called IORING_OP_GET_BUF, inspired by splice(2) as well as other
> > proposals like fused requests. The main idea is to use io_uring's
> > registered buffers as the middle man instead of pipes. Once a buffer
> > is fetched / spliced from a file using a new fops callback
> > ->iou_get_buf, it's installed as a registered buffers and can be used
> > by all operations supporting the feature.
> > 
> > Once the userspace releases the buffer, io_uring will wait for all
> > requests using the buffer to complete and then use a file provided
> > callback ->release() to return the buffer back. It operates on the
> 
> In the commit of "io_uring: add an example for buf-get op", I don't see
> any code to release the buffer, can you explain it in details about how
> to release the buffer in userspace? And add it in your example?
> 
> Here I guess the ->release() is called in the following code path:
> 
> io_buffer_unmap
>     io_rsrc_buf_put
>         io_rsrc_put_work
>             io_rsrc_node_ref_zero
>                 io_put_rsrc_node
> 
> If it is true, what is counter-pair code for io_put_rsrc_node()?

It is io_req_set_rsrc_node() called from io_prep_rw() and
io_send_zc_prep().

So only fs IO and net zc supports it, however the fixed buffer
is used in ->prep(), that means this way can't work when the
rw/net_send_zc OP follows IORING_OP_GET_BUF by IO_LINK.

This way is one big defect because IOSQE_IO_LINK does simplify
application a lot.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF
  2023-05-02 14:57   ` Ming Lei
  2023-05-02 15:20     ` Ming Lei
@ 2023-05-03 14:54     ` Pavel Begunkov
  2023-05-04  2:06       ` Ming Lei
  1 sibling, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2023-05-03 14:54 UTC (permalink / raw)
  To: Ming Lei; +Cc: io-uring

On 5/2/23 15:57, Ming Lei wrote:
> On Sun, Apr 30, 2023 at 10:35:29AM +0100, Pavel Begunkov wrote:
>> There are several problems with splice requests, aka IORING_OP_SPLICE:
>> 1) They are always executed by a worker thread, which is a slow path,
>>     as we don't have any reliable way to execute it NOWAIT.
>> 2) It can't easily poll for data, as there are 2 files it operates on.
>>     It would either need to track what file to poll or poll both of them,
>>     in both cases it'll be a mess and add lot of overhead.
>> 3) It has to have pipes in the middle, which adds overhead and is not
>>     great from the uapi design perspective when it goes for io_uring
>>     requests.
>> 4) We want to operate with spliced data as with a normal buffer, i.e.
>>     write / send / etc. data as normally while it's zerocopy.
>>
>> It can partially be solved, but the root cause is a suboptimal for
>> io_uring design of IORING_OP_SPLICE. Introduce a new request type
>> called IORING_OP_GET_BUF, inspired by splice(2) as well as other
>> proposals like fused requests. The main idea is to use io_uring's
>> registered buffers as the middle man instead of pipes. Once a buffer
>> is fetched / spliced from a file using a new fops callback
>> ->iou_get_buf, it's installed as a registered buffers and can be used
>> by all operations supporting the feature.
>>
>> Once the userspace releases the buffer, io_uring will wait for all
>> requests using the buffer to complete and then use a file provided
>> callback ->release() to return the buffer back. It operates on the
> 
> In the commit of "io_uring: add an example for buf-get op", I don't see
> any code to release the buffer, can you explain it in details about how
> to release the buffer in userspace? And add it in your example?

Sure, we need to add buf updates via request.

Particularly, in this RFC, the removal from the table was happening
in io_install_buffer() by one of the test-only patches, the "remove
previous entry on update" style as it's with files. Then it's
released with the last ref put, either on removal with a request
like:

io_free_batch_list()
      io_req_put_rsrc_locked()
          ...

> Here I guess the ->release() is called in the following code path:
> 
> io_buffer_unmap
>      io_rsrc_buf_put
>          io_rsrc_put_work
>              io_rsrc_node_ref_zero
>                  io_put_rsrc_node
> 
> If it is true, what is counter-pair code for io_put_rsrc_node()?
> So far, only see io_req_set_rsrc_node() is called from
> io_file_get_fixed(), is it needed for consumer OP of the buffer?
> 
> Also io_buffer_unmap() is called after io_rsrc_node's reference drops
> to zero, which means ->release() isn't called after all its consumer(s)
> are done given io_rsrc_node is shared by in-flight requests. If it is
> true, this way will increase buffer lifetime a lot.

That's true. It's not a new downside, so might make more sense
to do counting per rsrc (file, buffer), which is not so bad for
now, but would be a bit concerning if we grow the number of rsrc
types.

> ublk zero copy needs to call ->release() immediately after all
> consumers are done, because the ublk disk request won't be completed
> until the buffer is released(the buffer actually belongs to ublk block request).
> 
> Also the usage in liburing example needs two extra syscall(io_uring_enter) for
> handling one IO, not take into account the "release OP". IMO, this way makes

Something is amiss here. It's 3 requests, which means 3 syscalls
if you send requests separately (each step can be batch more
requests), or 1 syscall if you link them together. There is an
example using links for 2 requests in the test case.

> application more complicated, also might perform worse:
> 
> 1) for ublk zero copy, the original IO just needs one OP, but now it
> takes three OPs, so application has to take coroutine for applying

Perhaps, you mean two requests for fused, IORING_OP_FUSED_CMD + IO
request, vs three for IORING_OP_GET_BUF. There might be some sort of
auto-remove on use, making it two requests, but that seems a bit ugly.

> 3 stages batch submission(GET_BUF, IO, release buffer) since IO_LINK can't
> or not suggested to be used. In case of low QD, batch size is reduced much,
> and performance may hurt because IOs/syscall is 1/3 of fused command.

I'm not a big fan of links for their inflexibility, but it can be
used. The point is rather it's better not to be the only way to
use the feature as we may need to stop in the middle, return
control to the userspace and let it handle errors, do data processing
and so on. The latter may need a partial memcpy() into the userspace,
e.g. copy a handful bytes of headers to decide what to do with the
rest of data.

I deem fused cmds to be a variant of linking, so it's rather with
it you link 2 requests vs optionally linking 3 with this patchset.

> 2) GET_BUF OP is separated from the consumer OP, this way may cause
> more cache miss, and I know this way is for avoiding IO_LINK.
> 
> I'd understand the approach first before using it to implement ublk zero copy
> and comparing its performance with fused command.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF
  2023-05-03 14:54     ` Pavel Begunkov
@ 2023-05-04  2:06       ` Ming Lei
  2023-05-08  2:30         ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Ming Lei @ 2023-05-04  2:06 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, ming.lei

On Wed, May 03, 2023 at 03:54:02PM +0100, Pavel Begunkov wrote:
> On 5/2/23 15:57, Ming Lei wrote:
> > On Sun, Apr 30, 2023 at 10:35:29AM +0100, Pavel Begunkov wrote:
> > > There are several problems with splice requests, aka IORING_OP_SPLICE:
> > > 1) They are always executed by a worker thread, which is a slow path,
> > >     as we don't have any reliable way to execute it NOWAIT.
> > > 2) It can't easily poll for data, as there are 2 files it operates on.
> > >     It would either need to track what file to poll or poll both of them,
> > >     in both cases it'll be a mess and add lot of overhead.
> > > 3) It has to have pipes in the middle, which adds overhead and is not
> > >     great from the uapi design perspective when it goes for io_uring
> > >     requests.
> > > 4) We want to operate with spliced data as with a normal buffer, i.e.
> > >     write / send / etc. data as normally while it's zerocopy.
> > > 
> > > It can partially be solved, but the root cause is a suboptimal for
> > > io_uring design of IORING_OP_SPLICE. Introduce a new request type
> > > called IORING_OP_GET_BUF, inspired by splice(2) as well as other
> > > proposals like fused requests. The main idea is to use io_uring's
> > > registered buffers as the middle man instead of pipes. Once a buffer
> > > is fetched / spliced from a file using a new fops callback
> > > ->iou_get_buf, it's installed as a registered buffers and can be used
> > > by all operations supporting the feature.
> > > 
> > > Once the userspace releases the buffer, io_uring will wait for all
> > > requests using the buffer to complete and then use a file provided
> > > callback ->release() to return the buffer back. It operates on the
> > 
> > In the commit of "io_uring: add an example for buf-get op", I don't see
> > any code to release the buffer, can you explain it in details about how
> > to release the buffer in userspace? And add it in your example?
> 
> Sure, we need to add buf updates via request.

I guess it can't be buf update, here we need to release buf. At least
ublk needs to release the buffer explicitly after all consumers are done
with the buffer registered by IORING_OP_GET_BUF, when there may not be
any new buffer to be provided, and the buffer is per-IO for each io
request from /dev/ublkbN.

> 
> Particularly, in this RFC, the removal from the table was happening
> in io_install_buffer() by one of the test-only patches, the "remove
> previous entry on update" style as it's with files. Then it's
> released with the last ref put, either on removal with a request
> like:
> 
> io_free_batch_list()
>      io_req_put_rsrc_locked()
>          ...
> 
> > Here I guess the ->release() is called in the following code path:
> > 
> > io_buffer_unmap
> >      io_rsrc_buf_put
> >          io_rsrc_put_work
> >              io_rsrc_node_ref_zero
> >                  io_put_rsrc_node
> > 
> > If it is true, what is counter-pair code for io_put_rsrc_node()?
> > So far, only see io_req_set_rsrc_node() is called from
> > io_file_get_fixed(), is it needed for consumer OP of the buffer?
> > 
> > Also io_buffer_unmap() is called after io_rsrc_node's reference drops
> > to zero, which means ->release() isn't called after all its consumer(s)
> > are done given io_rsrc_node is shared by in-flight requests. If it is
> > true, this way will increase buffer lifetime a lot.
> 
> That's true. It's not a new downside, so might make more sense
> to do counting per rsrc (file, buffer), which is not so bad for
> now, but would be a bit concerning if we grow the number of rsrc
> types.

It may not be one deal for current fixed file user, which isn't usually
updated in fast path.

But IORING_OP_GET_BUF is supposed to be for fast path, this delay
release is really one problem. Cause if there is any new buffer provided &
consumed, old buffer can't be released any more. And this way can't
be used in ublk zero copy, let me share the ublk model a bit:

1) one batch ublk blk io requests(/dev/ublkbN) are coming, and batch size is
often QD of workload on /dev/ublkbN, and batch size could be 1 or more.

2) ublk driver notifies the blk io requests via io_uring command
completion

3) ublk server starts to handle this batch of io requests, by calling
IORING_OP_GET_BUF on /dev/ublkcN & normal OPs(rw, net recv/send, ...)
for each request in this batch

4) new batch of ublk blk io requests can come when handling the previous
batch of io requests, then the old buffer release is delayed until new
batch of ublk io requests are completed.

> 
> > ublk zero copy needs to call ->release() immediately after all
> > consumers are done, because the ublk disk request won't be completed
> > until the buffer is released(the buffer actually belongs to ublk block request).
> > 
> > Also the usage in liburing example needs two extra syscall(io_uring_enter) for
> > handling one IO, not take into account the "release OP". IMO, this way makes
> 
> Something is amiss here. It's 3 requests, which means 3 syscalls
> if you send requests separately (each step can be batch more
> requests), or 1 syscall if you link them together. There is an
> example using links for 2 requests in the test case.

See the final comment about link support.

> 
> > application more complicated, also might perform worse:
> > 
> > 1) for ublk zero copy, the original IO just needs one OP, but now it
> > takes three OPs, so application has to take coroutine for applying
> 
> Perhaps, you mean two requests for fused, IORING_OP_FUSED_CMD + IO
> request, vs three for IORING_OP_GET_BUF. There might be some sort of
> auto-remove on use, making it two requests, but that seems a bit ugly.

The most important part is that IORING_OP_GET_BUF adds two extra wait:

1) one io_uring_enter() is required after submitting IORING_OP_GET_BUF

2) another io_uring_enter() is needed after submitting buffer consumer
OPs, before calling buffer release OP(not added yet in your patchset)

The two waits not only causes application more complicated, but also
hurts performance, cause IOs/syscall is reduced. People loves io_uring
because it is really async io model, such as, all kinds of IO can be
submitted in single context, and wait in single io_uring_enter().

> 
> > 3 stages batch submission(GET_BUF, IO, release buffer) since IO_LINK can't
> > or not suggested to be used. In case of low QD, batch size is reduced much,
> > and performance may hurt because IOs/syscall is 1/3 of fused command.
> 
> I'm not a big fan of links for their inflexibility, but it can be
> used. The point is rather it's better not to be the only way to
> use the feature as we may need to stop in the middle, return

LINK always support to stop in the middle, right?

> control to the userspace and let it handle errors, do data processing
> and so on. The latter may need a partial memcpy() into the userspace,
> e.g. copy a handful bytes of headers to decide what to do with the
> rest of data.

At least this patchset doesn't work with IO_LINK, please see my previous
reply because the current rw/net zc retrieves fixed buffer in ->prep().

Yeah, the problem can be addressed by moving the buffer retrieving into
->issue().

But my concern is more about easy use and performance:

1) with io_link, the extra two waits(after IORING_OP_GET_BUF and before IORING_OP_GET_BUF)
required can be saved, then application doesn't need coroutine or
similar trick for avoiding the extra wait handling.

2) but if three OPs uses the buffer registered by IORING_OP_GET_BUF, the three
OPs have to be linked one by one, then all three have to be be submitted one
after the previous one is completed, performance is hurt a lot.

> 
> I deem fused cmds to be a variant of linking, so it's rather with
> it you link 2 requests vs optionally linking 3 with this patchset.

IMO, one extra 64byte touching may slow things a little, but I think it
won't be big deal, what really matters is that easy use of the interface
and performance.

Fused command solves both: allow to submit the consumer OPs concurrently;
avoid multiple wait point and make application easier to implement

For example:

1) fused command based application: (every IO command takes 1 syscall usually)

	for each reapped CQE:
		- if CQE is for io command from /dev/ublkc:
			- handle it by allocating/queuing three SQEs:
				- one SQE is for primary command,
				- the 2nd SQE is for READ to file 1
				- the 3rd SQE is for READ to file 2
		- if CQE is for primary command or normal OP:
			- just check if the whole fused command is completed
				- if yes, notify that the current io command is
				completed (the current io request from /dev/ublkbN will
				be completed in ublk driver, and this io command is
				queued for incoming request)
				- if any err, handle the err: retry or terminate,
				  similar with handling for 'yes'
				- otherwise, just return
		io_uring_enter();			//single wait point

2) IORING_OP_GET_BUF based application(every IO command needs two syscall)

	for each reapped CQE:
		- if CQE is for io command from /dev/ublkc:
			- handle it by:
				queuing IORING_OP_GET_BUF
				recoding this io needs to be hanlded

	io_uring_enter()	//wait until all IORING_OP_GET_BUF are done

	for each IO recorded in above, queue two new RW OPs(one for file1, one for file2)

	io_uring_enter()	//wait until the above two OPs are done

	for each reapped CQE:
		release the buffer registered by IORING_OP_GET_BUF

Or coroutine based approach: (every io command needs two syscalls)

	for each reaaped CQE:
		- if CQE is for command from /dev/ublkc:
				- queue IORING_OP_GET_BUF
				- corouine wait: until IORING_OP_GET_BUF is done
				- queue two RW OPs with the registered buffer
				- coroutine wait: until all above OP is done
				- release buffer registered by IORING_OP_GET_BUF
		- otherwise:
			- call related co routine wakeup handler for advancing in
			above co routine wait point
		io_uring_enter();

coroutine based approach may look clean and easier to implement, but
it depends on language support, and still needs extra care, also stackless
coroutine isn't easy to use.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF
  2023-05-04  2:06       ` Ming Lei
@ 2023-05-08  2:30         ` Pavel Begunkov
  2023-05-17  4:05           ` Ming Lei
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2023-05-08  2:30 UTC (permalink / raw)
  To: Ming Lei; +Cc: io-uring

On 5/4/23 03:06, Ming Lei wrote:
> On Wed, May 03, 2023 at 03:54:02PM +0100, Pavel Begunkov wrote:
>> On 5/2/23 15:57, Ming Lei wrote:
>>> On Sun, Apr 30, 2023 at 10:35:29AM +0100, Pavel Begunkov wrote:
>>>> There are several problems with splice requests, aka IORING_OP_SPLICE:
>>>> 1) They are always executed by a worker thread, which is a slow path,
>>>>      as we don't have any reliable way to execute it NOWAIT.
>>>> 2) It can't easily poll for data, as there are 2 files it operates on.
>>>>      It would either need to track what file to poll or poll both of them,
>>>>      in both cases it'll be a mess and add lot of overhead.
>>>> 3) It has to have pipes in the middle, which adds overhead and is not
>>>>      great from the uapi design perspective when it goes for io_uring
>>>>      requests.
>>>> 4) We want to operate with spliced data as with a normal buffer, i.e.
>>>>      write / send / etc. data as normally while it's zerocopy.
>>>>
>>>> It can partially be solved, but the root cause is a suboptimal for
>>>> io_uring design of IORING_OP_SPLICE. Introduce a new request type
>>>> called IORING_OP_GET_BUF, inspired by splice(2) as well as other
>>>> proposals like fused requests. The main idea is to use io_uring's
>>>> registered buffers as the middle man instead of pipes. Once a buffer
>>>> is fetched / spliced from a file using a new fops callback
>>>> ->iou_get_buf, it's installed as a registered buffers and can be used
>>>> by all operations supporting the feature.
>>>>
>>>> Once the userspace releases the buffer, io_uring will wait for all
>>>> requests using the buffer to complete and then use a file provided
>>>> callback ->release() to return the buffer back. It operates on the
>>>
>>> In the commit of "io_uring: add an example for buf-get op", I don't see
>>> any code to release the buffer, can you explain it in details about how
>>> to release the buffer in userspace? And add it in your example?
>>
>> Sure, we need to add buf updates via request.
> 
> I guess it can't be buf update, here we need to release buf. At least
> ublk needs to release the buffer explicitly after all consumers are done
> with the buffer registered by IORING_OP_GET_BUF, when there may not be
> any new buffer to be provided, and the buffer is per-IO for each io
> request from /dev/ublkbN.

By "updates" I usually mean removals as well, just like
IORING_REGISTER_BUFFERS_UPDATE with iov_base == 0 would remove
the buffer.


>> Particularly, in this RFC, the removal from the table was happening
>> in io_install_buffer() by one of the test-only patches, the "remove
>> previous entry on update" style as it's with files. Then it's
>> released with the last ref put, either on removal with a request
>> like:
>>
>> io_free_batch_list()
>>       io_req_put_rsrc_locked()
>>           ...
>>
>>> Here I guess the ->release() is called in the following code path:
>>>
>>> io_buffer_unmap
>>>       io_rsrc_buf_put
>>>           io_rsrc_put_work
>>>               io_rsrc_node_ref_zero
>>>                   io_put_rsrc_node
>>>
>>> If it is true, what is counter-pair code for io_put_rsrc_node()?
>>> So far, only see io_req_set_rsrc_node() is called from
>>> io_file_get_fixed(), is it needed for consumer OP of the buffer?
>>>
>>> Also io_buffer_unmap() is called after io_rsrc_node's reference drops
>>> to zero, which means ->release() isn't called after all its consumer(s)
>>> are done given io_rsrc_node is shared by in-flight requests. If it is
>>> true, this way will increase buffer lifetime a lot.
>>
>> That's true. It's not a new downside, so might make more sense
>> to do counting per rsrc (file, buffer), which is not so bad for
>> now, but would be a bit concerning if we grow the number of rsrc
>> types.
> 
> It may not be one deal for current fixed file user, which isn't usually
> updated in fast path.

With opening/accepting right into io_uring skipping normal file
tables and cases like open->read->close, it's rather somewhat of
medium hotness, definitely not a slow path. And the problem affects
it as well
  
> But IORING_OP_GET_BUF is supposed to be for fast path, this delay
> release is really one problem. Cause if there is any new buffer provided &
> consumed, old buffer can't be released any more. And this way can't
> be used in ublk zero copy, let me share the ublk model a bit:
> 
> 1) one batch ublk blk io requests(/dev/ublkbN) are coming, and batch size is
> often QD of workload on /dev/ublkbN, and batch size could be 1 or more.
> 
> 2) ublk driver notifies the blk io requests via io_uring command
> completion
> 
> 3) ublk server starts to handle this batch of io requests, by calling
> IORING_OP_GET_BUF on /dev/ublkcN & normal OPs(rw, net recv/send, ...)
> for each request in this batch
> 
> 4) new batch of ublk blk io requests can come when handling the previous
> batch of io requests, then the old buffer release is delayed until new
> batch of ublk io requests are completed.
> 
>>
>>> ublk zero copy needs to call ->release() immediately after all
>>> consumers are done, because the ublk disk request won't be completed
>>> until the buffer is released(the buffer actually belongs to ublk block request).
>>>
>>> Also the usage in liburing example needs two extra syscall(io_uring_enter) for
>>> handling one IO, not take into account the "release OP". IMO, this way makes
>>
>> Something is amiss here. It's 3 requests, which means 3 syscalls
>> if you send requests separately (each step can be batch more
>> requests), or 1 syscall if you link them together. There is an
>> example using links for 2 requests in the test case.
> 
> See the final comment about link support.
> 
>>
>>> application more complicated, also might perform worse:
>>>
>>> 1) for ublk zero copy, the original IO just needs one OP, but now it
>>> takes three OPs, so application has to take coroutine for applying
>>
>> Perhaps, you mean two requests for fused, IORING_OP_FUSED_CMD + IO
>> request, vs three for IORING_OP_GET_BUF. There might be some sort of
>> auto-remove on use, making it two requests, but that seems a bit ugly.
> 
> The most important part is that IORING_OP_GET_BUF adds two extra wait:
> 
> 1) one io_uring_enter() is required after submitting IORING_OP_GET_BUF
> 
> 2) another io_uring_enter() is needed after submitting buffer consumer
> OPs, before calling buffer release OP(not added yet in your patchset)
> 
> The two waits not only causes application more complicated, but also
> hurts performance, cause IOs/syscall is reduced. People loves io_uring
> because it is really async io model, such as, all kinds of IO can be
> submitted in single context, and wait in single io_uring_enter().

*Pseudo code*
N = get_free_buffer_slot();
sqe1 = prep_getbuf_sqe(buf_idx = N);
sqe1->flags |= F_LINK;

sqe2 = prep_write_sqe(buf_idx = N);
sqe2->flags |= F_LINK;

sqe3 = prep_release_buf_sqe(buf_idx = N);
sqe3->flags |= F_LINK;

submit_and_wait(nr_wait=3);


That should work, there is only 1 syscall. We can also
play with SKIP_CQE_SUCCESS and/or HARDLINK to fold 3 cqes
into 1.

>>> 3 stages batch submission(GET_BUF, IO, release buffer) since IO_LINK can't
>>> or not suggested to be used. In case of low QD, batch size is reduced much,
>>> and performance may hurt because IOs/syscall is 1/3 of fused command.
>>
>> I'm not a big fan of links for their inflexibility, but it can be
>> used. The point is rather it's better not to be the only way to
>> use the feature as we may need to stop in the middle, return
> 
> LINK always support to stop in the middle, right?

Normal links will stop execution on "error", which is usually
cqe->res < 0, but read/write will also break the link on short
IO, and the short IO behaviour of send/recv will depend on
MSG_WAITALL. IOSQE_IO_HARDLINK will continue executing linked
requests even with prior errors.

>> control to the userspace and let it handle errors, do data processing
>> and so on. The latter may need a partial memcpy() into the userspace,
>> e.g. copy a handful bytes of headers to decide what to do with the
>> rest of data.
> 
> At least this patchset doesn't work with IO_LINK, please see my previous
> reply because the current rw/net zc retrieves fixed buffer in ->prep().
> 
> Yeah, the problem can be addressed by moving the buffer retrieving into
> ->issue().

Right, one of the test patches does exactly that (fwiw, not tested
seriously), and as it was previously done for files. It won't be
a big problem.

> But my concern is more about easy use and performance:
> 
> 1) with io_link, the extra two waits(after IORING_OP_GET_BUF and before IORING_OP_GET_BUF)
> required can be saved, then application doesn't need coroutine or
> similar trick for avoiding the extra wait handling.

It shouldn't need that, see above.

> 2) but if three OPs uses the buffer registered by IORING_OP_GET_BUF, the three
> OPs have to be linked one by one, then all three have to be be submitted one
> after the previous one is completed, performance is hurt a lot.

In general, it sounds like a generic feature and should be as such.
But

I agree with the sentiment, that's exactly why I was saying that
they're not perfectly flexible and don't cover all the cases, but
the same can be said for the 1->N kinds of dependency. And even
with a completely configurable graphs of requests there will be
questions like "how to take result from one request and give it
to another with a random meaning, i.e. as a buffer/fd/size/
index/whatnot", and there are more complicated examples.

I don't see any good scalable way for that without programmability.
It appeared before that it was faster to return back to the userspace
than using BPF, might be worth to test it again, especially with
DEFER_TASKRUN and recent optimisations.


>> I deem fused cmds to be a variant of linking, so it's rather with
>> it you link 2 requests vs optionally linking 3 with this patchset.
> 
> IMO, one extra 64byte touching may slow things a little, but I think it

Fwiw, the overhead comes from all io_uring hops a request goes
through, e.g. init, ->prep, ->issue, free, etc. I'd love to
see it getting even slimmer, and there are definitely spots
that can be optimised.

> won't be big deal, what really matters is that easy use of the interface
> and performance.

Which should also be balanced with flexibility and not only.

> Fused command solves both: allow to submit the consumer OPs concurrently;
> avoid multiple wait point and make application easier to implement
> 
> For example:
> 
> 1) fused command based application: (every IO command takes 1 syscall usually)
> 
> 	for each reapped CQE:
> 		- if CQE is for io command from /dev/ublkc:
> 			- handle it by allocating/queuing three SQEs:
> 				- one SQE is for primary command,
> 				- the 2nd SQE is for READ to file 1
> 				- the 3rd SQE is for READ to file 2
> 		- if CQE is for primary command or normal OP:
> 			- just check if the whole fused command is completed
> 				- if yes, notify that the current io command is
> 				completed (the current io request from /dev/ublkbN will
> 				be completed in ublk driver, and this io command is
> 				queued for incoming request)
> 				- if any err, handle the err: retry or terminate,
> 				  similar with handling for 'yes'
> 				- otherwise, just return
> 		io_uring_enter();			//single wait point
> 
> 2) IORING_OP_GET_BUF based application(every IO command needs two syscall)
> 
> 	for each reapped CQE:
> 		- if CQE is for io command from /dev/ublkc:
> 			- handle it by:
> 				queuing IORING_OP_GET_BUF
> 				recoding this io needs to be hanlded
> 
> 	io_uring_enter()	//wait until all IORING_OP_GET_BUF are done
> 
> 	for each IO recorded in above, queue two new RW OPs(one for file1, one for file2)
> 
> 	io_uring_enter()	//wait until the above two OPs are done
> 
> 	for each reapped CQE:
> 		release the buffer registered by IORING_OP_GET_BUF
> 
> Or coroutine based approach: (every io command needs two syscalls)
> 
> 	for each reaaped CQE:
> 		- if CQE is for command from /dev/ublkc:
> 				- queue IORING_OP_GET_BUF
> 				- corouine wait: until IORING_OP_GET_BUF is done
> 				- queue two RW OPs with the registered buffer
> 				- coroutine wait: until all above OP is done
> 				- release buffer registered by IORING_OP_GET_BUF
> 		- otherwise:
> 			- call related co routine wakeup handler for advancing in
> 			above co routine wait point
> 		io_uring_enter();
> 
> coroutine based approach may look clean and easier to implement, but
> it depends on language support, and still needs extra care, also stackless
> coroutine isn't easy to use.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF
  2023-05-08  2:30         ` Pavel Begunkov
@ 2023-05-17  4:05           ` Ming Lei
  0 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2023-05-17  4:05 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, ming.lei

On Mon, May 08, 2023 at 03:30:55AM +0100, Pavel Begunkov wrote:
> On 5/4/23 03:06, Ming Lei wrote:
> > On Wed, May 03, 2023 at 03:54:02PM +0100, Pavel Begunkov wrote:
> > > On 5/2/23 15:57, Ming Lei wrote:
> > > > On Sun, Apr 30, 2023 at 10:35:29AM +0100, Pavel Begunkov wrote:
> > > > > There are several problems with splice requests, aka IORING_OP_SPLICE:
> > > > > 1) They are always executed by a worker thread, which is a slow path,
> > > > >      as we don't have any reliable way to execute it NOWAIT.
> > > > > 2) It can't easily poll for data, as there are 2 files it operates on.
> > > > >      It would either need to track what file to poll or poll both of them,
> > > > >      in both cases it'll be a mess and add lot of overhead.
> > > > > 3) It has to have pipes in the middle, which adds overhead and is not
> > > > >      great from the uapi design perspective when it goes for io_uring
> > > > >      requests.
> > > > > 4) We want to operate with spliced data as with a normal buffer, i.e.
> > > > >      write / send / etc. data as normally while it's zerocopy.
> > > > > 
> > > > > It can partially be solved, but the root cause is a suboptimal for
> > > > > io_uring design of IORING_OP_SPLICE. Introduce a new request type
> > > > > called IORING_OP_GET_BUF, inspired by splice(2) as well as other
> > > > > proposals like fused requests. The main idea is to use io_uring's
> > > > > registered buffers as the middle man instead of pipes. Once a buffer
> > > > > is fetched / spliced from a file using a new fops callback
> > > > > ->iou_get_buf, it's installed as a registered buffers and can be used
> > > > > by all operations supporting the feature.
> > > > > 
> > > > > Once the userspace releases the buffer, io_uring will wait for all
> > > > > requests using the buffer to complete and then use a file provided
> > > > > callback ->release() to return the buffer back. It operates on the
> > > > 
> > > > In the commit of "io_uring: add an example for buf-get op", I don't see
> > > > any code to release the buffer, can you explain it in details about how
> > > > to release the buffer in userspace? And add it in your example?
> > > 
> > > Sure, we need to add buf updates via request.
> > 
> > I guess it can't be buf update, here we need to release buf. At least
> > ublk needs to release the buffer explicitly after all consumers are done
> > with the buffer registered by IORING_OP_GET_BUF, when there may not be
> > any new buffer to be provided, and the buffer is per-IO for each io
> > request from /dev/ublkbN.
> 
> By "updates" I usually mean removals as well, just like
> IORING_REGISTER_BUFFERS_UPDATE with iov_base == 0 would remove
> the buffer.

OK.

> 
> 
> > > Particularly, in this RFC, the removal from the table was happening
> > > in io_install_buffer() by one of the test-only patches, the "remove
> > > previous entry on update" style as it's with files. Then it's
> > > released with the last ref put, either on removal with a request
> > > like:

This way can't work for ublk, we definitely need to release the buffer immediately
after it is consumed, otherwise the ublk block request won't be completed.

> > > 
> > > io_free_batch_list()
> > >       io_req_put_rsrc_locked()
> > >           ...
> > > 
> > > > Here I guess the ->release() is called in the following code path:
> > > > 
> > > > io_buffer_unmap
> > > >       io_rsrc_buf_put
> > > >           io_rsrc_put_work
> > > >               io_rsrc_node_ref_zero
> > > >                   io_put_rsrc_node
> > > > 
> > > > If it is true, what is counter-pair code for io_put_rsrc_node()?
> > > > So far, only see io_req_set_rsrc_node() is called from
> > > > io_file_get_fixed(), is it needed for consumer OP of the buffer?
> > > > 
> > > > Also io_buffer_unmap() is called after io_rsrc_node's reference drops
> > > > to zero, which means ->release() isn't called after all its consumer(s)
> > > > are done given io_rsrc_node is shared by in-flight requests. If it is
> > > > true, this way will increase buffer lifetime a lot.
> > > 
> > > That's true. It's not a new downside, so might make more sense
> > > to do counting per rsrc (file, buffer), which is not so bad for
> > > now, but would be a bit concerning if we grow the number of rsrc
> > > types.
> > 
> > It may not be one deal for current fixed file user, which isn't usually
> > updated in fast path.
> 
> With opening/accepting right into io_uring skipping normal file
> tables and cases like open->read->close, it's rather somewhat of
> medium hotness, definitely not a slow path. And the problem affects
> it as well

OK, then it becomes more rationale to remove the delayed release by
using io_rsrc_node, I am wondering why not add reference counter
to the buffer which needs to be used in such fast path?

> > But IORING_OP_GET_BUF is supposed to be for fast path, this delay
> > release is really one problem. Cause if there is any new buffer provided &
> > consumed, old buffer can't be released any more. And this way can't
> > be used in ublk zero copy, let me share the ublk model a bit:
> > 
> > 1) one batch ublk blk io requests(/dev/ublkbN) are coming, and batch size is
> > often QD of workload on /dev/ublkbN, and batch size could be 1 or more.
> > 
> > 2) ublk driver notifies the blk io requests via io_uring command
> > completion
> > 
> > 3) ublk server starts to handle this batch of io requests, by calling
> > IORING_OP_GET_BUF on /dev/ublkcN & normal OPs(rw, net recv/send, ...)
> > for each request in this batch
> > 
> > 4) new batch of ublk blk io requests can come when handling the previous
> > batch of io requests, then the old buffer release is delayed until new
> > batch of ublk io requests are completed.
> > 
> > > 
> > > > ublk zero copy needs to call ->release() immediately after all
> > > > consumers are done, because the ublk disk request won't be completed
> > > > until the buffer is released(the buffer actually belongs to ublk block request).
> > > > 
> > > > Also the usage in liburing example needs two extra syscall(io_uring_enter) for
> > > > handling one IO, not take into account the "release OP". IMO, this way makes
> > > 
> > > Something is amiss here. It's 3 requests, which means 3 syscalls
> > > if you send requests separately (each step can be batch more
> > > requests), or 1 syscall if you link them together. There is an
> > > example using links for 2 requests in the test case.
> > 
> > See the final comment about link support.
> > 
> > > 
> > > > application more complicated, also might perform worse:
> > > > 
> > > > 1) for ublk zero copy, the original IO just needs one OP, but now it
> > > > takes three OPs, so application has to take coroutine for applying
> > > 
> > > Perhaps, you mean two requests for fused, IORING_OP_FUSED_CMD + IO
> > > request, vs three for IORING_OP_GET_BUF. There might be some sort of
> > > auto-remove on use, making it two requests, but that seems a bit ugly.
> > 
> > The most important part is that IORING_OP_GET_BUF adds two extra wait:
> > 
> > 1) one io_uring_enter() is required after submitting IORING_OP_GET_BUF
> > 
> > 2) another io_uring_enter() is needed after submitting buffer consumer
> > OPs, before calling buffer release OP(not added yet in your patchset)
> > 
> > The two waits not only causes application more complicated, but also
> > hurts performance, cause IOs/syscall is reduced. People loves io_uring
> > because it is really async io model, such as, all kinds of IO can be
> > submitted in single context, and wait in single io_uring_enter().
> 
> *Pseudo code*
> N = get_free_buffer_slot();
> sqe1 = prep_getbuf_sqe(buf_idx = N);
> sqe1->flags |= F_LINK;
> 
> sqe2 = prep_write_sqe(buf_idx = N);
> sqe2->flags |= F_LINK;
> 
> sqe3 = prep_release_buf_sqe(buf_idx = N);
> sqe3->flags |= F_LINK;

The above link shouldn't be needed.

> 
> submit_and_wait(nr_wait=3);
> 
> 
> That should work, there is only 1 syscall. We can also
> play with SKIP_CQE_SUCCESS and/or HARDLINK to fold 3 cqes
> into 1.

Yeah, this way does simplify application and is more efficient since more
OPs can be issued via single syscall, but another problem is raised, if
there are multiple prep_write_sqe() in this chain and all consume this same
buffer, all WRITEs have to be issued after the previous one is completed,
this way is slow, and no such problem in fused command.

> 
> > > > 3 stages batch submission(GET_BUF, IO, release buffer) since IO_LINK can't
> > > > or not suggested to be used. In case of low QD, batch size is reduced much,
> > > > and performance may hurt because IOs/syscall is 1/3 of fused command.
> > > 
> > > I'm not a big fan of links for their inflexibility, but it can be
> > > used. The point is rather it's better not to be the only way to
> > > use the feature as we may need to stop in the middle, return
> > 
> > LINK always support to stop in the middle, right?
> 
> Normal links will stop execution on "error", which is usually
> cqe->res < 0, but read/write will also break the link on short
> IO, and the short IO behaviour of send/recv will depend on
> MSG_WAITALL. IOSQE_IO_HARDLINK will continue executing linked
> requests even with prior errors.

I think ublk server is fine with current handling of link break.

> 
> > > control to the userspace and let it handle errors, do data processing
> > > and so on. The latter may need a partial memcpy() into the userspace,
> > > e.g. copy a handful bytes of headers to decide what to do with the
> > > rest of data.
> > 
> > At least this patchset doesn't work with IO_LINK, please see my previous
> > reply because the current rw/net zc retrieves fixed buffer in ->prep().
> > 
> > Yeah, the problem can be addressed by moving the buffer retrieving into
> > ->issue().
> 
> Right, one of the test patches does exactly that (fwiw, not tested
> seriously), and as it was previously done for files. It won't be
> a big problem.

OK, looks you must forget to include the patch in this series.

> 
> > But my concern is more about easy use and performance:
> > 
> > 1) with io_link, the extra two waits(after IORING_OP_GET_BUF and before IORING_OP_GET_BUF)
> > required can be saved, then application doesn't need coroutine or
> > similar trick for avoiding the extra wait handling.
> 
> It shouldn't need that, see above.
> 
> > 2) but if three OPs uses the buffer registered by IORING_OP_GET_BUF, the three
> > OPs have to be linked one by one, then all three have to be be submitted one
> > after the previous one is completed, performance is hurt a lot.
> 
> In general, it sounds like a generic feature and should be as such.
> But
> 
> I agree with the sentiment, that's exactly why I was saying that
> they're not perfectly flexible and don't cover all the cases, but
> the same can be said for the 1->N kinds of dependency. And even
> with a completely configurable graphs of requests there will be
> questions like "how to take result from one request and give it
> to another with a random meaning, i.e. as a buffer/fd/size/
> index/whatnot", and there are more complicated examples.
> 
> I don't see any good scalable way for that without programmability.
> It appeared before that it was faster to return back to the userspace
> than using BPF, might be worth to test it again, especially with
> DEFER_TASKRUN and recent optimisations.

I don't think result from one request need to be retrieved for the
following request, such as by taking your previous code:

	*Pseudo code*
	N = get_free_buffer_slot();
	sqe1 = prep_getbuf_sqe(buf_idx = N);
	sqe1->flags |= F_LINK;
	
	sqe2 = prep_write_sqe(buf_idx = N);
	sqe2->flags |= F_LINK;

	sqe3= prep_write_sqe(buf_idx = N);
	sqe3->flags |= F_LINK;
	
	sqe4 = prep_release_buf_sqe(buf_idx = N);
	sqe4->flags |= F_LINK;

After sqe1 is done, we just need to support to submit sqe2/sqe3/sqe4
concurrently, however prep_release_buf_sqe need to support async buf
release.

> 
> 
> > > I deem fused cmds to be a variant of linking, so it's rather with
> > > it you link 2 requests vs optionally linking 3 with this patchset.
> > 
> > IMO, one extra 64byte touching may slow things a little, but I think it
> 
> Fwiw, the overhead comes from all io_uring hops a request goes
> through, e.g. init, ->prep, ->issue, free, etc. I'd love to
> see it getting even slimmer, and there are definitely spots
> that can be optimised.
> 
> > won't be big deal, what really matters is that easy use of the interface
> > and performance.
> 
> Which should also be balanced with flexibility and not only.

That is why I prefer to fused command which can solve the problem in
one easy way without big core code change(include interface), then generic
core code still can keep its flexibility.

thanks,
Ming


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-05-17  4:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-30  9:35 [RFC 0/7] Rethinking splice Pavel Begunkov
2023-04-30  9:35 ` [RFC 1/7] io_uring: add io_mapped_ubuf caches Pavel Begunkov
2023-04-30  9:35 ` [RFC 2/7] io_uring: add reg-buffer data directions Pavel Begunkov
2023-04-30  9:35 ` [RFC 3/7] io_uring: fail loop_rw_iter with pure bvec bufs Pavel Begunkov
2023-04-30  9:35 ` [RFC 4/7] io_uring/rsrc: introduce struct iou_buf_desc Pavel Begunkov
2023-04-30  9:35 ` [RFC 5/7] io_uring/rsrc: add buffer release callbacks Pavel Begunkov
2023-04-30  9:35 ` [RFC 6/7] io_uring/rsrc: introduce helper installing one buffer Pavel Begunkov
2023-04-30  9:35 ` [RFC 7/7] io_uring,fs: introduce IORING_OP_GET_BUF Pavel Begunkov
2023-05-02 14:57   ` Ming Lei
2023-05-02 15:20     ` Ming Lei
2023-05-03 14:54     ` Pavel Begunkov
2023-05-04  2:06       ` Ming Lei
2023-05-08  2:30         ` Pavel Begunkov
2023-05-17  4:05           ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox