* [PATCH 1/6] block: const blk_rq_nr_phys_segments request
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
@ 2025-02-03 15:45 ` Keith Busch
2025-02-03 15:45 ` [PATCH 2/6] io_uring: use node for import Keith Busch
` (7 subsequent siblings)
8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-02-03 15:45 UTC (permalink / raw)
To: io-uring, linux-block, ming.lei, axboe, asml.silence; +Cc: Keith Busch
From: Keith Busch <[email protected]>
The request is not modified. Mark it as const so that other const
functions may use this helper.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/blk-mq.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a0a9007cc1e36..56ef03bc68884 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1125,7 +1125,7 @@ void blk_abort_request(struct request *);
* own special payload. In that case we still return 1 here so that this
* special payload will be mapped.
*/
-static inline unsigned short blk_rq_nr_phys_segments(struct request *rq)
+static inline unsigned short blk_rq_nr_phys_segments(const struct request *rq)
{
if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
return 1;
--
2.43.5
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 2/6] io_uring: use node for import
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
2025-02-03 15:45 ` [PATCH 1/6] block: const blk_rq_nr_phys_segments request Keith Busch
@ 2025-02-03 15:45 ` Keith Busch
2025-02-03 15:45 ` [PATCH 3/6] io_uring: add support for kernel registered bvecs Keith Busch
` (6 subsequent siblings)
8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-02-03 15:45 UTC (permalink / raw)
To: io-uring, linux-block, ming.lei, axboe, asml.silence; +Cc: Keith Busch
From: Jens Axboe <[email protected]>
Replace the mapped buffer to the parent node. This is preparing for a
future for different types with specific handling considerations.
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
---
io_uring/net.c | 3 +--
io_uring/rsrc.c | 6 +++---
io_uring/rsrc.h | 5 ++---
io_uring/rw.c | 2 +-
io_uring/uring_cmd.c | 2 +-
5 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/io_uring/net.c b/io_uring/net.c
index 85f55fbc25c94..4e9d0f04b902d 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1359,8 +1359,7 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
return ret;
ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter,
- node->buf, (u64)(uintptr_t)sr->buf,
- sr->len);
+ node, (u64)(uintptr_t)sr->buf, sr->len);
if (unlikely(ret))
return ret;
kmsg->msg.sg_from_iter = io_sg_from_iter;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index af39b69eb4fde..4d0e1c06c8bc6 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -860,10 +860,10 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
return ret;
}
-int io_import_fixed(int ddir, struct iov_iter *iter,
- struct io_mapped_ubuf *imu,
- u64 buf_addr, size_t len)
+int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
+ u64 buf_addr, size_t len)
{
+ struct io_mapped_ubuf *imu = node->buf;
u64 buf_end;
size_t offset;
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 190f7ee45de93..abd0d5d42c3e1 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -50,9 +50,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data);
int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
-int io_import_fixed(int ddir, struct iov_iter *iter,
- struct io_mapped_ubuf *imu,
- u64 buf_addr, size_t len);
+int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
+ u64 buf_addr, size_t len);
int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg);
int io_sqe_buffers_unregister(struct io_ring_ctx *ctx);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index a9a2733be8420..d6332d019dd56 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -393,7 +393,7 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe
io_req_assign_buf_node(req, node);
io = req->async_data;
- ret = io_import_fixed(ddir, &io->iter, node->buf, rw->addr, rw->len);
+ ret = io_import_fixed(ddir, &io->iter, node, rw->addr, rw->len);
iov_iter_save_state(&io->iter, &io->iter_state);
return ret;
}
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index fc94c465a9850..b7b9baf30d728 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -281,7 +281,7 @@ int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw,
/* Must have had rsrc_node assigned at prep time */
if (node)
- return io_import_fixed(rw, iter, node->buf, ubuf, len);
+ return io_import_fixed(rw, iter, node, ubuf, len);
return -EFAULT;
}
--
2.43.5
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 3/6] io_uring: add support for kernel registered bvecs
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
2025-02-03 15:45 ` [PATCH 1/6] block: const blk_rq_nr_phys_segments request Keith Busch
2025-02-03 15:45 ` [PATCH 2/6] io_uring: use node for import Keith Busch
@ 2025-02-03 15:45 ` Keith Busch
2025-02-07 14:08 ` Pavel Begunkov
2025-02-10 14:12 ` Ming Lei
2025-02-03 15:45 ` [PATCH 4/6] ublk: zc register/unregister bvec Keith Busch
` (5 subsequent siblings)
8 siblings, 2 replies; 27+ messages in thread
From: Keith Busch @ 2025-02-03 15:45 UTC (permalink / raw)
To: io-uring, linux-block, ming.lei, axboe, asml.silence; +Cc: Keith Busch
From: Keith Busch <[email protected]>
Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.
User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/io_uring.h | 1 +
include/linux/io_uring_types.h | 3 +
io_uring/rsrc.c | 114 +++++++++++++++++++++++++++++++--
io_uring/rsrc.h | 1 +
4 files changed, 114 insertions(+), 5 deletions(-)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 85fe4e6b275c7..b5637a2aae340 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -5,6 +5,7 @@
#include <linux/sched.h>
#include <linux/xarray.h>
#include <uapi/linux/io_uring.h>
+#include <linux/blk-mq.h>
#if defined(CONFIG_IO_URING)
void __io_uring_cancel(bool cancel_all);
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 623d8e798a11a..7e5a5a70c35f2 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -695,4 +695,7 @@ static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
return ctx->flags & IORING_SETUP_CQE32;
}
+int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq, unsigned int tag);
+void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int tag);
+
#endif
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 4d0e1c06c8bc6..8c4c374abcc10 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -111,7 +111,10 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
if (!refcount_dec_and_test(&imu->refs))
return;
for (i = 0; i < imu->nr_bvecs; i++)
- unpin_user_page(imu->bvec[i].bv_page);
+ if (node->type == IORING_RSRC_KBUF)
+ put_page(imu->bvec[i].bv_page);
+ else
+ unpin_user_page(imu->bvec[i].bv_page);
if (imu->acct_pages)
io_unaccount_mem(ctx, imu->acct_pages);
kvfree(imu);
@@ -240,6 +243,13 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
struct io_rsrc_node *node;
u64 tag = 0;
+ i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
+ node = io_rsrc_node_lookup(&ctx->buf_table, i);
+ if (node && node->type != IORING_RSRC_BUFFER) {
+ err = -EBUSY;
+ break;
+ }
+
uvec = u64_to_user_ptr(user_data);
iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat);
if (IS_ERR(iov)) {
@@ -258,6 +268,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
err = PTR_ERR(node);
break;
}
+
if (tag) {
if (!node) {
err = -EINVAL;
@@ -265,7 +276,6 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
}
node->tag = tag;
}
- i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
io_reset_rsrc_node(ctx, &ctx->buf_table, i);
ctx->buf_table.nodes[i] = node;
if (ctx->compat)
@@ -453,6 +463,7 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
fput(io_slot_file(node));
break;
case IORING_RSRC_BUFFER:
+ case IORING_RSRC_KBUF:
if (node->buf)
io_buffer_unmap(ctx, node);
break;
@@ -860,6 +871,92 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
return ret;
}
+static struct io_rsrc_node *io_buffer_alloc_node(struct io_ring_ctx *ctx,
+ unsigned int nr_bvecs,
+ unsigned int len)
+{
+ struct io_mapped_ubuf *imu;
+ struct io_rsrc_node *node;
+
+ node = io_rsrc_node_alloc(IORING_RSRC_KBUF);
+ if (!node)
+ return NULL;
+
+ imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+ if (!imu) {
+ io_put_rsrc_node(ctx, node);
+ return NULL;
+ }
+
+ imu->ubuf = 0;
+ imu->len = len;
+ imu->acct_pages = 0;
+ imu->nr_bvecs = nr_bvecs;
+ refcount_set(&imu->refs, 1);
+
+ node->buf = imu;
+ return node;
+}
+
+int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq,
+ unsigned int index)
+{
+ struct io_rsrc_data *data = &ctx->buf_table;
+ u16 nr_bvecs = blk_rq_nr_phys_segments(rq);
+ struct req_iterator rq_iter;
+ struct io_rsrc_node *node;
+ struct bio_vec bv;
+ int i = 0;
+
+ lockdep_assert_held(&ctx->uring_lock);
+
+ if (WARN_ON_ONCE(!data->nr))
+ return -EINVAL;
+ if (WARN_ON_ONCE(index >= data->nr))
+ return -EINVAL;
+
+ node = data->nodes[index];
+ if (WARN_ON_ONCE(node))
+ return -EBUSY;
+
+ node = io_buffer_alloc_node(ctx, nr_bvecs, blk_rq_bytes(rq));
+ if (!node)
+ return -ENOMEM;
+
+ rq_for_each_bvec(bv, rq, rq_iter) {
+ get_page(bv.bv_page);
+ node->buf->bvec[i].bv_page = bv.bv_page;
+ node->buf->bvec[i].bv_len = bv.bv_len;
+ node->buf->bvec[i].bv_offset = bv.bv_offset;
+ i++;
+ }
+ data->nodes[index] = node;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
+
+void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index)
+{
+ struct io_rsrc_data *data = &ctx->buf_table;
+ struct io_rsrc_node *node;
+
+ lockdep_assert_held(&ctx->uring_lock);
+
+ if (WARN_ON_ONCE(!data->nr))
+ return;
+ if (WARN_ON_ONCE(index >= data->nr))
+ return;
+
+ node = data->nodes[index];
+ if (WARN_ON_ONCE(!node || !node->buf))
+ return;
+ if (WARN_ON_ONCE(node->type != IORING_RSRC_KBUF))
+ return;
+ io_reset_rsrc_node(ctx, data, index);
+}
+EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
+
int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
u64 buf_addr, size_t len)
{
@@ -886,8 +983,8 @@ int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
/*
* Don't use iov_iter_advance() here, as it's really slow for
* using the latter parts of a big fixed buffer - it iterates
- * over each segment manually. We can cheat a bit here, because
- * we know that:
+ * over each segment manually. We can cheat a bit here for user
+ * registered nodes, because we know that:
*
* 1) it's a BVEC iter, we set it up
* 2) all bvecs are the same in size, except potentially the
@@ -901,7 +998,14 @@ int io_import_fixed(int ddir, struct iov_iter *iter, struct io_rsrc_node *node,
*/
const struct bio_vec *bvec = imu->bvec;
- if (offset < bvec->bv_len) {
+ /*
+ * Kernel buffer bvecs, on the other hand, don't necessarily
+ * have the size property of user registered ones, so we have
+ * to use the slow iter advance.
+ */
+ if (node->type == IORING_RSRC_KBUF)
+ iov_iter_advance(iter, offset);
+ else if (offset < bvec->bv_len) {
iter->iov_offset = offset;
} else {
unsigned long seg_skip;
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index abd0d5d42c3e1..d1d90d9cd2b43 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -13,6 +13,7 @@
enum {
IORING_RSRC_FILE = 0,
IORING_RSRC_BUFFER = 1,
+ IORING_RSRC_KBUF = 2,
};
struct io_rsrc_node {
--
2.43.5
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] io_uring: add support for kernel registered bvecs
2025-02-03 15:45 ` [PATCH 3/6] io_uring: add support for kernel registered bvecs Keith Busch
@ 2025-02-07 14:08 ` Pavel Begunkov
2025-02-07 15:17 ` Keith Busch
2025-02-10 14:12 ` Ming Lei
1 sibling, 1 reply; 27+ messages in thread
From: Pavel Begunkov @ 2025-02-07 14:08 UTC (permalink / raw)
To: Keith Busch, io-uring, linux-block, ming.lei, axboe; +Cc: Keith Busch
On 2/3/25 15:45, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> Provide an interface for the kernel to leverage the existing
> pre-registered buffers that io_uring provides. User space can reference
> these later to achieve zero-copy IO.
>
> User space must register an empty fixed buffer table with io_uring in
> order for the kernel to make use of it.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> include/linux/io_uring.h | 1 +
> include/linux/io_uring_types.h | 3 +
> io_uring/rsrc.c | 114 +++++++++++++++++++++++++++++++--
> io_uring/rsrc.h | 1 +
> 4 files changed, 114 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 85fe4e6b275c7..b5637a2aae340 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -5,6 +5,7 @@
> #include <linux/sched.h>
> #include <linux/xarray.h>
> #include <uapi/linux/io_uring.h>
> +#include <linux/blk-mq.h>
>
> #if defined(CONFIG_IO_URING)
> void __io_uring_cancel(bool cancel_all);
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 623d8e798a11a..7e5a5a70c35f2 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -695,4 +695,7 @@ static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
> return ctx->flags & IORING_SETUP_CQE32;
> }
>
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq, unsigned int tag);
> +void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int tag);
> +
> #endif
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 4d0e1c06c8bc6..8c4c374abcc10 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -111,7 +111,10 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
> if (!refcount_dec_and_test(&imu->refs))
> return;
> for (i = 0; i < imu->nr_bvecs; i++)
> - unpin_user_page(imu->bvec[i].bv_page);
> + if (node->type == IORING_RSRC_KBUF)
> + put_page(imu->bvec[i].bv_page);
Just a note, that's fine but I hope we'll be able to optimise
that later.
> + else
> + unpin_user_page(imu->bvec[i].bv_page);
> if (imu->acct_pages)
> io_unaccount_mem(ctx, imu->acct_pages);
> kvfree(imu);
> @@ -240,6 +243,13 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> struct io_rsrc_node *node;
> u64 tag = 0;
>
> + i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
> + node = io_rsrc_node_lookup(&ctx->buf_table, i);
> + if (node && node->type != IORING_RSRC_BUFFER) {
We might need to rethink how it's unregistered. The next patch
does it as a ublk commands, but what happens if it gets ejected
by someone else? get_page might protect from kernel corruption
and here you try to forbid ejections, but there is io_rsrc_data_free()
and the io_uring ctx can die as well and it will have to drop it.
And then you don't really have clear ownership rules. Does ublk
releases the block request and "returns ownership" over pages to
its user while io_uring is still dying and potenially have some
IO inflight against it?
That's why I liked more the option to allow removing buffers from
the table as per usual io_uring api / rules instead of a separate
unregister ublk cmd. And inside, when all node refs are dropped,
it'd call back to ublk. This way you have a single mechanism of
how buffers are dropped from io_uring perspective. Thoughts?
> + err = -EBUSY;
> + break;
> + }
> +
> uvec = u64_to_user_ptr(user_data);
> iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat);
> if (IS_ERR(iov)) {
> @@ -258,6 +268,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> err = PTR_ERR(node);
> break;
> }
...
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq,
> + unsigned int index)
> +{
> + struct io_rsrc_data *data = &ctx->buf_table;
> + u16 nr_bvecs = blk_rq_nr_phys_segments(rq);
> + struct req_iterator rq_iter;
> + struct io_rsrc_node *node;
> + struct bio_vec bv;
> + int i = 0;
> +
> + lockdep_assert_held(&ctx->uring_lock);
> +
> + if (WARN_ON_ONCE(!data->nr))
> + return -EINVAL;
IIUC you can trigger all these from the user space, so they
can't be warnings. Likely same goes for unregister*()
> + if (WARN_ON_ONCE(index >= data->nr))
> + return -EINVAL;
> +
> + node = data->nodes[index];
> + if (WARN_ON_ONCE(node))
> + return -EBUSY;
> +
> + node = io_buffer_alloc_node(ctx, nr_bvecs, blk_rq_bytes(rq));
> + if (!node)
> + return -ENOMEM;
> +
> + rq_for_each_bvec(bv, rq, rq_iter) {
> + get_page(bv.bv_page);
> + node->buf->bvec[i].bv_page = bv.bv_page;
> + node->buf->bvec[i].bv_len = bv.bv_len;
> + node->buf->bvec[i].bv_offset = bv.bv_offset;
bvec_set_page() should be more convenient
> + i++;
> + }
> + data->nodes[index] = node;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> +
...
> unsigned long seg_skip;
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index abd0d5d42c3e1..d1d90d9cd2b43 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -13,6 +13,7 @@
> enum {
> IORING_RSRC_FILE = 0,
> IORING_RSRC_BUFFER = 1,
> + IORING_RSRC_KBUF = 2,
The name "kbuf" is already used, to avoid confusion let's rename it.
Ming called it leased buffers before, I think it's a good name.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] io_uring: add support for kernel registered bvecs
2025-02-07 14:08 ` Pavel Begunkov
@ 2025-02-07 15:17 ` Keith Busch
2025-02-08 15:49 ` Pavel Begunkov
0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-02-07 15:17 UTC (permalink / raw)
To: Pavel Begunkov; +Cc: Keith Busch, io-uring, linux-block, ming.lei, axboe
On Fri, Feb 07, 2025 at 02:08:23PM +0000, Pavel Begunkov wrote:
> On 2/3/25 15:45, Keith Busch wrote:
> > struct io_rsrc_node *node;
> > u64 tag = 0;
> > + i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
> > + node = io_rsrc_node_lookup(&ctx->buf_table, i);
> > + if (node && node->type != IORING_RSRC_BUFFER) {
>
> We might need to rethink how it's unregistered. The next patch
> does it as a ublk commands, but what happens if it gets ejected
> by someone else? get_page might protect from kernel corruption
> and here you try to forbid ejections, but there is io_rsrc_data_free()
> and the io_uring ctx can die as well and it will have to drop it.
We prevent clearing an index through the typical user register update
call. The expected way to clear for a well functioning program is
through the kernel interfaces.
Other than that, there's nothing special about kernel buffers here. You
can kill the ring or tear down registered buffer table, but that same
scenario exists for user registered buffers. The only thing io_uring
needs to ensure is that nothing gets corrupted. User registered buffers
hold a pin on the user pages while the node is referenced. Kernel
registered buffers hold a page reference while the node is referenced.
Nothing special.
> And then you don't really have clear ownership rules. Does ublk
> releases the block request and "returns ownership" over pages to
> its user while io_uring is still dying and potenially have some
> IO inflight against it?
>
> That's why I liked more the option to allow removing buffers from
> the table as per usual io_uring api / rules instead of a separate
> unregister ublk cmd.
ublk is the only entity that knows about the struct request that
provides the bvec we want to use for zero-copy, so it has to be ublk
that handles the registration. Moving the unregister outside of that
breaks the symmetry and requires an indirect call.
> And inside, when all node refs are dropped,
> it'd call back to ublk. This way you have a single mechanism of
> how buffers are dropped from io_uring perspective. Thoughts?
>
> > + err = -EBUSY;
> > + break;
> > + }
> > +
> > uvec = u64_to_user_ptr(user_data);
> > iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat);
> > if (IS_ERR(iov)) {
> > @@ -258,6 +268,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> > err = PTR_ERR(node);
> > break;
> > }
> ...
> > +int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq,
> > + unsigned int index)
> > +{
> > + struct io_rsrc_data *data = &ctx->buf_table;
> > + u16 nr_bvecs = blk_rq_nr_phys_segments(rq);
> > + struct req_iterator rq_iter;
> > + struct io_rsrc_node *node;
> > + struct bio_vec bv;
> > + int i = 0;
> > +
> > + lockdep_assert_held(&ctx->uring_lock);
> > +
> > + if (WARN_ON_ONCE(!data->nr))
> > + return -EINVAL;
>
> IIUC you can trigger all these from the user space, so they
> can't be warnings. Likely same goes for unregister*()
It helped with debugging, but sure, the warns don't need to be there.
> > + if (WARN_ON_ONCE(index >= data->nr))
> > + return -EINVAL;
> > +
> > + node = data->nodes[index];
> > + if (WARN_ON_ONCE(node))
> > + return -EBUSY;
> > +
> > + node = io_buffer_alloc_node(ctx, nr_bvecs, blk_rq_bytes(rq));
> > + if (!node)
> > + return -ENOMEM;
> > +
> > + rq_for_each_bvec(bv, rq, rq_iter) {
> > + get_page(bv.bv_page);
> > + node->buf->bvec[i].bv_page = bv.bv_page;
> > + node->buf->bvec[i].bv_len = bv.bv_len;
> > + node->buf->bvec[i].bv_offset = bv.bv_offset;
>
> bvec_set_page() should be more convenient
Indeed.
> > + i++;
> > + }
> > + data->nodes[index] = node;
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> > +
>
> ...
> > unsigned long seg_skip;
> > diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> > index abd0d5d42c3e1..d1d90d9cd2b43 100644
> > --- a/io_uring/rsrc.h
> > +++ b/io_uring/rsrc.h
> > @@ -13,6 +13,7 @@
> > enum {
> > IORING_RSRC_FILE = 0,
> > IORING_RSRC_BUFFER = 1,
> > + IORING_RSRC_KBUF = 2,
>
> The name "kbuf" is already used, to avoid confusion let's rename it.
> Ming called it leased buffers before, I think it's a good name.
These are just fixed buffers, just like user space onces. The only
difference is where the buffer comes from: kernel or userspace? I don't
see what the term "lease" has to do with this.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] io_uring: add support for kernel registered bvecs
2025-02-07 15:17 ` Keith Busch
@ 2025-02-08 15:49 ` Pavel Begunkov
0 siblings, 0 replies; 27+ messages in thread
From: Pavel Begunkov @ 2025-02-08 15:49 UTC (permalink / raw)
To: Keith Busch; +Cc: Keith Busch, io-uring, linux-block, ming.lei, axboe
On 2/7/25 15:17, Keith Busch wrote:
> On Fri, Feb 07, 2025 at 02:08:23PM +0000, Pavel Begunkov wrote:
>> On 2/3/25 15:45, Keith Busch wrote:
>>> struct io_rsrc_node *node;
>>> u64 tag = 0;
>>> + i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
>>> + node = io_rsrc_node_lookup(&ctx->buf_table, i);
>>> + if (node && node->type != IORING_RSRC_BUFFER) {
>>
>> We might need to rethink how it's unregistered. The next patch
>> does it as a ublk commands, but what happens if it gets ejected
>> by someone else? get_page might protect from kernel corruption
>> and here you try to forbid ejections, but there is io_rsrc_data_free()
>> and the io_uring ctx can die as well and it will have to drop it.
>
> We prevent clearing an index through the typical user register update
> call. The expected way to clear for a well functioning program is
> through the kernel interfaces.
What I'm saying, it's a sanity check, but it doesn't prevent it
from happening from other paths, and I understand that you're
trying to cover for that.
> Other than that, there's nothing special about kernel buffers here. You
> can kill the ring or tear down registered buffer table, but that same
> scenario exists for user registered buffers. The only thing io_uring
For registered buffers the user can and will have to handle it, but in
case of this proposal the end ublk user wouldn't even know there is
an io_uring and registered buffers, so ultimately the ublk driver will
have to handle edge cases. And for ublk driver to be able to handle it
well even in case of ublk server failures, it'll need to be able to wait
until io_uring releases the buffer.
For example, the ublk server crashes, which closes io_uring => there
is no way to do unregister cmd anymore. IIUC, the ublk driver will
want to complete the block request returning an error, but if it's
done before io_uring releases the buffer, the end ublk user may
attempt to reuse the memory while io_uring is still concurrently
writing to / reading from it, which would be disastrous.
One thing I like about ublk unregister cmd though, is that you can
add some more control like reporting back a short IO, but I doubt we
can do it sanely without sending some sort of a notification back
to ublk. So, maybe it should be both, and in case of forced
unregistration ublk will consider it to be a failure. Another option
is to do it all through normal(ish) io_uring buffer unregisteration
path, but maybe enhanced with additional custom arguments. This way
we have only one path doing that.
> needs to ensure is that nothing gets corrupted. User registered buffers
> hold a pin on the user pages while the node is referenced. Kernel
> registered buffers hold a page reference while the node is referenced.
> Nothing special.
>
>> And then you don't really have clear ownership rules. Does ublk
>> releases the block request and "returns ownership" over pages to
>> its user while io_uring is still dying and potenially have some
>> IO inflight against it?
>>
>> That's why I liked more the option to allow removing buffers from
>> the table as per usual io_uring api / rules instead of a separate
>> unregister ublk cmd.
>
> ublk is the only entity that knows about the struct request that
> provides the bvec we want to use for zero-copy, so it has to be ublk
> that handles the registration. Moving the unregister outside of that
> breaks the symmetry and requires an indirect call.
cmd execution takes 2 indirect calls, not like there is a
difference here.
>
>> And inside, when all node refs are dropped,
>> it'd call back to ublk. This way you have a single mechanism of
>> how buffers are dropped from io_uring perspective. Thoughts?
>>
>>> + err = -EBUSY;
>>> + break;
>>> + }
>>> +
...
>> ...
>>> unsigned long seg_skip;
>>> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
>>> index abd0d5d42c3e1..d1d90d9cd2b43 100644
>>> --- a/io_uring/rsrc.h
>>> +++ b/io_uring/rsrc.h
>>> @@ -13,6 +13,7 @@
>>> enum {
>>> IORING_RSRC_FILE = 0,
>>> IORING_RSRC_BUFFER = 1,
>>> + IORING_RSRC_KBUF = 2,
>>
>> The name "kbuf" is already used, to avoid confusion let's rename it.
>> Ming called it leased buffers before, I think it's a good name.
>
> These are just fixed buffers, just like user space onces. The only
> difference is where the buffer comes from: kernel or userspace? I don't
> see what the term "lease" has to do with this.
In this particular case, there is a kernel component that expects
it back, that's the leasing part, but thinking about it more, you're
right, the interface can support workflows different from it as well.
I actually like kbuf, but again it's confusing because already used
for an entirely different thing. Maybe it's fine if it doesn't leak
outside of node types.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] io_uring: add support for kernel registered bvecs
2025-02-03 15:45 ` [PATCH 3/6] io_uring: add support for kernel registered bvecs Keith Busch
2025-02-07 14:08 ` Pavel Begunkov
@ 2025-02-10 14:12 ` Ming Lei
2025-02-10 15:05 ` Keith Busch
1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2025-02-10 14:12 UTC (permalink / raw)
To: Keith Busch; +Cc: io-uring, linux-block, axboe, asml.silence, Keith Busch
On Mon, Feb 03, 2025 at 07:45:14AM -0800, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> Provide an interface for the kernel to leverage the existing
> pre-registered buffers that io_uring provides. User space can reference
> these later to achieve zero-copy IO.
>
> User space must register an empty fixed buffer table with io_uring in
> order for the kernel to make use of it.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> include/linux/io_uring.h | 1 +
> include/linux/io_uring_types.h | 3 +
> io_uring/rsrc.c | 114 +++++++++++++++++++++++++++++++--
> io_uring/rsrc.h | 1 +
> 4 files changed, 114 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 85fe4e6b275c7..b5637a2aae340 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -5,6 +5,7 @@
> #include <linux/sched.h>
> #include <linux/xarray.h>
> #include <uapi/linux/io_uring.h>
> +#include <linux/blk-mq.h>
>
> #if defined(CONFIG_IO_URING)
> void __io_uring_cancel(bool cancel_all);
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 623d8e798a11a..7e5a5a70c35f2 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -695,4 +695,7 @@ static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
> return ctx->flags & IORING_SETUP_CQE32;
> }
>
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq, unsigned int tag);
> +void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int tag);
> +
> #endif
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 4d0e1c06c8bc6..8c4c374abcc10 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -111,7 +111,10 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
> if (!refcount_dec_and_test(&imu->refs))
> return;
> for (i = 0; i < imu->nr_bvecs; i++)
> - unpin_user_page(imu->bvec[i].bv_page);
> + if (node->type == IORING_RSRC_KBUF)
> + put_page(imu->bvec[i].bv_page);
> + else
> + unpin_user_page(imu->bvec[i].bv_page);
> if (imu->acct_pages)
> io_unaccount_mem(ctx, imu->acct_pages);
> kvfree(imu);
> @@ -240,6 +243,13 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> struct io_rsrc_node *node;
> u64 tag = 0;
>
> + i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
> + node = io_rsrc_node_lookup(&ctx->buf_table, i);
> + if (node && node->type != IORING_RSRC_BUFFER) {
> + err = -EBUSY;
> + break;
> + }
> +
> uvec = u64_to_user_ptr(user_data);
> iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat);
> if (IS_ERR(iov)) {
> @@ -258,6 +268,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> err = PTR_ERR(node);
> break;
> }
> +
> if (tag) {
> if (!node) {
> err = -EINVAL;
> @@ -265,7 +276,6 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
> }
> node->tag = tag;
> }
> - i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
> io_reset_rsrc_node(ctx, &ctx->buf_table, i);
> ctx->buf_table.nodes[i] = node;
> if (ctx->compat)
> @@ -453,6 +463,7 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
> fput(io_slot_file(node));
> break;
> case IORING_RSRC_BUFFER:
> + case IORING_RSRC_KBUF:
> if (node->buf)
> io_buffer_unmap(ctx, node);
> break;
> @@ -860,6 +871,92 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
> return ret;
> }
>
> +static struct io_rsrc_node *io_buffer_alloc_node(struct io_ring_ctx *ctx,
> + unsigned int nr_bvecs,
> + unsigned int len)
> +{
> + struct io_mapped_ubuf *imu;
> + struct io_rsrc_node *node;
> +
> + node = io_rsrc_node_alloc(IORING_RSRC_KBUF);
> + if (!node)
> + return NULL;
> +
> + imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
> + if (!imu) {
> + io_put_rsrc_node(ctx, node);
> + return NULL;
> + }
> +
> + imu->ubuf = 0;
> + imu->len = len;
> + imu->acct_pages = 0;
> + imu->nr_bvecs = nr_bvecs;
> + refcount_set(&imu->refs, 1);
> +
> + node->buf = imu;
> + return node;
> +}
> +
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq,
> + unsigned int index)
> +{
> + struct io_rsrc_data *data = &ctx->buf_table;
> + u16 nr_bvecs = blk_rq_nr_phys_segments(rq);
> + struct req_iterator rq_iter;
> + struct io_rsrc_node *node;
> + struct bio_vec bv;
> + int i = 0;
> +
> + lockdep_assert_held(&ctx->uring_lock);
> +
> + if (WARN_ON_ONCE(!data->nr))
> + return -EINVAL;
> + if (WARN_ON_ONCE(index >= data->nr))
> + return -EINVAL;
> +
> + node = data->nodes[index];
> + if (WARN_ON_ONCE(node))
> + return -EBUSY;
> +
> + node = io_buffer_alloc_node(ctx, nr_bvecs, blk_rq_bytes(rq));
> + if (!node)
> + return -ENOMEM;
> +
> + rq_for_each_bvec(bv, rq, rq_iter) {
> + get_page(bv.bv_page);
> + node->buf->bvec[i].bv_page = bv.bv_page;
> + node->buf->bvec[i].bv_len = bv.bv_len;
> + node->buf->bvec[i].bv_offset = bv.bv_offset;
> + i++;
In this patchset, ublk request buffer may cross uring OPs, so it is inevitable
for buggy application to complete IO command & ublk request before
io_uring read/write OP using the buffer/page is completed .
That is probably the reason why page reference is increased here, then
bvec page lifetime becomes not aligned with request any more from block
layer viewpoint.
Not sure this way is safe:
1) for current block storage driver, when request is completed, all
request bvec page ownership is transferred to upper layer(FS, application, ...),
but it becomes not true for ublk zero copy with this patchset
2) BIO_PAGE_PINNED may not be set for bio, so upper layer might think that
bvec pages can be reused or reclaimed after this ublk bio is completed.
Thanks,
Ming
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 3/6] io_uring: add support for kernel registered bvecs
2025-02-10 14:12 ` Ming Lei
@ 2025-02-10 15:05 ` Keith Busch
0 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-02-10 15:05 UTC (permalink / raw)
To: Ming Lei; +Cc: Keith Busch, io-uring, linux-block, axboe, asml.silence
On Mon, Feb 10, 2025 at 10:12:44PM +0800, Ming Lei wrote:
> On Mon, Feb 03, 2025 at 07:45:14AM -0800, Keith Busch wrote:
> > + rq_for_each_bvec(bv, rq, rq_iter) {
> > + get_page(bv.bv_page);
> > + node->buf->bvec[i].bv_page = bv.bv_page;
> > + node->buf->bvec[i].bv_len = bv.bv_len;
> > + node->buf->bvec[i].bv_offset = bv.bv_offset;
> > + i++;
>
> In this patchset, ublk request buffer may cross uring OPs, so it is inevitable
> for buggy application to complete IO command & ublk request before
> io_uring read/write OP using the buffer/page is completed .
The buggy app would have to both complete the requests and unregister
the fixed buffer (the registration takes a reference, too) while having
backend requests in flight using that registered buffer. That could
happen, which is why the page references are elevated. It should contain
the fallout of the buggy application to the application's memory.
But if this is really a scenario that we must prevent from happening,
then I think the indirect callback is really the best option. It's not a
big deal, I just wanted to try to avoid it.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 4/6] ublk: zc register/unregister bvec
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
` (2 preceding siblings ...)
2025-02-03 15:45 ` [PATCH 3/6] io_uring: add support for kernel registered bvecs Keith Busch
@ 2025-02-03 15:45 ` Keith Busch
2025-02-08 5:50 ` Ming Lei
2025-02-03 15:45 ` [PATCH 5/6] io_uring: add abstraction for buf_table rsrc data Keith Busch
` (4 subsequent siblings)
8 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-02-03 15:45 UTC (permalink / raw)
To: io-uring, linux-block, ming.lei, axboe, asml.silence; +Cc: Keith Busch
From: Keith Busch <[email protected]>
Provide new operations for the user to request mapping an active request
to an io uring instance's buf_table. The user has to provide the index
it wants to install the buffer.
A reference count is taken on the request to ensure it can't be
completed while it is active in a ring's buf_table.
Signed-off-by: Keith Busch <[email protected]>
---
drivers/block/ublk_drv.c | 139 +++++++++++++++++++++++++---------
include/uapi/linux/ublk_cmd.h | 4 +
2 files changed, 107 insertions(+), 36 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 529085181f355..58f224b5687b9 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -51,6 +51,9 @@
/* private ioctl command mirror */
#define UBLK_CMD_DEL_DEV_ASYNC _IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
+#define UBLK_IO_REGISTER_IO_BUF _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
+#define UBLK_IO_UNREGISTER_IO_BUF _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
+
/* All UBLK_F_* have to be included into UBLK_F_ALL */
#define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
| UBLK_F_URING_CMD_COMP_IN_TASK \
@@ -76,6 +79,9 @@ struct ublk_rq_data {
struct llist_node node;
struct kref ref;
+
+#define UBLK_ZC_REGISTERED 0
+ unsigned long flags;
};
struct ublk_uring_cmd_pdu {
@@ -201,7 +207,7 @@ static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq,
int tag);
static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub)
{
- return ub->dev_info.flags & UBLK_F_USER_COPY;
+ return ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
}
static inline bool ublk_dev_is_zoned(const struct ublk_device *ub)
@@ -581,7 +587,7 @@ static void ublk_apply_params(struct ublk_device *ub)
static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
{
- return ubq->flags & UBLK_F_USER_COPY;
+ return ubq->flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
}
static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
@@ -1747,6 +1753,96 @@ static inline void ublk_prep_cancel(struct io_uring_cmd *cmd,
io_uring_cmd_mark_cancelable(cmd, issue_flags);
}
+
+static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
+ struct ublk_queue *ubq, int tag, size_t offset)
+{
+ struct request *req;
+
+ if (!ublk_need_req_ref(ubq))
+ return NULL;
+
+ req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+ if (!req)
+ return NULL;
+
+ if (!ublk_get_req_ref(ubq, req))
+ return NULL;
+
+ if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
+ goto fail_put;
+
+ if (!ublk_rq_has_data(req))
+ goto fail_put;
+
+ if (offset > blk_rq_bytes(req))
+ goto fail_put;
+
+ return req;
+fail_put:
+ ublk_put_req_ref(ubq, req);
+ return NULL;
+}
+
+static int ublk_register_io_buf(struct io_uring_cmd *cmd,
+ struct ublk_queue *ubq, int tag,
+ const struct ublksrv_io_cmd *ub_cmd)
+{
+ struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+ struct ublk_device *ub = cmd->file->private_data;
+ int index = (int)ub_cmd->addr, ret;
+ struct ublk_rq_data *data;
+ struct request *req;
+
+ if (!ub)
+ return -EPERM;
+
+ req = __ublk_check_and_get_req(ub, ubq, tag, 0);
+ if (!req)
+ return -EINVAL;
+
+ data = blk_mq_rq_to_pdu(req);
+ if (test_and_set_bit(UBLK_ZC_REGISTERED, &data->flags)) {
+ ublk_put_req_ref(ubq, req);
+ return -EBUSY;
+ }
+
+ ret = io_buffer_register_bvec(ctx, req, index);
+ if (ret) {
+ clear_bit(UBLK_ZC_REGISTERED, &data->flags);
+ ublk_put_req_ref(ubq, req);
+ return ret;
+ }
+
+ return 0;
+}
+
+static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
+ struct ublk_queue *ubq, int tag,
+ const struct ublksrv_io_cmd *ub_cmd)
+{
+ struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+ struct ublk_device *ub = cmd->file->private_data;
+ int index = (int)ub_cmd->addr;
+ struct ublk_rq_data *data;
+ struct request *req;
+
+ if (!ub)
+ return -EPERM;
+
+ req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+ if (!req)
+ return -EINVAL;
+
+ data = blk_mq_rq_to_pdu(req);
+ if (!test_and_clear_bit(UBLK_ZC_REGISTERED, &data->flags))
+ return -EINVAL;
+
+ ublk_put_req_ref(ubq, req);
+ io_buffer_unregister_bvec(ctx, index);
+ return 0;
+}
+
static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
unsigned int issue_flags,
const struct ublksrv_io_cmd *ub_cmd)
@@ -1798,6 +1894,10 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
ret = -EINVAL;
switch (_IOC_NR(cmd_op)) {
+ case UBLK_IO_REGISTER_IO_BUF:
+ return ublk_register_io_buf(cmd, ubq, tag, ub_cmd);
+ case UBLK_IO_UNREGISTER_IO_BUF:
+ return ublk_unregister_io_buf(cmd, ubq, tag, ub_cmd);
case UBLK_IO_FETCH_REQ:
/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
if (ublk_queue_ready(ubq)) {
@@ -1872,36 +1972,6 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
return -EIOCBQUEUED;
}
-static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
- struct ublk_queue *ubq, int tag, size_t offset)
-{
- struct request *req;
-
- if (!ublk_need_req_ref(ubq))
- return NULL;
-
- req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
- if (!req)
- return NULL;
-
- if (!ublk_get_req_ref(ubq, req))
- return NULL;
-
- if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
- goto fail_put;
-
- if (!ublk_rq_has_data(req))
- goto fail_put;
-
- if (offset > blk_rq_bytes(req))
- goto fail_put;
-
- return req;
-fail_put:
- ublk_put_req_ref(ubq, req);
- return NULL;
-}
-
static inline int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
unsigned int issue_flags)
{
@@ -2527,9 +2597,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
goto out_free_dev_number;
}
- /* We are not ready to support zero copy */
- ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
-
ub->dev_info.nr_hw_queues = min_t(unsigned int,
ub->dev_info.nr_hw_queues, nr_cpu_ids);
ublk_align_max_io_size(ub);
@@ -2860,7 +2927,7 @@ static int ublk_ctrl_get_features(struct io_uring_cmd *cmd)
{
const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe);
void __user *argp = (void __user *)(unsigned long)header->addr;
- u64 features = UBLK_F_ALL & ~UBLK_F_SUPPORT_ZERO_COPY;
+ u64 features = UBLK_F_ALL;
if (header->len != UBLK_FEATURES_LEN || !header->addr)
return -EINVAL;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index a8bc98bb69fce..74246c926b55f 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -94,6 +94,10 @@
_IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd)
#define UBLK_U_IO_NEED_GET_DATA \
_IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd)
+#define UBLK_U_IO_REGISTER_IO_BUF \
+ _IOWR('u', 0x23, struct ublksrv_io_cmd)
+#define UBLK_U_IO_UNREGISTER_IO_BUF \
+ _IOWR('u', 0x24, struct ublksrv_io_cmd)
/* only ABORT means that no re-fetch */
#define UBLK_IO_RES_OK 0
--
2.43.5
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH 4/6] ublk: zc register/unregister bvec
2025-02-03 15:45 ` [PATCH 4/6] ublk: zc register/unregister bvec Keith Busch
@ 2025-02-08 5:50 ` Ming Lei
0 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2025-02-08 5:50 UTC (permalink / raw)
To: Keith Busch; +Cc: io-uring, linux-block, axboe, asml.silence, Keith Busch
On Mon, Feb 03, 2025 at 07:45:15AM -0800, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> Provide new operations for the user to request mapping an active request
> to an io uring instance's buf_table. The user has to provide the index
> it wants to install the buffer.
>
> A reference count is taken on the request to ensure it can't be
> completed while it is active in a ring's buf_table.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> drivers/block/ublk_drv.c | 139 +++++++++++++++++++++++++---------
> include/uapi/linux/ublk_cmd.h | 4 +
> 2 files changed, 107 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 529085181f355..58f224b5687b9 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -51,6 +51,9 @@
> /* private ioctl command mirror */
> #define UBLK_CMD_DEL_DEV_ASYNC _IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
>
> +#define UBLK_IO_REGISTER_IO_BUF _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
> +#define UBLK_IO_UNREGISTER_IO_BUF _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
> +
...
> @@ -1798,6 +1894,10 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
>
> ret = -EINVAL;
> switch (_IOC_NR(cmd_op)) {
> + case UBLK_IO_REGISTER_IO_BUF:
> + return ublk_register_io_buf(cmd, ubq, tag, ub_cmd);
> + case UBLK_IO_UNREGISTER_IO_BUF:
> + return ublk_unregister_io_buf(cmd, ubq, tag, ub_cmd);
Here IO_BUF is kernel buffer, we have to make sure that it won't be
leaked.
Such as application panic, how to un-register it?
Thanks,
Ming
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 5/6] io_uring: add abstraction for buf_table rsrc data
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
` (3 preceding siblings ...)
2025-02-03 15:45 ` [PATCH 4/6] ublk: zc register/unregister bvec Keith Busch
@ 2025-02-03 15:45 ` Keith Busch
2025-02-03 15:45 ` [PATCH 6/6] io_uring: cache nodes and mapped buffers Keith Busch
` (3 subsequent siblings)
8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-02-03 15:45 UTC (permalink / raw)
To: io-uring, linux-block, ming.lei, axboe, asml.silence; +Cc: Keith Busch
From: Keith Busch <[email protected]>
We'll need to add more fields specific to the registered buffers, so
make a layer for it now. No functional change in this patch.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/io_uring_types.h | 6 +++-
io_uring/fdinfo.c | 8 +++---
io_uring/net.c | 2 +-
io_uring/nop.c | 2 +-
io_uring/register.c | 2 +-
io_uring/rsrc.c | 51 +++++++++++++++++-----------------
io_uring/rw.c | 2 +-
io_uring/uring_cmd.c | 2 +-
8 files changed, 39 insertions(+), 36 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 7e5a5a70c35f2..aa661ebfd6568 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -67,6 +67,10 @@ struct io_file_table {
unsigned int alloc_hint;
};
+struct io_buf_table {
+ struct io_rsrc_data data;
+};
+
struct io_hash_bucket {
struct hlist_head list;
} ____cacheline_aligned_in_smp;
@@ -290,7 +294,7 @@ struct io_ring_ctx {
struct io_wq_work_list iopoll_list;
struct io_file_table file_table;
- struct io_rsrc_data buf_table;
+ struct io_buf_table buf_table;
struct io_submit_state submit_state;
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index f60d0a9d505e2..d389c06cbce10 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -217,12 +217,12 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
seq_puts(m, "\n");
}
}
- seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.nr);
- for (i = 0; has_lock && i < ctx->buf_table.nr; i++) {
+ seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.data.nr);
+ for (i = 0; has_lock && i < ctx->buf_table.data.nr; i++) {
struct io_mapped_ubuf *buf = NULL;
- if (ctx->buf_table.nodes[i])
- buf = ctx->buf_table.nodes[i]->buf;
+ if (ctx->buf_table.data.nodes[i])
+ buf = ctx->buf_table.data.nodes[i]->buf;
if (buf)
seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf, buf->len);
else
diff --git a/io_uring/net.c b/io_uring/net.c
index 4e9d0f04b902d..4917786456cf8 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1348,7 +1348,7 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
ret = -EFAULT;
io_ring_submit_lock(ctx, issue_flags);
- node = io_rsrc_node_lookup(&ctx->buf_table, sr->buf_index);
+ node = io_rsrc_node_lookup(&ctx->buf_table.data, sr->buf_index);
if (node) {
io_req_assign_buf_node(sr->notif, node);
ret = 0;
diff --git a/io_uring/nop.c b/io_uring/nop.c
index 5e5196df650a1..e3ebe5f019076 100644
--- a/io_uring/nop.c
+++ b/io_uring/nop.c
@@ -69,7 +69,7 @@ int io_nop(struct io_kiocb *req, unsigned int issue_flags)
ret = -EFAULT;
io_ring_submit_lock(ctx, issue_flags);
- node = io_rsrc_node_lookup(&ctx->buf_table, nop->buffer);
+ node = io_rsrc_node_lookup(&ctx->buf_table.data, nop->buffer);
if (node) {
io_req_assign_buf_node(req, node);
ret = 0;
diff --git a/io_uring/register.c b/io_uring/register.c
index 0db181437ae33..e8f00b19e75f6 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -919,7 +919,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
ret = __io_uring_register(ctx, opcode, arg, nr_args);
trace_io_uring_register(ctx, opcode, ctx->file_table.data.nr,
- ctx->buf_table.nr, ret);
+ ctx->buf_table.data.nr, ret);
mutex_unlock(&ctx->uring_lock);
fput(file);
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 8c4c374abcc10..864c2eabf8efd 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -234,17 +234,17 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
__u32 done;
int i, err;
- if (!ctx->buf_table.nr)
+ if (!ctx->buf_table.data.nr)
return -ENXIO;
- if (up->offset + nr_args > ctx->buf_table.nr)
+ if (up->offset + nr_args > ctx->buf_table.data.nr)
return -EINVAL;
for (done = 0; done < nr_args; done++) {
struct io_rsrc_node *node;
u64 tag = 0;
- i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
- node = io_rsrc_node_lookup(&ctx->buf_table, i);
+ i = array_index_nospec(up->offset + done, ctx->buf_table.data.nr);
+ node = io_rsrc_node_lookup(&ctx->buf_table.data, i);
if (node && node->type != IORING_RSRC_BUFFER) {
err = -EBUSY;
break;
@@ -276,8 +276,8 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
}
node->tag = tag;
}
- io_reset_rsrc_node(ctx, &ctx->buf_table, i);
- ctx->buf_table.nodes[i] = node;
+ io_reset_rsrc_node(ctx, &ctx->buf_table.data, i);
+ ctx->buf_table.data.nodes[i] = node;
if (ctx->compat)
user_data += sizeof(struct compat_iovec);
else
@@ -556,9 +556,9 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
{
- if (!ctx->buf_table.nr)
+ if (!ctx->buf_table.data.nr)
return -ENXIO;
- io_rsrc_data_free(ctx, &ctx->buf_table);
+ io_rsrc_data_free(ctx, &ctx->buf_table.data);
return 0;
}
@@ -585,8 +585,8 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
}
/* check previously registered pages */
- for (i = 0; i < ctx->buf_table.nr; i++) {
- struct io_rsrc_node *node = ctx->buf_table.nodes[i];
+ for (i = 0; i < ctx->buf_table.data.nr; i++) {
+ struct io_rsrc_node *node = ctx->buf_table.data.nodes[i];
struct io_mapped_ubuf *imu;
if (!node)
@@ -812,7 +812,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));
- if (ctx->buf_table.nr)
+ if (ctx->buf_table.data.nr)
return -EBUSY;
if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
return -EINVAL;
@@ -865,7 +865,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
data.nodes[i] = node;
}
- ctx->buf_table = data;
+ ctx->buf_table.data = data;
if (ret)
io_sqe_buffers_unregister(ctx);
return ret;
@@ -901,7 +901,7 @@ static struct io_rsrc_node *io_buffer_alloc_node(struct io_ring_ctx *ctx,
int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq,
unsigned int index)
{
- struct io_rsrc_data *data = &ctx->buf_table;
+ struct io_rsrc_data *data = &ctx->buf_table.data;
u16 nr_bvecs = blk_rq_nr_phys_segments(rq);
struct req_iterator rq_iter;
struct io_rsrc_node *node;
@@ -938,7 +938,7 @@ EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int index)
{
- struct io_rsrc_data *data = &ctx->buf_table;
+ struct io_rsrc_data *data = &ctx->buf_table.data;
struct io_rsrc_node *node;
lockdep_assert_held(&ctx->uring_lock);
@@ -1054,10 +1054,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
if (!arg->nr && (arg->dst_off || arg->src_off))
return -EINVAL;
/* not allowed unless REPLACE is set */
- if (ctx->buf_table.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE))
+ if (ctx->buf_table.data.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE))
return -EBUSY;
- nbufs = src_ctx->buf_table.nr;
+ nbufs = src_ctx->buf_table.data.nr;
if (!arg->nr)
arg->nr = nbufs;
else if (arg->nr > nbufs)
@@ -1067,13 +1067,13 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
if (check_add_overflow(arg->nr, arg->dst_off, &nbufs))
return -EOVERFLOW;
- ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.nr));
+ ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr));
if (ret)
return ret;
/* Fill entries in data from dst that won't overlap with src */
- for (i = 0; i < min(arg->dst_off, ctx->buf_table.nr); i++) {
- struct io_rsrc_node *src_node = ctx->buf_table.nodes[i];
+ for (i = 0; i < min(arg->dst_off, ctx->buf_table.data.nr); i++) {
+ struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i];
if (src_node) {
data.nodes[i] = src_node;
@@ -1082,7 +1082,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
}
ret = -ENXIO;
- nbufs = src_ctx->buf_table.nr;
+ nbufs = src_ctx->buf_table.data.nr;
if (!nbufs)
goto out_free;
ret = -EINVAL;
@@ -1102,7 +1102,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
while (nr--) {
struct io_rsrc_node *dst_node, *src_node;
- src_node = io_rsrc_node_lookup(&src_ctx->buf_table, i);
+ src_node = io_rsrc_node_lookup(&src_ctx->buf_table.data, i);
if (!src_node) {
dst_node = NULL;
} else {
@@ -1124,7 +1124,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
* old and new nodes at this point.
*/
if (arg->flags & IORING_REGISTER_DST_REPLACE)
- io_rsrc_data_free(ctx, &ctx->buf_table);
+ io_sqe_buffers_unregister(ctx);
/*
* ctx->buf_table must be empty now - either the contents are being
@@ -1132,10 +1132,9 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
* copied to a ring that does not have buffers yet (checked at function
* entry).
*/
- WARN_ON_ONCE(ctx->buf_table.nr);
- ctx->buf_table = data;
+ WARN_ON_ONCE(ctx->buf_table.data.nr);
+ ctx->buf_table.data = data;
return 0;
-
out_free:
io_rsrc_data_free(ctx, &data);
return ret;
@@ -1160,7 +1159,7 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg)
return -EFAULT;
if (buf.flags & ~(IORING_REGISTER_SRC_REGISTERED|IORING_REGISTER_DST_REPLACE))
return -EINVAL;
- if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.nr)
+ if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.data.nr)
return -EBUSY;
if (memchr_inv(buf.pad, 0, sizeof(buf.pad)))
return -EINVAL;
diff --git a/io_uring/rw.c b/io_uring/rw.c
index d6332d019dd56..f49ae3de94317 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -387,7 +387,7 @@ static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe
if (unlikely(ret))
return ret;
- node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
+ node = io_rsrc_node_lookup(&ctx->buf_table.data, req->buf_index);
if (!node)
return -EFAULT;
io_req_assign_buf_node(req, node);
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index b7b9baf30d728..5c9f14d700373 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -213,7 +213,7 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
struct io_rsrc_node *node;
u16 index = READ_ONCE(sqe->buf_index);
- node = io_rsrc_node_lookup(&ctx->buf_table, index);
+ node = io_rsrc_node_lookup(&ctx->buf_table.data, index);
if (unlikely(!node))
return -EFAULT;
/*
--
2.43.5
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH 6/6] io_uring: cache nodes and mapped buffers
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
` (4 preceding siblings ...)
2025-02-03 15:45 ` [PATCH 5/6] io_uring: add abstraction for buf_table rsrc data Keith Busch
@ 2025-02-03 15:45 ` Keith Busch
2025-02-07 12:41 ` Pavel Begunkov
2025-02-06 15:28 ` [PATCH 0/6] ublk zero-copy support Keith Busch
` (2 subsequent siblings)
8 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-02-03 15:45 UTC (permalink / raw)
To: io-uring, linux-block, ming.lei, axboe, asml.silence; +Cc: Keith Busch
From: Keith Busch <[email protected]>
Frequent alloc/free cycles on these is pretty costly. Use an io cache to
more efficiently reuse these buffers.
Signed-off-by: Keith Busch <[email protected]>
---
include/linux/io_uring_types.h | 16 ++---
io_uring/filetable.c | 2 +-
io_uring/rsrc.c | 108 ++++++++++++++++++++++++---------
io_uring/rsrc.h | 2 +-
4 files changed, 92 insertions(+), 36 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index aa661ebfd6568..c0e0c1f92e5b1 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -67,8 +67,17 @@ struct io_file_table {
unsigned int alloc_hint;
};
+struct io_alloc_cache {
+ void **entries;
+ unsigned int nr_cached;
+ unsigned int max_cached;
+ size_t elem_size;
+};
+
struct io_buf_table {
struct io_rsrc_data data;
+ struct io_alloc_cache node_cache;
+ struct io_alloc_cache imu_cache;
};
struct io_hash_bucket {
@@ -222,13 +231,6 @@ struct io_submit_state {
struct blk_plug plug;
};
-struct io_alloc_cache {
- void **entries;
- unsigned int nr_cached;
- unsigned int max_cached;
- size_t elem_size;
-};
-
struct io_ring_ctx {
/* const or read-mostly hot data */
struct {
diff --git a/io_uring/filetable.c b/io_uring/filetable.c
index dd8eeec97acf6..a21660e3145ab 100644
--- a/io_uring/filetable.c
+++ b/io_uring/filetable.c
@@ -68,7 +68,7 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
if (slot_index >= ctx->file_table.data.nr)
return -EINVAL;
- node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
if (!node)
return -ENOMEM;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 864c2eabf8efd..5434b0d992d62 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -117,23 +117,39 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
unpin_user_page(imu->bvec[i].bv_page);
if (imu->acct_pages)
io_unaccount_mem(ctx, imu->acct_pages);
- kvfree(imu);
+ if (struct_size(imu, bvec, imu->nr_bvecs) >
+ ctx->buf_table.imu_cache.elem_size ||
+ !io_alloc_cache_put(&ctx->buf_table.imu_cache, imu))
+ kvfree(imu);
}
}
-struct io_rsrc_node *io_rsrc_node_alloc(int type)
+struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type)
{
struct io_rsrc_node *node;
- node = kzalloc(sizeof(*node), GFP_KERNEL);
+ if (type == IORING_RSRC_FILE)
+ node = kmalloc(sizeof(*node), GFP_KERNEL);
+ else
+ node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL, NULL);
if (node) {
node->type = type;
node->refs = 1;
+ node->tag = 0;
+ node->file_ptr = 0;
}
return node;
}
-__cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data)
+static __cold void __io_rsrc_data_free(struct io_rsrc_data *data)
+{
+ kvfree(data->nodes);
+ data->nodes = NULL;
+ data->nr = 0;
+}
+
+__cold void io_rsrc_data_free(struct io_ring_ctx *ctx,
+ struct io_rsrc_data *data)
{
if (!data->nr)
return;
@@ -141,9 +157,7 @@ __cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data
if (data->nodes[data->nr])
io_put_rsrc_node(ctx, data->nodes[data->nr]);
}
- kvfree(data->nodes);
- data->nodes = NULL;
- data->nr = 0;
+ __io_rsrc_data_free(data);
}
__cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
@@ -157,6 +171,31 @@ __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
return -ENOMEM;
}
+static __cold int io_rsrc_buffer_alloc(struct io_buf_table *table, unsigned nr)
+{
+ int ret;
+
+ ret = io_rsrc_data_alloc(&table->data, nr);
+ if (ret)
+ return ret;
+
+ ret = io_alloc_cache_init(&table->node_cache, nr,
+ sizeof(struct io_rsrc_node));
+ if (ret)
+ goto out_1;
+
+ ret = io_alloc_cache_init(&table->imu_cache, nr, 512);
+ if (ret)
+ goto out_2;
+
+ return 0;
+out_2:
+ io_alloc_cache_free(&table->node_cache, kfree);
+out_1:
+ __io_rsrc_data_free(&table->data);
+ return ret;
+}
+
static int __io_sqe_files_update(struct io_ring_ctx *ctx,
struct io_uring_rsrc_update2 *up,
unsigned nr_args)
@@ -206,7 +245,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
err = -EBADF;
break;
}
- node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
if (!node) {
err = -ENOMEM;
fput(file);
@@ -466,6 +505,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
case IORING_RSRC_KBUF:
if (node->buf)
io_buffer_unmap(ctx, node);
+ if (io_alloc_cache_put(&ctx->buf_table.node_cache, node))
+ return;
break;
default:
WARN_ON_ONCE(1);
@@ -534,7 +575,7 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
goto fail;
}
ret = -ENOMEM;
- node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
if (!node) {
fput(file);
goto fail;
@@ -554,11 +595,19 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
return ret;
}
+static void io_rsrc_buffer_free(struct io_ring_ctx *ctx,
+ struct io_buf_table *table)
+{
+ io_rsrc_data_free(ctx, &table->data);
+ io_alloc_cache_free(&table->node_cache, kfree);
+ io_alloc_cache_free(&table->imu_cache, kfree);
+}
+
int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
{
if (!ctx->buf_table.data.nr)
return -ENXIO;
- io_rsrc_data_free(ctx, &ctx->buf_table.data);
+ io_rsrc_buffer_free(ctx, &ctx->buf_table);
return 0;
}
@@ -739,7 +788,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
if (!iov->iov_base)
return NULL;
- node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
if (!node)
return ERR_PTR(-ENOMEM);
node->buf = NULL;
@@ -759,7 +808,10 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
coalesced = io_coalesce_buffer(&pages, &nr_pages, &data);
}
- imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
+ if (struct_size(imu, bvec, nr_pages) > ctx->buf_table.imu_cache.elem_size)
+ imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
+ else
+ imu = io_cache_alloc(&ctx->buf_table.imu_cache, GFP_KERNEL, NULL);
if (!imu)
goto done;
@@ -805,9 +857,9 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
unsigned int nr_args, u64 __user *tags)
{
struct page *last_hpage = NULL;
- struct io_rsrc_data data;
struct iovec fast_iov, *iov = &fast_iov;
const struct iovec __user *uvec;
+ struct io_buf_table table;
int i, ret;
BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));
@@ -816,13 +868,14 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
return -EBUSY;
if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
return -EINVAL;
- ret = io_rsrc_data_alloc(&data, nr_args);
+ ret = io_rsrc_buffer_alloc(&table, nr_args);
if (ret)
return ret;
if (!arg)
memset(iov, 0, sizeof(*iov));
+ ctx->buf_table = table;
for (i = 0; i < nr_args; i++) {
struct io_rsrc_node *node;
u64 tag = 0;
@@ -862,10 +915,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
}
node->tag = tag;
}
- data.nodes[i] = node;
+ table.data.nodes[i] = node;
}
-
- ctx->buf_table.data = data;
if (ret)
io_sqe_buffers_unregister(ctx);
return ret;
@@ -878,11 +929,14 @@ static struct io_rsrc_node *io_buffer_alloc_node(struct io_ring_ctx *ctx,
struct io_mapped_ubuf *imu;
struct io_rsrc_node *node;
- node = io_rsrc_node_alloc(IORING_RSRC_KBUF);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_KBUF);
if (!node)
return NULL;
- imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+ if (struct_size(imu, bvec, nr_bvecs) > ctx->buf_table.imu_cache.elem_size)
+ imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+ else
+ imu = io_cache_alloc(&ctx->buf_table.imu_cache, GFP_KERNEL, NULL);
if (!imu) {
io_put_rsrc_node(ctx, node);
return NULL;
@@ -1036,7 +1090,7 @@ static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx,
struct io_uring_clone_buffers *arg)
{
- struct io_rsrc_data data;
+ struct io_buf_table table;
int i, ret, off, nr;
unsigned int nbufs;
@@ -1067,7 +1121,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
if (check_add_overflow(arg->nr, arg->dst_off, &nbufs))
return -EOVERFLOW;
- ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr));
+ ret = io_rsrc_buffer_alloc(&table, max(nbufs, ctx->buf_table.data.nr));
if (ret)
return ret;
@@ -1076,7 +1130,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i];
if (src_node) {
- data.nodes[i] = src_node;
+ table.data.nodes[i] = src_node;
src_node->refs++;
}
}
@@ -1106,7 +1160,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
if (!src_node) {
dst_node = NULL;
} else {
- dst_node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+ dst_node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
if (!dst_node) {
ret = -ENOMEM;
goto out_free;
@@ -1115,12 +1169,12 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
refcount_inc(&src_node->buf->refs);
dst_node->buf = src_node->buf;
}
- data.nodes[off++] = dst_node;
+ table.data.nodes[off++] = dst_node;
i++;
}
/*
- * If asked for replace, put the old table. data->nodes[] holds both
+ * If asked for replace, put the old table. table.data->nodes[] holds both
* old and new nodes at this point.
*/
if (arg->flags & IORING_REGISTER_DST_REPLACE)
@@ -1133,10 +1187,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
* entry).
*/
WARN_ON_ONCE(ctx->buf_table.data.nr);
- ctx->buf_table.data = data;
+ ctx->buf_table = table;
return 0;
out_free:
- io_rsrc_data_free(ctx, &data);
+ io_rsrc_buffer_free(ctx, &table);
return ret;
}
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index d1d90d9cd2b43..759ac373b0dc6 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -46,7 +46,7 @@ struct io_imu_folio_data {
unsigned int nr_folios;
};
-struct io_rsrc_node *io_rsrc_node_alloc(int type);
+struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type);
void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data);
int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
--
2.43.5
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] io_uring: cache nodes and mapped buffers
2025-02-03 15:45 ` [PATCH 6/6] io_uring: cache nodes and mapped buffers Keith Busch
@ 2025-02-07 12:41 ` Pavel Begunkov
2025-02-07 15:33 ` Keith Busch
2025-02-07 15:59 ` Keith Busch
0 siblings, 2 replies; 27+ messages in thread
From: Pavel Begunkov @ 2025-02-07 12:41 UTC (permalink / raw)
To: Keith Busch, io-uring, linux-block, ming.lei, axboe; +Cc: Keith Busch
On 2/3/25 15:45, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> Frequent alloc/free cycles on these is pretty costly. Use an io cache to
> more efficiently reuse these buffers.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> include/linux/io_uring_types.h | 16 ++---
> io_uring/filetable.c | 2 +-
> io_uring/rsrc.c | 108 ++++++++++++++++++++++++---------
> io_uring/rsrc.h | 2 +-
> 4 files changed, 92 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index aa661ebfd6568..c0e0c1f92e5b1 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -67,8 +67,17 @@ struct io_file_table {
> unsigned int alloc_hint;
> };
>
> +struct io_alloc_cache {
> + void **entries;
> + unsigned int nr_cached;
> + unsigned int max_cached;
> + size_t elem_size;
> +};
> +
> struct io_buf_table {
> struct io_rsrc_data data;
> + struct io_alloc_cache node_cache;
> + struct io_alloc_cache imu_cache;
We can avoid all churn if you kill patch 5/6 and place put the
caches directly into struct io_ring_ctx. It's a bit better for
future cache improvements and we can even reuse the node cache
for files.
...
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 864c2eabf8efd..5434b0d992d62 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -117,23 +117,39 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
> unpin_user_page(imu->bvec[i].bv_page);
> if (imu->acct_pages)
> io_unaccount_mem(ctx, imu->acct_pages);
> - kvfree(imu);
> + if (struct_size(imu, bvec, imu->nr_bvecs) >
> + ctx->buf_table.imu_cache.elem_size ||
It could be quite a large allocation, let's not cache it if
it hasn't came from the cache for now. We can always improve
on top.
And can we invert how it's calculated? See below. You'll have
fewer calculations in the fast path, and I don't really like
users looking at ->elem_size when it's not necessary.
#define IO_CACHED_BVEC_SEGS N
io_alloc_cache_init(&table->imu_cache, ...,
/* elem_size */ struct_size(imu, ..., IO_CACHED_BVEC_SEGS));
alloc(bvec_segs) {
if (bvec_segs <= IO_CACHED_BVEC_SEGS)
/* use the cache */
...
}
free() {
if (imu->nr_segs == IO_CACHED_BVEC_SEGS)
/* return to cache */
else {
WARN_ON_ONCE(imu->nr_segs < IO_CACHED_BVEC_SEGS);
...
}
}
> + !io_alloc_cache_put(&ctx->buf_table.imu_cache, imu))
> + kvfree(imu);
> }
> }
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] io_uring: cache nodes and mapped buffers
2025-02-07 12:41 ` Pavel Begunkov
@ 2025-02-07 15:33 ` Keith Busch
2025-02-08 14:00 ` Pavel Begunkov
2025-02-07 15:59 ` Keith Busch
1 sibling, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-02-07 15:33 UTC (permalink / raw)
To: Pavel Begunkov; +Cc: Keith Busch, io-uring, linux-block, ming.lei, axboe
On Fri, Feb 07, 2025 at 12:41:17PM +0000, Pavel Begunkov wrote:
> On 2/3/25 15:45, Keith Busch wrote:
> > From: Keith Busch <[email protected]>
> >
> > Frequent alloc/free cycles on these is pretty costly. Use an io cache to
> > more efficiently reuse these buffers.
> >
> > Signed-off-by: Keith Busch <[email protected]>
> > ---
> > include/linux/io_uring_types.h | 16 ++---
> > io_uring/filetable.c | 2 +-
> > io_uring/rsrc.c | 108 ++++++++++++++++++++++++---------
> > io_uring/rsrc.h | 2 +-
> > 4 files changed, 92 insertions(+), 36 deletions(-)
> >
> > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > index aa661ebfd6568..c0e0c1f92e5b1 100644
> > --- a/include/linux/io_uring_types.h
> > +++ b/include/linux/io_uring_types.h
> > @@ -67,8 +67,17 @@ struct io_file_table {
> > unsigned int alloc_hint;
> > };
> > +struct io_alloc_cache {
> > + void **entries;
> > + unsigned int nr_cached;
> > + unsigned int max_cached;
> > + size_t elem_size;
> > +};
> > +
> > struct io_buf_table {
> > struct io_rsrc_data data;
> > + struct io_alloc_cache node_cache;
> > + struct io_alloc_cache imu_cache;
>
> We can avoid all churn if you kill patch 5/6 and place put the
> caches directly into struct io_ring_ctx. It's a bit better for
> future cache improvements and we can even reuse the node cache
> for files.
>
> ...
> > diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> > index 864c2eabf8efd..5434b0d992d62 100644
> > --- a/io_uring/rsrc.c
> > +++ b/io_uring/rsrc.c
> > @@ -117,23 +117,39 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
> > unpin_user_page(imu->bvec[i].bv_page);
> > if (imu->acct_pages)
> > io_unaccount_mem(ctx, imu->acct_pages);
> > - kvfree(imu);
> > + if (struct_size(imu, bvec, imu->nr_bvecs) >
> > + ctx->buf_table.imu_cache.elem_size ||
>
> It could be quite a large allocation, let's not cache it if
> it hasn't came from the cache for now. We can always improve
> on top.
Eh? This already skips inserting into the cache if it wasn't allocated
out of the cache.
I picked an arbitrary size, 512b, as the threshold for caching. If you
need more bvecs than fit in that, it falls back to a kvmalloc/kvfree.
The allocation overhead is pretty insignificant when you're transferring
large payloads like that, and 14 vectors was chosen as the tipping point
because it fits in a nice round number.
> And can we invert how it's calculated? See below. You'll have
> fewer calculations in the fast path, and I don't really like
> users looking at ->elem_size when it's not necessary.
>
>
> #define IO_CACHED_BVEC_SEGS N
Yah, that's fine.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] io_uring: cache nodes and mapped buffers
2025-02-07 15:33 ` Keith Busch
@ 2025-02-08 14:00 ` Pavel Begunkov
0 siblings, 0 replies; 27+ messages in thread
From: Pavel Begunkov @ 2025-02-08 14:00 UTC (permalink / raw)
To: Keith Busch; +Cc: Keith Busch, io-uring, linux-block, ming.lei, axboe
On 2/7/25 15:33, Keith Busch wrote:
...
>>> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
>>> index 864c2eabf8efd..5434b0d992d62 100644
>>> --- a/io_uring/rsrc.c
>>> +++ b/io_uring/rsrc.c
>>> @@ -117,23 +117,39 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
>>> unpin_user_page(imu->bvec[i].bv_page);
>>> if (imu->acct_pages)
>>> io_unaccount_mem(ctx, imu->acct_pages);
>>> - kvfree(imu);
>>> + if (struct_size(imu, bvec, imu->nr_bvecs) >
>>> + ctx->buf_table.imu_cache.elem_size ||
>>
>> It could be quite a large allocation, let's not cache it if
>> it hasn't came from the cache for now. We can always improve
>> on top.
>
> Eh? This already skips inserting into the cache if it wasn't allocated
> out of the cache.
Ah yes, you're right, I overlooked it.
>
> I picked an arbitrary size, 512b, as the threshold for caching. If you
> need more bvecs than fit in that, it falls back to a kvmalloc/kvfree.
> The allocation overhead is pretty insignificant when you're transferring
> large payloads like that, and 14 vectors was chosen as the tipping point
> because it fits in a nice round number.
>
>> And can we invert how it's calculated? See below. You'll have
>> fewer calculations in the fast path, and I don't really like
>> users looking at ->elem_size when it's not necessary.
>>
>>
>> #define IO_CACHED_BVEC_SEGS N
>
> Yah, that's fine.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] io_uring: cache nodes and mapped buffers
2025-02-07 12:41 ` Pavel Begunkov
2025-02-07 15:33 ` Keith Busch
@ 2025-02-07 15:59 ` Keith Busch
2025-02-08 14:24 ` Pavel Begunkov
1 sibling, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-02-07 15:59 UTC (permalink / raw)
To: Pavel Begunkov; +Cc: Keith Busch, io-uring, linux-block, ming.lei, axboe
On Fri, Feb 07, 2025 at 12:41:17PM +0000, Pavel Begunkov wrote:
> On 2/3/25 15:45, Keith Busch wrote:
> > +struct io_alloc_cache {
> > + void **entries;
> > + unsigned int nr_cached;
> > + unsigned int max_cached;
> > + size_t elem_size;
> > +};
> > +
> > struct io_buf_table {
> > struct io_rsrc_data data;
> > + struct io_alloc_cache node_cache;
> > + struct io_alloc_cache imu_cache;
>
> We can avoid all churn if you kill patch 5/6 and place put the
> caches directly into struct io_ring_ctx. It's a bit better for
> future cache improvements and we can even reuse the node cache
> for files.
I had this that way in an earlier version. The cache is tightly
connected to the buf table, though, so splitting them up makes for some
awkward cleanup. Grouping them together makes it clear their lifetimes
are as a single unit.
The filetable could have moved its bitmap into io_ring_ctx too, but it's
in its own structure like this, and it conceptually makes sense. This is
following in that same pattern.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 6/6] io_uring: cache nodes and mapped buffers
2025-02-07 15:59 ` Keith Busch
@ 2025-02-08 14:24 ` Pavel Begunkov
0 siblings, 0 replies; 27+ messages in thread
From: Pavel Begunkov @ 2025-02-08 14:24 UTC (permalink / raw)
To: Keith Busch; +Cc: Keith Busch, io-uring, linux-block, ming.lei, axboe
On 2/7/25 15:59, Keith Busch wrote:
> On Fri, Feb 07, 2025 at 12:41:17PM +0000, Pavel Begunkov wrote:
>> On 2/3/25 15:45, Keith Busch wrote:
>>> +struct io_alloc_cache {
>>> + void **entries;
>>> + unsigned int nr_cached;
>>> + unsigned int max_cached;
>>> + size_t elem_size;
>>> +};
>>> +
>>> struct io_buf_table {
>>> struct io_rsrc_data data;
>>> + struct io_alloc_cache node_cache;
>>> + struct io_alloc_cache imu_cache;
>>
>> We can avoid all churn if you kill patch 5/6 and place put the
>> caches directly into struct io_ring_ctx. It's a bit better for
>> future cache improvements and we can even reuse the node cache
>> for files.
>
> I had this that way in an earlier version. The cache is tightly
> connected to the buf table, though, so splitting them up makes for some
> awkward cleanup. Grouping them together makes it clear their lifetimes
> are as a single unit.
I'd say it's tightly couple with the context as depends on the
ctx sync, and the table doesn't really care from where memory
comes.
Anyway, there are basically two reasons. If we're adding node cache,
we need to reuse it for files as well. And I'll likely need to pull
it all back to ctx for cleaning up caches in general, and I'd rather
avoid such back and forths as it's not so great for backports and
things like that.
Is there a good reason why it needs to be in there?
> The filetable could have moved its bitmap into io_ring_ctx too, but it's
> in its own structure like this, and it conceptually makes sense. This is
> following in that same pattern.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] ublk zero-copy support
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
` (5 preceding siblings ...)
2025-02-03 15:45 ` [PATCH 6/6] io_uring: cache nodes and mapped buffers Keith Busch
@ 2025-02-06 15:28 ` Keith Busch
2025-02-07 3:51 ` Ming Lei
2025-02-08 0:51 ` Bernd Schubert
8 siblings, 0 replies; 27+ messages in thread
From: Keith Busch @ 2025-02-06 15:28 UTC (permalink / raw)
To: Keith Busch; +Cc: io-uring, linux-block, ming.lei, axboe, asml.silence
On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> This is a new look at supporting zero copy with ublk.
Just to give some numbers behind this since I didn't in the original
cover letter. The numbers are from a 1.6GHz Xeon platform.
Using the ublksrv patch I provided in the cover letter, created two ublk
block devices with null_blk backings:
ublk add -t loop -f /dev/nullb0
ublk add -t loop -f /dev/nullb1 -z
Using t/io_uring, comparing the ublk device without zero-copy vs the one
with zero-copy (-z) enabled
4k read:
Legacy:
IOPS=387.78K, BW=1514MiB/s, IOS/call=32/32
IOPS=395.14K, BW=1543MiB/s, IOS/call=32/32
IOPS=395.68K, BW=1545MiB/s, IOS/call=32/31
Zero-copy:
IOPS=482.69K, BW=1885MiB/s, IOS/call=32/31
IOPS=481.34K, BW=1880MiB/s, IOS/call=32/32
IOPS=481.66K, BW=1881MiB/s, IOS/call=32/32
64k read:
Legacy:
IOPS=73248, BW=4.58GiB/s, IOS/call=32/32
IOPS=73664, BW=4.60GiB/s, IOS/call=32/32
IOPS=72288, BW=4.52GiB/s, IOS/call=32/32
Zero-copy:
IOPS=381.76K, BW=23.86GiB/s, IOS/call=32/31
IOPS=378.18K, BW=23.64GiB/s, IOS/call=32/32
IOPS=379.52K, BW=23.72GiB/s, IOS/call=32/32
The register/unregister overhead is low enough to show a decent
improvement even at 4k IO. And it's using half the memory with lower CPU
utilization per IO, so all good wins.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] ublk zero-copy support
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
` (6 preceding siblings ...)
2025-02-06 15:28 ` [PATCH 0/6] ublk zero-copy support Keith Busch
@ 2025-02-07 3:51 ` Ming Lei
2025-02-07 14:06 ` Keith Busch
2025-02-08 0:51 ` Bernd Schubert
8 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2025-02-07 3:51 UTC (permalink / raw)
To: Keith Busch; +Cc: io-uring, linux-block, axboe, asml.silence, Keith Busch
On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> This is a new look at supporting zero copy with ublk.
>
> The previous version from Ming can be viewed here:
>
> https://lore.kernel.org/linux-block/[email protected]/
>
> Based on the feedback from that thread, the desired io_uring interfaces
> needed to be simpler, and the kernel registered resources need to behave
> more similiar to user registered buffers.
>
> This series introduces a new resource node type, KBUF, which, like the
> BUFFER resource, needs to be installed into an io_uring buf_node table
> in order for the user to access it in a fixed buffer command. The
> new io_uring kernel API provides a way for a user to register a struct
> request's bvec to a specific index, and a way to unregister it.
>
> When the ublk server receives notification of a new command, it must
> first select an index and register the zero copy buffer. It may use that
> index for any number of fixed buffer commands, then it must unregister
> the index when it's done. This can all be done in a single io_uring_enter
> if desired, or it can be split into multiple enters if needed.
I suspect it may not be done in single io_uring_enter() because there
is strict dependency among the three OPs(register buffer, read/write,
unregister buffer).
>
> The io_uring instance that gets the zero copy registration doesn't
> necessarily need to be the same ring that is receiving notifcations from
> the ublk_drv module. This allows you to split frontend and backend rings
> if desired.
>
> At the end of this cover letter, I've provided a patch to the ublksrv to
> demonstrate how to use this.
>
> Jens Axboe (1):
> io_uring: use node for import
>
> Keith Busch (5):
> block: const blk_rq_nr_phys_segments request
> io_uring: add support for kernel registered bvecs
> ublk: zc register/unregister bvec
> io_uring: add abstraction for buf_table rsrc data
> io_uring: cache nodes and mapped buffers
>
> drivers/block/ublk_drv.c | 139 +++++++++++++-----
> include/linux/blk-mq.h | 2 +-
> include/linux/io_uring.h | 1 +
> include/linux/io_uring_types.h | 25 +++-
> include/uapi/linux/ublk_cmd.h | 4 +
> io_uring/fdinfo.c | 8 +-
> io_uring/filetable.c | 2 +-
> io_uring/net.c | 5 +-
> io_uring/nop.c | 2 +-
> io_uring/register.c | 2 +-
> io_uring/rsrc.c | 259 ++++++++++++++++++++++++++-------
> io_uring/rsrc.h | 8 +-
> io_uring/rw.c | 4 +-
> io_uring/uring_cmd.c | 4 +-
> 14 files changed, 351 insertions(+), 114 deletions(-)
>
> --
> 2.43.5
>
> ublksrv:
>
> https://github.com/ublk-org/ublksrv
>
> ---
> include/ublk_cmd.h | 4 +++
> include/ublksrv_tgt.h | 13 ++++++++
> lib/ublksrv.c | 9 ++++++
> tgt_loop.cpp | 74 +++++++++++++++++++++++++++++++++++++++++--
> ublksrv_tgt.cpp | 2 +-
> 5 files changed, 99 insertions(+), 3 deletions(-)
>
> diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h
> index 0150003..07439be 100644
> --- a/include/ublk_cmd.h
> +++ b/include/ublk_cmd.h
> @@ -94,6 +94,10 @@
> _IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd)
> #define UBLK_U_IO_NEED_GET_DATA \
> _IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd)
> +#define UBLK_U_IO_REGISTER_IO_BUF \
> + _IOWR('u', 0x23, struct ublksrv_io_cmd)
> +#define UBLK_U_IO_UNREGISTER_IO_BUF \
> + _IOWR('u', 0x24, struct ublksrv_io_cmd)
>
> /* only ABORT means that no re-fetch */
> #define UBLK_IO_RES_OK 0
> diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h
> index 1deee2b..6291531 100644
> --- a/include/ublksrv_tgt.h
> +++ b/include/ublksrv_tgt.h
> @@ -189,4 +189,17 @@ static inline void ublk_get_sqe_pair(struct io_uring *r,
> *sqe2 = io_uring_get_sqe(r);
> }
>
> +static inline void ublk_get_sqe_three(struct io_uring *r,
> + struct io_uring_sqe **sqe1, struct io_uring_sqe **sqe2,
> + struct io_uring_sqe **sqe3)
> +{
> + unsigned left = io_uring_sq_space_left(r);
> +
> + if (left < 3)
> + io_uring_submit(r);
> +
> + *sqe1 = io_uring_get_sqe(r);
> + *sqe2 = io_uring_get_sqe(r);
> + *sqe3 = io_uring_get_sqe(r);
> +}
> #endif
> diff --git a/lib/ublksrv.c b/lib/ublksrv.c
> index 16a9e13..7205247 100644
> --- a/lib/ublksrv.c
> +++ b/lib/ublksrv.c
> @@ -619,6 +619,15 @@ skip_alloc_buf:
> goto fail;
> }
>
> + if (ctrl_dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) {
> + ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth);
> + if (ret) {
> + ublk_err("ublk dev %d queue %d register spare buffers failed %d",
> + q->dev->ctrl_dev->dev_info.dev_id, q->q_id, ret);
> + goto fail;
> + }
> + }
> +
> io_uring_register_ring_fd(&q->ring);
>
> /*
> diff --git a/tgt_loop.cpp b/tgt_loop.cpp
> index 0f16676..ce44c7d 100644
> --- a/tgt_loop.cpp
> +++ b/tgt_loop.cpp
> @@ -246,12 +246,62 @@ static inline int loop_fallocate_mode(const struct ublksrv_io_desc *iod)
> return mode;
> }
>
> +static inline void io_uring_prep_buf_register(struct io_uring_sqe *sqe,
> + int dev_fd, int tag, int q_id, __u64 index)
> +{
> + struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
> +
> + io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
> + sqe->opcode = IORING_OP_URING_CMD;
> + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE;
IOSQE_IO_LINK is missed, because the following buffer consumer OP
has to be issued after this buffer register OP is completed.
> + sqe->cmd_op = UBLK_U_IO_REGISTER_IO_BUF;
> +
> + cmd->tag = tag;
> + cmd->addr = index;
> + cmd->q_id = q_id;
> +}
> +
> +static inline void io_uring_prep_buf_unregister(struct io_uring_sqe *sqe,
> + int dev_fd, int tag, int q_id, __u64 index)
> +{
> + struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
> +
> + io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
> + sqe->opcode = IORING_OP_URING_CMD;
> + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE;
> + sqe->cmd_op = UBLK_U_IO_UNREGISTER_IO_BUF;
IOSQE_IO_LINK is missed, because buffer un-register OP has to be issued
after the previous buffer consumer OP is completed.
> +
> + cmd->tag = tag;
> + cmd->addr = index;
> + cmd->q_id = q_id;
> +}
> +
> static void loop_queue_tgt_read(const struct ublksrv_queue *q,
> const struct ublksrv_io_desc *iod, int tag)
> {
> + const struct ublksrv_ctrl_dev_info *info =
> + ublksrv_ctrl_get_dev_info(ublksrv_get_ctrl_dev(q->dev));
> unsigned ublk_op = ublksrv_get_op(iod);
>
> - if (user_copy) {
> + if (info->flags & UBLK_F_SUPPORT_ZERO_COPY) {
> + struct io_uring_sqe *reg;
> + struct io_uring_sqe *read;
> + struct io_uring_sqe *ureg;
> +
> + ublk_get_sqe_three(q->ring_ptr, ®, &read, &ureg);
> +
> + io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag);
> +
> + io_uring_prep_read_fixed(read, 1 /*fds[1]*/,
> + 0,
> + iod->nr_sectors << 9,
> + iod->start_sector << 9,
> + tag);
> + io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE);
> + read->user_data = build_user_data(tag, ublk_op, 0, 1);
Does this interface support to read to partial buffer? Which is useful
for stacking device cases.
Also does this interface support to consume the buffer from multiple
OPs concurrently?
Thanks,
Ming
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] ublk zero-copy support
2025-02-07 3:51 ` Ming Lei
@ 2025-02-07 14:06 ` Keith Busch
2025-02-08 5:44 ` Ming Lei
2025-02-08 7:52 ` Ming Lei
0 siblings, 2 replies; 27+ messages in thread
From: Keith Busch @ 2025-02-07 14:06 UTC (permalink / raw)
To: Ming Lei; +Cc: Keith Busch, io-uring, linux-block, axboe, asml.silence
On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
> On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> >
> > The previous version from Ming can be viewed here:
> >
> > https://lore.kernel.org/linux-block/[email protected]/
> >
> > Based on the feedback from that thread, the desired io_uring interfaces
> > needed to be simpler, and the kernel registered resources need to behave
> > more similiar to user registered buffers.
> >
> > This series introduces a new resource node type, KBUF, which, like the
> > BUFFER resource, needs to be installed into an io_uring buf_node table
> > in order for the user to access it in a fixed buffer command. The
> > new io_uring kernel API provides a way for a user to register a struct
> > request's bvec to a specific index, and a way to unregister it.
> >
> > When the ublk server receives notification of a new command, it must
> > first select an index and register the zero copy buffer. It may use that
> > index for any number of fixed buffer commands, then it must unregister
> > the index when it's done. This can all be done in a single io_uring_enter
> > if desired, or it can be split into multiple enters if needed.
>
> I suspect it may not be done in single io_uring_enter() because there
> is strict dependency among the three OPs(register buffer, read/write,
> unregister buffer).
The registration is synchronous. io_uring completes the SQE entirely
before it even looks at the read command in the next SQE.
The read or write is asynchronous, but it's prep takes a reference on
the node before moving on to the next SQE..
The unregister is synchronous, and clears the index node, but the
possibly inflight read or write has a reference on that node, so all
good.
> > + ublk_get_sqe_three(q->ring_ptr, ®, &read, &ureg);
> > +
> > + io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag);
> > +
> > + io_uring_prep_read_fixed(read, 1 /*fds[1]*/,
> > + 0,
> > + iod->nr_sectors << 9,
> > + iod->start_sector << 9,
> > + tag);
> > + io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE);
> > + read->user_data = build_user_data(tag, ublk_op, 0, 1);
>
> Does this interface support to read to partial buffer? Which is useful
> for stacking device cases.
Are you wanting to read into this buffer without copying in parts? As in
provide an offset and/or smaller length across multiple commands? If
that's what you mean, then yes, you can do that here.
> Also does this interface support to consume the buffer from multiple
> OPs concurrently?
You can register as many kernel buffers from as many OPs as you have
space for in your table, and you can use them all concurrently. Pretty
much the same as user registered fixed buffers. The main difference from
user buffers is how you register them.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] ublk zero-copy support
2025-02-07 14:06 ` Keith Busch
@ 2025-02-08 5:44 ` Ming Lei
2025-02-08 14:16 ` Pavel Begunkov
2025-02-08 7:52 ` Ming Lei
1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2025-02-08 5:44 UTC (permalink / raw)
To: Keith Busch
Cc: Keith Busch, io-uring, linux-block, axboe, asml.silence,
Bernd Schubert
On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
> > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> > >
> > > The previous version from Ming can be viewed here:
> > >
> > > https://lore.kernel.org/linux-block/[email protected]/
> > >
> > > Based on the feedback from that thread, the desired io_uring interfaces
> > > needed to be simpler, and the kernel registered resources need to behave
> > > more similiar to user registered buffers.
> > >
> > > This series introduces a new resource node type, KBUF, which, like the
> > > BUFFER resource, needs to be installed into an io_uring buf_node table
> > > in order for the user to access it in a fixed buffer command. The
> > > new io_uring kernel API provides a way for a user to register a struct
> > > request's bvec to a specific index, and a way to unregister it.
> > >
> > > When the ublk server receives notification of a new command, it must
> > > first select an index and register the zero copy buffer. It may use that
> > > index for any number of fixed buffer commands, then it must unregister
> > > the index when it's done. This can all be done in a single io_uring_enter
> > > if desired, or it can be split into multiple enters if needed.
> >
> > I suspect it may not be done in single io_uring_enter() because there
> > is strict dependency among the three OPs(register buffer, read/write,
> > unregister buffer).
>
> The registration is synchronous. io_uring completes the SQE entirely
> before it even looks at the read command in the next SQE.
Can you explain a bit "synchronous" here?
In patch 4, two ublk uring_cmd(UBLK_U_IO_REGISTER_IO_BUF/UBLK_U_IO_UNREGISTER_IO_BUF)
are added, and their handlers are called from uring_cmd's ->issue().
>
> The read or write is asynchronous, but it's prep takes a reference on
> the node before moving on to the next SQE..
The buffer is registered in ->issue() of UBLK_U_IO_REGISTER_IO_BUF,
and it isn't done yet when calling ->prep() of read_fixed/write_fixed,
in which buffer is looked up in ->prep().
>
> The unregister is synchronous, and clears the index node, but the
> possibly inflight read or write has a reference on that node, so all
> good.
UBLK_U_IO_UNREGISTER_IO_BUF tells ublk that the buffer isn't used any
more, but it is being used by the async read/write.
It might work, but looks a bit fragile, such as:
One buggy application may panic kernel if the IO command is completed
before read/write is done.
>
> > > + ublk_get_sqe_three(q->ring_ptr, ®, &read, &ureg);
> > > +
> > > + io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag);
> > > +
> > > + io_uring_prep_read_fixed(read, 1 /*fds[1]*/,
> > > + 0,
> > > + iod->nr_sectors << 9,
> > > + iod->start_sector << 9,
> > > + tag);
> > > + io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE);
> > > + read->user_data = build_user_data(tag, ublk_op, 0, 1);
> >
> > Does this interface support to read to partial buffer? Which is useful
> > for stacking device cases.
>
> Are you wanting to read into this buffer without copying in parts? As in
> provide an offset and/or smaller length across multiple commands? If
> that's what you mean, then yes, you can do that here.
OK.
>
> > Also does this interface support to consume the buffer from multiple
> > OPs concurrently?
>
> You can register as many kernel buffers from as many OPs as you have
> space for in your table, and you can use them all concurrently. Pretty
> much the same as user registered fixed buffers. The main difference from
> user buffers is how you register them.
Here it depends on if LINK between buffer register and read/write are
required. If it is required, multiple OPs consuming the buffer have to
be linked one by one, then they can't be issue concurrently.
Thanks,
Ming
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] ublk zero-copy support
2025-02-08 5:44 ` Ming Lei
@ 2025-02-08 14:16 ` Pavel Begunkov
2025-02-08 20:13 ` Keith Busch
0 siblings, 1 reply; 27+ messages in thread
From: Pavel Begunkov @ 2025-02-08 14:16 UTC (permalink / raw)
To: Ming Lei, Keith Busch
Cc: Keith Busch, io-uring, linux-block, axboe, Bernd Schubert
On 2/8/25 05:44, Ming Lei wrote:
> On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
>> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
>>> On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
>>>>
>>>> The previous version from Ming can be viewed here:
>>>>
>>>> https://lore.kernel.org/linux-block/[email protected]/
>>>>
>>>> Based on the feedback from that thread, the desired io_uring interfaces
>>>> needed to be simpler, and the kernel registered resources need to behave
>>>> more similiar to user registered buffers.
>>>>
>>>> This series introduces a new resource node type, KBUF, which, like the
>>>> BUFFER resource, needs to be installed into an io_uring buf_node table
>>>> in order for the user to access it in a fixed buffer command. The
>>>> new io_uring kernel API provides a way for a user to register a struct
>>>> request's bvec to a specific index, and a way to unregister it.
>>>>
>>>> When the ublk server receives notification of a new command, it must
>>>> first select an index and register the zero copy buffer. It may use that
>>>> index for any number of fixed buffer commands, then it must unregister
>>>> the index when it's done. This can all be done in a single io_uring_enter
>>>> if desired, or it can be split into multiple enters if needed.
>>>
>>> I suspect it may not be done in single io_uring_enter() because there
>>> is strict dependency among the three OPs(register buffer, read/write,
>>> unregister buffer).
>>
>> The registration is synchronous. io_uring completes the SQE entirely
>> before it even looks at the read command in the next SQE.
>
> Can you explain a bit "synchronous" here?
I'd believe synchronous here means "executed during submission from
the submit syscall path". And I agree that we can't rely on that.
That's an implementation detail and io_uring doesn't promise that,
but even now it relies on not using certain features like drain and
the async flag.
> In patch 4, two ublk uring_cmd(UBLK_U_IO_REGISTER_IO_BUF/UBLK_U_IO_UNREGISTER_IO_BUF)
> are added, and their handlers are called from uring_cmd's ->issue().
>
>>
>> The read or write is asynchronous, but it's prep takes a reference on
>> the node before moving on to the next SQE..
>
> The buffer is registered in ->issue() of UBLK_U_IO_REGISTER_IO_BUF,
> and it isn't done yet when calling ->prep() of read_fixed/write_fixed,
> in which buffer is looked up in ->prep().
I believe we should eventually move all such binding to ->issue
to be consistent with file handling. Not super happy about either
of those, but that's the kinds of problems coming from supporting
links.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] ublk zero-copy support
2025-02-08 14:16 ` Pavel Begunkov
@ 2025-02-08 20:13 ` Keith Busch
2025-02-08 21:40 ` Pavel Begunkov
0 siblings, 1 reply; 27+ messages in thread
From: Keith Busch @ 2025-02-08 20:13 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Ming Lei, Keith Busch, io-uring, linux-block, axboe,
Bernd Schubert
On Sat, Feb 08, 2025 at 02:16:15PM +0000, Pavel Begunkov wrote:
> On 2/8/25 05:44, Ming Lei wrote:
> > On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
> > > On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
> > > > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> > > > >
> > > > > The previous version from Ming can be viewed here:
> > > > >
> > > > > https://lore.kernel.org/linux-block/[email protected]/
> > > > >
> > > > > Based on the feedback from that thread, the desired io_uring interfaces
> > > > > needed to be simpler, and the kernel registered resources need to behave
> > > > > more similiar to user registered buffers.
> > > > >
> > > > > This series introduces a new resource node type, KBUF, which, like the
> > > > > BUFFER resource, needs to be installed into an io_uring buf_node table
> > > > > in order for the user to access it in a fixed buffer command. The
> > > > > new io_uring kernel API provides a way for a user to register a struct
> > > > > request's bvec to a specific index, and a way to unregister it.
> > > > >
> > > > > When the ublk server receives notification of a new command, it must
> > > > > first select an index and register the zero copy buffer. It may use that
> > > > > index for any number of fixed buffer commands, then it must unregister
> > > > > the index when it's done. This can all be done in a single io_uring_enter
> > > > > if desired, or it can be split into multiple enters if needed.
> > > >
> > > > I suspect it may not be done in single io_uring_enter() because there
> > > > is strict dependency among the three OPs(register buffer, read/write,
> > > > unregister buffer).
> > >
> > > The registration is synchronous. io_uring completes the SQE entirely
> > > before it even looks at the read command in the next SQE.
> >
> > Can you explain a bit "synchronous" here?
>
> I'd believe synchronous here means "executed during submission from
> the submit syscall path". And I agree that we can't rely on that.
> That's an implementation detail and io_uring doesn't promise that,
The commands are processed in order under the ctx's uring_lock. What are
you thinking you might do to make this happen in any different order?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] ublk zero-copy support
2025-02-08 20:13 ` Keith Busch
@ 2025-02-08 21:40 ` Pavel Begunkov
0 siblings, 0 replies; 27+ messages in thread
From: Pavel Begunkov @ 2025-02-08 21:40 UTC (permalink / raw)
To: Keith Busch
Cc: Ming Lei, Keith Busch, io-uring, linux-block, axboe,
Bernd Schubert
On 2/8/25 20:13, Keith Busch wrote:
> On Sat, Feb 08, 2025 at 02:16:15PM +0000, Pavel Begunkov wrote:
>> On 2/8/25 05:44, Ming Lei wrote:
>>> On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
>>>> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
>>>>> On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
>>>>>>
>>>>>> The previous version from Ming can be viewed here:
>>>>>>
>>>>>> https://lore.kernel.org/linux-block/[email protected]/
>>>>>>
>>>>>> Based on the feedback from that thread, the desired io_uring interfaces
>>>>>> needed to be simpler, and the kernel registered resources need to behave
>>>>>> more similiar to user registered buffers.
>>>>>>
>>>>>> This series introduces a new resource node type, KBUF, which, like the
>>>>>> BUFFER resource, needs to be installed into an io_uring buf_node table
>>>>>> in order for the user to access it in a fixed buffer command. The
>>>>>> new io_uring kernel API provides a way for a user to register a struct
>>>>>> request's bvec to a specific index, and a way to unregister it.
>>>>>>
>>>>>> When the ublk server receives notification of a new command, it must
>>>>>> first select an index and register the zero copy buffer. It may use that
>>>>>> index for any number of fixed buffer commands, then it must unregister
>>>>>> the index when it's done. This can all be done in a single io_uring_enter
>>>>>> if desired, or it can be split into multiple enters if needed.
>>>>>
>>>>> I suspect it may not be done in single io_uring_enter() because there
>>>>> is strict dependency among the three OPs(register buffer, read/write,
>>>>> unregister buffer).
>>>>
>>>> The registration is synchronous. io_uring completes the SQE entirely
>>>> before it even looks at the read command in the next SQE.
>>>
>>> Can you explain a bit "synchronous" here?
>>
>> I'd believe synchronous here means "executed during submission from
>> the submit syscall path". And I agree that we can't rely on that.
>> That's an implementation detail and io_uring doesn't promise that,
>
> The commands are processed in order under the ctx's uring_lock. What are
> you thinking you might do to make this happen in any different order?
Bunch of stuff. IOSQE_ASYNC will reorder them. Drain can push it to
a different path with no guarantees what happens there, even when you
only used drain only for some past requests. Or it can get reordered
by racing with draining. Someone floated (not merged) idea before of
hybrid task / sqpoll execution, things like that might be needed at
some point, and that will reorder requests. Or you might want to
offload more aggressively, e.g. to already waiting tasks or the
thread pool.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] ublk zero-copy support
2025-02-07 14:06 ` Keith Busch
2025-02-08 5:44 ` Ming Lei
@ 2025-02-08 7:52 ` Ming Lei
1 sibling, 0 replies; 27+ messages in thread
From: Ming Lei @ 2025-02-08 7:52 UTC (permalink / raw)
To: Keith Busch; +Cc: Keith Busch, io-uring, linux-block, axboe, asml.silence
On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
> > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
...
> > Does this interface support to read to partial buffer? Which is useful
> > for stacking device cases.
>
> Are you wanting to read into this buffer without copying in parts? As in
> provide an offset and/or smaller length across multiple commands? If
> that's what you mean, then yes, you can do that here.
Sorry, forget another problem here.
For short read, we need to zero the remained bytes in each part of buffer,
how to deal with that?
Please see io_req_zero_remained() in my patch of "[PATCH V10 10/12] io_uring:
support leased group buffer with REQ_F_GROUP_BUF":
https://lore.kernel.org/linux-block/[email protected]/#r
Thanks,
Ming
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] ublk zero-copy support
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
` (7 preceding siblings ...)
2025-02-07 3:51 ` Ming Lei
@ 2025-02-08 0:51 ` Bernd Schubert
8 siblings, 0 replies; 27+ messages in thread
From: Bernd Schubert @ 2025-02-08 0:51 UTC (permalink / raw)
To: Keith Busch, io-uring, linux-block, ming.lei, axboe, asml.silence
Cc: Keith Busch
Hi Keith,
On 2/3/25 16:45, Keith Busch wrote:
> From: Keith Busch <[email protected]>
>
> This is a new look at supporting zero copy with ublk.
will try to look at it over the weekend. Could you please keep me in the
loop for future versions?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 27+ messages in thread