public inbox for [email protected]
 help / color / mirror / Atom feed
From: Pavel Begunkov <[email protected]>
To: Keith Busch <[email protected]>,
	[email protected], [email protected],
	[email protected], [email protected]
Cc: Keith Busch <[email protected]>
Subject: Re: [PATCH 3/6] io_uring: add support for kernel registered bvecs
Date: Fri, 7 Feb 2025 14:08:23 +0000	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 2/3/25 15:45, Keith Busch wrote:
> From: Keith Busch <[email protected]>
> 
> Provide an interface for the kernel to leverage the existing
> pre-registered buffers that io_uring provides. User space can reference
> these later to achieve zero-copy IO.
> 
> User space must register an empty fixed buffer table with io_uring in
> order for the kernel to make use of it.
> 
> Signed-off-by: Keith Busch <[email protected]>
> ---
>   include/linux/io_uring.h       |   1 +
>   include/linux/io_uring_types.h |   3 +
>   io_uring/rsrc.c                | 114 +++++++++++++++++++++++++++++++--
>   io_uring/rsrc.h                |   1 +
>   4 files changed, 114 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 85fe4e6b275c7..b5637a2aae340 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -5,6 +5,7 @@
>   #include <linux/sched.h>
>   #include <linux/xarray.h>
>   #include <uapi/linux/io_uring.h>
> +#include <linux/blk-mq.h>
>   
>   #if defined(CONFIG_IO_URING)
>   void __io_uring_cancel(bool cancel_all);
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 623d8e798a11a..7e5a5a70c35f2 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -695,4 +695,7 @@ static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
>   	return ctx->flags & IORING_SETUP_CQE32;
>   }
>   
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq, unsigned int tag);
> +void io_buffer_unregister_bvec(struct io_ring_ctx *ctx, unsigned int tag);
> +
>   #endif
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 4d0e1c06c8bc6..8c4c374abcc10 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -111,7 +111,10 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
>   		if (!refcount_dec_and_test(&imu->refs))
>   			return;
>   		for (i = 0; i < imu->nr_bvecs; i++)
> -			unpin_user_page(imu->bvec[i].bv_page);
> +			if (node->type == IORING_RSRC_KBUF)
> +				put_page(imu->bvec[i].bv_page);

Just a note, that's fine but I hope we'll be able to optimise
that later.

> +			else
> +				unpin_user_page(imu->bvec[i].bv_page);
>   		if (imu->acct_pages)
>   			io_unaccount_mem(ctx, imu->acct_pages);
>   		kvfree(imu);
> @@ -240,6 +243,13 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
>   		struct io_rsrc_node *node;
>   		u64 tag = 0;
>   
> +		i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
> +		node = io_rsrc_node_lookup(&ctx->buf_table, i);
> +		if (node && node->type != IORING_RSRC_BUFFER) {

We might need to rethink how it's unregistered. The next patch
does it as a ublk commands, but what happens if it gets ejected
by someone else? get_page might protect from kernel corruption
and here you try to forbid ejections, but there is io_rsrc_data_free()
and the io_uring ctx can die as well and it will have to drop it.
And then you don't really have clear ownership rules. Does ublk
releases the block request and "returns ownership" over pages to
its user while io_uring is still dying and potenially have some
IO inflight against it?

That's why I liked more the option to allow removing buffers from
the table as per usual io_uring api / rules instead of a separate
unregister ublk cmd. And inside, when all node refs are dropped,
it'd call back to ublk. This way you have a single mechanism of
how buffers are dropped from io_uring perspective. Thoughts?

> +			err = -EBUSY;
> +			break;
> +		}
> +
>   		uvec = u64_to_user_ptr(user_data);
>   		iov = iovec_from_user(uvec, 1, 1, &fast_iov, ctx->compat);
>   		if (IS_ERR(iov)) {
> @@ -258,6 +268,7 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
>   			err = PTR_ERR(node);
>   			break;
>   		}
...
> +int io_buffer_register_bvec(struct io_ring_ctx *ctx, const struct request *rq,
> +			    unsigned int index)
> +{
> +	struct io_rsrc_data *data = &ctx->buf_table;
> +	u16 nr_bvecs = blk_rq_nr_phys_segments(rq);
> +	struct req_iterator rq_iter;
> +	struct io_rsrc_node *node;
> +	struct bio_vec bv;
> +	int i = 0;
> +
> +	lockdep_assert_held(&ctx->uring_lock);
> +
> +	if (WARN_ON_ONCE(!data->nr))
> +		return -EINVAL;

IIUC you can trigger all these from the user space, so they
can't be warnings. Likely same goes for unregister*()

> +	if (WARN_ON_ONCE(index >= data->nr))
> +		return -EINVAL;
> +
> +	node = data->nodes[index];
> +	if (WARN_ON_ONCE(node))
> +		return -EBUSY;
> +
> +	node = io_buffer_alloc_node(ctx, nr_bvecs, blk_rq_bytes(rq));
> +	if (!node)
> +		return -ENOMEM;
> +
> +	rq_for_each_bvec(bv, rq, rq_iter) {
> +		get_page(bv.bv_page);
> +		node->buf->bvec[i].bv_page = bv.bv_page;
> +		node->buf->bvec[i].bv_len = bv.bv_len;
> +		node->buf->bvec[i].bv_offset = bv.bv_offset;

bvec_set_page() should be more convenient

> +		i++;
> +	}
> +	data->nodes[index] = node;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
> +

...
>   			unsigned long seg_skip;
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index abd0d5d42c3e1..d1d90d9cd2b43 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -13,6 +13,7 @@
>   enum {
>   	IORING_RSRC_FILE		= 0,
>   	IORING_RSRC_BUFFER		= 1,
> +	IORING_RSRC_KBUF		= 2,

The name "kbuf" is already used, to avoid confusion let's rename it.
Ming called it leased buffers before, I think it's a good name.


-- 
Pavel Begunkov


  reply	other threads:[~2025-02-07 14:08 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-03 15:45 [PATCH 0/6] ublk zero-copy support Keith Busch
2025-02-03 15:45 ` [PATCH 1/6] block: const blk_rq_nr_phys_segments request Keith Busch
2025-02-03 15:45 ` [PATCH 2/6] io_uring: use node for import Keith Busch
2025-02-03 15:45 ` [PATCH 3/6] io_uring: add support for kernel registered bvecs Keith Busch
2025-02-07 14:08   ` Pavel Begunkov [this message]
2025-02-07 15:17     ` Keith Busch
2025-02-08 15:49       ` Pavel Begunkov
2025-02-10 14:12   ` Ming Lei
2025-02-10 15:05     ` Keith Busch
2025-02-03 15:45 ` [PATCH 4/6] ublk: zc register/unregister bvec Keith Busch
2025-02-08  5:50   ` Ming Lei
2025-02-03 15:45 ` [PATCH 5/6] io_uring: add abstraction for buf_table rsrc data Keith Busch
2025-02-03 15:45 ` [PATCH 6/6] io_uring: cache nodes and mapped buffers Keith Busch
2025-02-07 12:41   ` Pavel Begunkov
2025-02-07 15:33     ` Keith Busch
2025-02-08 14:00       ` Pavel Begunkov
2025-02-07 15:59     ` Keith Busch
2025-02-08 14:24       ` Pavel Begunkov
2025-02-06 15:28 ` [PATCH 0/6] ublk zero-copy support Keith Busch
2025-02-07  3:51 ` Ming Lei
2025-02-07 14:06   ` Keith Busch
2025-02-08  5:44     ` Ming Lei
2025-02-08 14:16       ` Pavel Begunkov
2025-02-08 20:13         ` Keith Busch
2025-02-08 21:40           ` Pavel Begunkov
2025-02-08  7:52     ` Ming Lei
2025-02-08  0:51 ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox