public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Caleb Sander Mateos <csander@purestorage.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	io-uring@vger.kernel.org,
	 Pavel Begunkov <asml.silence@gmail.com>,
	Stefan Metzmacher <metze@samba.org>
Subject: Re: [PATCH V2 08/13] io_uring: bpf: add uring_bpf_memcpy() kfunc
Date: Fri, 13 Mar 2026 22:18:53 -0700	[thread overview]
Message-ID: <CADUfDZqaM2cNn=QJzY1G912F5mX+Uhz5wTV4r3wvhAg-qhoBFQ@mail.gmail.com> (raw)
In-Reply-To: <20260106101126.4064990-9-ming.lei@redhat.com>

On Tue, Jan 6, 2026 at 2:12 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add uring_bpf_memcpy() kfunc that copies data between io_uring BPF
> buffers. This kfunc supports all 5 buffer types defined in
> io_bpf_buf_desc:
>
> - IO_BPF_BUF_USER: plain userspace buffer
> - IO_BPF_BUF_FIXED: fixed buffer (absolute address within buffer)
> - IO_BPF_BUF_VEC: vectored userspace buffer (iovec array)
> - IO_BPF_BUF_KFIXED: kernel fixed buffer (offset-based addressing)
> - IO_BPF_BUF_KVEC: kernel vectored buffer
>
> Add helper functions for buffer import:
> - io_bpf_import_fixed_buf(): handles FIXED/KFIXED types with proper
>   node reference counting
> - io_bpf_import_kvec_buf(): handles KVEC using __io_prep_reg_iovec()
>   and __io_import_reg_vec()
> - io_bpf_import_buffer(): unified dispatcher for all buffer types
> - io_bpf_copy_iters(): page-based copy between iov_iters
>
> The kfunc properly manages buffer node references and submit lock
> for registered buffer access.

Ming,

Thank you for working on this patch series and sorry I didn't have a
chance to take a look at this revision earlier.

My main reservation with this approach is that a dedicated kfunc is
needed for each data manipulation operation an io_uring BPF program
conceivably wants to perform. Even for the initial RAID parity
calculation use case, I imagine we'd need a different kfunc for each
RAID stripe data width and parity width combination. Other data
operations added in the future would also need new kfuncs to support
them.
struct io_bpf_buf_desc is certainly more flexible than the approach in
v1, but I worry that fully consuming the iterator in a single kfunc
call prevents composing data operations during a single pass over the
buffer. For example, if the input data needs to be encrypted and then
parity needs to be generated over the encrypted data, calling an
encryption kfunc and a parity generation kfunc in sequence would
perform two passes over the data. To do it in a single pass, you'd
need something like an "encrypt_and_gen_parity()" kfunc, which seems
very niche.
The hope was that io_uring BPF programs would provide a generic
approach to registered buffer data manipulation, but the dependency on
a large set of kfuncs seems a significant hurdle. I suspect a fully
generic approach would require essentially reproducing the struct
iov_iter abstraction in kfunc form.

Here's an alternate idea I've been mulling over: could ublk devices
implement the proposed user dma-buf abstraction
(https://lore.kernel.org/io-uring/4796d2f7-5300-4884-bd2e-3fcc7fdd7cea@gmail.com/)?
A client process could then pre-register its data pages as a dma-buf
against a ublk device, which would provide a file handle to the ublk
server for the dma-buf memory that it could mmap() into its address
space. A ublk I/O using a dma-buf would be dispatched to the ublk
server with the virtual address of the data buffer in the ublk
server's address space. Then the ublk server could directly access the
data buffer in userspace.
From what I understand, this is similar to the original concept for
ublk zero-copy, but with an additional pre-registration step. I think
that may address the concern about mapping pages into the ublk
server's address space in the I/O path. It would be an opt-in
optimization for client applications, so it can require that the pages
that will be shared with the ublk server are MAP_SHARED.

>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  io_uring/bpf_op.c | 320 +++++++++++++++++++++++++++++++++++++++++++++-
>  io_uring/bpf_op.h |   3 +-
>  2 files changed, 321 insertions(+), 2 deletions(-)
>
> diff --git a/io_uring/bpf_op.c b/io_uring/bpf_op.c
> index d6f146abe304..3c577aa3dfc4 100644
> --- a/io_uring/bpf_op.c
> +++ b/io_uring/bpf_op.c
> @@ -10,9 +10,11 @@
>  #include <linux/btf.h>
>  #include <linux/btf_ids.h>
>  #include <linux/filter.h>
> +#include <linux/uio.h>
>  #include <uapi/linux/io_uring.h>
>  #include "io_uring.h"
>  #include "register.h"
> +#include "rsrc.h"
>  #include "bpf_op.h"
>
>  static inline unsigned char uring_bpf_get_op(u32 op_flags)
> @@ -47,6 +49,7 @@ int io_uring_bpf_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>
>         data->opf = opf;
>         data->ops = ops;
> +       data->issue_flags = 0;
>         ret = ops->prep_fn(data, sqe);
>         if (!ret) {
>                 /* Only increment refcount on success (uring_lock already held) */
> @@ -74,7 +77,13 @@ static int __io_uring_bpf_issue(struct io_kiocb *req)
>
>  int io_uring_bpf_issue(struct io_kiocb *req, unsigned int issue_flags)
>  {
> -       return __io_uring_bpf_issue(req);
> +       struct uring_bpf_data *data = io_kiocb_to_cmd(req, struct uring_bpf_data);
> +       int ret;
> +
> +       data->issue_flags = issue_flags;
> +       ret = __io_uring_bpf_issue(req);
> +       data->issue_flags = 0;
> +       return ret;
>  }
>
>  void io_uring_bpf_fail(struct io_kiocb *req)
> @@ -291,6 +300,235 @@ static struct bpf_struct_ops bpf_uring_bpf_ops = {
>         .owner = THIS_MODULE,
>  };
>
> +/*
> + * Helper to copy data between two iov_iters using page extraction.
> + * Extracts pages from source iterator and copies them to destination.
> + * Returns number of bytes copied or negative error code.
> + */
> +static ssize_t io_bpf_copy_iters(struct iov_iter *src, struct iov_iter *dst,
> +                                size_t len)
> +{
> +#define MAX_PAGES_PER_LOOP 32
> +       struct page *pages[MAX_PAGES_PER_LOOP];
> +       size_t total_copied = 0;
> +       bool need_unpin;
> +
> +       need_unpin = iov_iter_extract_will_pin(src);
> +
> +       while (len > 0) {
> +               struct page **page_array = pages;
> +               size_t offset, copied = 0;
> +               ssize_t extracted;
> +               unsigned int nr_pages;
> +               size_t chunk_len;
> +               int i;
> +
> +               chunk_len = min_t(size_t, len, MAX_PAGES_PER_LOOP * PAGE_SIZE);
> +               extracted = iov_iter_extract_pages(src, &page_array, chunk_len,
> +                                                  MAX_PAGES_PER_LOOP, 0, &offset);
> +               if (extracted <= 0) {
> +                       if (total_copied > 0)
> +                               break;
> +                       return extracted < 0 ? extracted : -EFAULT;
> +               }
> +
> +               nr_pages = DIV_ROUND_UP(offset + extracted, PAGE_SIZE);
> +
> +               for (i = 0; i < nr_pages && copied < extracted; i++) {
> +                       size_t page_offset = (i == 0) ? offset : 0;
> +                       size_t page_len = min_t(size_t, extracted - copied,
> +                                               PAGE_SIZE - page_offset);
> +                       size_t n;
> +
> +                       n = copy_page_to_iter(page_array[i], page_offset, page_len, dst);
> +                       copied += n;
> +                       if (n < page_len)
> +                               break;
> +               }
> +
> +               if (need_unpin)
> +                       unpin_user_pages(page_array, nr_pages);
> +
> +               total_copied += copied;
> +               len -= copied;
> +
> +               if (copied < extracted)
> +                       break;
> +       }
> +
> +       return total_copied;
> +#undef MAX_PAGES_PER_LOOP
> +}
> +
> +/*
> + * Helper to import fixed buffer (FIXED or KFIXED).
> + * Must be called with submit lock held.
> + *
> + * FIXED: addr is absolute userspace address within buffer
> + * KFIXED: addr is offset from buffer start
> + *
> + * Returns node with incremented refcount on success, ERR_PTR on failure.
> + */
> +static struct io_rsrc_node *io_bpf_import_fixed_buf(struct io_ring_ctx *ctx,
> +                                                   struct iov_iter *iter,
> +                                                   const struct io_bpf_buf_desc *desc,
> +                                                   int ddir)
> +{
> +       struct io_rsrc_node *node;
> +       struct io_mapped_ubuf *imu;
> +       int ret;
> +
> +       node = io_rsrc_node_lookup(&ctx->buf_table, desc->buf_index);
> +       if (!node)
> +               return ERR_PTR(-EFAULT);
> +
> +       imu = node->buf;
> +       if (!(imu->dir & (1 << ddir)))
> +               return ERR_PTR(-EFAULT);

This is already checked in io_import_fixed()? Same comment for
io_bpf_import_reg_vec().

> +
> +       node->refs++;
> +
> +       ret = io_import_fixed(ddir, iter, imu, desc->addr, desc->len);
> +       if (ret) {
> +               node->refs--;

Just wait to increment node->refs until io_import_fixed() succeeds?
Same comment for io_bpf_import_reg_vec().

> +               return ERR_PTR(ret);
> +       }
> +
> +       return node;
> +}
> +
> +/*
> + * Helper to import registered vectored buffer (KVEC).
> + * Must be called with submit lock held.
> + *
> + * addr: userspace iovec pointer
> + * len: number of iovecs
> + * buf_index: registered buffer index
> + *
> + * Returns node with incremented refcount on success, ERR_PTR on failure.
> + * Caller must call io_vec_free(vec) after use.
> + */
> +static struct io_rsrc_node *io_bpf_import_reg_vec(struct io_ring_ctx *ctx,
> +                                                  struct iov_iter *iter,
> +                                                  const struct io_bpf_buf_desc *desc,
> +                                                  int ddir, struct iou_vec *vec)
> +{
> +       struct io_rsrc_node *node;
> +       struct io_mapped_ubuf *imu;
> +       int ret;
> +
> +       node = io_rsrc_node_lookup(&ctx->buf_table, desc->buf_index);
> +       if (!node)
> +               return ERR_PTR(-EFAULT);
> +
> +       imu = node->buf;
> +       if (!(imu->dir & (1 << ddir)))
> +               return ERR_PTR(-EFAULT);
> +
> +       node->refs++;
> +
> +       /* Prepare iovec from userspace */
> +       ret = __io_prep_reg_iovec(vec, u64_to_user_ptr(desc->addr),
> +                                 desc->len, ctx->compat, NULL);
> +       if (ret)
> +               goto err;
> +
> +       /* Import vectored buffer from registered buffer */
> +       ret = __io_import_reg_vec(ddir, iter, imu, vec, desc->len, NULL);
> +       if (ret)
> +               goto err;
> +
> +       return node;
> +err:
> +       node->refs--;
> +       return ERR_PTR(ret);
> +}
> +
> +/*
> + * Helper to import a vectored user buffer (VEC) into iou_vec.
> + * Allocates space in vec and copies iovec from userspace.
> + *
> + * Returns 0 on success, negative error code on failure.
> + * Caller must call io_vec_free(vec) after use.
> + */
> +static int io_bpf_import_vec_buf(struct io_ring_ctx *ctx,
> +                                struct iov_iter *iter,
> +                                const struct io_bpf_buf_desc *desc,
> +                                int ddir, struct iou_vec *vec)
> +{
> +       unsigned nr_vecs = desc->len;
> +       struct iovec *iov;
> +       size_t total_len = 0;
> +       void *res;
> +       int ret, i;
> +
> +       if (nr_vecs > vec->nr) {
> +               ret = io_vec_realloc(vec, nr_vecs);
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       iov = vec->iovec;
> +       res = iovec_from_user(u64_to_user_ptr(desc->addr), nr_vecs,
> +                             nr_vecs, iov, ctx->compat);
> +       if (IS_ERR(res))
> +               return PTR_ERR(res);
> +
> +       for (i = 0; i < nr_vecs; i++)
> +               total_len += iov[i].iov_len;
> +
> +       iov_iter_init(iter, ddir, iov, nr_vecs, total_len);
> +       return 0;
> +}
> +
> +/*
> + * Helper to import a buffer into an iov_iter based on io_bpf_buf_desc.
> + * Supports all 5 buffer types: USER, FIXED, VEC, KFIXED, KVEC.
> + * Must be called with submit lock held for FIXED/KFIXED/KVEC types.
> + *
> + * @ctx: ring context
> + * @iter: output iterator
> + * @desc: buffer descriptor
> + * @ddir: direction (ITER_SOURCE for source, ITER_DEST for destination)
> + * @vec: iou_vec for VEC/KVEC types (caller must call io_vec_free after use)
> + *
> + * Returns node pointer (may be NULL for USER/VEC), or ERR_PTR on failure.
> + * Caller must drop node reference when done if non-NULL.
> + */
> +static struct io_rsrc_node *io_bpf_import_buffer(struct io_ring_ctx *ctx,
> +                                                struct iov_iter *iter,
> +                                                const struct io_bpf_buf_desc *desc,
> +                                                int ddir, struct iou_vec *vec)
> +{
> +       int ret;
> +
> +       switch (desc->type) {
> +       case IO_BPF_BUF_USER:
> +               /* Plain user buffer */
> +               ret = import_ubuf(ddir, u64_to_user_ptr(desc->addr),
> +                                 desc->len, iter);
> +               return ret ? ERR_PTR(ret) : NULL;
> +
> +       case IO_BPF_BUF_FIXED:
> +       case IO_BPF_BUF_KFIXED:

Seems like these are handled identically, are both needed?

> +               /* FIXED: addr is absolute address within buffer */
> +               /* KFIXED: addr is offset from buffer start */
> +               return io_bpf_import_fixed_buf(ctx, iter, desc, ddir);
> +
> +       case IO_BPF_BUF_VEC:
> +               /* Vectored user buffer - addr is iovec ptr, len is nr_vecs */
> +               ret = io_bpf_import_vec_buf(ctx, iter, desc, ddir, vec);
> +               return ret ? ERR_PTR(ret) : NULL;
> +
> +       case IO_BPF_BUF_REG_VEC:
> +               /* Registered vectored buffer */
> +               return io_bpf_import_reg_vec(ctx, iter, desc, ddir, vec);
> +
> +       default:
> +               return ERR_PTR(-EINVAL);
> +       }
> +}
> +
>  __bpf_kfunc_start_defs();
>  __bpf_kfunc void uring_bpf_set_result(struct uring_bpf_data *data, int res)
>  {
> @@ -300,10 +538,90 @@ __bpf_kfunc void uring_bpf_set_result(struct uring_bpf_data *data, int res)
>                 req_set_fail(req);
>         io_req_set_res(req, res, 0);
>  }
> +
> +/**
> + * uring_bpf_memcpy - Copy data between io_uring BPF buffers
> + * @data: BPF request data containing request context
> + * @dst: Destination buffer descriptor
> + * @src: Source buffer descriptor
> + *
> + * Copies data from source buffer to destination buffer.
> + * Supports all 5 buffer types: USER, FIXED, VEC, KFIXED, REG_VEC.
> + * The copy length is min of actual buffer sizes (for VEC types,
> + * total bytes across all vectors, not nr_vecs).
> + *
> + * Returns: Number of bytes copied on success, negative error code on failure
> + */
> +__bpf_kfunc ssize_t uring_bpf_memcpy(const struct uring_bpf_data *data,
> +                                    struct io_bpf_buf_desc *dst,
> +                                    struct io_bpf_buf_desc *src)
> +{
> +       struct io_kiocb *req = cmd_to_io_kiocb((void *)data);

Could omit the explicit void * cast

Best,
Caleb

> +       struct io_ring_ctx *ctx = req->ctx;
> +       unsigned int issue_flags = data->issue_flags;
> +       struct io_rsrc_node *src_node, *dst_node;
> +       struct iov_iter src_iter, dst_iter;
> +       struct iou_vec src_vec = {};
> +       struct iou_vec dst_vec = {};
> +       ssize_t ret;
> +       size_t len;
> +
> +       /* Validate buffer types */
> +       if (src->type > IO_BPF_BUF_REG_VEC || dst->type > IO_BPF_BUF_REG_VEC)
> +               return -EINVAL;
> +
> +       io_ring_submit_lock(ctx, issue_flags);
> +
> +       /* Import source buffer */
> +       src_node = io_bpf_import_buffer(ctx, &src_iter, src, ITER_SOURCE,
> +                                       &src_vec);
> +       if (IS_ERR(src_node)) {
> +               ret = PTR_ERR(src_node);
> +               goto unlock;
> +       }
> +
> +       /* Import destination buffer */
> +       dst_node = io_bpf_import_buffer(ctx, &dst_iter, dst, ITER_DEST,
> +                                       &dst_vec);
> +       if (IS_ERR(dst_node)) {
> +               ret = PTR_ERR(dst_node);
> +               goto put_src;
> +       }
> +
> +       /*
> +        * Calculate copy length from actual iterator sizes.
> +        * For VEC types, desc->len is nr_vecs, not total bytes.
> +        */
> +       len = min(iov_iter_count(&src_iter), iov_iter_count(&dst_iter));
> +       if (!len) {
> +               ret = 0;
> +               goto put_dst;
> +       }
> +       if (len > MAX_RW_COUNT) {
> +               ret = -EINVAL;
> +               goto put_dst;
> +       }
> +
> +       /* Copy data between iterators */
> +       ret = io_bpf_copy_iters(&src_iter, &dst_iter, len);
> +
> +put_dst:
> +       io_vec_free(&dst_vec);
> +       if (dst_node)
> +               io_put_rsrc_node(ctx, dst_node);
> +put_src:
> +       io_vec_free(&src_vec);
> +       if (src_node)
> +               io_put_rsrc_node(ctx, src_node);
> +unlock:
> +       io_ring_submit_unlock(ctx, issue_flags);
> +       return ret;
> +}
>  __bpf_kfunc_end_defs();
>
>  BTF_KFUNCS_START(uring_bpf_kfuncs)
>  BTF_ID_FLAGS(func, uring_bpf_set_result)
> +BTF_ID_FLAGS(func, uring_bpf_memcpy)
>  BTF_KFUNCS_END(uring_bpf_kfuncs)
>
>  static const struct btf_kfunc_id_set uring_kfunc_set = {
> diff --git a/io_uring/bpf_op.h b/io_uring/bpf_op.h
> index 9de0606f5d25..6004fb906983 100644
> --- a/io_uring/bpf_op.h
> +++ b/io_uring/bpf_op.h
> @@ -13,10 +13,11 @@ struct uring_bpf_data {
>         void                            *req_data;  /* not for bpf prog */
>         const struct uring_bpf_ops      *ops;
>         u32                             opf;
> +       u32                             issue_flags; /* io_uring issue flags */
>
>         /* writeable for bpf prog */
>         u8              pdu[64 - sizeof(void *) -
> -               sizeof(struct uring_bpf_ops *) - sizeof(u32)];
> +               sizeof(struct uring_bpf_ops *) - 2 * sizeof(u32)];
>  };
>
>  typedef int (*uring_bpf_prep_t)(struct uring_bpf_data *data,
> --
> 2.47.0
>

  reply	other threads:[~2026-03-14  5:19 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-06 10:11 [PATCH v2 0/13] io_uring: add IORING_OP_BPF for extending io_uring Ming Lei
2026-01-06 10:11 ` [PATCH V2 01/13] io_uring: make io_import_fixed() global Ming Lei
2026-01-06 10:11 ` [PATCH V2 02/13] io_uring: refactor io_prep_reg_iovec() for BPF kfunc use Ming Lei
2026-01-06 10:11 ` [PATCH V2 03/13] io_uring: refactor io_import_reg_vec() " Ming Lei
2026-01-06 10:11 ` [PATCH V2 04/13] io_uring: prepare for extending io_uring with bpf Ming Lei
2026-01-06 10:11 ` [PATCH V2 05/13] io_uring: bpf: extend io_uring with bpf struct_ops Ming Lei
2026-01-06 10:11 ` [PATCH V2 06/13] io_uring: bpf: implement struct_ops registration Ming Lei
2026-01-06 10:11 ` [PATCH V2 07/13] io_uring: bpf: add BPF buffer descriptor for IORING_OP_BPF Ming Lei
2026-01-06 10:11 ` [PATCH V2 08/13] io_uring: bpf: add uring_bpf_memcpy() kfunc Ming Lei
2026-03-14  5:18   ` Caleb Sander Mateos [this message]
2026-01-06 10:11 ` [PATCH V2 09/13] selftests/io_uring: update mini liburing Ming Lei
2026-01-06 10:11 ` [PATCH V2 10/13] selftests/io_uring: add BPF struct_ops and kfunc tests Ming Lei
2026-01-06 10:11 ` [PATCH V2 11/13] selftests/io_uring: add bpf_memcpy selftest for uring_bpf_memcpy() kfunc Ming Lei
2026-01-06 10:11 ` [PATCH V2 12/13] selftests/io_uring: add copy_user_to_fixed() and copy_fixed_to_user() bpf_memcpy tests Ming Lei
2026-01-06 10:11 ` [PATCH V2 13/13] selftests/io_uring: add copy_user_to_reg_vec() and copy_reg_vec_to_user() " Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADUfDZqaM2cNn=QJzY1G912F5mX+Uhz5wTV4r3wvhAg-qhoBFQ@mail.gmail.com' \
    --to=csander@purestorage.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    --cc=metze@samba.org \
    --cc=ming.lei@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox