public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Stefan Metzmacher <metze@samba.org>, Jens Axboe <axboe@kernel.dk>,
	io-uring@vger.kernel.org, Akilesh Kailash <akailash@google.com>,
	bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>
Subject: Re: [PATCH 3/5] io_uring: bpf: extend io_uring with bpf struct_ops
Date: Tue, 9 Dec 2025 11:08:36 +0800	[thread overview]
Message-ID: <aTeStJ9_Tu0i5_wH@fedora> (raw)
In-Reply-To: <CADUfDZqpTSihuYnTqUbtctrX4OGT7Szr-_wWb4xLgg11RcwYkA@mail.gmail.com>

On Mon, Dec 08, 2025 at 02:45:35PM -0800, Caleb Sander Mateos wrote:
> On Thu, Nov 13, 2025 at 7:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Thu, Nov 13, 2025 at 12:19:33PM +0100, Stefan Metzmacher wrote:
> > > Am 13.11.25 um 11:59 schrieb Ming Lei:
> > > > On Thu, Nov 13, 2025 at 11:32:56AM +0100, Stefan Metzmacher wrote:
> > > > > Hi Ming,
> > > > >
> > > > > > io_uring can be extended with bpf struct_ops in the following ways:
> > > > > >
> > > > > > 1) add new io_uring operation from application
> > > > > > - one typical use case is for operating device zero-copy buffer, which
> > > > > > belongs to kernel, and not visible or too expensive to export to
> > > > > > userspace, such as supporting copy data from this buffer to userspace,
> > > > > > decompressing data to zero-copy buffer in Android case[1][2], or
> > > > > > checksum/decrypting.
> > > > > >
> > > > > > [1] https://lpc.events/event/18/contributions/1710/attachments/1440/3070/LPC2024_ublk_zero_copy.pdf
> > > > > >
> > > > > > 2) extend 64 byte SQE, since bpf map can be used to store IO data
> > > > > >      conveniently
> > > > > >
> > > > > > 3) communicate in IO chain, since bpf map can be shared among IOs,
> > > > > > when one bpf IO is completed, data can be written to IO chain wide
> > > > > > bpf map, then the following bpf IO can retrieve the data from this bpf
> > > > > > map, this way is more flexible than io_uring built-in buffer
> > > > > >
> > > > > > 4) pretty handy to inject error for test purpose
> > > > > >
> > > > > > bpf struct_ops is one very handy way to attach bpf prog with kernel, and
> > > > > > this patch simply wires existed io_uring operation callbacks with added
> > > > > > uring bpf struct_ops, so application can define its own uring bpf
> > > > > > operations.
> > > > >
> > > > > This sounds useful to me.
> > > > >
> > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > ---
> > > > > >    include/uapi/linux/io_uring.h |   9 ++
> > > > > >    io_uring/bpf.c                | 271 +++++++++++++++++++++++++++++++++-
> > > > > >    io_uring/io_uring.c           |   1 +
> > > > > >    io_uring/io_uring.h           |   3 +-
> > > > > >    io_uring/uring_bpf.h          |  30 ++++
> > > > > >    5 files changed, 311 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> > > > > > index b8c49813b4e5..94d2050131ac 100644
> > > > > > --- a/include/uapi/linux/io_uring.h
> > > > > > +++ b/include/uapi/linux/io_uring.h
> > > > > > @@ -74,6 +74,7 @@ struct io_uring_sqe {
> > > > > >                 __u32           install_fd_flags;
> > > > > >                 __u32           nop_flags;
> > > > > >                 __u32           pipe_flags;
> > > > > > +               __u32           bpf_op_flags;
> > > > > >         };
> > > > > >         __u64   user_data;      /* data to be passed back at completion time */
> > > > > >         /* pack this to avoid bogus arm OABI complaints */
> > > > > > @@ -427,6 +428,13 @@ enum io_uring_op {
> > > > > >    #define IORING_RECVSEND_BUNDLE               (1U << 4)
> > > > > >    #define IORING_SEND_VECTORIZED               (1U << 5)
> > > > > > +/*
> > > > > > + * sqe->bpf_op_flags           top 8bits is for storing bpf op
> > > > > > + *                             The other 24bits are used for bpf prog
> > > > > > + */
> > > > > > +#define IORING_BPF_OP_BITS     (8)
> > > > > > +#define IORING_BPF_OP_SHIFT    (24)
> > > > > > +
> > > > > >    /*
> > > > > >     * cqe.res for IORING_CQE_F_NOTIF if
> > > > > >     * IORING_SEND_ZC_REPORT_USAGE was requested
> > > > > > @@ -631,6 +639,7 @@ struct io_uring_params {
> > > > > >    #define IORING_FEAT_MIN_TIMEOUT              (1U << 15)
> > > > > >    #define IORING_FEAT_RW_ATTR          (1U << 16)
> > > > > >    #define IORING_FEAT_NO_IOWAIT                (1U << 17)
> > > > > > +#define IORING_FEAT_BPF                        (1U << 18)
> > > > > >    /*
> > > > > >     * io_uring_register(2) opcodes and arguments
> > > > > > diff --git a/io_uring/bpf.c b/io_uring/bpf.c
> > > > > > index bb1e37d1e804..8227be6d5a10 100644
> > > > > > --- a/io_uring/bpf.c
> > > > > > +++ b/io_uring/bpf.c
> > > > > > @@ -4,28 +4,95 @@
> > > > > >    #include <linux/kernel.h>
> > > > > >    #include <linux/errno.h>
> > > > > >    #include <uapi/linux/io_uring.h>
> > > > > > +#include <linux/init.h>
> > > > > > +#include <linux/types.h>
> > > > > > +#include <linux/bpf_verifier.h>
> > > > > > +#include <linux/bpf.h>
> > > > > > +#include <linux/btf.h>
> > > > > > +#include <linux/btf_ids.h>
> > > > > > +#include <linux/filter.h>
> > > > > >    #include "io_uring.h"
> > > > > >    #include "uring_bpf.h"
> > > > > > +#define MAX_BPF_OPS_COUNT      (1 << IORING_BPF_OP_BITS)
> > > > > > +
> > > > > >    static DEFINE_MUTEX(uring_bpf_ctx_lock);
> > > > > >    static LIST_HEAD(uring_bpf_ctx_list);
> > > > > > +DEFINE_STATIC_SRCU(uring_bpf_srcu);
> > > > > > +static struct uring_bpf_ops bpf_ops[MAX_BPF_OPS_COUNT];
> > > > >
> > > > > This indicates to me that the whole system with all applications in all namespaces
> > > > > need to coordinate in order to use these 256 ops?
> > > >
> > > > So far there is only 62 in-tree io_uring operation defined, I feel 256
> > > > should be enough.
> > > >
> > > > > I think in order to have something useful, this should be per
> > > > > struct io_ring_ctx and each application should be able to load
> > > > > its own bpf programs.
> > > >
> > > > per-ctx requirement looks reasonable, and it shouldn't be hard to
> > > > support.
> > > >
> > > > >
> > > > > Something that uses bpf_prog_get_type() based on a bpf_fd
> > > > > like SIOCKCMATTACH in net/kcm/kcmsock.c.
> > > >
> > > > I considered per-ctx prog before, one drawback is the prog can't be shared
> > > > among io_ring_ctx, which could waste memory. In my ublk case, there can be
> > > > lots of devices sharing same bpf prog.
> > >
> > > Can't the ublk instances coordinate and use the same bpf_fd?
> > > new instances could request it via a unix socket and SCM_RIGHTS
> > > from a long running loading process. On the other hand do they
> > > really want to share?
> >
> > struct_ops is typically registered once, used everywhere, such as
> > sched_ext and socket example.
> >
> > This patch follows this usage, so every io_uring application can access it like the
> > in-kernel operations.
> >
> > I can understand the requirement for per-io-ring-ctx struct_ops, which
> > won't cause conflict among different applications.
> >
> > For example, ublk/raid5, there are 100 such devices, each device is created in dedicated
> > process and uses its own io-uring, so 100 same struct_ops prog are registered in memory.
> > Given struct_ops prog is registered as per-io-ring-ctx, it may not be shared by `bpf_fd`, IMO.
> 
> I agree with Stefan that a global IORING_OP_BPF op to BPF program
> mapping will be difficult to coordinate between processes. For
> example, consider two different ublk server programs that each want to
> use a different BPF program. Ideally, each should be an independent
> program and not need to know the op ids used by the other.

Each processes can query free slots by checking `bpftool struc_ops`.

> On the other hand, a multithreaded process may have multiple
> io_ring_ctxs and want to use the same IORING_OP_BPF ops with all of
> them. So a process-level mapping seems to make the most sense. And
> that's exactly the mapping level that we would get from using the BPF
> program file descriptor to specify the IORING_OP_BPF op. Additionally,
> as Stefan points out, the IORING_OP_BPF program could be shared with
> another process by sending the file descriptor using SCM_RIGHTS. And

io_uring FD doesn't support SCM_RIGHTS.

If one privileged process sends bpf prog FD via SCM_RIGHTS, that means
this privileged process may be allowed to register any IORING_OP_BPF program,
sounds like `CONFIG_BPF_UNPRIV_DEFAULT_OFF == n`. Probably it is fine
if we just expose 'struct uring_bpf_data' and not expose `struct io_kiocb`
to bpf prog.

Another ways is to register global struct_ops prog in the following way:

- the 1st 256 progs are stored in plain array, which can be for really
  global/generic progs

- the others(256 ~ 65535) progs are stored in xarray, processes can lookup
  free slots and use them in dynamic allocation way.

I'd suggest to start with global register, which is easy to use, and extend
to per-uring-ctx struct_ops in future.


Thanks,
Ming


  reply	other threads:[~2025-12-09  3:08 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-04 16:21 [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring Ming Lei
2025-11-04 16:21 ` [PATCH 1/5] io_uring: prepare for extending io_uring with bpf Ming Lei
2025-12-31  1:13   ` Caleb Sander Mateos
2025-12-31  9:33     ` Ming Lei
2025-11-04 16:21 ` [PATCH 2/5] io_uring: bpf: add io_uring_ctx setup for BPF into one list Ming Lei
2025-12-31  1:13   ` Caleb Sander Mateos
2025-12-31  9:49     ` Ming Lei
2025-12-31 16:19       ` Caleb Sander Mateos
2025-11-04 16:21 ` [PATCH 3/5] io_uring: bpf: extend io_uring with bpf struct_ops Ming Lei
2025-11-07 19:02   ` kernel test robot
2025-11-08  6:53   ` kernel test robot
2025-11-13 10:32   ` Stefan Metzmacher
2025-11-13 10:59     ` Ming Lei
2025-11-13 11:19       ` Stefan Metzmacher
2025-11-14  3:00         ` Ming Lei
2025-12-08 22:45           ` Caleb Sander Mateos
2025-12-09  3:08             ` Ming Lei [this message]
2025-12-10 16:11               ` Caleb Sander Mateos
2025-11-19 14:39   ` Jonathan Corbet
2025-11-20  1:46     ` Ming Lei
2025-11-20  1:51       ` Ming Lei
2025-12-31  1:19   ` Caleb Sander Mateos
2025-12-31 10:32     ` Ming Lei
2025-12-31 16:48       ` Caleb Sander Mateos
2025-11-04 16:21 ` [PATCH 4/5] io_uring: bpf: add buffer support for IORING_OP_BPF Ming Lei
2025-11-13 10:42   ` Stefan Metzmacher
2025-11-13 11:04     ` Ming Lei
2025-11-13 11:25       ` Stefan Metzmacher
2025-12-31  1:42   ` Caleb Sander Mateos
2025-12-31 11:02     ` Ming Lei
2025-12-31 17:02       ` Caleb Sander Mateos
2025-11-04 16:21 ` [PATCH 5/5] io_uring: bpf: add io_uring_bpf_req_memcpy() kfunc Ming Lei
2025-11-07 18:51   ` kernel test robot
2025-12-31  1:42   ` Caleb Sander Mateos
2025-11-05 12:47 ` [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring Pavel Begunkov
2025-11-05 15:57   ` Ming Lei
2025-11-06 16:03     ` Pavel Begunkov
2025-11-07 15:54       ` Ming Lei
2025-11-11 14:07         ` Pavel Begunkov
2025-11-13  4:18           ` Ming Lei
2025-11-19 19:00             ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aTeStJ9_Tu0i5_wH@fedora \
    --to=ming.lei@redhat.com \
    --cc=akailash@google.com \
    --cc=ast@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bpf@vger.kernel.org \
    --cc=csander@purestorage.com \
    --cc=io-uring@vger.kernel.org \
    --cc=metze@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox