From: Ming Lei <ming.lei@redhat.com>
To: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>,
io-uring@vger.kernel.org,
Caleb Sander Mateos <csander@purestorage.com>,
Akilesh Kailash <akailash@google.com>,
bpf@vger.kernel.org, Alexei Starovoitov <ast@kernel.org>
Subject: Re: [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring
Date: Thu, 13 Nov 2025 12:18:08 +0800 [thread overview]
Message-ID: <aRVcAFOsb7X3kxB9@fedora> (raw)
In-Reply-To: <9b59b165-1f57-4cb6-ae62-403d922ad4da@gmail.com>
On Tue, Nov 11, 2025 at 02:07:47PM +0000, Pavel Begunkov wrote:
> On 11/7/25 15:54, Ming Lei wrote:
> > On Thu, Nov 06, 2025 at 04:03:29PM +0000, Pavel Begunkov wrote:
> > > On 11/5/25 15:57, Ming Lei wrote:
> > > > On Wed, Nov 05, 2025 at 12:47:58PM +0000, Pavel Begunkov wrote:
> > > > > On 11/4/25 16:21, Ming Lei wrote:
> > > > > > Hello,
> > > > > >
> > > > > > Add IORING_OP_BPF for extending io_uring operations, follows typical cases:
> > > > >
> > > > > BPF requests were tried long time ago and it wasn't great. Performance
> > > >
> > > > Care to share the link so I can learn from the lesson? Maybe things have
> > > > changed now...
> > >
> > > https://lore.kernel.org/io-uring/a83f147b-ea9d-e693-a2e9-c6ce16659749@gmail.com/T/#m31d0a2ac6e2213f912a200f5e8d88bd74f81406b
> > >
> > > There were some extra features and testing from folks, but I don't
> > > think it was ever posted to the list.
> >
> > Thanks for sharing the link:
> >
> > ```
> > The main problem solved is feeding completion information of other
> > requests in a form of CQEs back into BPF. I decided to wire up support
> > for multiple completion queues (aka CQs) and give BPF programs access to
> > them, so leaving userspace in control over synchronisation that should
> > be much more flexible that the link-based approach.
> > ```
>
> FWIW, and those extensions were the sign telling that the approach
> wasn't flexible enough.
>
> > Looks it is totally different with my patch in motivation and policy.
> >
> > I do _not_ want to move application logic into kernel by building SQE from
> > kernel prog. With IORING_OP_BPF, the whole io_uring application is
> > built & maintained completely in userspace, so I needn't to do cumbersome
> > kernel/user communication just for setting up one SQE in prog, not mention
> > maintaining SQE's relation with userspace side's.
>
> It's built and maintained in userspace in either case, and in
No.
BPF prog is not userspace, it is definitely kernel stuff, but it belongs to
application scope.
> both cases you have bpf implementing some logic that was previously
> done in userspace. To emphasize, you can do the desired parts of
> handling in BPF, and I'm not suggesting moving the entirety of
> request processing in there.
The problem with your patch is that SQE is built in bpf prog(kernel), then
inevitable application logic is moved to bpf prog, which isn't good at
handling complicated logic.
Then people have to run kernel<->user communication for setting up the SQE.
And the SQE in bpf prog may need to be linked with previous and following SQEs in
usersapce, which basically partitions application logic into two parts: one
is in userspace, another is in bpf prog(kernel).
The patch I am suggesting doesn't have this problem, all SQEs are built in
userspace, and just the minimized part(standalone and well defined function) is
done in bpf prog.
>
> > > > > for short BPF programs is not great because of io_uring request handling
> > > > > overhead. And flexibility was severely lacking, so even simple use cases
> > > >
> > > > What is the overhead? In this patch, OP's prep() and issue() are defined in
> > >
> > > The overhead of creating, freeing and executing a request. If you use
> > > it with links, it's also overhead of that. That prototype could also
> > > optionally wait for completions, and it wasn't free either.
> >
> > IORING_OP_BPF is same with existing normal io_uring request and link, wrt
> > all above you mentioned.
>
> It is, but it's an extra request, and in previous testing overhead
> for that extra request was affecting total performance, that's why
> linking or not is also important.
Yes, but does the extra request matters for whole performance?
I did have such test:
1) in tools/testing/selftests/ublk/null.c
- for zero copy test, one extra nop is submitted
2) rublk test
- for zero copy test, it simply returns without submitting nop
The IOPS gap is pretty small.
Also in your approach, without allocating one new SQE in bpf, how to
provide generic interface for bpf prog to work on different functions, such
as, memory copy or raid5 parity or compression ..., all require flexible
handling, such as, variable parameters, buffer could be plain user memory
, fixed, vectored or fixed vectored,..., so one SQE or new operation is the
easiest way for providing the abstraction and generic bpf prog interface.
>
> > IORING_OP_BPF's motivation is for being io_uring's supplementary or extention
> > in function, not for improving performance.
> >
> > >
> > > > bpf prog, but in typical use case, the code size is pretty small, and bpf
> > > > prog code is supposed to run in fast path.>
> > > > > were looking pretty ugly, internally, and for BPF writers as well.
> > > >
> > > > I am not sure what `simple use cases` you are talking about.
> > >
> > > As an example, creating a loop reading a file:
> > > read N bytes; wait for completion; repeat
> >
> > IORING_OP_BPF isn't supposed to implement FS operation in bpf prog.
> >
> > It doesn't mean IORING_OP_BPF can't support async issuing:
> >
> > - issue_wait() can be added for offload in io-wq context
> >
> > OR
> >
> > - for typical FS AIO, in theory it can be supported too, just the struct_ops need
> > to define one completion callback, and the callback can be called from
> > ->ki_complete().
>
> There is more to IO than read/write, and I'm afraid each new type of
> operation would need some extra kfunc glue. And even then there is
> enough of handling for rw requests in io_uring than just calling the
> callback. It's nicer to be able to reuse all io_uring request
> handling, which wouldn't even need extra kfuncs.
Looks you are trying to propose generic bpf io_uring request, which is
ambitious goal, :-)
But that isn't my patchset's motivation, which just serves as supplement or
extention of existing io_uring.
Another big case could be network IO, which could be covered -EAGAIN,
or other main cases?
>
> ...
> > > > and it can't be used in my case.
> > > Hmm, how so? Let's say ublk registers a buffer and posts a
> > > completion. Then BPF runs, it sees the completion and does the
> > > necessary processing, probably using some kfuncs like the ones
> >
> > It is easy to say, how can the BPF prog know the next completion is
> > exactly waiting for? You have to rely on bpf map to communicate with userspace
>
> By taking a peek at and maybe dereferencing cqe->user_data.
Yes, but you have to pass the interested ->user_data to bpf prog first.
There could be many inflight interested IOs, how to query them efficiently?
Scan each one after every CQE is posted? But ebpf just support bound loops,
the complexity may be run out of easily[1].
https://docs.ebpf.io/linux/concepts/loops/
>
> > to understanding what completion is what you are interested in, also
> > need all information from userpace for preparing the SQE for submission
> > from bpf prog. Tons of userspace and kernel communication.
>
> You can setup a BPF arena, and all that comm will be working with
> a block of shared memory. Or same but via io_uring parameter region.
> That sounds pretty simple.
But application logic has to splitted into two parts, both two have to
rely on the shared memory to communicate.
The exiting io_uring application has been complicated enough, adding one
extra shared memory communication for holding application logic just makes
things worse. Even in userspace programming, it is horrible to model logic
into data, that is why state machine pattern is usually not readable.
Think about writing high performance raid5 application based on ublk zero
copy & io_uring, for example, handling one simple write:
- one ublk write command comes for raid5
- suppose the command just writes data to one single stripe exactly
- submitting each write to N - 1 disks
- When all N writes are done, the new SQE needs to work:
- calculate parity by reading buffers from the N request kernel buffer
and writing resulted XOR parity to one user specified buffer
- then new FS IO need to be submitted to write the parity data to one calculated
disk(N)
So the involved things for bpf prog SQE:
- monitoring N - 1 writes
- do the parity calculation job, which has to define one kfunc
- mark parity is ready & notify userspace for writing parity(how to
notify?)
Now there can be variable(many) such WRITEs to handle concurrently, and the
bpf prog has to cover them all.
The above just the simplest case, the write command may not align with
stripe, so parity calculation may need to read data from other stripes.
If you think it is `pretty simple`, care to provide one example to show your
approach is workable?
>
> > > you introduced. After it can optionally queue up requests
> > > writing it to the storage or anything else.
> >
> > Again, I do not want to move userspace logic into bpf prog(kernel), what
> > IORING_BPF_OP provides is to define one operation, then userspace
> > can use it just like in-kernel operations.
>
> Right, but that's rather limited. I want to cover all those
> use cases with one implementation instead of fragmenting users,
> if that can be achieved.
I don't know when your ambitious plan can land or be doable.
I am going to write V2 with the approach of IORING_BPF_OP which is at least
workable for some cases, and much easier to take in userspace. Also it
doesn't conflict with your approach.
Thanks,
Ming
next prev parent reply other threads:[~2025-11-13 4:18 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-04 16:21 [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring Ming Lei
2025-11-04 16:21 ` [PATCH 1/5] io_uring: prepare for extending io_uring with bpf Ming Lei
2025-12-31 1:13 ` Caleb Sander Mateos
2025-12-31 9:33 ` Ming Lei
2025-11-04 16:21 ` [PATCH 2/5] io_uring: bpf: add io_uring_ctx setup for BPF into one list Ming Lei
2025-12-31 1:13 ` Caleb Sander Mateos
2025-12-31 9:49 ` Ming Lei
2025-12-31 16:19 ` Caleb Sander Mateos
2025-11-04 16:21 ` [PATCH 3/5] io_uring: bpf: extend io_uring with bpf struct_ops Ming Lei
2025-11-07 19:02 ` kernel test robot
2025-11-08 6:53 ` kernel test robot
2025-11-13 10:32 ` Stefan Metzmacher
2025-11-13 10:59 ` Ming Lei
2025-11-13 11:19 ` Stefan Metzmacher
2025-11-14 3:00 ` Ming Lei
2025-12-08 22:45 ` Caleb Sander Mateos
2025-12-09 3:08 ` Ming Lei
2025-12-10 16:11 ` Caleb Sander Mateos
2025-11-19 14:39 ` Jonathan Corbet
2025-11-20 1:46 ` Ming Lei
2025-11-20 1:51 ` Ming Lei
2025-12-31 1:19 ` Caleb Sander Mateos
2025-12-31 10:32 ` Ming Lei
2025-12-31 16:48 ` Caleb Sander Mateos
2025-11-04 16:21 ` [PATCH 4/5] io_uring: bpf: add buffer support for IORING_OP_BPF Ming Lei
2025-11-13 10:42 ` Stefan Metzmacher
2025-11-13 11:04 ` Ming Lei
2025-11-13 11:25 ` Stefan Metzmacher
2025-12-31 1:42 ` Caleb Sander Mateos
2025-12-31 11:02 ` Ming Lei
2025-12-31 17:02 ` Caleb Sander Mateos
2025-11-04 16:21 ` [PATCH 5/5] io_uring: bpf: add io_uring_bpf_req_memcpy() kfunc Ming Lei
2025-11-07 18:51 ` kernel test robot
2025-12-31 1:42 ` Caleb Sander Mateos
2025-11-05 12:47 ` [PATCH 0/5] io_uring: add IORING_OP_BPF for extending io_uring Pavel Begunkov
2025-11-05 15:57 ` Ming Lei
2025-11-06 16:03 ` Pavel Begunkov
2025-11-07 15:54 ` Ming Lei
2025-11-11 14:07 ` Pavel Begunkov
2025-11-13 4:18 ` Ming Lei [this message]
2025-11-19 19:00 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aRVcAFOsb7X3kxB9@fedora \
--to=ming.lei@redhat.com \
--cc=akailash@google.com \
--cc=asml.silence@gmail.com \
--cc=ast@kernel.org \
--cc=axboe@kernel.dk \
--cc=bpf@vger.kernel.org \
--cc=csander@purestorage.com \
--cc=io-uring@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox