From: Ming Lei <tom.leiming@gmail.com>
To: Bernd Schubert <bernd@niova.io>
Cc: Ming Lei <ming.lei@redhat.com>,
fuse-devel@lists.linux.dev, Joanne Koong <joannelkoong@gmail.com>,
io-uring <io-uring@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
Pavel Begunkov <asml.silence@gmail.com>,
Miklos Szeredi <miklos@szeredi.hu>
Subject: Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
Date: Fri, 17 Apr 2026 22:35:38 +0800 [thread overview]
Message-ID: <aeJFOmvCF3ArL9iq@fedora> (raw)
In-Reply-To: <55db9a65-4408-42d2-8958-3bf3aa79d554@niova.io>
On Thu, Apr 16, 2026 at 09:13:41PM +0200, Bernd Schubert wrote:
>
>
> On 4/16/26 17:48, Ming Lei wrote:
> > On Thu, Apr 16, 2026 at 04:46:01PM +0200, Bernd Schubert wrote:
> >> Hi Ming,
> >>
> >> On 4/16/26 15:49, Ming Lei wrote:
> >>> Hi Bernd,
> >>>
> >>> On Tue, Apr 14, 2026 at 5:33 AM Bernd Schubert <bernd@niova.io> wrote:
> >>>>
> >>>> Hi Joanne, et al,
> >>>>
> >>>> this is a bit of duplication of the discussion we had before, but I was
> >>>> badly distracted with other work and also switching employer - didn't
> >>>> manage to reply [1].
> >>>>
> >>>>
> >>>> I'm still not too happy about kBuf and its restriction of locked-only
> >>>> memory. Right now I'm reviewing your patches from the view of what needs
> >>>> to be done for ublk (for my current employer) and also for fuse to
> >>>> support different buffer sizes. Let's say fuse only support kBuf and its
> >>>> restriction of pinned memory, I think we would be forced to add support
> >>>> for different buffer sizes to the current ring-entry-provides-the-buffer
> >>>> and the new kBuf interface - from my point of view code dup.
> >>>> If we would allow pBuf for fuse, we could put the current
> >>>> 'ring-entry-provides-the-buffer' interface into maintenance mode and
> >>>> support new features with the new interface only. I know you disagree on
> >>>> using pBuf [1] with the argument that userspace could free the buffer.
> >>>> Well, if it does, it does something totally wrong and the same could
> >>>> happen today over /dev/fuse and also the existing fuse-over-io-uring.
> >>>> Just the window is smaller, as the pages are extracted from the buffer
> >>>> during the copy.
> >>>>
> >>>> I was looking into what would be needed to support pBuf and I think
> >>>> io-uring could extract pages from pBuf when the buffer is obtained - it
> >>>> would limit the window when userspace can do something wrong in a
> >>>> similar way current fuse and ublk works.
> >>>>
> >>>> Suggested changes:
> >>>>
> >>>> io_uring:
> >>>>
> >>>> - io_pin_pages() gets a 'bool longterm' parameter.
> >>>> The new pBuf path would pass false, every other exsting caller true.
> >>>>
> >>>> - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
> >>>> - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
> >>>> provided bvec
> >>>> - New struct io_ring_buf (in cmd.h)
> >>>>
> >>>> struct io_ring_buf {
> >>>> size_t len;
> >>>> unsigned int buf_id;
> >>>> unsigned int nr_bvecs;
> >>>>
> >>>> /* private */
> >>>> u64 addr;
> >>>> u8 is_pinned;
> >>>> };
> >>>>
> >>>>
> >>>> Fuse changes:
> >>>>
> >>>> - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
> >>>> replaced by io_ring_buf + pre-allocated bvec array.
> >>>> - Buffer selection under queue->lock removed. The lock only protects
> >>>> request dequeue and entry state transitions. Page access happens
> >>>> after the lock is dropped, in the context where the copy runs.
> >>>> - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
> >>>> iov_iter_bvec() and would continue to use iov_iter_get_pages2()
> >>>>
> >>>> What do you think?
> >>>>
> >>>> And my current primary goal is to let ublk to support multiple buffer
> >>>> sizes - ublk would also need to get support for kBuf/pBuf and I'm
> >>>
> >>> Ublk server is just one liburing application, and it supports all generic
> >>> io_uring buffer types, so kbuf/pbuf should be fine for your ublk server
> >>> in theory.
> >>>
> >>> It really depends on how your ublk server is implemented.
> >>>
> >>> Maybe you can share your motivation first before discussing kbuf/pbuf support.
> >>> If it is for DMA, there are other candidates too, such as hugepage,
> >>> recent added
> >>> UBLK_U_CMD_REG_BUF, ...
> >> Joanne had actually removed kBuf and switched to pBuf alone and that
> >> simiplifies things a bit.
> >>
> >> Motivation is to reduce memory usage. Let's say you need 4 IOs of 1MB to
> >> saturate streaming bandwidth, but still want to get smaller IOs through,
> >> for these smaller IOs you don't want to assign the 1MB buffer for each
> >> queue entry / tag.
> >
> > Thanks for sharing the motivation.
> >
> > Maybe you can pass UBLK_F_USER_COPY, and each IO buffer can be allocated
> > dynamically completely from userspace, then pre-allocation can be avoided.
>
> I had looked into, but that is still another syscall / roundtrip, will
> have the same performance issue as UBLK_F_NEED_GET_DATA and probably
> worse because compared to ring IO that is a syscall per IO.
Yeah, it seems true in your use case in which compression is followed,
so pread/pwrite for read/write io buffer can't be linked to io_uring SQE pipeline.
However, I am not sure how you use pbuf for this use case, one big thing is
that the buffer has to be provided to ublk FETCH_AND_COMMAND command
beforehand for handling the coming ublk IO request, which size can't be
known at that time. I will study the pBuf patchset later, but it depends
how ublk driver uses it too, IMO.
Meantime another (more flexible)way is to use bpf struct_ops for allocating &
freeing IO buffer, following the basic idea:
- define struct_ops(alloc_io_buf, free_io_buf) for allocating & freeing io buffer
which is used for copying data between request pages and this buffer
- ->alloc_io_buf() can be called from ublk_map_io() and ->free_io_buf()
can be called from ublk_unmap_io()
- the allocated buffer can be accessed directly from both userspace ublk server
and bpf prog, bpf arena is one perfect match for this use case, page
pinning is avoided meantime.
- the two callbacks are not called for the following features:
UBLK_F_SUPPORT_ZERO_COPY,UBLK_F_USER_COPY, UBLK_F_AUTO_BUF_REG or
UBLK_IO_F_SHMEM_ZC is set for this IO
- motivation is for avoiding big pre-allocate, so ublk server can
use dynamic per-queue heap for allocating io buffer in space-effective way.
- with this feature, userspace needn't to pre-allocate io buffer with max
buffer size, and typical implementation is to provide one bpf area heap
for bpf prog to alloc & free buffer. And it still can fallback to usercopy
code path in case of allocation failure from bpf prog.
You may compare the two approaches for your use case.
>
> >
> >> Zero copy is currently still out of question for us, although I will
> >> look into your recent work for integration of eBPF and if erasure
> >> coding, compression and checksums could be done with that (I guess
> >> checksums is the easy part).
> >
> > Got it, compression could be the hardest one, however, the recent added bpf
> > iterator based buffer interface may simplify everything. I'd suggest you to look
> > at it, and provide some feedback if possible.
> >
> > Also if your client application uses direct IO, recent added UBLK_F_SHMEM_ZC
> > could simplify implementation a lot, meantime with zero copy & user-mapped
> > address.
>
> Oh I see, that was just merged. Nice, thank you! I don't our users will
> be DIO only, but nice to have that ZC option!
It can be thought as speedup or optimization for DIO use case.
Thanks,
Ming
next prev parent reply other threads:[~2026-04-17 14:35 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 21:33 fuse/io-uring: Proposal to support pBuf in additon to kBuf Bernd Schubert
2026-04-14 0:56 ` Joanne Koong
2026-04-14 17:34 ` Bernd Schubert
2026-04-15 0:19 ` Joanne Koong
2026-04-16 13:49 ` Ming Lei
2026-04-16 14:46 ` Bernd Schubert
2026-04-16 15:48 ` Ming Lei
2026-04-16 19:13 ` Bernd Schubert
2026-04-17 14:35 ` Ming Lei [this message]
2026-04-17 21:02 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aeJFOmvCF3ArL9iq@fedora \
--to=tom.leiming@gmail.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=bernd@niova.io \
--cc=fuse-devel@lists.linux.dev \
--cc=io-uring@vger.kernel.org \
--cc=joannelkoong@gmail.com \
--cc=miklos@szeredi.hu \
--cc=ming.lei@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox