From: Ming Lei <tom.leiming@gmail.com>
To: Bernd Schubert <bernd@niova.io>
Cc: Ming Lei <ming.lei@redhat.com>,
fuse-devel@lists.linux.dev, Joanne Koong <joannelkoong@gmail.com>,
io-uring <io-uring@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
Pavel Begunkov <asml.silence@gmail.com>,
Miklos Szeredi <miklos@szeredi.hu>
Subject: Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
Date: Thu, 16 Apr 2026 23:48:48 +0800 [thread overview]
Message-ID: <aeEE4FVGdi5RqKs_@fedora> (raw)
In-Reply-To: <fcad39e2-37b5-46a9-a280-2315e0397985@niova.io>
On Thu, Apr 16, 2026 at 04:46:01PM +0200, Bernd Schubert wrote:
> Hi Ming,
>
> On 4/16/26 15:49, Ming Lei wrote:
> > Hi Bernd,
> >
> > On Tue, Apr 14, 2026 at 5:33 AM Bernd Schubert <bernd@niova.io> wrote:
> >>
> >> Hi Joanne, et al,
> >>
> >> this is a bit of duplication of the discussion we had before, but I was
> >> badly distracted with other work and also switching employer - didn't
> >> manage to reply [1].
> >>
> >>
> >> I'm still not too happy about kBuf and its restriction of locked-only
> >> memory. Right now I'm reviewing your patches from the view of what needs
> >> to be done for ublk (for my current employer) and also for fuse to
> >> support different buffer sizes. Let's say fuse only support kBuf and its
> >> restriction of pinned memory, I think we would be forced to add support
> >> for different buffer sizes to the current ring-entry-provides-the-buffer
> >> and the new kBuf interface - from my point of view code dup.
> >> If we would allow pBuf for fuse, we could put the current
> >> 'ring-entry-provides-the-buffer' interface into maintenance mode and
> >> support new features with the new interface only. I know you disagree on
> >> using pBuf [1] with the argument that userspace could free the buffer.
> >> Well, if it does, it does something totally wrong and the same could
> >> happen today over /dev/fuse and also the existing fuse-over-io-uring.
> >> Just the window is smaller, as the pages are extracted from the buffer
> >> during the copy.
> >>
> >> I was looking into what would be needed to support pBuf and I think
> >> io-uring could extract pages from pBuf when the buffer is obtained - it
> >> would limit the window when userspace can do something wrong in a
> >> similar way current fuse and ublk works.
> >>
> >> Suggested changes:
> >>
> >> io_uring:
> >>
> >> - io_pin_pages() gets a 'bool longterm' parameter.
> >> The new pBuf path would pass false, every other exsting caller true.
> >>
> >> - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
> >> - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
> >> provided bvec
> >> - New struct io_ring_buf (in cmd.h)
> >>
> >> struct io_ring_buf {
> >> size_t len;
> >> unsigned int buf_id;
> >> unsigned int nr_bvecs;
> >>
> >> /* private */
> >> u64 addr;
> >> u8 is_pinned;
> >> };
> >>
> >>
> >> Fuse changes:
> >>
> >> - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
> >> replaced by io_ring_buf + pre-allocated bvec array.
> >> - Buffer selection under queue->lock removed. The lock only protects
> >> request dequeue and entry state transitions. Page access happens
> >> after the lock is dropped, in the context where the copy runs.
> >> - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
> >> iov_iter_bvec() and would continue to use iov_iter_get_pages2()
> >>
> >> What do you think?
> >>
> >> And my current primary goal is to let ublk to support multiple buffer
> >> sizes - ublk would also need to get support for kBuf/pBuf and I'm
> >
> > Ublk server is just one liburing application, and it supports all generic
> > io_uring buffer types, so kbuf/pbuf should be fine for your ublk server
> > in theory.
> >
> > It really depends on how your ublk server is implemented.
> >
> > Maybe you can share your motivation first before discussing kbuf/pbuf support.
> > If it is for DMA, there are other candidates too, such as hugepage,
> > recent added
> > UBLK_U_CMD_REG_BUF, ...
> Joanne had actually removed kBuf and switched to pBuf alone and that
> simiplifies things a bit.
>
> Motivation is to reduce memory usage. Let's say you need 4 IOs of 1MB to
> saturate streaming bandwidth, but still want to get smaller IOs through,
> for these smaller IOs you don't want to assign the 1MB buffer for each
> queue entry / tag.
Thanks for sharing the motivation.
Maybe you can pass UBLK_F_USER_COPY, and each IO buffer can be allocated
dynamically completely from userspace, then pre-allocation can be avoided.
> Zero copy is currently still out of question for us, although I will
> look into your recent work for integration of eBPF and if erasure
> coding, compression and checksums could be done with that (I guess
> checksums is the easy part).
Got it, compression could be the hardest one, however, the recent added bpf
iterator based buffer interface may simplify everything. I'd suggest you to look
at it, and provide some feedback if possible.
Also if your client application uses direct IO, recent added UBLK_F_SHMEM_ZC
could simplify implementation a lot, meantime with zero copy & user-mapped
address.
>
> Ublk already has UBLK_F_NEED_GET_DATA, but that has two issues
> - needs another round trip (testing on my laptop shows a perf loss of 10
> to 15% per queue)
> - It does not release the application buffer on read. I have an idea how
> to fix that, but here at Niova we would like to go the dynamic memory
> appraoch with pBufs to avoid additional round trip overhead.
>
> Idea with pBufs: Several pBufs registered per queue at registration
> time. Every pBuf represents a different IO size. Optionally as with
> Joannes patches [1] the buffers can get pinned to avoid mapping to pages
> for every access.
I feel the plain fixed buffer might work too, but I may not get the whole
idea yet, looks I need to dig into pBuf first.
> I'm currently working on a patch series with some luck will sent an RFC
> tomorrow. The harder part compared to fuse is that ublk_drv does not
> have its own queues/lists so far. This is my first work on block layer -
> I'm not sure if internal struct request queuing is allowed at all.
> Testing will show in a bit :)
Great, glad to take a look after your RFC is out.
Thanks,
Ming
next prev parent reply other threads:[~2026-04-16 15:48 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 21:33 fuse/io-uring: Proposal to support pBuf in additon to kBuf Bernd Schubert
2026-04-14 0:56 ` Joanne Koong
2026-04-14 17:34 ` Bernd Schubert
2026-04-15 0:19 ` Joanne Koong
2026-04-16 13:49 ` Ming Lei
2026-04-16 14:46 ` Bernd Schubert
2026-04-16 15:48 ` Ming Lei [this message]
2026-04-16 19:13 ` Bernd Schubert
2026-04-17 14:35 ` Ming Lei
2026-04-17 21:02 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aeEE4FVGdi5RqKs_@fedora \
--to=tom.leiming@gmail.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=bernd@niova.io \
--cc=fuse-devel@lists.linux.dev \
--cc=io-uring@vger.kernel.org \
--cc=joannelkoong@gmail.com \
--cc=miklos@szeredi.hu \
--cc=ming.lei@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox