public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* fuse/io-uring: Proposal to support pBuf in additon to kBuf
@ 2026-04-13 21:33 Bernd Schubert
  2026-04-14  0:56 ` Joanne Koong
  2026-04-16 13:49 ` Ming Lei
  0 siblings, 2 replies; 10+ messages in thread
From: Bernd Schubert @ 2026-04-13 21:33 UTC (permalink / raw)
  To: fuse-devel
  Cc: Joanne Koong, io-uring, Jens Axboe, Pavel Begunkov, Ming Lei,
	Miklos Szeredi

Hi Joanne, et al,

this is a bit of duplication of the discussion we had before, but I was
badly distracted with other work and also switching employer - didn't
manage to reply [1].


I'm still not too happy about kBuf and its restriction of locked-only
memory. Right now I'm reviewing your patches from the view of what needs
to be done for ublk (for my current employer) and also for fuse to
support different buffer sizes. Let's say fuse only support kBuf and its
restriction of pinned memory, I think we would be forced to add support
for different buffer sizes to the current ring-entry-provides-the-buffer
and the new kBuf interface - from my point of view code dup.
If we would allow pBuf for fuse, we could put the current
'ring-entry-provides-the-buffer' interface into maintenance mode and
support new features with the new interface only. I know you disagree on
using pBuf [1] with the argument that userspace could free the buffer.
Well, if it does, it does something totally wrong and the same could
happen today over /dev/fuse and also the existing fuse-over-io-uring.
Just the window is smaller, as the pages are extracted from the buffer
during the copy.

I was looking into what would be needed to support pBuf and I think
io-uring could extract pages from pBuf when the buffer is obtained - it
would limit the window when userspace can do something wrong in a
similar way current fuse and ublk works.

Suggested changes:

io_uring:

  - io_pin_pages() gets a 'bool longterm' parameter.
The new pBuf path would pass false, every other exsting caller true.

  - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
  - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
provided bvec
  - New struct io_ring_buf (in cmd.h)

struct io_ring_buf {
       size_t                  len;
       unsigned int            buf_id;
       unsigned int            nr_bvecs;

       /* private */
       u64                     addr;
       u8                      is_pinned;
};


Fuse changes:

  - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
    replaced by io_ring_buf + pre-allocated bvec array.
  - Buffer selection under queue->lock removed.  The lock only protects
    request dequeue and entry state transitions.  Page access happens
    after the lock is dropped, in the context where the copy runs.
  - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
    iov_iter_bvec() and would continue to use iov_iter_get_pages2()

What do you think?

And my current primary goal is to let ublk to support multiple buffer
sizes - ublk would also need to get support for kBuf/pBuf and I'm
current assuming that fuse and ublk rings should just get multiple
kBufs/pBufs and a config options that mapps bufs to io-size. I'm still
looking into details for that.


Thanks,
Bernd


[1]
https://lore.kernel.org/r/CAJnrk1armV9VzBqrrdfr15K5ySBx2YJRk_P0okGnkzyMx_eDOw@mail.gmail.com



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
  2026-04-13 21:33 fuse/io-uring: Proposal to support pBuf in additon to kBuf Bernd Schubert
@ 2026-04-14  0:56 ` Joanne Koong
  2026-04-14 17:34   ` Bernd Schubert
  2026-04-16 13:49 ` Ming Lei
  1 sibling, 1 reply; 10+ messages in thread
From: Joanne Koong @ 2026-04-14  0:56 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: fuse-devel, io-uring, Jens Axboe, Pavel Begunkov, Ming Lei,
	Miklos Szeredi

On Mon, Apr 13, 2026 at 2:33 PM Bernd Schubert <bernd@niova.io> wrote:
>
> Hi Joanne, et al,
>
> this is a bit of duplication of the discussion we had before, but I was
> badly distracted with other work and also switching employer - didn't
> manage to reply [1].
>
>
> I'm still not too happy about kBuf and its restriction of locked-only
> memory. Right now I'm reviewing your patches from the view of what needs
> to be done for ublk (for my current employer) and also for fuse to
> support different buffer sizes. Let's say fuse only support kBuf and its
> restriction of pinned memory, I think we would be forced to add support
> for different buffer sizes to the current ring-entry-provides-the-buffer
> and the new kBuf interface - from my point of view code dup.
> If we would allow pBuf for fuse, we could put the current
> 'ring-entry-provides-the-buffer' interface into maintenance mode and
> support new features with the new interface only. I know you disagree on
> using pBuf [1] with the argument that userspace could free the buffer.
> Well, if it does, it does something totally wrong and the same could
> happen today over /dev/fuse and also the existing fuse-over-io-uring.
> Just the window is smaller, as the pages are extracted from the buffer
> during the copy.
>
> I was looking into what would be needed to support pBuf and I think
> io-uring could extract pages from pBuf when the buffer is obtained - it
> would limit the window when userspace can do something wrong in a
> similar way current fuse and ublk works.
>
> Suggested changes:
>
> io_uring:
>
>   - io_pin_pages() gets a 'bool longterm' parameter.
> The new pBuf path would pass false, every other exsting caller true.
>
>   - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
>   - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
> provided bvec
>   - New struct io_ring_buf (in cmd.h)
>
> struct io_ring_buf {
>        size_t                  len;
>        unsigned int            buf_id;
>        unsigned int            nr_bvecs;
>
>        /* private */
>        u64                     addr;
>        u8                      is_pinned;
> };
>
>
> Fuse changes:
>
>   - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
>     replaced by io_ring_buf + pre-allocated bvec array.
>   - Buffer selection under queue->lock removed.  The lock only protects
>     request dequeue and entry state transitions.  Page access happens
>     after the lock is dropped, in the context where the copy runs.
>   - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
>     iov_iter_bvec() and would continue to use iov_iter_get_pages2()
>
> What do you think?
>
> And my current primary goal is to let ublk to support multiple buffer
> sizes - ublk would also need to get support for kBuf/pBuf and I'm
> current assuming that fuse and ublk rings should just get multiple
> kBufs/pBufs and a config options that mapps bufs to io-size. I'm still
> looking into details for that.

Hi Bernd,

Thanks for your email. There were some changes made from v1 -> v2, so
please see the v2 "fuse: add io-uring buffer rings and zero-copy"
patchset [1], as I think this will hopefully address your concerns
about mlock. In short, what changed from v1 -> v2 is that I dropped
the approach where kernel-managed buffers is an io-uring native
infrastructure. I realized when trying to implement integration
between the io-uring networking layer and kmbuf rings that kmbufs
didn't tie in as nicely as I'd thought with io-uring native requests,
and fuse has too many constraints for the kmbuf ring (locking
semantics, request lifecycle, etc.) that it made the io-uring side
less clean; this made me realize this logic would be better off not
part of io-uring infrastructure and instead self-contained in fuse, as
Pavel had suggested.

In v2, the fuse headers and payload buffers are passed as user
allocations during registration time through the sqe iovs and
server-side has control over whether to pin the headers or payload
buffers or both (or pin neither), eg bufrings can be used without
pinning (no mlock requirement) and pinning is an opt-in optimization.
Zero-copy requires pinning both headers and payload buffers, as zero
copy requires CAP_SYS_ADMIN privileges anyways. In this design, the
buffers are only recyclable by the kernel (unlike pbufs). Unlike pbufs
where the api contract is that any buffers not explicitly put into the
ring by userspace are under the full control of userspace and not
touched by the kernel, this design continues the existing-fuse-uring
contract that any buffers passed in through sqe iovs during
registration will be copied to/from the kernel as long as the fuse
connection is alive. In the future, if the buffers need to be
kernel-allocated for dma contiguity or other reasons, that could be
added separately if/when it becomes necessary.

Does this address your concerns?

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20260402162840.2989717-1-joannelkoong@gmail.com/T/#mb8f96895aa2773424005ee06bb62ae980e95e604

>
>
> Thanks,
> Bernd
>
>
> [1]
> https://lore.kernel.org/r/CAJnrk1armV9VzBqrrdfr15K5ySBx2YJRk_P0okGnkzyMx_eDOw@mail.gmail.com
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
  2026-04-14  0:56 ` Joanne Koong
@ 2026-04-14 17:34   ` Bernd Schubert
  2026-04-15  0:19     ` Joanne Koong
  0 siblings, 1 reply; 10+ messages in thread
From: Bernd Schubert @ 2026-04-14 17:34 UTC (permalink / raw)
  To: Joanne Koong
  Cc: fuse-devel, io-uring, Jens Axboe, Pavel Begunkov, Ming Lei,
	Miklos Szeredi



On 4/14/26 02:56, Joanne Koong wrote:
> On Mon, Apr 13, 2026 at 2:33 PM Bernd Schubert <bernd@niova.io> wrote:
>>
>> Hi Joanne, et al,
>>
>> this is a bit of duplication of the discussion we had before, but I was
>> badly distracted with other work and also switching employer - didn't
>> manage to reply [1].
>>
>>
>> I'm still not too happy about kBuf and its restriction of locked-only
>> memory. Right now I'm reviewing your patches from the view of what needs
>> to be done for ublk (for my current employer) and also for fuse to
>> support different buffer sizes. Let's say fuse only support kBuf and its
>> restriction of pinned memory, I think we would be forced to add support
>> for different buffer sizes to the current ring-entry-provides-the-buffer
>> and the new kBuf interface - from my point of view code dup.
>> If we would allow pBuf for fuse, we could put the current
>> 'ring-entry-provides-the-buffer' interface into maintenance mode and
>> support new features with the new interface only. I know you disagree on
>> using pBuf [1] with the argument that userspace could free the buffer.
>> Well, if it does, it does something totally wrong and the same could
>> happen today over /dev/fuse and also the existing fuse-over-io-uring.
>> Just the window is smaller, as the pages are extracted from the buffer
>> during the copy.
>>
>> I was looking into what would be needed to support pBuf and I think
>> io-uring could extract pages from pBuf when the buffer is obtained - it
>> would limit the window when userspace can do something wrong in a
>> similar way current fuse and ublk works.
>>
>> Suggested changes:
>>
>> io_uring:
>>
>>   - io_pin_pages() gets a 'bool longterm' parameter.
>> The new pBuf path would pass false, every other exsting caller true.
>>
>>   - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
>>   - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
>> provided bvec
>>   - New struct io_ring_buf (in cmd.h)
>>
>> struct io_ring_buf {
>>        size_t                  len;
>>        unsigned int            buf_id;
>>        unsigned int            nr_bvecs;
>>
>>        /* private */
>>        u64                     addr;
>>        u8                      is_pinned;
>> };
>>
>>
>> Fuse changes:
>>
>>   - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
>>     replaced by io_ring_buf + pre-allocated bvec array.
>>   - Buffer selection under queue->lock removed.  The lock only protects
>>     request dequeue and entry state transitions.  Page access happens
>>     after the lock is dropped, in the context where the copy runs.
>>   - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
>>     iov_iter_bvec() and would continue to use iov_iter_get_pages2()
>>
>> What do you think?
>>
>> And my current primary goal is to let ublk to support multiple buffer
>> sizes - ublk would also need to get support for kBuf/pBuf and I'm
>> current assuming that fuse and ublk rings should just get multiple
>> kBufs/pBufs and a config options that mapps bufs to io-size. I'm still
>> looking into details for that.
> 
> Hi Bernd,
> 
> Thanks for your email. There were some changes made from v1 -> v2, so
> please see the v2 "fuse: add io-uring buffer rings and zero-copy"
> patchset [1], as I think this will hopefully address your concerns
> about mlock. In short, what changed from v1 -> v2 is that I dropped
> the approach where kernel-managed buffers is an io-uring native
> infrastructure. I realized when trying to implement integration
> between the io-uring networking layer and kmbuf rings that kmbufs
> didn't tie in as nicely as I'd thought with io-uring native requests,
> and fuse has too many constraints for the kmbuf ring (locking
> semantics, request lifecycle, etc.) that it made the io-uring side
> less clean; this made me realize this logic would be better off not
> part of io-uring infrastructure and instead self-contained in fuse, as
> Pavel had suggested.
> 
> In v2, the fuse headers and payload buffers are passed as user
> allocations during registration time through the sqe iovs and
> server-side has control over whether to pin the headers or payload
> buffers or both (or pin neither), eg bufrings can be used without
> pinning (no mlock requirement) and pinning is an opt-in optimization.
> Zero-copy requires pinning both headers and payload buffers, as zero
> copy requires CAP_SYS_ADMIN privileges anyways. In this design, the
> buffers are only recyclable by the kernel (unlike pbufs). Unlike pbufs
> where the api contract is that any buffers not explicitly put into the
> ring by userspace are under the full control of userspace and not
> touched by the kernel, this design continues the existing-fuse-uring
> contract that any buffers passed in through sqe iovs during
> registration will be copied to/from the kernel as long as the fuse
> connection is alive. In the future, if the buffers need to be
> kernel-allocated for dma contiguity or other reasons, that could be
> added separately if/when it becomes necessary.
> 
> Does this address your concerns?

yes absolutely it does. I had actually thought that Jens had already
accepted the kBuf changes. I had seen the discussion with Pavel, but
then seen (at least I think) that Jens had accepted it. And with the
different versions, I didn't notice that v2 (in my counting that is v5,
I think), doesn't use kBuf anymore.

If I understand it right, the io-uring bvec changes are only needed for
zero-copy?

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
  2026-04-14 17:34   ` Bernd Schubert
@ 2026-04-15  0:19     ` Joanne Koong
  0 siblings, 0 replies; 10+ messages in thread
From: Joanne Koong @ 2026-04-15  0:19 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: fuse-devel, io-uring, Jens Axboe, Pavel Begunkov, Ming Lei,
	Miklos Szeredi

On Tue, Apr 14, 2026 at 10:34 AM Bernd Schubert <bernd@niova.io> wrote:
>
>
> On 4/14/26 02:56, Joanne Koong wrote:
> > On Mon, Apr 13, 2026 at 2:33 PM Bernd Schubert <bernd@niova.io> wrote:
> >>
> >> Hi Joanne, et al,
> >>
> >> this is a bit of duplication of the discussion we had before, but I was
> >> badly distracted with other work and also switching employer - didn't
> >> manage to reply [1].
> >>
> >>
> >> I'm still not too happy about kBuf and its restriction of locked-only
> >> memory. Right now I'm reviewing your patches from the view of what needs
> >> to be done for ublk (for my current employer) and also for fuse to
> >> support different buffer sizes. Let's say fuse only support kBuf and its
> >> restriction of pinned memory, I think we would be forced to add support
> >> for different buffer sizes to the current ring-entry-provides-the-buffer
> >> and the new kBuf interface - from my point of view code dup.
> >> If we would allow pBuf for fuse, we could put the current
> >> 'ring-entry-provides-the-buffer' interface into maintenance mode and
> >> support new features with the new interface only. I know you disagree on
> >> using pBuf [1] with the argument that userspace could free the buffer.
> >> Well, if it does, it does something totally wrong and the same could
> >> happen today over /dev/fuse and also the existing fuse-over-io-uring.
> >> Just the window is smaller, as the pages are extracted from the buffer
> >> during the copy.
> >>
> >> I was looking into what would be needed to support pBuf and I think
> >> io-uring could extract pages from pBuf when the buffer is obtained - it
> >> would limit the window when userspace can do something wrong in a
> >> similar way current fuse and ublk works.
> >>
> >> Suggested changes:
> >>
> >> io_uring:
> >>
> >>   - io_pin_pages() gets a 'bool longterm' parameter.
> >> The new pBuf path would pass false, every other exsting caller true.
> >>
> >>   - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
> >>   - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
> >> provided bvec
> >>   - New struct io_ring_buf (in cmd.h)
> >>
> >> struct io_ring_buf {
> >>        size_t                  len;
> >>        unsigned int            buf_id;
> >>        unsigned int            nr_bvecs;
> >>
> >>        /* private */
> >>        u64                     addr;
> >>        u8                      is_pinned;
> >> };
> >>
> >>
> >> Fuse changes:
> >>
> >>   - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
> >>     replaced by io_ring_buf + pre-allocated bvec array.
> >>   - Buffer selection under queue->lock removed.  The lock only protects
> >>     request dequeue and entry state transitions.  Page access happens
> >>     after the lock is dropped, in the context where the copy runs.
> >>   - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
> >>     iov_iter_bvec() and would continue to use iov_iter_get_pages2()
> >>
> >> What do you think?
> >>
> >> And my current primary goal is to let ublk to support multiple buffer
> >> sizes - ublk would also need to get support for kBuf/pBuf and I'm
> >> current assuming that fuse and ublk rings should just get multiple
> >> kBufs/pBufs and a config options that mapps bufs to io-size. I'm still
> >> looking into details for that.
> >
> > Hi Bernd,
> >
> > Thanks for your email. There were some changes made from v1 -> v2, so
> > please see the v2 "fuse: add io-uring buffer rings and zero-copy"
> > patchset [1], as I think this will hopefully address your concerns
> > about mlock. In short, what changed from v1 -> v2 is that I dropped
> > the approach where kernel-managed buffers is an io-uring native
> > infrastructure. I realized when trying to implement integration
> > between the io-uring networking layer and kmbuf rings that kmbufs
> > didn't tie in as nicely as I'd thought with io-uring native requests,
> > and fuse has too many constraints for the kmbuf ring (locking
> > semantics, request lifecycle, etc.) that it made the io-uring side
> > less clean; this made me realize this logic would be better off not
> > part of io-uring infrastructure and instead self-contained in fuse, as
> > Pavel had suggested.
> >
> > In v2, the fuse headers and payload buffers are passed as user
> > allocations during registration time through the sqe iovs and
> > server-side has control over whether to pin the headers or payload
> > buffers or both (or pin neither), eg bufrings can be used without
> > pinning (no mlock requirement) and pinning is an opt-in optimization.
> > Zero-copy requires pinning both headers and payload buffers, as zero
> > copy requires CAP_SYS_ADMIN privileges anyways. In this design, the
> > buffers are only recyclable by the kernel (unlike pbufs). Unlike pbufs
> > where the api contract is that any buffers not explicitly put into the
> > ring by userspace are under the full control of userspace and not
> > touched by the kernel, this design continues the existing-fuse-uring
> > contract that any buffers passed in through sqe iovs during
> > registration will be copied to/from the kernel as long as the fuse
> > connection is alive. In the future, if the buffers need to be
> > kernel-allocated for dma contiguity or other reasons, that could be
> > added separately if/when it becomes necessary.
> >
> > Does this address your concerns?
>

Hi Bernd,

> yes absolutely it does. I had actually thought that Jens had already
> accepted the kBuf changes. I had seen the discussion with Pavel, but

Yes it had been briefly merged into Jens's io_uring/for-7.1 staging
tree and then it was dropped.

> then seen (at least I think) that Jens had accepted it. And with the
> different versions, I didn't notice that v2 (in my counting that is v5,
> I think), doesn't use kBuf anymore.

No worries, thanks for taking a look at it now.

>
> If I understand it right, the io-uring bvec changes are only needed for
> zero-copy?

Yes, that is correct.

Thanks,
Joanne
>
> Thanks,
> Bernd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
  2026-04-13 21:33 fuse/io-uring: Proposal to support pBuf in additon to kBuf Bernd Schubert
  2026-04-14  0:56 ` Joanne Koong
@ 2026-04-16 13:49 ` Ming Lei
  2026-04-16 14:46   ` Bernd Schubert
  1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2026-04-16 13:49 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: fuse-devel, Joanne Koong, io-uring, Jens Axboe, Pavel Begunkov,
	Miklos Szeredi, Lei, Ming

Hi Bernd,

On Tue, Apr 14, 2026 at 5:33 AM Bernd Schubert <bernd@niova.io> wrote:
>
> Hi Joanne, et al,
>
> this is a bit of duplication of the discussion we had before, but I was
> badly distracted with other work and also switching employer - didn't
> manage to reply [1].
>
>
> I'm still not too happy about kBuf and its restriction of locked-only
> memory. Right now I'm reviewing your patches from the view of what needs
> to be done for ublk (for my current employer) and also for fuse to
> support different buffer sizes. Let's say fuse only support kBuf and its
> restriction of pinned memory, I think we would be forced to add support
> for different buffer sizes to the current ring-entry-provides-the-buffer
> and the new kBuf interface - from my point of view code dup.
> If we would allow pBuf for fuse, we could put the current
> 'ring-entry-provides-the-buffer' interface into maintenance mode and
> support new features with the new interface only. I know you disagree on
> using pBuf [1] with the argument that userspace could free the buffer.
> Well, if it does, it does something totally wrong and the same could
> happen today over /dev/fuse and also the existing fuse-over-io-uring.
> Just the window is smaller, as the pages are extracted from the buffer
> during the copy.
>
> I was looking into what would be needed to support pBuf and I think
> io-uring could extract pages from pBuf when the buffer is obtained - it
> would limit the window when userspace can do something wrong in a
> similar way current fuse and ublk works.
>
> Suggested changes:
>
> io_uring:
>
>   - io_pin_pages() gets a 'bool longterm' parameter.
> The new pBuf path would pass false, every other exsting caller true.
>
>   - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
>   - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
> provided bvec
>   - New struct io_ring_buf (in cmd.h)
>
> struct io_ring_buf {
>        size_t                  len;
>        unsigned int            buf_id;
>        unsigned int            nr_bvecs;
>
>        /* private */
>        u64                     addr;
>        u8                      is_pinned;
> };
>
>
> Fuse changes:
>
>   - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
>     replaced by io_ring_buf + pre-allocated bvec array.
>   - Buffer selection under queue->lock removed.  The lock only protects
>     request dequeue and entry state transitions.  Page access happens
>     after the lock is dropped, in the context where the copy runs.
>   - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
>     iov_iter_bvec() and would continue to use iov_iter_get_pages2()
>
> What do you think?
>
> And my current primary goal is to let ublk to support multiple buffer
> sizes - ublk would also need to get support for kBuf/pBuf and I'm

Ublk server is just one liburing application, and it supports all generic
io_uring buffer types, so kbuf/pbuf should be fine for your ublk server
in theory.

It really depends on how your ublk server is implemented.

Maybe you can share your motivation first before discussing kbuf/pbuf support.
If it is for DMA,  there are other candidates too, such as hugepage,
recent added
UBLK_U_CMD_REG_BUF, ...


Thanks,
Ming


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
  2026-04-16 13:49 ` Ming Lei
@ 2026-04-16 14:46   ` Bernd Schubert
  2026-04-16 15:48     ` Ming Lei
  2026-04-17 21:02     ` Joanne Koong
  0 siblings, 2 replies; 10+ messages in thread
From: Bernd Schubert @ 2026-04-16 14:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: fuse-devel, Joanne Koong, io-uring, Jens Axboe, Pavel Begunkov,
	Miklos Szeredi, Lei, Ming

Hi Ming,

On 4/16/26 15:49, Ming Lei wrote:
> Hi Bernd,
> 
> On Tue, Apr 14, 2026 at 5:33 AM Bernd Schubert <bernd@niova.io> wrote:
>>
>> Hi Joanne, et al,
>>
>> this is a bit of duplication of the discussion we had before, but I was
>> badly distracted with other work and also switching employer - didn't
>> manage to reply [1].
>>
>>
>> I'm still not too happy about kBuf and its restriction of locked-only
>> memory. Right now I'm reviewing your patches from the view of what needs
>> to be done for ublk (for my current employer) and also for fuse to
>> support different buffer sizes. Let's say fuse only support kBuf and its
>> restriction of pinned memory, I think we would be forced to add support
>> for different buffer sizes to the current ring-entry-provides-the-buffer
>> and the new kBuf interface - from my point of view code dup.
>> If we would allow pBuf for fuse, we could put the current
>> 'ring-entry-provides-the-buffer' interface into maintenance mode and
>> support new features with the new interface only. I know you disagree on
>> using pBuf [1] with the argument that userspace could free the buffer.
>> Well, if it does, it does something totally wrong and the same could
>> happen today over /dev/fuse and also the existing fuse-over-io-uring.
>> Just the window is smaller, as the pages are extracted from the buffer
>> during the copy.
>>
>> I was looking into what would be needed to support pBuf and I think
>> io-uring could extract pages from pBuf when the buffer is obtained - it
>> would limit the window when userspace can do something wrong in a
>> similar way current fuse and ublk works.
>>
>> Suggested changes:
>>
>> io_uring:
>>
>>   - io_pin_pages() gets a 'bool longterm' parameter.
>> The new pBuf path would pass false, every other exsting caller true.
>>
>>   - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
>>   - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
>> provided bvec
>>   - New struct io_ring_buf (in cmd.h)
>>
>> struct io_ring_buf {
>>        size_t                  len;
>>        unsigned int            buf_id;
>>        unsigned int            nr_bvecs;
>>
>>        /* private */
>>        u64                     addr;
>>        u8                      is_pinned;
>> };
>>
>>
>> Fuse changes:
>>
>>   - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
>>     replaced by io_ring_buf + pre-allocated bvec array.
>>   - Buffer selection under queue->lock removed.  The lock only protects
>>     request dequeue and entry state transitions.  Page access happens
>>     after the lock is dropped, in the context where the copy runs.
>>   - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
>>     iov_iter_bvec() and would continue to use iov_iter_get_pages2()
>>
>> What do you think?
>>
>> And my current primary goal is to let ublk to support multiple buffer
>> sizes - ublk would also need to get support for kBuf/pBuf and I'm
> 
> Ublk server is just one liburing application, and it supports all generic
> io_uring buffer types, so kbuf/pbuf should be fine for your ublk server
> in theory.
> 
> It really depends on how your ublk server is implemented.
> 
> Maybe you can share your motivation first before discussing kbuf/pbuf support.
> If it is for DMA,  there are other candidates too, such as hugepage,
> recent added
> UBLK_U_CMD_REG_BUF, ...
Joanne had actually removed kBuf and switched to pBuf alone and that
simiplifies things a bit.

Motivation is to reduce memory usage. Let's say you need 4 IOs of 1MB to
saturate streaming bandwidth, but still want to get smaller IOs through,
for these smaller IOs you don't want to assign the 1MB buffer for each
queue entry / tag.
Zero copy is currently still out of question for us, although I will
look into your recent work for integration of eBPF and if erasure
coding, compression and checksums could be done with that (I guess
checksums is the easy part).

Ublk already has UBLK_F_NEED_GET_DATA, but that has two issues
- needs another round trip (testing on my laptop shows a perf loss of 10
to 15% per queue)
- It does not release the application buffer on read. I have an idea how
to fix that, but here at Niova we would like to go the dynamic memory
appraoch with pBufs to avoid additional round trip overhead.

Idea with pBufs: Several pBufs registered per queue at registration
time. Every pBuf represents a different IO size. Optionally as with
Joannes patches [1] the buffers can get pinned to avoid mapping to pages
for every access.
I'm currently working on a patch series with some luck will sent an RFC
tomorrow. The harder part compared to fuse is that ublk_drv does not
have its own queues/lists so far. This is my first work on block layer -
I'm not sure if internal struct request queuing is allowed at all.
Testing will show in a bit :)


Thanks,
Bernd


[1]
https://lore.kernel.org/linux-fsdevel/20260402162840.2989717-1-joannelkoong@gmail.com/T/#mb8f96895aa2773424005ee06bb62ae980e95e604



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
  2026-04-16 14:46   ` Bernd Schubert
@ 2026-04-16 15:48     ` Ming Lei
  2026-04-16 19:13       ` Bernd Schubert
  2026-04-17 21:02     ` Joanne Koong
  1 sibling, 1 reply; 10+ messages in thread
From: Ming Lei @ 2026-04-16 15:48 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Ming Lei, fuse-devel, Joanne Koong, io-uring, Jens Axboe,
	Pavel Begunkov, Miklos Szeredi

On Thu, Apr 16, 2026 at 04:46:01PM +0200, Bernd Schubert wrote:
> Hi Ming,
> 
> On 4/16/26 15:49, Ming Lei wrote:
> > Hi Bernd,
> > 
> > On Tue, Apr 14, 2026 at 5:33 AM Bernd Schubert <bernd@niova.io> wrote:
> >>
> >> Hi Joanne, et al,
> >>
> >> this is a bit of duplication of the discussion we had before, but I was
> >> badly distracted with other work and also switching employer - didn't
> >> manage to reply [1].
> >>
> >>
> >> I'm still not too happy about kBuf and its restriction of locked-only
> >> memory. Right now I'm reviewing your patches from the view of what needs
> >> to be done for ublk (for my current employer) and also for fuse to
> >> support different buffer sizes. Let's say fuse only support kBuf and its
> >> restriction of pinned memory, I think we would be forced to add support
> >> for different buffer sizes to the current ring-entry-provides-the-buffer
> >> and the new kBuf interface - from my point of view code dup.
> >> If we would allow pBuf for fuse, we could put the current
> >> 'ring-entry-provides-the-buffer' interface into maintenance mode and
> >> support new features with the new interface only. I know you disagree on
> >> using pBuf [1] with the argument that userspace could free the buffer.
> >> Well, if it does, it does something totally wrong and the same could
> >> happen today over /dev/fuse and also the existing fuse-over-io-uring.
> >> Just the window is smaller, as the pages are extracted from the buffer
> >> during the copy.
> >>
> >> I was looking into what would be needed to support pBuf and I think
> >> io-uring could extract pages from pBuf when the buffer is obtained - it
> >> would limit the window when userspace can do something wrong in a
> >> similar way current fuse and ublk works.
> >>
> >> Suggested changes:
> >>
> >> io_uring:
> >>
> >>   - io_pin_pages() gets a 'bool longterm' parameter.
> >> The new pBuf path would pass false, every other exsting caller true.
> >>
> >>   - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
> >>   - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
> >> provided bvec
> >>   - New struct io_ring_buf (in cmd.h)
> >>
> >> struct io_ring_buf {
> >>        size_t                  len;
> >>        unsigned int            buf_id;
> >>        unsigned int            nr_bvecs;
> >>
> >>        /* private */
> >>        u64                     addr;
> >>        u8                      is_pinned;
> >> };
> >>
> >>
> >> Fuse changes:
> >>
> >>   - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
> >>     replaced by io_ring_buf + pre-allocated bvec array.
> >>   - Buffer selection under queue->lock removed.  The lock only protects
> >>     request dequeue and entry state transitions.  Page access happens
> >>     after the lock is dropped, in the context where the copy runs.
> >>   - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
> >>     iov_iter_bvec() and would continue to use iov_iter_get_pages2()
> >>
> >> What do you think?
> >>
> >> And my current primary goal is to let ublk to support multiple buffer
> >> sizes - ublk would also need to get support for kBuf/pBuf and I'm
> > 
> > Ublk server is just one liburing application, and it supports all generic
> > io_uring buffer types, so kbuf/pbuf should be fine for your ublk server
> > in theory.
> > 
> > It really depends on how your ublk server is implemented.
> > 
> > Maybe you can share your motivation first before discussing kbuf/pbuf support.
> > If it is for DMA,  there are other candidates too, such as hugepage,
> > recent added
> > UBLK_U_CMD_REG_BUF, ...
> Joanne had actually removed kBuf and switched to pBuf alone and that
> simiplifies things a bit.
> 
> Motivation is to reduce memory usage. Let's say you need 4 IOs of 1MB to
> saturate streaming bandwidth, but still want to get smaller IOs through,
> for these smaller IOs you don't want to assign the 1MB buffer for each
> queue entry / tag.

Thanks for sharing the motivation.

Maybe you can pass UBLK_F_USER_COPY, and each IO buffer can be allocated
dynamically completely from userspace, then pre-allocation can be avoided.

> Zero copy is currently still out of question for us, although I will
> look into your recent work for integration of eBPF and if erasure
> coding, compression and checksums could be done with that (I guess
> checksums is the easy part).

Got it, compression could be the hardest one, however, the recent added bpf
iterator based buffer interface may simplify everything. I'd suggest you to look
at it, and provide some feedback if possible.

Also if your client application uses direct IO, recent added UBLK_F_SHMEM_ZC
could simplify implementation a lot, meantime with zero copy & user-mapped
address.

> 
> Ublk already has UBLK_F_NEED_GET_DATA, but that has two issues
> - needs another round trip (testing on my laptop shows a perf loss of 10
> to 15% per queue)
> - It does not release the application buffer on read. I have an idea how
> to fix that, but here at Niova we would like to go the dynamic memory
> appraoch with pBufs to avoid additional round trip overhead.
> 
> Idea with pBufs: Several pBufs registered per queue at registration
> time. Every pBuf represents a different IO size. Optionally as with
> Joannes patches [1] the buffers can get pinned to avoid mapping to pages
> for every access.

I feel the plain fixed buffer might work too, but I may not get the whole
idea yet, looks I need to dig into pBuf first.

> I'm currently working on a patch series with some luck will sent an RFC
> tomorrow. The harder part compared to fuse is that ublk_drv does not
> have its own queues/lists so far. This is my first work on block layer -
> I'm not sure if internal struct request queuing is allowed at all.
> Testing will show in a bit :)

Great, glad to take a look after your RFC is out.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
  2026-04-16 15:48     ` Ming Lei
@ 2026-04-16 19:13       ` Bernd Schubert
  2026-04-17 14:35         ` Ming Lei
  0 siblings, 1 reply; 10+ messages in thread
From: Bernd Schubert @ 2026-04-16 19:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, fuse-devel, Joanne Koong, io-uring, Jens Axboe,
	Pavel Begunkov, Miklos Szeredi



On 4/16/26 17:48, Ming Lei wrote:
> On Thu, Apr 16, 2026 at 04:46:01PM +0200, Bernd Schubert wrote:
>> Hi Ming,
>>
>> On 4/16/26 15:49, Ming Lei wrote:
>>> Hi Bernd,
>>>
>>> On Tue, Apr 14, 2026 at 5:33 AM Bernd Schubert <bernd@niova.io> wrote:
>>>>
>>>> Hi Joanne, et al,
>>>>
>>>> this is a bit of duplication of the discussion we had before, but I was
>>>> badly distracted with other work and also switching employer - didn't
>>>> manage to reply [1].
>>>>
>>>>
>>>> I'm still not too happy about kBuf and its restriction of locked-only
>>>> memory. Right now I'm reviewing your patches from the view of what needs
>>>> to be done for ublk (for my current employer) and also for fuse to
>>>> support different buffer sizes. Let's say fuse only support kBuf and its
>>>> restriction of pinned memory, I think we would be forced to add support
>>>> for different buffer sizes to the current ring-entry-provides-the-buffer
>>>> and the new kBuf interface - from my point of view code dup.
>>>> If we would allow pBuf for fuse, we could put the current
>>>> 'ring-entry-provides-the-buffer' interface into maintenance mode and
>>>> support new features with the new interface only. I know you disagree on
>>>> using pBuf [1] with the argument that userspace could free the buffer.
>>>> Well, if it does, it does something totally wrong and the same could
>>>> happen today over /dev/fuse and also the existing fuse-over-io-uring.
>>>> Just the window is smaller, as the pages are extracted from the buffer
>>>> during the copy.
>>>>
>>>> I was looking into what would be needed to support pBuf and I think
>>>> io-uring could extract pages from pBuf when the buffer is obtained - it
>>>> would limit the window when userspace can do something wrong in a
>>>> similar way current fuse and ublk works.
>>>>
>>>> Suggested changes:
>>>>
>>>> io_uring:
>>>>
>>>>   - io_pin_pages() gets a 'bool longterm' parameter.
>>>> The new pBuf path would pass false, every other exsting caller true.
>>>>
>>>>   - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
>>>>   - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
>>>> provided bvec
>>>>   - New struct io_ring_buf (in cmd.h)
>>>>
>>>> struct io_ring_buf {
>>>>        size_t                  len;
>>>>        unsigned int            buf_id;
>>>>        unsigned int            nr_bvecs;
>>>>
>>>>        /* private */
>>>>        u64                     addr;
>>>>        u8                      is_pinned;
>>>> };
>>>>
>>>>
>>>> Fuse changes:
>>>>
>>>>   - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
>>>>     replaced by io_ring_buf + pre-allocated bvec array.
>>>>   - Buffer selection under queue->lock removed.  The lock only protects
>>>>     request dequeue and entry state transitions.  Page access happens
>>>>     after the lock is dropped, in the context where the copy runs.
>>>>   - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
>>>>     iov_iter_bvec() and would continue to use iov_iter_get_pages2()
>>>>
>>>> What do you think?
>>>>
>>>> And my current primary goal is to let ublk to support multiple buffer
>>>> sizes - ublk would also need to get support for kBuf/pBuf and I'm
>>>
>>> Ublk server is just one liburing application, and it supports all generic
>>> io_uring buffer types, so kbuf/pbuf should be fine for your ublk server
>>> in theory.
>>>
>>> It really depends on how your ublk server is implemented.
>>>
>>> Maybe you can share your motivation first before discussing kbuf/pbuf support.
>>> If it is for DMA,  there are other candidates too, such as hugepage,
>>> recent added
>>> UBLK_U_CMD_REG_BUF, ...
>> Joanne had actually removed kBuf and switched to pBuf alone and that
>> simiplifies things a bit.
>>
>> Motivation is to reduce memory usage. Let's say you need 4 IOs of 1MB to
>> saturate streaming bandwidth, but still want to get smaller IOs through,
>> for these smaller IOs you don't want to assign the 1MB buffer for each
>> queue entry / tag.
> 
> Thanks for sharing the motivation.
> 
> Maybe you can pass UBLK_F_USER_COPY, and each IO buffer can be allocated
> dynamically completely from userspace, then pre-allocation can be avoided.

I had looked into, but that is still another syscall / roundtrip, will
have the same performance issue as UBLK_F_NEED_GET_DATA and probably
worse because compared to ring IO that is a syscall per IO.

> 
>> Zero copy is currently still out of question for us, although I will
>> look into your recent work for integration of eBPF and if erasure
>> coding, compression and checksums could be done with that (I guess
>> checksums is the easy part).
> 
> Got it, compression could be the hardest one, however, the recent added bpf
> iterator based buffer interface may simplify everything. I'd suggest you to look
> at it, and provide some feedback if possible.
> 
> Also if your client application uses direct IO, recent added UBLK_F_SHMEM_ZC
> could simplify implementation a lot, meantime with zero copy & user-mapped
> address.

Oh I see, that was just merged. Nice, thank you! I don't our users will
be DIO only, but nice to have that ZC option!


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
  2026-04-16 19:13       ` Bernd Schubert
@ 2026-04-17 14:35         ` Ming Lei
  0 siblings, 0 replies; 10+ messages in thread
From: Ming Lei @ 2026-04-17 14:35 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Ming Lei, fuse-devel, Joanne Koong, io-uring, Jens Axboe,
	Pavel Begunkov, Miklos Szeredi

On Thu, Apr 16, 2026 at 09:13:41PM +0200, Bernd Schubert wrote:
> 
> 
> On 4/16/26 17:48, Ming Lei wrote:
> > On Thu, Apr 16, 2026 at 04:46:01PM +0200, Bernd Schubert wrote:
> >> Hi Ming,
> >>
> >> On 4/16/26 15:49, Ming Lei wrote:
> >>> Hi Bernd,
> >>>
> >>> On Tue, Apr 14, 2026 at 5:33 AM Bernd Schubert <bernd@niova.io> wrote:
> >>>>
> >>>> Hi Joanne, et al,
> >>>>
> >>>> this is a bit of duplication of the discussion we had before, but I was
> >>>> badly distracted with other work and also switching employer - didn't
> >>>> manage to reply [1].
> >>>>
> >>>>
> >>>> I'm still not too happy about kBuf and its restriction of locked-only
> >>>> memory. Right now I'm reviewing your patches from the view of what needs
> >>>> to be done for ublk (for my current employer) and also for fuse to
> >>>> support different buffer sizes. Let's say fuse only support kBuf and its
> >>>> restriction of pinned memory, I think we would be forced to add support
> >>>> for different buffer sizes to the current ring-entry-provides-the-buffer
> >>>> and the new kBuf interface - from my point of view code dup.
> >>>> If we would allow pBuf for fuse, we could put the current
> >>>> 'ring-entry-provides-the-buffer' interface into maintenance mode and
> >>>> support new features with the new interface only. I know you disagree on
> >>>> using pBuf [1] with the argument that userspace could free the buffer.
> >>>> Well, if it does, it does something totally wrong and the same could
> >>>> happen today over /dev/fuse and also the existing fuse-over-io-uring.
> >>>> Just the window is smaller, as the pages are extracted from the buffer
> >>>> during the copy.
> >>>>
> >>>> I was looking into what would be needed to support pBuf and I think
> >>>> io-uring could extract pages from pBuf when the buffer is obtained - it
> >>>> would limit the window when userspace can do something wrong in a
> >>>> similar way current fuse and ublk works.
> >>>>
> >>>> Suggested changes:
> >>>>
> >>>> io_uring:
> >>>>
> >>>>   - io_pin_pages() gets a 'bool longterm' parameter.
> >>>> The new pBuf path would pass false, every other exsting caller true.
> >>>>
> >>>>   - io_ring_buf_pin_user() / io_ring_buf_unpin_user()
> >>>>   - io_ring_buf_get_pages()/io_ring_buf_put_pages() -> fills the
> >>>> provided bvec
> >>>>   - New struct io_ring_buf (in cmd.h)
> >>>>
> >>>> struct io_ring_buf {
> >>>>        size_t                  len;
> >>>>        unsigned int            buf_id;
> >>>>        unsigned int            nr_bvecs;
> >>>>
> >>>>        /* private */
> >>>>        u64                     addr;
> >>>>        u8                      is_pinned;
> >>>> };
> >>>>
> >>>>
> >>>> Fuse changes:
> >>>>
> >>>>   - fuse_ring_ent (bufring union side): payload_kvec and ringbuf_buf_id
> >>>>     replaced by io_ring_buf + pre-allocated bvec array.
> >>>>   - Buffer selection under queue->lock removed.  The lock only protects
> >>>>     request dequeue and entry state transitions.  Page access happens
> >>>>     after the lock is dropped, in the context where the copy runs.
> >>>>   - setup_fuse_copy_state bufring branch: is_kaddr/kaddr replaced by
> >>>>     iov_iter_bvec() and would continue to use iov_iter_get_pages2()
> >>>>
> >>>> What do you think?
> >>>>
> >>>> And my current primary goal is to let ublk to support multiple buffer
> >>>> sizes - ublk would also need to get support for kBuf/pBuf and I'm
> >>>
> >>> Ublk server is just one liburing application, and it supports all generic
> >>> io_uring buffer types, so kbuf/pbuf should be fine for your ublk server
> >>> in theory.
> >>>
> >>> It really depends on how your ublk server is implemented.
> >>>
> >>> Maybe you can share your motivation first before discussing kbuf/pbuf support.
> >>> If it is for DMA,  there are other candidates too, such as hugepage,
> >>> recent added
> >>> UBLK_U_CMD_REG_BUF, ...
> >> Joanne had actually removed kBuf and switched to pBuf alone and that
> >> simiplifies things a bit.
> >>
> >> Motivation is to reduce memory usage. Let's say you need 4 IOs of 1MB to
> >> saturate streaming bandwidth, but still want to get smaller IOs through,
> >> for these smaller IOs you don't want to assign the 1MB buffer for each
> >> queue entry / tag.
> > 
> > Thanks for sharing the motivation.
> > 
> > Maybe you can pass UBLK_F_USER_COPY, and each IO buffer can be allocated
> > dynamically completely from userspace, then pre-allocation can be avoided.
> 
> I had looked into, but that is still another syscall / roundtrip, will
> have the same performance issue as UBLK_F_NEED_GET_DATA and probably
> worse because compared to ring IO that is a syscall per IO.

Yeah, it seems true in your use case in which compression is followed,
so pread/pwrite for read/write io buffer can't be linked to io_uring SQE pipeline.

However, I am not sure how you use pbuf for this use case, one big thing is
that the buffer has to be provided to ublk FETCH_AND_COMMAND command
beforehand for handling the coming ublk IO request, which size can't be
known at that time. I will study the pBuf patchset later, but it depends
how ublk driver uses it too, IMO.

Meantime another (more flexible)way is to use bpf struct_ops for allocating &
freeing IO buffer, following the basic idea:

- define struct_ops(alloc_io_buf, free_io_buf) for allocating & freeing io buffer
which is used for copying data between request pages and this buffer

- ->alloc_io_buf() can be called from ublk_map_io() and ->free_io_buf()
can be called from ublk_unmap_io()

- the allocated buffer can be accessed directly from both userspace ublk server
and bpf prog, bpf arena is one perfect match for this use case, page
pinning is avoided meantime.

- the two callbacks are not called for the following features:
UBLK_F_SUPPORT_ZERO_COPY,UBLK_F_USER_COPY, UBLK_F_AUTO_BUF_REG or
UBLK_IO_F_SHMEM_ZC is set for this IO

- motivation is for avoiding big pre-allocate, so ublk server can
use dynamic per-queue heap for allocating io buffer in space-effective way.

- with this feature, userspace needn't to pre-allocate io buffer with max
  buffer size, and typical implementation is to provide one bpf area heap
  for bpf prog to alloc & free buffer. And it still can fallback to usercopy
  code path in case of allocation failure from bpf prog.

You may compare the two approaches for your use case.

> 
> > 
> >> Zero copy is currently still out of question for us, although I will
> >> look into your recent work for integration of eBPF and if erasure
> >> coding, compression and checksums could be done with that (I guess
> >> checksums is the easy part).
> > 
> > Got it, compression could be the hardest one, however, the recent added bpf
> > iterator based buffer interface may simplify everything. I'd suggest you to look
> > at it, and provide some feedback if possible.
> > 
> > Also if your client application uses direct IO, recent added UBLK_F_SHMEM_ZC
> > could simplify implementation a lot, meantime with zero copy & user-mapped
> > address.
> 
> Oh I see, that was just merged. Nice, thank you! I don't our users will
> be DIO only, but nice to have that ZC option!

It can be thought as speedup or optimization for DIO use case.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: fuse/io-uring: Proposal to support pBuf in additon to kBuf
  2026-04-16 14:46   ` Bernd Schubert
  2026-04-16 15:48     ` Ming Lei
@ 2026-04-17 21:02     ` Joanne Koong
  1 sibling, 0 replies; 10+ messages in thread
From: Joanne Koong @ 2026-04-17 21:02 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Ming Lei, fuse-devel, io-uring, Jens Axboe, Pavel Begunkov,
	Miklos Szeredi, Lei, Ming

On Thu, Apr 16, 2026 at 7:46 AM Bernd Schubert <bernd@niova.io> wrote:
>
> Hi Ming,
>
> On 4/16/26 15:49, Ming Lei wrote:
> > Hi Bernd,
> >
> > On Tue, Apr 14, 2026 at 5:33 AM Bernd Schubert <bernd@niova.io> wrote:
> >>
> >> And my current primary goal is to let ublk to support multiple buffer
> >> sizes - ublk would also need to get support for kBuf/pBuf and I'm
> >
> > Ublk server is just one liburing application, and it supports all generic
> > io_uring buffer types, so kbuf/pbuf should be fine for your ublk server
> > in theory.
> >
> > It really depends on how your ublk server is implemented.
> >
> > Maybe you can share your motivation first before discussing kbuf/pbuf support.
> > If it is for DMA,  there are other candidates too, such as hugepage,
> > recent added
> > UBLK_U_CMD_REG_BUF, ...
> Joanne had actually removed kBuf and switched to pBuf alone and that
> simiplifies things a bit.

Hi Bernd,

>
> Motivation is to reduce memory usage. Let's say you need 4 IOs of 1MB to
> saturate streaming bandwidth, but still want to get smaller IOs through,
> for these smaller IOs you don't want to assign the 1MB buffer for each
> queue entry / tag.

Have you considered having separate rings take separate payload sizes
instead of having each ring support multiple different payload sizes?
I think this gives a few non-trivial benefits over per-ring
multi-buffer-size support:

* less head-of-line blocking - with a single ring, the large io
requests can block smaller metadata requests until the io completes,
since fuse processes cqes sequentially from a single ring. Separate
rings would allow smaller requests to proceed independently of io
* makes kernel-side request dispatching more efficient + simpler  - if
for example there's 10 different rings and each of them supports 4
categories of buffer sizes, imo it gets non-trivially complicated to
find an available ring that supports the payload size that needs to be
sent, if there's lots of parallel requests going on. In the worst
case, we would have to check each of the 10 rings' various categories
of buffer sizes to see if there's a slot that's big enough.
* simpler kernel-side buffer management - keeping track of the payload
buffers in the ring becomes a lot simpler, since there's just one
buffer size the ring supports
* more dynamic / deterministic scalability - I think you mentioned on
another thread you were interested in dynamically adding ents to
rings. Having separate rings for separate payload sizes would make
independently scaling queues based on workload characteristics a lot
easier. for example if there were 10 rings that each support 4
different buffer sizes, one question I would have is which ring would
the extra entry be added to? It kind of seems like at request dispatch
time, it would have to do that non-trivial ent searching logic across
all rings mentioned earlier to find that extra ent?

These are just my 2 cents, but it kind of seems t o me that having
separate rings take separate payload sizes could be more scalable for
your use case?

Thanks,
Joanne

> Zero copy is currently still out of question for us, although I will
> look into your recent work for integration of eBPF and if erasure
> coding, compression and checksums could be done with that (I guess
> checksums is the easy part).
>
> Ublk already has UBLK_F_NEED_GET_DATA, but that has two issues
> - needs another round trip (testing on my laptop shows a perf loss of 10
> to 15% per queue)
> - It does not release the application buffer on read. I have an idea how
> to fix that, but here at Niova we would like to go the dynamic memory
> appraoch with pBufs to avoid additional round trip overhead.
>
> Idea with pBufs: Several pBufs registered per queue at registration
> time. Every pBuf represents a different IO size. Optionally as with
> Joannes patches [1] the buffers can get pinned to avoid mapping to pages
> for every access.
> I'm currently working on a patch series with some luck will sent an RFC
> tomorrow. The harder part compared to fuse is that ublk_drv does not
> have its own queues/lists so far. This is my first work on block layer -
> I'm not sure if internal struct request queuing is allowed at all.
> Testing will show in a bit :)
>
>
> Thanks,
> Bernd
>
>
> [1]
> https://lore.kernel.org/linux-fsdevel/20260402162840.2989717-1-joannelkoong@gmail.com/T/#mb8f96895aa2773424005ee06bb62ae980e95e604
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-04-17 21:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-13 21:33 fuse/io-uring: Proposal to support pBuf in additon to kBuf Bernd Schubert
2026-04-14  0:56 ` Joanne Koong
2026-04-14 17:34   ` Bernd Schubert
2026-04-15  0:19     ` Joanne Koong
2026-04-16 13:49 ` Ming Lei
2026-04-16 14:46   ` Bernd Schubert
2026-04-16 15:48     ` Ming Lei
2026-04-16 19:13       ` Bernd Schubert
2026-04-17 14:35         ` Ming Lei
2026-04-17 21:02     ` Joanne Koong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox