public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Bernd Schubert <bernd@bsbernd.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: axboe@kernel.dk, hch@infradead.org, asml.silence@gmail.com,
	csander@purestorage.com, krisman@suse.de,
	linux-fsdevel@vger.kernel.org, io-uring@vger.kernel.org,
	Horst Birthelmer <hbirthelmer@ddn.com>
Subject: Re: [PATCH v3 0/8] io_uring: add kernel-managed buffer rings
Date: Fri, 20 Mar 2026 20:45:01 +0100	[thread overview]
Message-ID: <d9aae2bf-b81d-42c8-b919-5e64292323e8@bsbernd.com> (raw)
In-Reply-To: <CAJnrk1YLtQF=SF-GoG4irKYzzePNewNgyTeU7VLvUN6Ub_NFVw@mail.gmail.com>



On 3/20/26 20:20, Joanne Koong wrote:
> On Fri, Mar 20, 2026 at 10:16 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>>
>> On 3/6/26 01:32, Joanne Koong wrote:
>>> Currently, io_uring buffer rings require the application to allocate and
>>> manage the backing buffers. This series introduces buffer rings where the
>>> kernel allocates and manages the buffers on behalf of the application. From
>>> the uapi side, this goes through the pbuf ring interface, through the
>>> IOU_PBUF_RING_KERNEL_MANAGED flag.
>>>
>>> There was a long discussion with Pavel on v1 [1] regarding the design. The
>>> alternatives were to have the buffers allocated and registered through a
>>> memory region or through the registered buffers interface and have fuse
>>> implement ring buffer logic internally outside of io-uring. However, because
>>> the buffers need to be contiguous for DMA and some high-performance fuse
>>> servers may need non-fuse io-uring requests to use the buffer ring directly,
>>> v3 keeps the design.
>>>
>>> This is split out from the fuse-over-io_uring series in [2], which needs the
>>> kernel to own and manage buffers shared between the fuse server and the
>>> kernel. The link to the fuse tree that uses the commits in this series is in
>>> [3].
>>>
>>> This series is on top of the for-7.1/io_uring branch in Jens' io-uring
>>> tree (commit ee1d7dc33990). The corresponding liburing changes are in [4] and
>>> will be submitted after the changes in this patchset have landed.
>>>
>>> Thanks,
>>> Joanne
>>>
>>> [1] https://lore.kernel.org/linux-fsdevel/20260210002852.1394504-1-joannelkoong@gmail.com/T/#t
>>> [2] https://lore.kernel.org/linux-fsdevel/20260116233044.1532965-1-joannelkoong@gmail.com/
>>> [3] https://github.com/joannekoong/linux/commits/fuse_zero_copy_for_v3/
>>> [4] https://github.com/joannekoong/liburing/commits/pbuf_kernel_managed/
>>
> 
> Hi Bernd,
> 
>> Hi Joanne,
>>
>> I'm a bit late, but could we have a design discussion about fuse here?
>> From my point of view it would be good if we could have different
>> request sizes for the ring buffers. Without kbuf I thought we would just
> 
> Is your motivation for wanting different request sizes for the ring
> buffers so that it can optimize the memory costs of the buffers? I
> agree that trying to reduce the memory footprint of the buffers is
> very important. The main reason I ended up going with the buffer ring
> design was for that purpose. When kbuf incremental buffer consumption
> is added in the future (I plan to submit it separately once all the
> io-uring pieces of the fuse-zero-copy patchset land), this will allow
> non-overlapping regions of the individual buffer to be used across
> multiple different-sized requests concurrently.

That is also fine.

> 
> From my point of view, this is better than allocating variable-sized
> buffers upfront because:
> a) entries are fully maximized. With variable-sized buffers, the big
> buffers would be reserved specifically for payload requests while the
> small buffers would be reserved specifically for metadata requests. We
> could allocate '# entries' amount of small buffers, but for big
> buffers there would be less than '# entries'. If the server needs to
> service a lot of concurrent I/O requests, then the ring gets throttled
> on the limited number of big buffers available.

I would like to see something like 8K, 16K, 32K, 128K.

> 
> b) it best maximizes buffer memory. A request could need a buffer of
> any size so with variable-sized buffers, there's extra space in the
> buffer that is still being wasted. For example, for large payload
> requests, the big buffers would need to be the size of the max payload
> size (eg default 1 MB) but a lot of requests will fall under that.
> With incremental buffer consumption, only however many bytes used by
> the request are reserved in the buffer.

Doesn't that cause fragmentation?

> 
> c) there's no overhead with having to (as you pointed out) keep the
> buffers tracked and sorted into per-sized lists. If we wanted to use
> variable-sized buffers with kbufs instead of using incremental buffer
> consumption, the best way to do that would be to allocate a separate
> kbufring to support payload requests vs metadata requests.

Yeah, I had thought of multiple kbuf rings, with different sizes.

> 
>> register entries with different sizes, which would then get sorted into
>> per size lists. Now with kbuf that will not work anymore and we need
>> different kbuf sizes. But then kbuf is not suitable for non-privileged
>> users. So in order to support different request sizes one basically has
> 
> Non-privileged fuse servers use kbufs as well. It's only zero-copying
> that is not possible for non-privileged servers.

Non-privileged cannot pin, at least by default mlock size is 8MB. I was
under the impression that kbuf would be always pinned, but I need to
read over it again.

> 
>> to implement things two times - not ideal. Couldn't we have pbuf for
>> non-privileged users and basically depcrecate the existing fuse io-uring
> 
> I don't think this is necessary because kbufs works for both
> non-privileged and privileged servers. For how the buffer gets used by
> the server/kernel, pbufs are not an option here because the kernel has
> to be the one to recycle back the buffer (since it needs to read /
> copy data the server returns back in the buffer).

I was thinking to set a flag or take ref count and to disallow pbuf
destruction.

> 
>> buffer API? In the sense that it needs to be further supported for some
>> time, but won't get any new feature. Different buffer sizes would then
>> only be supported through kbuf/pbuf?
> 
> I hope I understood your questions correctly, but if I misread
> anything, please let me know. I am going to be updating and submitting
> the fuse patches next week - the main update will be changing the
> headers to go through a registered memory region (which I only
> realized existed after the discussion with Pavel in v1) instead of as
> a registered buffer, as that will allow us to avoid the per I/O lookup
> overhead and drop the patch for the
> "io_uring_fixed_index_get()/io_uring_fixed_index_put()" refcount dance
> altogether.

I will try to review ASAP when you submit.


Thanks,
Bernd



  reply	other threads:[~2026-03-20 19:45 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-06  0:32 [PATCH v3 0/8] io_uring: add kernel-managed buffer rings Joanne Koong
2026-03-06  0:32 ` [PATCH v3 1/8] io_uring/kbuf: add support for " Joanne Koong
2026-03-06  0:32 ` [PATCH v3 2/8] io_uring/kbuf: support kernel-managed buffer rings in buffer selection Joanne Koong
2026-03-06  0:32 ` [PATCH v3 3/8] io_uring/kbuf: add buffer ring pinning/unpinning Joanne Koong
2026-03-06  0:32 ` [PATCH v3 4/8] io_uring/kbuf: return buffer id in buffer selection Joanne Koong
2026-03-06  0:32 ` [PATCH v3 5/8] io_uring/kbuf: add recycling for kernel managed buffer rings Joanne Koong
2026-03-06  0:32 ` [PATCH v3 6/8] io_uring/kbuf: add io_uring_is_kmbuf_ring() Joanne Koong
2026-03-06  0:32 ` [PATCH v3 7/8] io_uring/kbuf: export io_ring_buffer_select() Joanne Koong
2026-03-06  0:32 ` [PATCH v3 8/8] io_uring/cmd: set selected buffer index in __io_uring_cmd_done() Joanne Koong
2026-03-20 16:45 ` [PATCH v3 0/8] io_uring: add kernel-managed buffer rings Jens Axboe
2026-03-20 17:16 ` Bernd Schubert
2026-03-20 19:20   ` Joanne Koong
2026-03-20 19:45     ` Bernd Schubert [this message]
2026-03-20 21:58       ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d9aae2bf-b81d-42c8-b919-5e64292323e8@bsbernd.com \
    --to=bernd@bsbernd.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=csander@purestorage.com \
    --cc=hbirthelmer@ddn.com \
    --cc=hch@infradead.org \
    --cc=io-uring@vger.kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=krisman@suse.de \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox