From: Bernd Schubert <bernd@bsbernd.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: axboe@kernel.dk, hch@infradead.org, asml.silence@gmail.com,
csander@purestorage.com, krisman@suse.de,
linux-fsdevel@vger.kernel.org, io-uring@vger.kernel.org,
Horst Birthelmer <hbirthelmer@ddn.com>
Subject: Re: [PATCH v3 0/8] io_uring: add kernel-managed buffer rings
Date: Fri, 20 Mar 2026 23:44:34 +0100 [thread overview]
Message-ID: <f8a1808b-4068-49a7-a17d-8dcab1ae9cdf@bsbernd.com> (raw)
In-Reply-To: <CAJnrk1bYLwHpZsW85XiyWfM=gXXXS6pHg4=p9fcbDOwpca8UXQ@mail.gmail.com>
On 3/20/26 22:58, Joanne Koong wrote:
> On Fri, Mar 20, 2026 at 12:45 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>
>> On 3/20/26 20:20, Joanne Koong wrote:
>>> On Fri, Mar 20, 2026 at 10:16 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>
>>>> On 3/6/26 01:32, Joanne Koong wrote:
>>> Hi Bernd,
>>>
>>>> Hi Joanne,
>>>>
>>>> I'm a bit late, but could we have a design discussion about fuse here?
>>>> From my point of view it would be good if we could have different
>>>> request sizes for the ring buffers. Without kbuf I thought we would just
>>>
>>> Is your motivation for wanting different request sizes for the ring
>>> buffers so that it can optimize the memory costs of the buffers? I
>>> agree that trying to reduce the memory footprint of the buffers is
>>> very important. The main reason I ended up going with the buffer ring
>>> design was for that purpose. When kbuf incremental buffer consumption
>>> is added in the future (I plan to submit it separately once all the
>>> io-uring pieces of the fuse-zero-copy patchset land), this will allow
>>> non-overlapping regions of the individual buffer to be used across
>>> multiple different-sized requests concurrently.
>>
>> That is also fine.
>>
>>>
>>> From my point of view, this is better than allocating variable-sized
>>> buffers upfront because:
>>> a) entries are fully maximized. With variable-sized buffers, the big
>>> buffers would be reserved specifically for payload requests while the
>>> small buffers would be reserved specifically for metadata requests. We
>>> could allocate '# entries' amount of small buffers, but for big
>>> buffers there would be less than '# entries'. If the server needs to
>>> service a lot of concurrent I/O requests, then the ring gets throttled
>>> on the limited number of big buffers available.
>>
>> I would like to see something like 8K, 16K, 32K, 128K.
>
> My worry is that for I/O heavy workloads with large read/write
> payloads (eg client access patterns reading/writing MBs at a time),
> the limited number of big enough buffers becomes the throttling
> bottleneck.
>
>>
>>>
>>> b) it best maximizes buffer memory. A request could need a buffer of
>>> any size so with variable-sized buffers, there's extra space in the
>>> buffer that is still being wasted. For example, for large payload
>>> requests, the big buffers would need to be the size of the max payload
>>> size (eg default 1 MB) but a lot of requests will fall under that.
>>> With incremental buffer consumption, only however many bytes used by
>>> the request are reserved in the buffer.
>>
>> Doesn't that cause fragmentation?
>
> With incremetnal buffer consumption, there's not fragmentation in the
> classical sense (eg scattered unusable holes). The buffer gets
> recycled back into the ring as a whole once all the requests in it
> have completed (tracked by refcounting).
>
> I think the concern is that if the server is very slow to fulfill
> requests and the workload pattern has it so that slow requests are
> packed into the same buffer as fast requests across all the buffers in
> the queue and that queue has all its buffers saturated, then the next
> buffer is available only once the slow request has completed. We can
> mitigate this by assigning the request to a queue on the nearest numa
> node as a fallback if we detect that case. We could also do the same
> thing to mitigate the variable-sized buffer scenario where there's not
> enough big buffers for the queue, but I think that logic ends up a bit
> more complex.
>
> I think overall we're able to support both incremental buffer
> consumption + variable-sized buffers if there's a need for it in the
> future where the server would like to choose.
>
>>
>>>
>>> c) there's no overhead with having to (as you pointed out) keep the
>>> buffers tracked and sorted into per-sized lists. If we wanted to use
>>> variable-sized buffers with kbufs instead of using incremental buffer
>>> consumption, the best way to do that would be to allocate a separate
>>> kbufring to support payload requests vs metadata requests.
>>
>> Yeah, I had thought of multiple kbuf rings, with different sizes.
>>
>>>
>>>> register entries with different sizes, which would then get sorted into
>>>> per size lists. Now with kbuf that will not work anymore and we need
>>>> different kbuf sizes. But then kbuf is not suitable for non-privileged
>>>> users. So in order to support different request sizes one basically has
>>>
>>> Non-privileged fuse servers use kbufs as well. It's only zero-copying
>>> that is not possible for non-privileged servers.
>>
>> Non-privileged cannot pin, at least by default mlock size is 8MB. I was
>> under the impression that kbuf would be always pinned, but I need to
>> read over it again.
>
> The kbufs get accounted to the user's mlock usage (this happens in
> __io_account_mem()). If the user running the unprivileged server
> doesn't belong to a group that has high enough mlock limits, they'll
> have to use regular fuse over-io-uring buffers instead of kbufs for
> most of their queues.
That is exactly what I mean - in reality unprivileged servers will not
be able to use kbufs. And there it would be good, if that server could
use unpinned pbufs.
Thanks,
Bernd
next prev parent reply other threads:[~2026-03-20 22:44 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-06 0:32 [PATCH v3 0/8] io_uring: add kernel-managed buffer rings Joanne Koong
2026-03-06 0:32 ` [PATCH v3 1/8] io_uring/kbuf: add support for " Joanne Koong
2026-03-06 0:32 ` [PATCH v3 2/8] io_uring/kbuf: support kernel-managed buffer rings in buffer selection Joanne Koong
2026-03-06 0:32 ` [PATCH v3 3/8] io_uring/kbuf: add buffer ring pinning/unpinning Joanne Koong
2026-03-06 0:32 ` [PATCH v3 4/8] io_uring/kbuf: return buffer id in buffer selection Joanne Koong
2026-03-06 0:32 ` [PATCH v3 5/8] io_uring/kbuf: add recycling for kernel managed buffer rings Joanne Koong
2026-03-06 0:32 ` [PATCH v3 6/8] io_uring/kbuf: add io_uring_is_kmbuf_ring() Joanne Koong
2026-03-06 0:32 ` [PATCH v3 7/8] io_uring/kbuf: export io_ring_buffer_select() Joanne Koong
2026-03-06 0:32 ` [PATCH v3 8/8] io_uring/cmd: set selected buffer index in __io_uring_cmd_done() Joanne Koong
2026-03-20 16:45 ` [PATCH v3 0/8] io_uring: add kernel-managed buffer rings Jens Axboe
2026-03-20 17:16 ` Bernd Schubert
2026-03-20 19:20 ` Joanne Koong
2026-03-20 19:45 ` Bernd Schubert
2026-03-20 21:58 ` Joanne Koong
2026-03-20 22:44 ` Bernd Schubert [this message]
2026-03-21 1:19 ` Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f8a1808b-4068-49a7-a17d-8dcab1ae9cdf@bsbernd.com \
--to=bernd@bsbernd.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=csander@purestorage.com \
--cc=hbirthelmer@ddn.com \
--cc=hch@infradead.org \
--cc=io-uring@vger.kernel.org \
--cc=joannelkoong@gmail.com \
--cc=krisman@suse.de \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox