public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Bernd Schubert <bernd@bsbernd.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: axboe@kernel.dk, hch@infradead.org, asml.silence@gmail.com,
	csander@purestorage.com, krisman@suse.de,
	linux-fsdevel@vger.kernel.org, io-uring@vger.kernel.org,
	Horst Birthelmer <hbirthelmer@ddn.com>
Subject: Re: [PATCH v3 0/8] io_uring: add kernel-managed buffer rings
Date: Fri, 20 Mar 2026 23:44:34 +0100	[thread overview]
Message-ID: <f8a1808b-4068-49a7-a17d-8dcab1ae9cdf@bsbernd.com> (raw)
In-Reply-To: <CAJnrk1bYLwHpZsW85XiyWfM=gXXXS6pHg4=p9fcbDOwpca8UXQ@mail.gmail.com>



On 3/20/26 22:58, Joanne Koong wrote:
> On Fri, Mar 20, 2026 at 12:45 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>>
>> On 3/20/26 20:20, Joanne Koong wrote:
>>> On Fri, Mar 20, 2026 at 10:16 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>>>>
>>>> On 3/6/26 01:32, Joanne Koong wrote:
>>> Hi Bernd,
>>>
>>>> Hi Joanne,
>>>>
>>>> I'm a bit late, but could we have a design discussion about fuse here?
>>>> From my point of view it would be good if we could have different
>>>> request sizes for the ring buffers. Without kbuf I thought we would just
>>>
>>> Is your motivation for wanting different request sizes for the ring
>>> buffers so that it can optimize the memory costs of the buffers? I
>>> agree that trying to reduce the memory footprint of the buffers is
>>> very important. The main reason I ended up going with the buffer ring
>>> design was for that purpose. When kbuf incremental buffer consumption
>>> is added in the future (I plan to submit it separately once all the
>>> io-uring pieces of the fuse-zero-copy patchset land), this will allow
>>> non-overlapping regions of the individual buffer to be used across
>>> multiple different-sized requests concurrently.
>>
>> That is also fine.
>>
>>>
>>> From my point of view, this is better than allocating variable-sized
>>> buffers upfront because:
>>> a) entries are fully maximized. With variable-sized buffers, the big
>>> buffers would be reserved specifically for payload requests while the
>>> small buffers would be reserved specifically for metadata requests. We
>>> could allocate '# entries' amount of small buffers, but for big
>>> buffers there would be less than '# entries'. If the server needs to
>>> service a lot of concurrent I/O requests, then the ring gets throttled
>>> on the limited number of big buffers available.
>>
>> I would like to see something like 8K, 16K, 32K, 128K.
> 
> My worry is that for I/O heavy workloads with large read/write
> payloads (eg client access patterns reading/writing MBs at a time),
> the limited number of big enough buffers becomes the throttling
> bottleneck.
> 
>>
>>>
>>> b) it best maximizes buffer memory. A request could need a buffer of
>>> any size so with variable-sized buffers, there's extra space in the
>>> buffer that is still being wasted. For example, for large payload
>>> requests, the big buffers would need to be the size of the max payload
>>> size (eg default 1 MB) but a lot of requests will fall under that.
>>> With incremental buffer consumption, only however many bytes used by
>>> the request are reserved in the buffer.
>>
>> Doesn't that cause fragmentation?
> 
> With incremetnal buffer consumption, there's not fragmentation in the
> classical sense (eg scattered unusable holes). The buffer gets
> recycled back into the ring as a whole once all the requests in it
> have completed (tracked by refcounting).
> 
> I think the concern is that if the server is very slow to fulfill
> requests and the workload pattern has it so that slow requests are
> packed into the same buffer as fast requests across all the buffers in
> the queue and that queue has all its buffers saturated, then the next
> buffer is available only once the slow request has completed. We can
> mitigate this by assigning the request to a queue on the nearest numa
> node as a fallback if we detect that case. We could also do the same
> thing to mitigate the variable-sized buffer scenario where there's not
> enough big buffers for the queue, but I think that logic ends up a bit
> more complex.
> 
> I think overall we're able to support both incremental buffer
> consumption + variable-sized buffers if there's a need for it in the
> future where the server would like to choose.
> 
>>
>>>
>>> c) there's no overhead with having to (as you pointed out) keep the
>>> buffers tracked and sorted into per-sized lists. If we wanted to use
>>> variable-sized buffers with kbufs instead of using incremental buffer
>>> consumption, the best way to do that would be to allocate a separate
>>> kbufring to support payload requests vs metadata requests.
>>
>> Yeah, I had thought of multiple kbuf rings, with different sizes.
>>
>>>
>>>> register entries with different sizes, which would then get sorted into
>>>> per size lists. Now with kbuf that will not work anymore and we need
>>>> different kbuf sizes. But then kbuf is not suitable for non-privileged
>>>> users. So in order to support different request sizes one basically has
>>>
>>> Non-privileged fuse servers use kbufs as well. It's only zero-copying
>>> that is not possible for non-privileged servers.
>>
>> Non-privileged cannot pin, at least by default mlock size is 8MB. I was
>> under the impression that kbuf would be always pinned, but I need to
>> read over it again.
> 
> The kbufs get accounted to the user's mlock usage (this happens in
> __io_account_mem()). If the user running the unprivileged server
> doesn't belong to a group that has high enough mlock limits, they'll
> have to use regular fuse over-io-uring buffers instead of kbufs for
> most of their queues.

That is exactly what I mean - in reality  unprivileged servers will not
be able to use kbufs. And there it would be good, if that server could
use unpinned pbufs.


Thanks,
Bernd

  reply	other threads:[~2026-03-20 22:44 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-06  0:32 [PATCH v3 0/8] io_uring: add kernel-managed buffer rings Joanne Koong
2026-03-06  0:32 ` [PATCH v3 1/8] io_uring/kbuf: add support for " Joanne Koong
2026-03-06  0:32 ` [PATCH v3 2/8] io_uring/kbuf: support kernel-managed buffer rings in buffer selection Joanne Koong
2026-03-06  0:32 ` [PATCH v3 3/8] io_uring/kbuf: add buffer ring pinning/unpinning Joanne Koong
2026-03-06  0:32 ` [PATCH v3 4/8] io_uring/kbuf: return buffer id in buffer selection Joanne Koong
2026-03-06  0:32 ` [PATCH v3 5/8] io_uring/kbuf: add recycling for kernel managed buffer rings Joanne Koong
2026-03-06  0:32 ` [PATCH v3 6/8] io_uring/kbuf: add io_uring_is_kmbuf_ring() Joanne Koong
2026-03-06  0:32 ` [PATCH v3 7/8] io_uring/kbuf: export io_ring_buffer_select() Joanne Koong
2026-03-06  0:32 ` [PATCH v3 8/8] io_uring/cmd: set selected buffer index in __io_uring_cmd_done() Joanne Koong
2026-03-20 16:45 ` [PATCH v3 0/8] io_uring: add kernel-managed buffer rings Jens Axboe
2026-03-20 17:16 ` Bernd Schubert
2026-03-20 19:20   ` Joanne Koong
2026-03-20 19:45     ` Bernd Schubert
2026-03-20 21:58       ` Joanne Koong
2026-03-20 22:44         ` Bernd Schubert [this message]
2026-03-21  1:19           ` Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f8a1808b-4068-49a7-a17d-8dcab1ae9cdf@bsbernd.com \
    --to=bernd@bsbernd.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=csander@purestorage.com \
    --cc=hbirthelmer@ddn.com \
    --cc=hch@infradead.org \
    --cc=io-uring@vger.kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=krisman@suse.de \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox