From: Bernd Schubert <bernd@bsbernd.com>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: miklos@szeredi.hu, axboe@kernel.dk,
linux-fsdevel@vger.kernel.org, bschubert@ddn.com,
asml.silence@gmail.com, io-uring@vger.kernel.org,
xiaobing.li@samsung.com, csander@purestorage.com,
kernel-team@meta.com
Subject: Re: [PATCH v2 8/8] fuse: support io-uring registered buffers
Date: Sun, 23 Nov 2025 21:12:47 +0100 [thread overview]
Message-ID: <941f06eb-2429-4752-bf56-fbc413da436f@bsbernd.com> (raw)
In-Reply-To: <CAJnrk1YPEDUbOu2N0EjfrkwK3Ge2XrNeaCY0YKL+E1t7Z8Xtvg@mail.gmail.com>
On 11/7/25 00:09, Joanne Koong wrote:
> On Thu, Nov 6, 2025 at 11:48 AM Bernd Schubert <bernd@bsbernd.com> wrote:
>>
>> On 10/27/25 23:28, Joanne Koong wrote:
>>> Add support for io-uring registered buffers for fuse daemons
>>> communicating through the io-uring interface. Daemons may register
>>> buffers ahead of time, which will eliminate the overhead of
>>> pinning/unpinning user pages and translating virtual addresses for every
>>> server-kernel interaction.
>>>
>>> To support page-aligned payloads, the buffer is structured such that the
>>> payload is at the front of the buffer and the fuse_uring_req_header is
>>> offset from the end of the buffer.
>>>
>>> To be backwards compatible, fuse uring still needs to support non-registered
>>> buffers as well.
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> ---
>>> fs/fuse/dev_uring.c | 200 +++++++++++++++++++++++++++++++++---------
>>> fs/fuse/dev_uring_i.h | 27 +++++-
>>> 2 files changed, 183 insertions(+), 44 deletions(-)
>>>
>>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>>> index c6b22b14b354..f501bc81f331 100644
>>> --- a/fs/fuse/dev_uring.c
>>> +++ b/fs/fuse/dev_uring.c
>>>
>>> +/*
>>> + * Prepare fixed buffer for access. Sets up the payload iter and kmaps the
>>> + * header.
>>> + *
>>> + * Callers must call fuse_uring_unmap_buffer() in the same scope to release the
>>> + * header mapping.
>>> + *
>>> + * For non-fixed buffers, this is a no-op.
>>> + */
>>> +static int fuse_uring_map_buffer(struct fuse_ring_ent *ent)
>>> +{
>>> + size_t header_size = sizeof(struct fuse_uring_req_header);
>>> + struct iov_iter iter;
>>> + struct page *header_page;
>>> + size_t count, start;
>>> + ssize_t copied;
>>> + int err;
>>> +
>>> + if (!ent->fixed_buffer)
>>> + return 0;
>>> +
>>> + err = io_uring_cmd_import_fixed_full(ITER_DEST, &iter, ent->cmd, 0);
>>
>> This seems to be a rather expensive call, especially as it gets
>> called twice (during submit and fetch).
>> Wouldn't be there be a possibility to check if the user buffer changed
>> and then keep the existing iter? I think Caleb had a similar idea
>> in patch 1/8.
>
> I think the best approach is to get rid of the call entirely by
> returning -EBUSY to the server if it tries unregistering the buffers
> while a connection is still alive. Then we would just have to set this
> up once at registration time, and use that for the lifetime of the
> connection. The discussion about this with Pavel is in [1] - I'm
> planning to do this as a separate follow-up.
>
> [1] https://lore.kernel.org/linux-fsdevel/9f0debb1-ce0e-4085-a3fe-0da7a8fd76a6@gmail.com/
>
>>
>>> + if (err)
>>> + return err;
>>> +
>>> + count = iov_iter_count(&iter);
>>> + if (count < header_size || count & (PAGE_SIZE - 1))
>>> + return -EINVAL;
>>
>> || !PAGE_ALIGNED(count)) ?
>
> Nice, I didn't realize this macro existed. Thanks.
>
>>
>>> +
>>> + /* Adjust the payload iter to protect the header from any overwrites */
>>> + ent->payload_iter = iter;
>>> + iov_iter_truncate(&ent->payload_iter, count - header_size);
>>> +
>>> + /* Set up the headers */
>>> + iov_iter_advance(&iter, count - header_size);
>>> + copied = iov_iter_get_pages2(&iter, &header_page, header_size, 1, &start);
>>
>> The iter is later used for the payload, but I miss a reset? iov_iter_revert()?
>
> This iter is separate from the payload iter and doesn't affect the
> payload iter's values because the "ent->payload_iter = iter;"
> assignment above shallow copies that out first.
>
>>
>>> + if (copied < header_size)
>>> + return -EFAULT;
>>> + ent->headers = kmap_local_page(header_page) + start;
>>
>> My plan for the alternative pinning patch (with io-uring) was to let the
>> header be shared by multiple entries. Current libfuse master handles
>> a fixed page size buffer for the payload (prepared page pinning - I
>> didn't expect I was blocked for 9 months on other work), missing is to
>> share it between ring entries.
>> I think this wouldn't work with registered buffer approach - it
>> always needs one full page?
>
> I've been working on the patches for zero-copy and that has required
> the design for registered buffers in this patch to change, namely that
> the payload and the headers must be separated out. For v3, I have them
> separate now.
>>
>> I would also like to discuss dynamic multiple payload sizes per queue.
>> For example to have something like
>>
>> 256 x 4K
>> 8 x 128K
>> 4 x 1M
>
> I think zero-copy might obviate the need for this. The way I have it
> right now, it uses sparse buffers for payloads, which prevents the
> server from needing to allocate the 1M buffer per ent. I'm hoping to
> send out the patches for this as part of v3 at the end of next week or
> next next week.
>
> Thanks,
> joanne
>
>>
>> I think there are currently two ways to do that
>>
>> 1) Sort entries into pools
>> 2) Sort buffers into pools and let entries use these. Here the header
>> would be fixed and payload would come from a pool.
>>
>> With the appraoch to have payload and header in one buffer we couldn't
>> use 2). Using 1) should be fine, though.
>>
>>>
>>> /*
>>> @@ -1249,20 +1358,29 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
>>> {
>>> struct fuse_ring_ent *ent = uring_cmd_to_ring_ent(cmd);
>>> struct fuse_ring_queue *queue = ent->queue;
>>> + bool send_ent = true;
>>> int err;
>>>
>>> - if (!(issue_flags & IO_URING_F_TASK_DEAD)) {
>>> - err = fuse_uring_prepare_send(ent, ent->fuse_req);
>>> - if (err) {
>>> - if (!fuse_uring_get_next_fuse_req(ent, queue))
>>> - return;
>>> - err = 0;
>>> - }
>>> - } else {
>>> - err = -ECANCELED;
>>> + if (issue_flags & IO_URING_F_TASK_DEAD) {
>>> + fuse_uring_send(ent, cmd, -ECANCELED, issue_flags);
>>> + return;
>>> + }
>>> +
>>> + err = fuse_uring_map_buffer(ent);
>>> + if (err) {
>>> + fuse_uring_req_end(ent, ent->fuse_req, err);
>>> + return;
>>
>> I think this needs to abort the connection now. There could be multiple
>> commands on the queue and they would be stuck now and there is no
>> notification to fuse server either.
>
> This approach makes sense to me and makes things a bit simpler. I'll
> add this to v3.
This is a just heads up, while I'm testing sync FUSE_INIT I'm running
in several issues. Part of it needs to be a re-send on -EGAIN and -EINTR
in libfuse - had slipped through so far, but another is related to our
page pinning. I just thought I can easy abort the connection when
that fails, but that results in a double lock, because so far we
assumed IO_URING_F_UNLOCKED in teardown context.
I'm going to submit a patch to teardown in uring task context, so that
we are sure about the flags.
I.e. I think my suggestion to abort here would run into the same
issue as below.
[10375.669761] fuse: FUSE_IO_URING_CMD_REGISTER returned -4
[10375.670632] fuse: FUSE_IO_URING_CMD_REGISTER failed err=-4
[10375.671922] ============================================
[10375.672754] WARNING: possible recursive locking detected
[10375.673577] 6.8.12+ #6 Tainted: G W O
[10375.674377] --------------------------------------------
[10375.675208] fuse-ring-0/7658 is trying to acquire lock:
[10375.676022] ffff8881040910b0 (&ctx->uring_lock){+.+.}-{4:4}, at: io_uring_cmd_done+0x14f/0x210
[10375.677335]
but task is already holding lock:
[10375.678292] ffff8881040910b0 (&ctx->uring_lock){+.+.}-{4:4}, at: __x64_sys_io_uring_enter+0x68a/0xbf0
[10375.679678]
other info that might help us debug this:
[10375.680722] Possible unsafe locking scenario:
[10375.681692] CPU0
[10375.682156] ----
[10375.682633] lock(&ctx->uring_lock);
[10375.683260] lock(&ctx->uring_lock);
[10375.683879]
*** DEADLOCK ***
[10375.684916] May be due to missing lock nesting notation
[10375.685994] 1 lock held by fuse-ring-0/7658:
[10375.686699] #0: ffff8881040910b0 (&ctx->uring_lock){+.+.}-{4:4}, at: __x64_sys_io_uring_enter+0x68a/0xbf0
[10375.688141]
stack backtrace:
[10375.688923] CPU: 0 PID: 7658 Comm: fuse-ring-0 Tainted: G W O 6.8.12+ #6
[10375.690150] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[10375.691534] Call Trace:
[10375.692019] <TASK>
[10375.692445] dump_stack_lvl+0x66/0x90
[10375.695219] __lock_acquire+0x13b8/0x25c0
[10375.695901] lock_acquire+0xb7/0x290
[10375.696507] ? io_uring_cmd_done+0x14f/0x210
[10375.697226] ? __lock_acquire+0x459/0x25c0
[10375.697913] __mutex_lock+0x7d/0xae0
[10375.698530] ? io_uring_cmd_done+0x14f/0x210
[10375.699251] ? io_uring_cmd_done+0x14f/0x210
[10375.699967] ? lock_acquire+0xb7/0x290
[10375.700606] ? fuse_uring_stop_list_entries+0x1c2/0x280 [fuse]
[10375.701508] ? io_uring_cmd_done+0x14f/0x210
[10375.702210] ? lock_release+0x24b/0x390
[10375.702862] io_uring_cmd_done+0x14f/0x210
[10375.703546] fuse_uring_stop_list_entries+0x24f/0x280 [fuse]
[10375.704439] fuse_uring_stop_queues+0xf7/0x240 [fuse]
[10375.705251] fuse_abort_conn+0x3e6/0x3f0 [fuse]
[10375.705998] fuse_uring_cmd+0x8de/0xf40 [fuse]
[10375.706742] ? lock_acquired+0xb1/0x320
[10375.707415] io_uring_cmd+0x6b/0x170
[10375.708080] io_issue_sqe+0x4e/0x460
[10375.708705] io_submit_sqes+0x22e/0x710
[10375.709356] __x64_sys_io_uring_enter+0x696/0xbf0
[10375.710109] do_syscall_64+0x6a/0x130
[10375.710743] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[10375.711541] RIP: 0033:0x7f31a073db95
[10375.712160] Code: 00 00 00 44 89 d0 41 b9 08 00 00 00 83 c8 10 f6 87 d0 00 00 00 01 8b bf cc 00 00 00 44 0f 45 d0 45 31 c0 b8 aa 01 00 00 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 41 83 e2 02 74 c2 f0 48 83 0c 24
[10375.714772] RSP: 002b:00007f319a4f4c88 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[10375.715946] RAX: ffffffffffffffda RBX: 00007f31992f5060 RCX: 00007f31a073db95
[10375.717000] RDX: 0000000000000001 RSI: 0000000000000008 RDI: 0000000000000006
[10375.718056] RBP: 00007f319a4f4d60 R08: 0000000000000000 R09: 0000000000000008
[10375.719113] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f31992f5000
[10375.720162] R13: 00000fe63325ea00 R14: 0000000000000000 R15: 00007f319a4f4cd0
[10375.721223] </TASK>
Thanks,
Bernd
next prev parent reply other threads:[~2025-11-23 20:12 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-27 22:27 [PATCH v2 0/8] fuse: support io-uring registered buffers Joanne Koong
2025-10-27 22:28 ` [PATCH v2 1/8] io_uring/uring_cmd: add io_uring_cmd_import_fixed_full() Joanne Koong
2025-10-28 1:28 ` Caleb Sander Mateos
2025-10-29 14:01 ` Pavel Begunkov
2025-10-29 18:37 ` Joanne Koong
2025-10-29 19:59 ` Bernd Schubert
2025-10-30 17:42 ` Pavel Begunkov
2025-10-30 18:06 ` Pavel Begunkov
2025-10-30 22:23 ` Bernd Schubert
2025-10-30 23:50 ` Joanne Koong
2025-10-31 10:27 ` Bernd Schubert
2025-10-31 21:19 ` Joanne Koong
2025-10-30 23:13 ` Joanne Koong
2025-10-27 22:28 ` [PATCH v2 2/8] fuse: refactor io-uring logic for getting next fuse request Joanne Koong
2025-10-30 23:07 ` Bernd Schubert
2025-10-27 22:28 ` [PATCH v2 3/8] fuse: refactor io-uring header copying to ring Joanne Koong
2025-10-30 23:15 ` Bernd Schubert
2025-10-30 23:52 ` Joanne Koong
2025-10-27 22:28 ` [PATCH v2 4/8] fuse: refactor io-uring header copying from ring Joanne Koong
2025-10-27 22:28 ` [PATCH v2 5/8] fuse: use enum types for header copying Joanne Koong
2025-11-05 23:01 ` Bernd Schubert
2025-11-06 21:59 ` Joanne Koong
2025-11-07 22:11 ` Bernd Schubert
2025-10-27 22:28 ` [PATCH v2 6/8] fuse: add user_ prefix to userspace headers and payload fields Joanne Koong
2025-10-28 1:32 ` Caleb Sander Mateos
2025-10-28 23:56 ` Joanne Koong
2025-11-06 13:35 ` Bernd Schubert
2025-10-27 22:28 ` [PATCH v2 7/8] fuse: refactor setting up copy state for payload copying Joanne Koong
2025-11-06 16:53 ` Bernd Schubert
2025-11-06 22:01 ` Joanne Koong
2025-10-27 22:28 ` [PATCH v2 8/8] fuse: support io-uring registered buffers Joanne Koong
2025-10-28 1:42 ` Caleb Sander Mateos
2025-10-28 23:56 ` Joanne Koong
2025-11-06 19:48 ` Bernd Schubert
2025-11-06 23:09 ` Joanne Koong
2025-11-07 22:16 ` Bernd Schubert
2025-11-07 22:23 ` Bernd Schubert
2025-11-23 20:12 ` Bernd Schubert [this message]
2025-11-25 1:13 ` Joanne Koong
2025-11-14 23:59 ` [PATCH v2 0/8] " Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=941f06eb-2429-4752-bf56-fbc413da436f@bsbernd.com \
--to=bernd@bsbernd.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=bschubert@ddn.com \
--cc=csander@purestorage.com \
--cc=io-uring@vger.kernel.org \
--cc=joannelkoong@gmail.com \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=xiaobing.li@samsung.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox