From: Joanne Koong <joannelkoong@gmail.com>
To: Bernd Schubert <bernd@bsbernd.com>
Cc: miklos@szeredi.hu, axboe@kernel.dk,
linux-fsdevel@vger.kernel.org, bschubert@ddn.com,
asml.silence@gmail.com, io-uring@vger.kernel.org,
xiaobing.li@samsung.com, csander@purestorage.com,
kernel-team@meta.com
Subject: Re: [PATCH v2 8/8] fuse: support io-uring registered buffers
Date: Mon, 24 Nov 2025 17:13:04 -0800 [thread overview]
Message-ID: <CAJnrk1b3TUgVJz7dTtGjOu4vuSkx2_POh671PpkACZnp9FOkrw@mail.gmail.com> (raw)
In-Reply-To: <941f06eb-2429-4752-bf56-fbc413da436f@bsbernd.com>
On Sun, Nov 23, 2025 at 12:12 PM Bernd Schubert <bernd@bsbernd.com> wrote:
>
>
>
> On 11/7/25 00:09, Joanne Koong wrote:
> > On Thu, Nov 6, 2025 at 11:48 AM Bernd Schubert <bernd@bsbernd.com> wrote:
> >>
> >> On 10/27/25 23:28, Joanne Koong wrote:
> >>> Add support for io-uring registered buffers for fuse daemons
> >>> communicating through the io-uring interface. Daemons may register
> >>> buffers ahead of time, which will eliminate the overhead of
> >>> pinning/unpinning user pages and translating virtual addresses for every
> >>> server-kernel interaction.
> >>>
> >>> To support page-aligned payloads, the buffer is structured such that the
> >>> payload is at the front of the buffer and the fuse_uring_req_header is
> >>> offset from the end of the buffer.
> >>>
> >>> To be backwards compatible, fuse uring still needs to support non-registered
> >>> buffers as well.
> >>>
> >>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> >>> ---
> >>> fs/fuse/dev_uring.c | 200 +++++++++++++++++++++++++++++++++---------
> >>> fs/fuse/dev_uring_i.h | 27 +++++-
> >>> 2 files changed, 183 insertions(+), 44 deletions(-)
> >>>
> >>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> >>> index c6b22b14b354..f501bc81f331 100644
> >>> --- a/fs/fuse/dev_uring.c
> >>> +++ b/fs/fuse/dev_uring.c
> >>>
> >>> +/*
> >>> + * Prepare fixed buffer for access. Sets up the payload iter and kmaps the
> >>> + * header.
> >>> + *
> >>> + * Callers must call fuse_uring_unmap_buffer() in the same scope to release the
> >>> + * header mapping.
> >>> + *
> >>> + * For non-fixed buffers, this is a no-op.
> >>> + */
> >>> +static int fuse_uring_map_buffer(struct fuse_ring_ent *ent)
> >>> +{
> >>> + size_t header_size = sizeof(struct fuse_uring_req_header);
> >>> + struct iov_iter iter;
> >>> + struct page *header_page;
> >>> + size_t count, start;
> >>> + ssize_t copied;
> >>> + int err;
> >>> +
> >>> + if (!ent->fixed_buffer)
> >>> + return 0;
> >>> +
> >>> + err = io_uring_cmd_import_fixed_full(ITER_DEST, &iter, ent->cmd, 0);
> >>
> >> This seems to be a rather expensive call, especially as it gets
> >> called twice (during submit and fetch).
> >> Wouldn't be there be a possibility to check if the user buffer changed
> >> and then keep the existing iter? I think Caleb had a similar idea
> >> in patch 1/8.
> >
> > I think the best approach is to get rid of the call entirely by
> > returning -EBUSY to the server if it tries unregistering the buffers
> > while a connection is still alive. Then we would just have to set this
> > up once at registration time, and use that for the lifetime of the
> > connection. The discussion about this with Pavel is in [1] - I'm
> > planning to do this as a separate follow-up.
> >
> > [1] https://lore.kernel.org/linux-fsdevel/9f0debb1-ce0e-4085-a3fe-0da7a8fd76a6@gmail.com/
> >
> >>
> >>> + if (err)
> >>> + return err;
> >>> +
> >>> + count = iov_iter_count(&iter);
> >>> + if (count < header_size || count & (PAGE_SIZE - 1))
> >>> + return -EINVAL;
> >>
> >> || !PAGE_ALIGNED(count)) ?
> >
> > Nice, I didn't realize this macro existed. Thanks.
> >
> >>
> >>> +
> >>> + /* Adjust the payload iter to protect the header from any overwrites */
> >>> + ent->payload_iter = iter;
> >>> + iov_iter_truncate(&ent->payload_iter, count - header_size);
> >>> +
> >>> + /* Set up the headers */
> >>> + iov_iter_advance(&iter, count - header_size);
> >>> + copied = iov_iter_get_pages2(&iter, &header_page, header_size, 1, &start);
> >>
> >> The iter is later used for the payload, but I miss a reset? iov_iter_revert()?
> >
> > This iter is separate from the payload iter and doesn't affect the
> > payload iter's values because the "ent->payload_iter = iter;"
> > assignment above shallow copies that out first.
> >
> >>
> >>> + if (copied < header_size)
> >>> + return -EFAULT;
> >>> + ent->headers = kmap_local_page(header_page) + start;
> >>
> >> My plan for the alternative pinning patch (with io-uring) was to let the
> >> header be shared by multiple entries. Current libfuse master handles
> >> a fixed page size buffer for the payload (prepared page pinning - I
> >> didn't expect I was blocked for 9 months on other work), missing is to
> >> share it between ring entries.
> >> I think this wouldn't work with registered buffer approach - it
> >> always needs one full page?
> >
> > I've been working on the patches for zero-copy and that has required
> > the design for registered buffers in this patch to change, namely that
> > the payload and the headers must be separated out. For v3, I have them
> > separate now.
> >>
> >> I would also like to discuss dynamic multiple payload sizes per queue.
> >> For example to have something like
> >>
> >> 256 x 4K
> >> 8 x 128K
> >> 4 x 1M
> >
> > I think zero-copy might obviate the need for this. The way I have it
> > right now, it uses sparse buffers for payloads, which prevents the
> > server from needing to allocate the 1M buffer per ent. I'm hoping to
> > send out the patches for this as part of v3 at the end of next week or
> > next next week.
> >
> > Thanks,
> > joanne
> >
> >>
> >> I think there are currently two ways to do that
> >>
> >> 1) Sort entries into pools
> >> 2) Sort buffers into pools and let entries use these. Here the header
> >> would be fixed and payload would come from a pool.
> >>
> >> With the appraoch to have payload and header in one buffer we couldn't
> >> use 2). Using 1) should be fine, though.
> >>
> >>>
> >>> /*
> >>> @@ -1249,20 +1358,29 @@ static void fuse_uring_send_in_task(struct io_uring_cmd *cmd,
> >>> {
> >>> struct fuse_ring_ent *ent = uring_cmd_to_ring_ent(cmd);
> >>> struct fuse_ring_queue *queue = ent->queue;
> >>> + bool send_ent = true;
> >>> int err;
> >>>
> >>> - if (!(issue_flags & IO_URING_F_TASK_DEAD)) {
> >>> - err = fuse_uring_prepare_send(ent, ent->fuse_req);
> >>> - if (err) {
> >>> - if (!fuse_uring_get_next_fuse_req(ent, queue))
> >>> - return;
> >>> - err = 0;
> >>> - }
> >>> - } else {
> >>> - err = -ECANCELED;
> >>> + if (issue_flags & IO_URING_F_TASK_DEAD) {
> >>> + fuse_uring_send(ent, cmd, -ECANCELED, issue_flags);
> >>> + return;
> >>> + }
> >>> +
> >>> + err = fuse_uring_map_buffer(ent);
> >>> + if (err) {
> >>> + fuse_uring_req_end(ent, ent->fuse_req, err);
> >>> + return;
> >>
> >> I think this needs to abort the connection now. There could be multiple
> >> commands on the queue and they would be stuck now and there is no
> >> notification to fuse server either.
> >
> > This approach makes sense to me and makes things a bit simpler. I'll
> > add this to v3.
>
> This is a just heads up, while I'm testing sync FUSE_INIT I'm running
> in several issues. Part of it needs to be a re-send on -EGAIN and -EINTR
> in libfuse - had slipped through so far, but another is related to our
> page pinning. I just thought I can easy abort the connection when
> that fails, but that results in a double lock, because so far we
> assumed IO_URING_F_UNLOCKED in teardown context.
> I'm going to submit a patch to teardown in uring task context, so that
> we are sure about the flags.
>
> I.e. I think my suggestion to abort here would run into the same
> issue as below.
>
>
> [10375.669761] fuse: FUSE_IO_URING_CMD_REGISTER returned -4
> [10375.670632] fuse: FUSE_IO_URING_CMD_REGISTER failed err=-4
>
> [10375.671922] ============================================
> [10375.672754] WARNING: possible recursive locking detected
> [10375.673577] 6.8.12+ #6 Tainted: G W O
> [10375.674377] --------------------------------------------
> [10375.675208] fuse-ring-0/7658 is trying to acquire lock:
> [10375.676022] ffff8881040910b0 (&ctx->uring_lock){+.+.}-{4:4}, at: io_uring_cmd_done+0x14f/0x210
> [10375.677335]
> but task is already holding lock:
> [10375.678292] ffff8881040910b0 (&ctx->uring_lock){+.+.}-{4:4}, at: __x64_sys_io_uring_enter+0x68a/0xbf0
> [10375.679678]
> other info that might help us debug this:
> [10375.680722] Possible unsafe locking scenario:
>
> [10375.681692] CPU0
> [10375.682156] ----
> [10375.682633] lock(&ctx->uring_lock);
> [10375.683260] lock(&ctx->uring_lock);
> [10375.683879]
> *** DEADLOCK ***
>
> [10375.684916] May be due to missing lock nesting notation
>
> [10375.685994] 1 lock held by fuse-ring-0/7658:
> [10375.686699] #0: ffff8881040910b0 (&ctx->uring_lock){+.+.}-{4:4}, at: __x64_sys_io_uring_enter+0x68a/0xbf0
> [10375.688141]
> stack backtrace:
> [10375.688923] CPU: 0 PID: 7658 Comm: fuse-ring-0 Tainted: G W O 6.8.12+ #6
> [10375.690150] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [10375.691534] Call Trace:
> [10375.692019] <TASK>
> [10375.692445] dump_stack_lvl+0x66/0x90
> [10375.695219] __lock_acquire+0x13b8/0x25c0
> [10375.695901] lock_acquire+0xb7/0x290
> [10375.696507] ? io_uring_cmd_done+0x14f/0x210
> [10375.697226] ? __lock_acquire+0x459/0x25c0
> [10375.697913] __mutex_lock+0x7d/0xae0
> [10375.698530] ? io_uring_cmd_done+0x14f/0x210
> [10375.699251] ? io_uring_cmd_done+0x14f/0x210
> [10375.699967] ? lock_acquire+0xb7/0x290
> [10375.700606] ? fuse_uring_stop_list_entries+0x1c2/0x280 [fuse]
> [10375.701508] ? io_uring_cmd_done+0x14f/0x210
> [10375.702210] ? lock_release+0x24b/0x390
> [10375.702862] io_uring_cmd_done+0x14f/0x210
> [10375.703546] fuse_uring_stop_list_entries+0x24f/0x280 [fuse]
> [10375.704439] fuse_uring_stop_queues+0xf7/0x240 [fuse]
> [10375.705251] fuse_abort_conn+0x3e6/0x3f0 [fuse]
> [10375.705998] fuse_uring_cmd+0x8de/0xf40 [fuse]
> [10375.706742] ? lock_acquired+0xb1/0x320
> [10375.707415] io_uring_cmd+0x6b/0x170
> [10375.708080] io_issue_sqe+0x4e/0x460
> [10375.708705] io_submit_sqes+0x22e/0x710
> [10375.709356] __x64_sys_io_uring_enter+0x696/0xbf0
> [10375.710109] do_syscall_64+0x6a/0x130
> [10375.710743] entry_SYSCALL_64_after_hwframe+0x4b/0x53
> [10375.711541] RIP: 0033:0x7f31a073db95
> [10375.712160] Code: 00 00 00 44 89 d0 41 b9 08 00 00 00 83 c8 10 f6 87 d0 00 00 00 01 8b bf cc 00 00 00 44 0f 45 d0 45 31 c0 b8 aa 01 00 00 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 41 83 e2 02 74 c2 f0 48 83 0c 24
> [10375.714772] RSP: 002b:00007f319a4f4c88 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
> [10375.715946] RAX: ffffffffffffffda RBX: 00007f31992f5060 RCX: 00007f31a073db95
> [10375.717000] RDX: 0000000000000001 RSI: 0000000000000008 RDI: 0000000000000006
> [10375.718056] RBP: 00007f319a4f4d60 R08: 0000000000000000 R09: 0000000000000008
> [10375.719113] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f31992f5000
> [10375.720162] R13: 00000fe63325ea00 R14: 0000000000000000 R15: 00007f319a4f4cd0
> [10375.721223] </TASK>
>
Thanks for the update, Bernd!
>
> Thanks,
> Bernd
>
next prev parent reply other threads:[~2025-11-25 1:13 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-27 22:27 [PATCH v2 0/8] fuse: support io-uring registered buffers Joanne Koong
2025-10-27 22:28 ` [PATCH v2 1/8] io_uring/uring_cmd: add io_uring_cmd_import_fixed_full() Joanne Koong
2025-10-28 1:28 ` Caleb Sander Mateos
2025-10-29 14:01 ` Pavel Begunkov
2025-10-29 18:37 ` Joanne Koong
2025-10-29 19:59 ` Bernd Schubert
2025-10-30 17:42 ` Pavel Begunkov
2025-10-30 18:06 ` Pavel Begunkov
2025-10-30 22:23 ` Bernd Schubert
2025-10-30 23:50 ` Joanne Koong
2025-10-31 10:27 ` Bernd Schubert
2025-10-31 21:19 ` Joanne Koong
2025-10-30 23:13 ` Joanne Koong
2025-10-27 22:28 ` [PATCH v2 2/8] fuse: refactor io-uring logic for getting next fuse request Joanne Koong
2025-10-30 23:07 ` Bernd Schubert
2025-10-27 22:28 ` [PATCH v2 3/8] fuse: refactor io-uring header copying to ring Joanne Koong
2025-10-30 23:15 ` Bernd Schubert
2025-10-30 23:52 ` Joanne Koong
2025-10-27 22:28 ` [PATCH v2 4/8] fuse: refactor io-uring header copying from ring Joanne Koong
2025-10-27 22:28 ` [PATCH v2 5/8] fuse: use enum types for header copying Joanne Koong
2025-11-05 23:01 ` Bernd Schubert
2025-11-06 21:59 ` Joanne Koong
2025-11-07 22:11 ` Bernd Schubert
2025-10-27 22:28 ` [PATCH v2 6/8] fuse: add user_ prefix to userspace headers and payload fields Joanne Koong
2025-10-28 1:32 ` Caleb Sander Mateos
2025-10-28 23:56 ` Joanne Koong
2025-11-06 13:35 ` Bernd Schubert
2025-10-27 22:28 ` [PATCH v2 7/8] fuse: refactor setting up copy state for payload copying Joanne Koong
2025-11-06 16:53 ` Bernd Schubert
2025-11-06 22:01 ` Joanne Koong
2025-10-27 22:28 ` [PATCH v2 8/8] fuse: support io-uring registered buffers Joanne Koong
2025-10-28 1:42 ` Caleb Sander Mateos
2025-10-28 23:56 ` Joanne Koong
2025-11-06 19:48 ` Bernd Schubert
2025-11-06 23:09 ` Joanne Koong
2025-11-07 22:16 ` Bernd Schubert
2025-11-07 22:23 ` Bernd Schubert
2025-11-23 20:12 ` Bernd Schubert
2025-11-25 1:13 ` Joanne Koong [this message]
2025-11-14 23:59 ` [PATCH v2 0/8] " Joanne Koong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJnrk1b3TUgVJz7dTtGjOu4vuSkx2_POh671PpkACZnp9FOkrw@mail.gmail.com \
--to=joannelkoong@gmail.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=bernd@bsbernd.com \
--cc=bschubert@ddn.com \
--cc=csander@purestorage.com \
--cc=io-uring@vger.kernel.org \
--cc=kernel-team@meta.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=xiaobing.li@samsung.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox