From: Bernd Schubert <[email protected]>
To: Miklos Szeredi <[email protected]>
Cc: Jens Axboe <[email protected]>,
Pavel Begunkov <[email protected]>,
[email protected], [email protected],
Joanne Koong <[email protected]>,
Josef Bacik <[email protected]>,
Amir Goldstein <[email protected]>,
Ming Lei <[email protected]>, David Wei <[email protected]>,
[email protected], Bernd Schubert <[email protected]>
Subject: [PATCH RFC v5 00/16] fuse: fuse-over-io-uring
Date: Thu, 07 Nov 2024 18:03:44 +0100 [thread overview]
Message-ID: <[email protected]> (raw)
This adds support for uring communication between kernel and
userspace daemon using opcode the IORING_OP_URING_CMD. The basic
appraoch was taken from ublk. The patches are in RFC state,
some major changes are still to be expected.
Motivation for these patches is all to increase fuse performance.
In fuse-over-io-uring requests avoid core switching (application
on core X, processing of fuse server on random core Y) and use
shared memory between kernel and userspace to transfer data.
Similar approaches have been taken by ZUFS and FUSE2, though
not over io-uring, but through ioctl IOs
https://lwn.net/Articles/756625/
https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2
Avoiding cache line bouncing / numa systems was discussed
between Amir and Miklos before and Miklos had posted
part of the private discussion here
https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
This cache line bouncing should be reduced by these patches, as
a) Switching between kernel and userspace is reduced by 50%,
as the request fetch (by read) and result commit (write) is replaced
by a single and submit and fetch command
b) Submitting via ring can avoid context switches at all.
Note: As of now userspace still needs to transition to the kernel to
wake up the submit the result, though it might be possible to
avoid that as well (for example either with IORING_SETUP_SQPOLL
(basic testing did not show performance advantage for now) or
the task that is submitting fuse requests to the ring could also
poll for results (needs additional work).
I had also noticed waitq wake-up latencies in fuse before
https://lore.kernel.org/lkml/[email protected]/T/
This spinning approach helped with performance (>40% improvement
for file creates), but due to random server side thread/core utilization
spinning cannot be well controlled in /dev/fuse mode.
With fuse-over-io-uring requests are handled on the same core
(sync requests) or on core+1 (large async requests) and performance
improvements are achieved without spinning.
Splice/zero-copy is not supported yet, Ming Lei is working
on io-uring support for ublk_drv, we can probably also use
that approach for fuse and get better zero copy than splice.
https://lore.kernel.org/io-uring/[email protected]/
RFCv1 and RFCv2 have been tested with multiple xfstest runs in a VM
(32 cores) with a kernel that has several debug options
enabled (like KASAN and MSAN). RFCv3 is not that well tested yet.
O_DIRECT is currently not working well with /dev/fuse and
also these patches, a patch has been submitted to fix that (although
the approach is refused)
https://www.spinics.net/lists/linux-fsdevel/msg280028.html
Up the to RFCv2 nice effect in io-uring mode was that xftests run faster
(like generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still
slow as this is with ASAN/leak-detection/etc.
With RFCv3 and removed mmap overall run time as approximately the same,
though some optimizations are removed in RFCv3, like submitting to
the ring from the task that created the fuse request (hence, without
io_uring_cmd_complete_in_task()).
The corresponding libfuse patches are on my uring branch,
but need cleanup for submission - will happen during the next
days.
https://github.com/bsbernd/libfuse/tree/uring
Testing with that libfuse branch is possible by running something
like:
example/passthrough_hp -o allow_other --debug-fuse --nopassthrough \
--uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 \
/scratch/source /scratch/dest
With the --debug-fuse option one should see CQE in the request type,
if requests are received via io-uring:
cqe unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 16, pid: 7060
unique: 4, result=104
Without the --uring option "cqe" is replaced by the default "dev"
dev unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 7117
unique: 4, success, outsize: 120
TODO list for next RFC version
- make the buffer layout exactly the same as /dev/fuse IO
- different request size - a large ring queue size currently needs
too much memory, even if most of the queue size is needed for small
IOs
Future work
- zero copy
I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023,
which, resulted in some tuning patches (at the end of the
patch series).
Benchmark results (with RFC v1)
=======================================
System used for the benchmark is a 32 core (HyperThreading enabled)
Xeon E5-2650 system. I don't have local disks attached that could do
>5GB/s IOs, for paged and dio results a patched version of passthrough-hp
was used that bypasses final reads/writes.
paged reads
-----------
128K IO size 1024K IO size
jobs /dev/fuse uring gain /dev/fuse uring gain
1 1117 1921 1.72 1902 1942 1.02
2 2502 3527 1.41 3066 3260 1.06
4 5052 6125 1.21 5994 6097 1.02
8 6273 10855 1.73 7101 10491 1.48
16 6373 11320 1.78 7660 11419 1.49
24 6111 9015 1.48 7600 9029 1.19
32 5725 7968 1.39 6986 7961 1.14
dio reads (1024K)
-----------------
jobs /dev/fuse uring gain
1 2023 3998 2.42
2 3375 7950 2.83
4 3823 15022 3.58
8 7796 22591 2.77
16 8520 27864 3.27
24 8361 20617 2.55
32 8717 12971 1.55
mmap reads (4K)
---------------
(sequential, I probably should have made it random, sequential exposes
a rather interesting/weird 'optimized' memcpy issue - sequential becomes
reversed order 4K read)
https://lore.kernel.org/linux-fsdevel/[email protected]/
jobs /dev/fuse uring gain
1 130 323 2.49
2 219 538 2.46
4 503 1040 2.07
8 1472 2039 1.38
16 2191 3518 1.61
24 2453 4561 1.86
32 2178 5628 2.58
(Results on request, setting MAP_HUGETLB much improves performance
for both, io-uring mode then has a slight advantage only.)
creates/s
----------
threads /dev/fuse uring gain
1 3944 10121 2.57
2 8580 24524 2.86
4 16628 44426 2.67
8 46746 56716 1.21
16 79740 102966 1.29
20 80284 119502 1.49
(the gain drop with >=8 cores needs to be investigated)
Jens had done some benchmarks with v3 and noticed only
25% improvement and half of CPU time usage, but v3
removes several optimizations (like waking the same core
and avoiding task io_uring_cmd_done in extra task context).
These optimizations will be submitted once the core work
is merged.
Signed-off-by: Bernd Schubert <[email protected]>
---
Changes in v5:
- Main focus in v5 is the separation of headers from payload,
which required to introduce 'struct fuse_zero_in'.
- Addressed several teardown issues, that were a regression in v4.
- Fixed "BUG: sleeping function called" due to allocation while
holding a lock reported by David Wei
- Fix function comment reported by kernel test rebot
- Fix set but unused variabled reported by test robot
- Link to v4: https://lore.kernel.org/r/[email protected]
Changes in v4:
- Removal of ioctls, all configuration is done dynamically
on the arrival of FUSE_URING_REQ_FETCH
- ring entries are not (and cannot be without config ioctls)
allocated as array of the ring/queue - removal of the tag
variable. Finding ring entries on FUSE_URING_REQ_COMMIT_AND_FETCH
is more cumbersome now and needs an almost unused
struct fuse_pqueue per fuse_ring_queue and uses the unique
id of fuse requests.
- No device clones needed for to workaroung hanging mounts
on fuse-server/daemon termination, handled by IO_URING_F_CANCEL
- Removal of sync/async ring entry types
- Addressed some of Joannes comments, but probably not all
- Only very basic tests run for v3, as more updates should follow quickly.
Changes in v3
- Removed the __wake_on_current_cpu optimization (for now
as that needs to go through another subsystem/tree) ,
removing it means a significant performance drop)
- Removed MMAP (Miklos)
- Switched to two IOCTLs, instead of one ioctl that had a field
for subcommands (ring and queue config) (Miklos)
- The ring entry state is a single state and not a bitmask anymore
(Josef)
- Addressed several other comments from Josef (I need to go over
the RFCv2 review again, I'm not sure if everything is addressed
already)
- Link to v3: https://lore.kernel.org/r/20240901-b4-fuse-uring-rfcv3-without-mmap-v3-0-9207f7391444@ddn.com
- Link to v2: https://lore.kernel.org/all/[email protected]/
- Link to v1: https://lore.kernel.org/r/[email protected]
---
Bernd Schubert (15):
fuse: rename to fuse_dev_end_requests and make non-static
fuse: Move fuse_get_dev to header file
fuse: Move request bits
fuse: Add fuse-io-uring design documentation
fuse: make args->in_args[0] to be always the header
fuse: {uring} Handle SQEs - register commands
fuse: Make fuse_copy non static
fuse: Add fuse-io-uring handling into fuse_copy
fuse: {uring} Add uring sqe commit and fetch support
fuse: {uring} Handle teardown of ring entries
fuse: {uring} Add a ring queue and send method
fuse: {uring} Allow to queue to the ring
fuse: {uring} Handle IO_URING_F_TASK_DEAD
fuse: {io-uring} Prevent mount point hang on fuse-server termination
fuse: enable fuse-over-io-uring
Pavel Begunkov (1):
io_uring/cmd: let cmds to know about dying task
Documentation/filesystems/fuse-io-uring.rst | 101 +++
fs/fuse/Kconfig | 12 +
fs/fuse/Makefile | 1 +
fs/fuse/dax.c | 13 +-
fs/fuse/dev.c | 174 ++--
fs/fuse/dev_uring.c | 1208 +++++++++++++++++++++++++++
fs/fuse/dev_uring_i.h | 191 +++++
fs/fuse/dir.c | 41 +-
fs/fuse/fuse_dev_i.h | 64 ++
fs/fuse/fuse_i.h | 21 +
fs/fuse/inode.c | 5 +-
fs/fuse/xattr.c | 9 +-
include/linux/io_uring_types.h | 1 +
include/uapi/linux/fuse.h | 57 ++
io_uring/uring_cmd.c | 6 +-
15 files changed, 1827 insertions(+), 77 deletions(-)
---
base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
change-id: 20241015-fuse-uring-for-6-10-rfc4-61d0fc6851f8
Best regards,
--
Bernd Schubert <[email protected]>
next reply other threads:[~2024-11-07 17:38 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-07 17:03 Bernd Schubert [this message]
2024-11-07 17:03 ` [PATCH RFC v5 01/16] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 02/16] fuse: Move fuse_get_dev to header file Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 03/16] fuse: Move request bits Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 04/16] fuse: Add fuse-io-uring design documentation Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 05/16] fuse: make args->in_args[0] to be always the header Bernd Schubert
2024-11-14 20:57 ` Joanne Koong
2024-11-14 21:05 ` Bernd Schubert
2024-11-14 21:29 ` Joanne Koong
2024-11-14 22:06 ` Bernd Schubert
2024-11-15 0:49 ` Joanne Koong
2024-11-07 17:03 ` [PATCH RFC v5 06/16] fuse: {uring} Handle SQEs - register commands Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 07/16] fuse: Make fuse_copy non static Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 08/16] fuse: Add fuse-io-uring handling into fuse_copy Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 09/16] fuse: {uring} Add uring sqe commit and fetch support Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 10/16] fuse: {uring} Handle teardown of ring entries Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 11/16] fuse: {uring} Add a ring queue and send method Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 12/16] fuse: {uring} Allow to queue to the ring Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 13/16] io_uring/cmd: let cmds to know about dying task Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 14/16] fuse: {uring} Handle IO_URING_F_TASK_DEAD Bernd Schubert
2024-11-07 17:03 ` [PATCH RFC v5 15/16] fuse: {io-uring} Prevent mount point hang on fuse-server termination Bernd Schubert
2024-11-18 19:32 ` Joanne Koong
2024-11-18 19:55 ` Bernd Schubert
2024-11-18 23:10 ` Joanne Koong
2024-11-18 23:30 ` Joanne Koong
2024-11-18 23:47 ` Bernd Schubert
2024-11-19 2:02 ` Joanne Koong
2024-11-19 9:32 ` Bernd Schubert
2024-11-07 17:04 ` [PATCH RFC v5 16/16] fuse: enable fuse-over-io-uring Bernd Schubert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241107-fuse-uring-for-6-10-rfc4-v5-0-e8660a991499@ddn.com \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox