public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
@ 2024-05-29 18:00 Bernd Schubert
  2024-05-29 18:00 ` [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends Bernd Schubert
                   ` (4 more replies)
  0 siblings, 5 replies; 56+ messages in thread
From: Bernd Schubert @ 2024-05-29 18:00 UTC (permalink / raw)
  To: Miklos Szeredi, Amir Goldstein, linux-fsdevel, Bernd Schubert,
	bernd.schubert
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra,
	Andrei Vagin, io-uring

From: Bernd Schubert <[email protected]>

This adds support for uring communication between kernel and
userspace daemon using opcode the IORING_OP_URING_CMD. The basic
appraoch was taken from ublk.  The patches are in RFC state,
some major changes are still to be expected.

Motivation for these patches is all to increase fuse performance.
In fuse-over-io-uring requests avoid core switching (application
on core X, processing of fuse server on random core Y) and use
shared memory between kernel and userspace to transfer data.
Similar approaches have been taken by ZUFS and FUSE2, though
not over io-uring, but through ioctl IOs

https://lwn.net/Articles/756625/
https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2

Avoiding cache line bouncing / numa systems was discussed
between Amir and Miklos before and Miklos had posted
part of the private discussion here
https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/

This cache line bouncing should be addressed by these patches
as well.

I had also noticed waitq wake-up latencies in fuse before
https://lore.kernel.org/lkml/[email protected]/T/

This spinning approach helped with performance (>40% improvement
for file creates), but due to random server side thread/core utilization
spinning cannot be well controlled in /dev/fuse mode.
With fuse-over-io-uring requests are handled on the same core
(sync requests) or on core+1 (large async requests) and performance
improvements are achieved without spinning.

Splice/zero-copy is not supported yet, Ming Lei is working
on io-uring support for ublk_drv, but I think so far there
is no final agreement on the approach to be taken yet.
Fuse-over-io-uring runs significantly faster than reads/writes
over /dev/fuse, even with splice enabled, so missing zc
should not be a blocking issue.

The patches have been tested with multiple xfstest runs in a VM
(32 cores) with a kernel that has several debug options
enabled (like KASAN and MSAN).
For some tests xfstests reports that O_DIRECT is not supported,
I need to investigate that. Interesting part is that exactly
these tests fail in plain /dev/fuse posix mode. I had to disabled
generic/650, which is enabling/disabling cpu cores - given ring
threads are bound to cores issues with that are no totally
unexpected, but then there (scheduler) kernel messages that
core binding for these threads is removed - this needs
to be further investigates.
Nice effect in io-uring mode is that tests run faster (like
generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still
slow as this is with ASAN/leak-detection/etc.

The corresponding libfuse patches are on my uring branch,
but need cleanup for submission - will happen during the next
days.
https://github.com/bsbernd/libfuse/tree/uring

If it should make review easier, patches posted here are on
this branch
https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2

TODO list for next RFC versions
- Let the ring configure ioctl return information, like mmap/queue-buf size
- Request kernel side address and len for a request - avoid calculation in userspace?
- multiple IO sizes per queue (avoiding a calculation in userspace is probably even
  more important)
- FUSE_INTERRUPT handling?
- Logging (adds fields in the ioctl and also ring-request),
  any mismatch between client and server is currently very hard to understand
  through error codes

Future work
- notifications, probably on their own ring
- zero copy

I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023,
which, resulted in some tuning patches (at the end of the
patch series).

Some benchmark results
======================

System used for the benchmark is a 32 core (HyperThreading enabled)
Xeon E5-2650 system. I don't have local disks attached that could do
>5GB/s IOs, for paged and dio results a patched version of passthrough-hp
was used that bypasses final reads/writes.

paged reads
-----------
            128K IO size                      1024K IO size
jobs   /dev/fuse     uring    gain     /dev/fuse    uring   gain
 1        1117        1921    1.72        1902       1942   1.02
 2        2502        3527    1.41        3066       3260   1.06
 4        5052        6125    1.21        5994       6097   1.02
 8        6273       10855    1.73        7101      10491   1.48
16        6373       11320    1.78        7660      11419   1.49
24        6111        9015    1.48        7600       9029   1.19
32        5725        7968    1.39        6986       7961   1.14

dio reads (1024K)
-----------------

jobs   /dev/fuse  uring   gain
1	    2023   3998	  2.42
2	    3375   7950   2.83
4	    3823   15022  3.58
8	    7796   22591  2.77
16	    8520   27864  3.27
24	    8361   20617  2.55
32	    8717   12971  1.55

mmap reads (4K)
---------------
(sequential, I probably should have made it random, sequential exposes
a rather interesting/weird 'optimized' memcpy issue - sequential becomes
reversed order 4K read)
https://lore.kernel.org/linux-fsdevel/[email protected]/

jobs  /dev/fuse     uring    gain
1       130          323     2.49
2       219          538     2.46
4       503         1040     2.07
8       1472        2039     1.38
16      2191        3518     1.61
24      2453        4561     1.86
32      2178        5628     2.58

(Results on request, setting MAP_HUGETLB much improves performance
for both, io-uring mode then has a slight advantage only.)

creates/s
----------
threads /dev/fuse     uring   gain
1          3944       10121   2.57
2          8580       24524   2.86
4         16628       44426   2.67
8         46746       56716   1.21
16        79740      102966   1.29
20        80284      119502   1.49

(the gain drop with >=8 cores needs to be investigated)

Remaining TODO list for RFCv3:
--------------------------------
1) Let the ring configure ioctl return information,
like mmap/queue-buf size

Right now libfuse and kernel have lots of duplicated setup code
and any kind of pointer/offset mismatch results in a non-working
ring that is hard to debug - probably better when the kernel does
the calculations and returns that to server side

2) In combination with 1, ring requests should retrieve their
userspace address and length from kernel side instead of
calculating it through the mmaped queue buffer on their own.
(Introduction of FUSE_URING_BUF_ADDR_FETCH)

3) Add log buffer into the ioctl and ring-request

This is to provide better error messages (instead of just
errno)

3) Multiple IO sizes per queue

Small IOs and metadata requests do not need large buffer sizes,
we need multiple IO sizes per queue.

4) FUSE_INTERRUPT handling

These are not handled yet, kernel side is probably not difficult
anymore as ring entries take fuse requests through lists.

Long term TODO:
--------------
Notifications through io-uring, maybe with a separated ring,
but I'm not sure yet.

Changes since RFCv1
-------------------
- No need to hold the task of the server side anymore.  Also no
  ioctls/threads waiting for shutdown anymore.  Shutdown now more
  works like the traditional fuse way.
- Each queue clones the fuse and device release makes an  exception
  for io-uring. Reason is that queued IORING_OP_URING_CMD
  (through .uring_cmd) prevent a device release. I.e. a killed
  server side typically triggers fuse_abort_conn(). This was the
  reason for the async stop-monitor in v1 and reference on the daemon
  task. However it was very racy and annotated immediately by Miklos.
- In v1 the offset parameter to mmap was identifying the QID, in v2
  server side is expected to send mmap from a core bound ring thread
  in numa mode and numa node is taken through the core of that thread.
  Kernel side of the mmap buffer is stored in an rbtree and assigned
  to the right qid through an additional queue ioctl.
- Release of IORING_OP_URING_CMD is done through lists now, instead
  of iterating over the entire array of queues/entries and does not
  depend on the entry state anymore (a bit of the state is still left
  for sanity check).
- Finding free ring queue entries is done through lists and not through
  a bitmap anymore
- Many other code changes and bug fixes
- Performance tunings

---
Bernd Schubert (19):
      fuse: rename to fuse_dev_end_requests and make non-static
      fuse: Move fuse_get_dev to header file
      fuse: Move request bits
      fuse: Add fuse-io-uring design documentation
      fuse: Add a uring config ioctl
      Add a vmalloc_node_user function
      fuse uring: Add an mmap method
      fuse: Add the queue configuration ioctl
      fuse: {uring} Add a dev_release exception for fuse-over-io-uring
      fuse: {uring} Handle SQEs - register commands
      fuse: Add support to copy from/to the ring buffer
      fuse: {uring} Add uring sqe commit and fetch support
      fuse: {uring} Handle uring shutdown
      fuse: {uring} Allow to queue to the ring
      export __wake_on_current_cpu
      fuse: {uring} Wake requests on the the current cpu
      fuse: {uring} Send async requests to qid of core + 1
      fuse: {uring} Set a min cpu offset io-size for reads/writes
      fuse: {uring} Optimize async sends

 Documentation/filesystems/fuse-io-uring.rst |  167 ++++
 fs/fuse/Kconfig                             |   12 +
 fs/fuse/Makefile                            |    1 +
 fs/fuse/dev.c                               |  310 +++++--
 fs/fuse/dev_uring.c                         | 1232 +++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h                       |  395 +++++++++
 fs/fuse/file.c                              |   15 +-
 fs/fuse/fuse_dev_i.h                        |   67 ++
 fs/fuse/fuse_i.h                            |    9 +
 fs/fuse/inode.c                             |    3 +
 include/linux/vmalloc.h                     |    1 +
 include/uapi/linux/fuse.h                   |  135 +++
 kernel/sched/wait.c                         |    1 +
 mm/nommu.c                                  |    6 +
 mm/vmalloc.c                                |   41 +-
 15 files changed, 2330 insertions(+), 65 deletions(-)
---
base-commit: dd5a440a31fae6e459c0d6271dddd62825505361
change-id: 20240529-fuse-uring-for-6-9-rfc2-out-f0a009005fdf

Best regards,
-- 
Bernd Schubert <[email protected]>


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2024-08-31  0:49 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends Bernd Schubert
2024-05-31 16:24   ` Jens Axboe
2024-05-31 17:36     ` Bernd Schubert
2024-05-31 19:10       ` Jens Axboe
2024-06-01 16:37         ` Bernd Schubert
2024-05-30  7:07 ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Amir Goldstein
2024-05-30 12:09   ` Bernd Schubert
2024-05-30 15:36 ` Kent Overstreet
2024-05-30 16:02   ` Bernd Schubert
2024-05-30 16:10     ` Kent Overstreet
2024-05-30 16:17       ` Bernd Schubert
2024-05-30 17:30         ` Kent Overstreet
2024-05-30 19:09         ` Josef Bacik
2024-05-30 20:05           ` Kent Overstreet
2024-05-31  3:53         ` [PATCH] fs: sys_ringbuffer() (WIP) Kent Overstreet
2024-05-31 13:11           ` kernel test robot
2024-05-31 15:49           ` kernel test robot
2024-05-30 16:21     ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Jens Axboe
2024-05-30 16:32       ` Bernd Schubert
2024-05-30 17:26         ` Jens Axboe
2024-05-30 17:16       ` Kent Overstreet
2024-05-30 17:28         ` Jens Axboe
2024-05-30 17:58           ` Kent Overstreet
2024-05-30 18:48             ` Jens Axboe
2024-05-30 19:35               ` Kent Overstreet
2024-05-31  0:11                 ` Jens Axboe
2024-06-04 23:45       ` Ming Lei
2024-05-30 20:47 ` Josef Bacik
2024-06-11  8:20 ` Miklos Szeredi
2024-06-11 10:26   ` Bernd Schubert
2024-06-11 15:35     ` Miklos Szeredi
2024-06-11 17:37       ` Bernd Schubert
2024-06-11 23:35         ` Kent Overstreet
2024-06-12 13:53           ` Bernd Schubert
2024-06-12 14:19             ` Kent Overstreet
2024-06-12 15:40               ` Bernd Schubert
2024-06-12 15:55                 ` Kent Overstreet
2024-06-12 16:15                   ` Bernd Schubert
2024-06-12 16:24                     ` Kent Overstreet
2024-06-12 16:44                       ` Bernd Schubert
2024-06-12  7:39         ` Miklos Szeredi
2024-06-12 13:32           ` Bernd Schubert
2024-06-12 13:46             ` Bernd Schubert
2024-06-12 14:07             ` Miklos Szeredi
2024-06-12 14:56               ` Bernd Schubert
2024-08-02 23:03                 ` Bernd Schubert
2024-08-29 22:32                 ` Bernd Schubert
2024-08-30 13:12                   ` Jens Axboe
2024-08-30 13:28                     ` Bernd Schubert
2024-08-30 13:33                       ` Jens Axboe
2024-08-30 14:55                         ` Pavel Begunkov
2024-08-30 15:10                           ` Bernd Schubert
2024-08-30 20:08                           ` Jens Axboe
2024-08-31  0:02                             ` Bernd Schubert
2024-08-31  0:49                               ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox