* [PATCH RFC v3 01/17] fuse: rename to fuse_dev_end_requests and make non-static
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
@ 2024-09-01 13:36 ` Bernd Schubert
2024-09-01 13:36 ` [PATCH RFC v3 02/17] fuse: Move fuse_get_dev to header file Bernd Schubert
` (16 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:36 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
This function is needed by fuse_uring.c to clean ring queues,
so make it non static. Especially in non-static mode the function
name 'end_requests' should be prefixed with fuse_
Signed-off-by: Bernd Schubert <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
---
fs/fuse/dev.c | 7 ++++---
fs/fuse/fuse_dev_i.h | 15 +++++++++++++++
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 9eb191b5c4de..74cb9ae90052 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -7,6 +7,7 @@
*/
#include "fuse_i.h"
+#include "fuse_dev_i.h"
#include <linux/init.h>
#include <linux/module.h>
@@ -2136,7 +2137,7 @@ static __poll_t fuse_dev_poll(struct file *file, poll_table *wait)
}
/* Abort all requests on the given list (pending or processing) */
-static void end_requests(struct list_head *head)
+void fuse_dev_end_requests(struct list_head *head)
{
while (!list_empty(head)) {
struct fuse_req *req;
@@ -2239,7 +2240,7 @@ void fuse_abort_conn(struct fuse_conn *fc)
wake_up_all(&fc->blocked_waitq);
spin_unlock(&fc->lock);
- end_requests(&to_end);
+ fuse_dev_end_requests(&to_end);
} else {
spin_unlock(&fc->lock);
}
@@ -2269,7 +2270,7 @@ int fuse_dev_release(struct inode *inode, struct file *file)
list_splice_init(&fpq->processing[i], &to_end);
spin_unlock(&fpq->lock);
- end_requests(&to_end);
+ fuse_dev_end_requests(&to_end);
/* Are we the last open device? */
if (atomic_dec_and_test(&fc->dev_count)) {
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
new file mode 100644
index 000000000000..5a1b8a2775d8
--- /dev/null
+++ b/fs/fuse/fuse_dev_i.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * FUSE: Filesystem in Userspace
+ * Copyright (C) 2001-2008 Miklos Szeredi <[email protected]>
+ */
+#ifndef _FS_FUSE_DEV_I_H
+#define _FS_FUSE_DEV_I_H
+
+#include <linux/types.h>
+
+void fuse_dev_end_requests(struct list_head *head);
+
+#endif
+
+
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 02/17] fuse: Move fuse_get_dev to header file
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
2024-09-01 13:36 ` [PATCH RFC v3 01/17] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
@ 2024-09-01 13:36 ` Bernd Schubert
2024-09-01 13:36 ` [PATCH RFC v3 03/17] fuse: Move request bits Bernd Schubert
` (15 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:36 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
Another preparation patch, as this function will be needed by
fuse/dev.c and fuse/dev_uring.c.
Signed-off-by: Bernd Schubert <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
---
fs/fuse/dev.c | 9 ---------
fs/fuse/fuse_dev_i.h | 9 +++++++++
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 74cb9ae90052..9ac69fd2cead 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -32,15 +32,6 @@ MODULE_ALIAS("devname:fuse");
static struct kmem_cache *fuse_req_cachep;
-static struct fuse_dev *fuse_get_dev(struct file *file)
-{
- /*
- * Lockless access is OK, because file->private data is set
- * once during mount and is valid until the file is released.
- */
- return READ_ONCE(file->private_data);
-}
-
static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
{
INIT_LIST_HEAD(&req->list);
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 5a1b8a2775d8..b38e67b3f889 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -8,6 +8,15 @@
#include <linux/types.h>
+static inline struct fuse_dev *fuse_get_dev(struct file *file)
+{
+ /*
+ * Lockless access is OK, because file->private data is set
+ * once during mount and is valid until the file is released.
+ */
+ return READ_ONCE(file->private_data);
+}
+
void fuse_dev_end_requests(struct list_head *head);
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 03/17] fuse: Move request bits
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
2024-09-01 13:36 ` [PATCH RFC v3 01/17] fuse: rename to fuse_dev_end_requests and make non-static Bernd Schubert
2024-09-01 13:36 ` [PATCH RFC v3 02/17] fuse: Move fuse_get_dev to header file Bernd Schubert
@ 2024-09-01 13:36 ` Bernd Schubert
2024-09-01 13:36 ` [PATCH RFC v3 04/17] fuse: Add fuse-io-uring design documentation Bernd Schubert
` (14 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:36 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
These are needed by dev_uring functions as well
Signed-off-by: Bernd Schubert <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
---
fs/fuse/dev.c | 4 ----
fs/fuse/fuse_dev_i.h | 4 ++++
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 9ac69fd2cead..dbc222f9b0f0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -26,10 +26,6 @@
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
-/* Ordinary requests have even IDs, while interrupts IDs are odd */
-#define FUSE_INT_REQ_BIT (1ULL << 0)
-#define FUSE_REQ_ID_STEP (1ULL << 1)
-
static struct kmem_cache *fuse_req_cachep;
static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index b38e67b3f889..6c506f040d5f 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -8,6 +8,10 @@
#include <linux/types.h>
+/* Ordinary requests have even IDs, while interrupts IDs are odd */
+#define FUSE_INT_REQ_BIT (1ULL << 0)
+#define FUSE_REQ_ID_STEP (1ULL << 1)
+
static inline struct fuse_dev *fuse_get_dev(struct file *file)
{
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 04/17] fuse: Add fuse-io-uring design documentation
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (2 preceding siblings ...)
2024-09-01 13:36 ` [PATCH RFC v3 03/17] fuse: Move request bits Bernd Schubert
@ 2024-09-01 13:36 ` Bernd Schubert
2024-09-01 13:36 ` [PATCH RFC v3 05/17] fuse: Add a uring config ioctl Bernd Schubert
` (13 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:36 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
Signed-off-by: Bernd Schubert <[email protected]>
---
Documentation/filesystems/fuse-io-uring.rst | 108 ++++++++++++++++++++++++++++
1 file changed, 108 insertions(+)
diff --git a/Documentation/filesystems/fuse-io-uring.rst b/Documentation/filesystems/fuse-io-uring.rst
new file mode 100644
index 000000000000..0f1e8bb7eca6
--- /dev/null
+++ b/Documentation/filesystems/fuse-io-uring.rst
@@ -0,0 +1,108 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+FUSE Uring design documentation
+==============================
+
+This documentation covers basic details how the fuse
+kernel/userspace communication through uring is configured
+and works. For generic details about FUSE see fuse.rst.
+
+This document also covers the current interface, which is
+still in development and might change.
+
+Limitations
+===========
+As of now not all requests types are supported through uring, userspace
+is required to also handle requests through /dev/fuse after
+uring setup is complete. Specifically notifications (initiated from
+the daemon side) and interrupts.
+
+Fuse uring configuration
+========================
+
+Fuse kernel requests are queued through the classical /dev/fuse
+read/write interface - until uring setup is complete.
+
+In order to set up fuse-over-io-uring userspace has to send a ring initiation
+and configuration ioctl(FUSE_DEV_IOC_URING_CFG) and then per queue
+the FUSE_DEV_IOC_URING_QUEUE_CFG. The latter will is mostly indentical
+with FUSE_DEV_IOC_CLONE, but also contains the qid.
+
+Kernel - userspace interface using uring
+========================================
+
+After queue ioctl setup userspace submits
+SQEs (opcode = IORING_OP_URING_CMD) in order to fetch
+fuse requests. Initial submit is with the sub command
+FUSE_URING_REQ_FETCH, which will just register entries
+to be available in the kernel.
+
+Once all entries for all queues are submitted, kernel starts
+to enqueue to ring queues.
+Userspace handles the CQE/fuse-request and submits the result as
+subcommand FUSE_URING_REQ_COMMIT_AND_FETCH - kernel completes
+the requests and also marks the entry available again. If there are
+pending requests waiting the request will be immediately submitted
+to the daemon again.
+
+Initial SQE
+-----------
+
+ | | FUSE filesystem daemon
+ | |
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD /
+ | | FUSE_URING_REQ_FETCH
+ | | [wait cqe]
+ | | >io_uring_wait_cqe() or
+ | | >io_uring_submit_and_wait()
+ | |
+ | >fuse_uring_cmd() |
+ | >fuse_uring_fetch() |
+ | >fuse_uring_ent_release() |
+
+
+Sending requests with CQEs
+--------------------------
+
+ | | FUSE filesystem daemon
+ | | [waiting for CQEs]
+ | "rm /mnt/fuse/file" |
+ | |
+ | >sys_unlink() |
+ | >fuse_unlink() |
+ | [allocate request] |
+ | >__fuse_request_send() |
+ | ... |
+ | >fuse_uring_queue_fuse_req |
+ | [queue request on fg or |
+ | bg queue] |
+ | >fuse_uring_assign_ring_entry() |
+ | >fuse_uring_send_to_ring() |
+ | >fuse_uring_copy_to_ring() |
+ | >io_uring_cmd_done() |
+ | >request_wait_answer() |
+ | [sleep on req->waitq] |
+ | | [receives and handles CQE]
+ | | [submit result and fetch next]
+ | | >io_uring_submit()
+ | | IORING_OP_URING_CMD/
+ | | FUSE_URING_REQ_COMMIT_AND_FETCH
+ | >fuse_uring_cmd() |
+ | >fuse_uring_commit_and_release() |
+ | >fuse_uring_copy_from_ring() |
+ | [ copy the result to the fuse req] |
+ | >fuse_uring_req_end_and_get_next() |
+ | >fuse_request_end() |
+ | [wake up req->waitq] |
+ | >fuse_uring_ent_release_and_fetch()|
+ | [wait or handle next req] |
+ | |
+ | |
+ | [req->waitq woken up] |
+ | <fuse_unlink() |
+ | <sys_unlink() |
+
+
+
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 05/17] fuse: Add a uring config ioctl
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (3 preceding siblings ...)
2024-09-01 13:36 ` [PATCH RFC v3 04/17] fuse: Add fuse-io-uring design documentation Bernd Schubert
@ 2024-09-01 13:36 ` Bernd Schubert
2024-09-04 0:43 ` Joanne Koong
2024-09-01 13:37 ` [PATCH RFC v3 06/17] fuse: Add the queue configuration ioctl Bernd Schubert
` (12 subsequent siblings)
17 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:36 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
This only adds the initial ioctl for basic fuse-uring initialization.
More ioctl types will be added later to initialize queues.
This also adds data structures needed or initialized by the ioctl
command and that will be used later.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/Kconfig | 12 ++++
fs/fuse/Makefile | 1 +
fs/fuse/dev.c | 33 ++++++++---
fs/fuse/dev_uring.c | 141 ++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/dev_uring_i.h | 113 +++++++++++++++++++++++++++++++++++++
fs/fuse/fuse_dev_i.h | 1 +
fs/fuse/fuse_i.h | 5 ++
fs/fuse/inode.c | 3 +
include/uapi/linux/fuse.h | 47 ++++++++++++++++
9 files changed, 349 insertions(+), 7 deletions(-)
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 8674dbfbe59d..11f37cefc94b 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -63,3 +63,15 @@ config FUSE_PASSTHROUGH
to be performed directly on a backing file.
If you want to allow passthrough operations, answer Y.
+
+config FUSE_IO_URING
+ bool "FUSE communication over io-uring"
+ default y
+ depends on FUSE_FS
+ depends on IO_URING
+ help
+ This allows sending FUSE requests over the IO uring interface and
+ also adds request core affinity.
+
+ If you want to allow fuse server/client communication through io-uring,
+ answer Y
diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 6e0228c6d0cb..7193a14374fd 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -11,5 +11,6 @@ fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
fuse-y += iomode.o
fuse-$(CONFIG_FUSE_DAX) += dax.o
fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
+fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
virtiofs-y := virtio_fs.o
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index dbc222f9b0f0..6489179e7260 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -8,6 +8,7 @@
#include "fuse_i.h"
#include "fuse_dev_i.h"
+#include "dev_uring_i.h"
#include <linux/init.h>
#include <linux/module.h>
@@ -26,6 +27,13 @@
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
+#ifdef CONFIG_FUSE_IO_URING
+static bool __read_mostly enable_uring;
+module_param(enable_uring, bool, 0644);
+MODULE_PARM_DESC(enable_uring,
+ "Enable uring userspace communication through uring.");
+#endif
+
static struct kmem_cache *fuse_req_cachep;
static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
@@ -2298,16 +2306,12 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
return 0;
}
-static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
+static long _fuse_dev_ioctl_clone(struct file *file, int oldfd)
{
int res;
- int oldfd;
struct fuse_dev *fud = NULL;
struct fd f;
- if (get_user(oldfd, argp))
- return -EFAULT;
-
f = fdget(oldfd);
if (!f.file)
return -EINVAL;
@@ -2330,6 +2334,16 @@ static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
return res;
}
+static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
+{
+ int oldfd;
+
+ if (get_user(oldfd, argp))
+ return -EFAULT;
+
+ return _fuse_dev_ioctl_clone(file, oldfd);
+}
+
static long fuse_dev_ioctl_backing_open(struct file *file,
struct fuse_backing_map __user *argp)
{
@@ -2365,8 +2379,9 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
return fuse_backing_close(fud->fc, backing_id);
}
-static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
- unsigned long arg)
+static long
+fuse_dev_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
{
void __user *argp = (void __user *)arg;
@@ -2380,6 +2395,10 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
case FUSE_DEV_IOC_BACKING_CLOSE:
return fuse_dev_ioctl_backing_close(file, argp);
+#ifdef CONFIG_FUSE_IO_URING
+ case FUSE_DEV_IOC_URING_CFG:
+ return fuse_uring_conn_cfg(file, argp);
+#endif
default:
return -ENOTTY;
}
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
new file mode 100644
index 000000000000..4e7518ef6527
--- /dev/null
+++ b/fs/fuse/dev_uring.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * FUSE: Filesystem in Userspace
+ * Copyright (c) 2023-2024 DataDirect Networks.
+ */
+
+#include "fuse_dev_i.h"
+#include "fuse_i.h"
+#include "dev_uring_i.h"
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/sched/signal.h>
+#include <linux/uio.h>
+#include <linux/miscdevice.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/pipe_fs_i.h>
+#include <linux/swap.h>
+#include <linux/splice.h>
+#include <linux/sched.h>
+#include <linux/io_uring.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/io_uring.h>
+#include <linux/io_uring/cmd.h>
+#include <linux/topology.h>
+#include <linux/io_uring/cmd.h>
+
+static void fuse_uring_queue_cfg(struct fuse_ring_queue *queue, int qid,
+ struct fuse_ring *ring)
+{
+ int tag;
+
+ queue->qid = qid;
+ queue->ring = ring;
+
+ for (tag = 0; tag < ring->queue_depth; tag++) {
+ struct fuse_ring_ent *ent = &queue->ring_ent[tag];
+
+ ent->queue = queue;
+ ent->tag = tag;
+ }
+}
+
+static int _fuse_uring_conn_cfg(struct fuse_ring_config *rcfg,
+ struct fuse_conn *fc, struct fuse_ring *ring,
+ size_t queue_sz)
+{
+ ring->numa_aware = rcfg->numa_aware;
+ ring->nr_queues = rcfg->nr_queues;
+ ring->per_core_queue = rcfg->nr_queues > 1;
+
+ ring->max_nr_sync = rcfg->sync_queue_depth;
+ ring->max_nr_async = rcfg->async_queue_depth;
+ ring->queue_depth = ring->max_nr_sync + ring->max_nr_async;
+
+ ring->req_buf_sz = rcfg->user_req_buf_sz;
+
+ ring->queue_size = queue_sz;
+
+ fc->ring = ring;
+ ring->fc = fc;
+
+ return 0;
+}
+
+static int fuse_uring_cfg_sanity(struct fuse_ring_config *rcfg)
+{
+ if (rcfg->nr_queues == 0) {
+ pr_info("zero number of queues is invalid.\n");
+ return -EINVAL;
+ }
+
+ if (rcfg->nr_queues > 1 && rcfg->nr_queues != num_present_cpus()) {
+ pr_info("nr-queues (%d) does not match nr-cores (%d).\n",
+ rcfg->nr_queues, num_present_cpus());
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/*
+ * Basic ring setup for this connection based on the provided configuration
+ */
+int fuse_uring_conn_cfg(struct file *file, void __user *argp)
+{
+ struct fuse_ring_config rcfg;
+ int res;
+ struct fuse_dev *fud;
+ struct fuse_conn *fc;
+ struct fuse_ring *ring = NULL;
+ struct fuse_ring_queue *queue;
+ int qid;
+
+ res = copy_from_user(&rcfg, (void *)argp, sizeof(rcfg));
+ if (res != 0)
+ return -EFAULT;
+ res = fuse_uring_cfg_sanity(&rcfg);
+ if (res != 0)
+ return res;
+
+ fud = fuse_get_dev(file);
+ if (fud == NULL)
+ return -ENODEV;
+ fc = fud->fc;
+
+ if (fc->ring == NULL) {
+ size_t queue_depth = rcfg.async_queue_depth +
+ rcfg.sync_queue_depth;
+ size_t queue_sz = sizeof(struct fuse_ring_queue) +
+ sizeof(struct fuse_ring_ent) * queue_depth;
+
+ ring = kvzalloc(sizeof(*fc->ring) + queue_sz * rcfg.nr_queues,
+ GFP_KERNEL_ACCOUNT);
+ if (ring == NULL)
+ return -ENOMEM;
+
+ spin_lock(&fc->lock);
+ if (fc->ring == NULL)
+ res = _fuse_uring_conn_cfg(&rcfg, fc, ring, queue_sz);
+ else
+ res = -EALREADY;
+ spin_unlock(&fc->lock);
+ if (res != 0)
+ goto err;
+ }
+
+ for (qid = 0; qid < ring->nr_queues; qid++) {
+ queue = fuse_uring_get_queue(ring, qid);
+ fuse_uring_queue_cfg(queue, qid, ring);
+ }
+
+ return 0;
+err:
+ kvfree(ring);
+ return res;
+}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
new file mode 100644
index 000000000000..d4eff87bcd1f
--- /dev/null
+++ b/fs/fuse/dev_uring_i.h
@@ -0,0 +1,113 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * FUSE: Filesystem in Userspace
+ * Copyright (c) 2023-2024 DataDirect Networks.
+ */
+
+#ifndef _FS_FUSE_DEV_URING_I_H
+#define _FS_FUSE_DEV_URING_I_H
+
+#include "fuse_i.h"
+
+#ifdef CONFIG_FUSE_IO_URING
+
+/* IORING_MAX_ENTRIES */
+#define FUSE_URING_MAX_QUEUE_DEPTH 32768
+
+/* A fuse ring entry, part of the ring queue */
+struct fuse_ring_ent {
+ /* back pointer */
+ struct fuse_ring_queue *queue;
+
+ /* array index in the ring-queue */
+ unsigned int tag;
+};
+
+struct fuse_ring_queue {
+ /*
+ * back pointer to the main fuse uring structure that holds this
+ * queue
+ */
+ struct fuse_ring *ring;
+
+ /* queue id, typically also corresponds to the cpu core */
+ unsigned int qid;
+
+ /* size depends on queue depth */
+ struct fuse_ring_ent ring_ent[] ____cacheline_aligned_in_smp;
+};
+
+/**
+ * Describes if uring is for communication and holds alls the data needed
+ * for uring communication
+ */
+struct fuse_ring {
+ /* back pointer */
+ struct fuse_conn *fc;
+
+ /* number of ring queues */
+ size_t nr_queues;
+
+ /* number of entries per queue */
+ size_t queue_depth;
+
+ /* req_arg_len + sizeof(struct fuse_req) */
+ size_t req_buf_sz;
+
+ /* max number of background requests per queue */
+ size_t max_nr_async;
+
+ /* max number of foreground requests */
+ size_t max_nr_sync;
+
+ /* size of struct fuse_ring_queue + queue-depth * entry-size */
+ size_t queue_size;
+
+ /* one queue per core or a single queue only ? */
+ unsigned int per_core_queue : 1;
+
+ /* numa aware memory allocation */
+ unsigned int numa_aware : 1;
+
+ struct fuse_ring_queue queues[] ____cacheline_aligned_in_smp;
+};
+
+void fuse_uring_abort_end_requests(struct fuse_ring *ring);
+int fuse_uring_conn_cfg(struct file *file, void __user *argp);
+
+static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
+{
+ if (fc->ring == NULL)
+ return;
+
+ kvfree(fc->ring);
+ fc->ring = NULL;
+}
+
+static inline struct fuse_ring_queue *
+fuse_uring_get_queue(struct fuse_ring *ring, int qid)
+{
+ char *ptr = (char *)ring->queues;
+
+ if (WARN_ON(qid > ring->nr_queues))
+ qid = 0;
+
+ return (struct fuse_ring_queue *)(ptr + qid * ring->queue_size);
+}
+
+#else /* CONFIG_FUSE_IO_URING */
+
+struct fuse_ring;
+
+static inline void fuse_uring_conn_init(struct fuse_ring *ring,
+ struct fuse_conn *fc)
+{
+}
+
+static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
+{
+}
+
+#endif /* CONFIG_FUSE_IO_URING */
+
+#endif /* _FS_FUSE_DEV_URING_I_H */
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 6c506f040d5f..e6289bafb788 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -7,6 +7,7 @@
#define _FS_FUSE_DEV_I_H
#include <linux/types.h>
+#include <linux/fs.h>
/* Ordinary requests have even IDs, while interrupts IDs are odd */
#define FUSE_INT_REQ_BIT (1ULL << 0)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f23919610313..33e81b895fee 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -917,6 +917,11 @@ struct fuse_conn {
/** IDR for backing files ids */
struct idr backing_files_map;
#endif
+
+#ifdef CONFIG_FUSE_IO_URING
+ /** uring connection information*/
+ struct fuse_ring *ring;
+#endif
};
/*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 99e44ea7d875..33a080b24d65 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -7,6 +7,7 @@
*/
#include "fuse_i.h"
+#include "dev_uring_i.h"
#include <linux/pagemap.h>
#include <linux/slab.h>
@@ -947,6 +948,8 @@ static void delayed_release(struct rcu_head *p)
{
struct fuse_conn *fc = container_of(p, struct fuse_conn, rcu);
+ fuse_uring_conn_destruct(fc);
+
put_user_ns(fc->user_ns);
fc->release(fc);
}
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d08b99d60f6f..a1c35e0338f0 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1079,12 +1079,53 @@ struct fuse_backing_map {
uint64_t padding;
};
+enum fuse_uring_ioctl_cmd {
+ /* not correctly initialized when set */
+ FUSE_URING_IOCTL_CMD_INVALID = 0,
+
+ /* Ioctl to prepare communucation with io-uring */
+ FUSE_URING_IOCTL_CMD_RING_CFG = 1,
+
+ /* Ring queue configuration ioctl */
+ FUSE_URING_IOCTL_CMD_QUEUE_CFG = 2,
+};
+
+enum fuse_uring_cfg_flags {
+ /* server/daemon side requests numa awareness */
+ FUSE_URING_WANT_NUMA = 1ul << 0,
+};
+
+struct fuse_ring_config {
+ /* number of queues */
+ uint32_t nr_queues;
+
+ /* number of foreground entries per queue */
+ uint32_t sync_queue_depth;
+
+ /* number of background entries per queue */
+ uint32_t async_queue_depth;
+
+ /*
+ * buffer size userspace allocated per request buffer
+ * from the mmaped queue buffer
+ */
+ uint32_t user_req_buf_sz;
+
+ /* ring config flags */
+ uint64_t numa_aware:1;
+
+ /* for future extensions */
+ uint8_t padding[64];
+};
+
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
#define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
struct fuse_backing_map)
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_URING_CFG _IOR(FUSE_DEV_IOC_MAGIC, 3, \
+ struct fuse_ring_config)
struct fuse_lseek_in {
uint64_t fh;
@@ -1186,4 +1227,10 @@ struct fuse_supp_groups {
uint32_t groups[];
};
+/**
+ * Size of the ring buffer header
+ */
+#define FUSE_RING_HEADER_BUF_SIZE 4096
+#define FUSE_RING_MIN_IN_OUT_ARG_SIZE 4096
+
#endif /* _LINUX_FUSE_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 05/17] fuse: Add a uring config ioctl
2024-09-01 13:36 ` [PATCH RFC v3 05/17] fuse: Add a uring config ioctl Bernd Schubert
@ 2024-09-04 0:43 ` Joanne Koong
2024-09-04 22:24 ` Bernd Schubert
0 siblings, 1 reply; 37+ messages in thread
From: Joanne Koong @ 2024-09-04 0:43 UTC (permalink / raw)
To: Bernd Schubert
Cc: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd, linux-fsdevel,
io-uring, Josef Bacik, Amir Goldstein
On Sun, Sep 1, 2024 at 6:37 AM Bernd Schubert <[email protected]> wrote:
>
> This only adds the initial ioctl for basic fuse-uring initialization.
> More ioctl types will be added later to initialize queues.
>
> This also adds data structures needed or initialized by the ioctl
> command and that will be used later.
>
> Signed-off-by: Bernd Schubert <[email protected]>
Exciting to read through the work in this patchset!
I left some comments, lots of which are more granular / implementation
details than high-level design, in case that would be helpful to you
in reducing the turnaround time for this patchset. Let me know if
you'd prefer a hold-off on that though, if your intention with the RFC
is more to get high-level feedback.
Thanks,
Joanne
> ---
> fs/fuse/Kconfig | 12 ++++
> fs/fuse/Makefile | 1 +
> fs/fuse/dev.c | 33 ++++++++---
> fs/fuse/dev_uring.c | 141 ++++++++++++++++++++++++++++++++++++++++++++++
> fs/fuse/dev_uring_i.h | 113 +++++++++++++++++++++++++++++++++++++
> fs/fuse/fuse_dev_i.h | 1 +
> fs/fuse/fuse_i.h | 5 ++
> fs/fuse/inode.c | 3 +
> include/uapi/linux/fuse.h | 47 ++++++++++++++++
> 9 files changed, 349 insertions(+), 7 deletions(-)
>
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 8674dbfbe59d..11f37cefc94b 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -63,3 +63,15 @@ config FUSE_PASSTHROUGH
> to be performed directly on a backing file.
>
> If you want to allow passthrough operations, answer Y.
> +
> +config FUSE_IO_URING
> + bool "FUSE communication over io-uring"
> + default y
> + depends on FUSE_FS
> + depends on IO_URING
> + help
> + This allows sending FUSE requests over the IO uring interface and
> + also adds request core affinity.
nit: this wording is a little bit awkward imo. Maybe something like
"... over the IO uring interface and enables core affinity for each
request" or "... over the IO uring interface and pins each request to
a specific core"?
I think there's an extra whitespace here in front of "also".
> +
> + If you want to allow fuse server/client communication through io-uring,
> + answer Y
super nit: missing a period at the end of Y.
> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
> index 6e0228c6d0cb..7193a14374fd 100644
> --- a/fs/fuse/Makefile
> +++ b/fs/fuse/Makefile
> @@ -11,5 +11,6 @@ fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
> fuse-y += iomode.o
> fuse-$(CONFIG_FUSE_DAX) += dax.o
> fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
> +fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
>
> virtiofs-y := virtio_fs.o
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index dbc222f9b0f0..6489179e7260 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -8,6 +8,7 @@
>
> #include "fuse_i.h"
> #include "fuse_dev_i.h"
> +#include "dev_uring_i.h"
>
> #include <linux/init.h>
> #include <linux/module.h>
> @@ -26,6 +27,13 @@
> MODULE_ALIAS_MISCDEV(FUSE_MINOR);
> MODULE_ALIAS("devname:fuse");
>
> +#ifdef CONFIG_FUSE_IO_URING
> +static bool __read_mostly enable_uring;
I don't see where enable_uring gets used in this patchset, are you
planning to use this in a separate future patchset?
> +module_param(enable_uring, bool, 0644);
> +MODULE_PARM_DESC(enable_uring,
> + "Enable uring userspace communication through uring.");
^^^ extra "uring" here?
> +#endif
> +
> static struct kmem_cache *fuse_req_cachep;
>
> static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
> @@ -2298,16 +2306,12 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
> return 0;
> }
>
> -static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
> +static long _fuse_dev_ioctl_clone(struct file *file, int oldfd)
I think it'd be a bit clearer if this change was moved to patch 06/17
"Add the queue configuration ioctl" where it gets used
> {
> int res;
> - int oldfd;
> struct fuse_dev *fud = NULL;
> struct fd f;
>
> - if (get_user(oldfd, argp))
> - return -EFAULT;
> -
> f = fdget(oldfd);
> if (!f.file)
> return -EINVAL;
> @@ -2330,6 +2334,16 @@ static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
> return res;
> }
>
> +static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
> +{
> + int oldfd;
> +
> + if (get_user(oldfd, argp))
> + return -EFAULT;
> +
> + return _fuse_dev_ioctl_clone(file, oldfd);
> +}
> +
> static long fuse_dev_ioctl_backing_open(struct file *file,
> struct fuse_backing_map __user *argp)
> {
> @@ -2365,8 +2379,9 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
> return fuse_backing_close(fud->fc, backing_id);
> }
>
> -static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
> - unsigned long arg)
> +static long
> +fuse_dev_ioctl(struct file *file, unsigned int cmd,
> + unsigned long arg)
I think you accidentally added this line break here?
> {
> void __user *argp = (void __user *)arg;
>
> @@ -2380,6 +2395,10 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
> case FUSE_DEV_IOC_BACKING_CLOSE:
> return fuse_dev_ioctl_backing_close(file, argp);
>
> +#ifdef CONFIG_FUSE_IO_URING
> + case FUSE_DEV_IOC_URING_CFG:
> + return fuse_uring_conn_cfg(file, argp);
> +#endif
> default:
> return -ENOTTY;
> }
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> new file mode 100644
> index 000000000000..4e7518ef6527
> --- /dev/null
> +++ b/fs/fuse/dev_uring.c
> @@ -0,0 +1,141 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * FUSE: Filesystem in Userspace
> + * Copyright (c) 2023-2024 DataDirect Networks.
> + */
> +
> +#include "fuse_dev_i.h"
> +#include "fuse_i.h"
> +#include "dev_uring_i.h"
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/poll.h>
> +#include <linux/sched/signal.h>
> +#include <linux/uio.h>
> +#include <linux/miscdevice.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/slab.h>
> +#include <linux/pipe_fs_i.h>
> +#include <linux/swap.h>
> +#include <linux/splice.h>
> +#include <linux/sched.h>
> +#include <linux/io_uring.h>
> +#include <linux/mm.h>
> +#include <linux/io.h>
> +#include <linux/io_uring.h>
> +#include <linux/io_uring/cmd.h>
> +#include <linux/topology.h>
> +#include <linux/io_uring/cmd.h>
Are all of these headers (eg miscdevice.h, pipe_fs_i.h, topology.h) needed?
> +
> +static void fuse_uring_queue_cfg(struct fuse_ring_queue *queue, int qid,
> + struct fuse_ring *ring)
> +{
> + int tag;
> +
> + queue->qid = qid;
> + queue->ring = ring;
> +
> + for (tag = 0; tag < ring->queue_depth; tag++) {
> + struct fuse_ring_ent *ent = &queue->ring_ent[tag];
> +
> + ent->queue = queue;
> + ent->tag = tag;
> + }
> +}
> +
> +static int _fuse_uring_conn_cfg(struct fuse_ring_config *rcfg,
> + struct fuse_conn *fc, struct fuse_ring *ring,
> + size_t queue_sz)
Should this function just be marked "void" as the return type?
> +{
> + ring->numa_aware = rcfg->numa_aware;
> + ring->nr_queues = rcfg->nr_queues;
> + ring->per_core_queue = rcfg->nr_queues > 1;
> +
> + ring->max_nr_sync = rcfg->sync_queue_depth;
> + ring->max_nr_async = rcfg->async_queue_depth;
> + ring->queue_depth = ring->max_nr_sync + ring->max_nr_async;
> +
> + ring->req_buf_sz = rcfg->user_req_buf_sz;
> +
> + ring->queue_size = queue_sz;
> +
> + fc->ring = ring;
> + ring->fc = fc;
> +
> + return 0;
> +}
> +
> +static int fuse_uring_cfg_sanity(struct fuse_ring_config *rcfg)
> +{
> + if (rcfg->nr_queues == 0) {
> + pr_info("zero number of queues is invalid.\n");
I think this might get misinterpreted as "zero queues are invalid" -
maybe something like: "fuse_ring_config nr_queues=0 is invalid arg"
might be clearer?
> + return -EINVAL;
> + }
> +
> + if (rcfg->nr_queues > 1 && rcfg->nr_queues != num_present_cpus()) {
Will it always be that nr_queues must be the number of CPUs on the
system or will that constraint be relaxed in the future?
> + pr_info("nr-queues (%d) does not match nr-cores (%d).\n",
nit: %u for nr_queues, s/nr-queues/nr_queues
It might be useful here to specify "uring nr_queues" as well to make
it more obvious
> + rcfg->nr_queues, num_present_cpus());
> + return -EINVAL;
> + }
> +
Should this function also sanity check that the queue depth is <=
FUSE_URING_MAX_QUEUE_DEPTH?
> + return 0;
> +}
> +
> +/*
> + * Basic ring setup for this connection based on the provided configuration
> + */
> +int fuse_uring_conn_cfg(struct file *file, void __user *argp)
Is there a reason we pass in "void __user *argp" instead of "struct
fuse_ring_config __user *argp"?
> +{
> + struct fuse_ring_config rcfg;
> + int res;
> + struct fuse_dev *fud;
> + struct fuse_conn *fc;
> + struct fuse_ring *ring = NULL;
> + struct fuse_ring_queue *queue;
> + int qid;
> +
> + res = copy_from_user(&rcfg, (void *)argp, sizeof(rcfg));
I don't think we need this "(void *)" cast here
> + if (res != 0)
> + return -EFAULT;
> + res = fuse_uring_cfg_sanity(&rcfg);
> + if (res != 0)
> + return res;
> +
> + fud = fuse_get_dev(file);
> + if (fud == NULL)
nit: if (!fud)
> + return -ENODEV;
Should this be -ENODEV or -EPERM? -ENODEV makes sense to me but I'm
seeing other callers of fuse_get_dev() in fuse/dev.c return -EPERM
when fud is NULL.
> + fc = fud->fc;
> +
Should we add a check
if (fc->ring)
return -EINVAL (or -EALREADY);
if not, then i think we need to move the "for (qid = 0; ...)" logic
below to be within the "if (fc->ring == NULL)" check
> + if (fc->ring == NULL) {
nit: if (!fc->ring)
> + size_t queue_depth = rcfg.async_queue_depth +
> + rcfg.sync_queue_depth;
> + size_t queue_sz = sizeof(struct fuse_ring_queue) +
> + sizeof(struct fuse_ring_ent) * queue_depth;
> +
> + ring = kvzalloc(sizeof(*fc->ring) + queue_sz * rcfg.nr_queues,
> + GFP_KERNEL_ACCOUNT);
> + if (ring == NULL)
nit: if (!ring)
> + return -ENOMEM;
> +
> + spin_lock(&fc->lock);
> + if (fc->ring == NULL)
if (!fc->ring)
> + res = _fuse_uring_conn_cfg(&rcfg, fc, ring, queue_sz);
> + else
> + res = -EALREADY;
> + spin_unlock(&fc->lock);
> + if (res != 0)
nit: if (res)
> + goto err;
> + }
> +
> + for (qid = 0; qid < ring->nr_queues; qid++) {
> + queue = fuse_uring_get_queue(ring, qid);
> + fuse_uring_queue_cfg(queue, qid, ring);
> + }
> +
> + return 0;
> +err:
> + kvfree(ring);
> + return res;
> +}
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> new file mode 100644
> index 000000000000..d4eff87bcd1f
> --- /dev/null
> +++ b/fs/fuse/dev_uring_i.h
> @@ -0,0 +1,113 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * FUSE: Filesystem in Userspace
> + * Copyright (c) 2023-2024 DataDirect Networks.
> + */
> +
> +#ifndef _FS_FUSE_DEV_URING_I_H
> +#define _FS_FUSE_DEV_URING_I_H
> +
> +#include "fuse_i.h"
> +
> +#ifdef CONFIG_FUSE_IO_URING
> +
> +/* IORING_MAX_ENTRIES */
nit: I'm not sure this comment is that helpful. The
"FUSE_URING_MAX_QUEUE_DEPTH" name is clear enough, I think.
> +#define FUSE_URING_MAX_QUEUE_DEPTH 32768
> +
> +/* A fuse ring entry, part of the ring queue */
> +struct fuse_ring_ent {
> + /* back pointer */
> + struct fuse_ring_queue *queue;
Do you think it's worth using the tag to find the queue (i think we
can just use some containerof magic to get the queue backpointer here
since ring_ent is embedded within struct fuse_ring_queue?) instead of
having this be an explicit 8 byte pointer? I'm thinking about the case
where the user sets a queue depth of 32k (eg
FUSE_URING_MAX_QUEUE_DEPTH) and is on an 8-core system where they set
nr_queues to 8. This would end up in 8 * 32k * 8 = 2 MiB extra memory
allocated which seems non-trivial (but I guess this is also an extreme
case). Curious what your thoughts on this are.
> +
> + /* array index in the ring-queue */
> + unsigned int tag;
Just wondering, is this called "tag" instead of "index" to be
consistent with an io-ring naming convention?
> +};
> +
> +struct fuse_ring_queue {
> + /*
> + * back pointer to the main fuse uring structure that holds this
> + * queue
> + */
> + struct fuse_ring *ring;
> +
> + /* queue id, typically also corresponds to the cpu core */
> + unsigned int qid;
> +
> + /* size depends on queue depth */
> + struct fuse_ring_ent ring_ent[] ____cacheline_aligned_in_smp;
> +};
> +
> +/**
> + * Describes if uring is for communication and holds alls the data needed
nit: maybe this should just be "Holds all the data needed for uring
communication"?
nit: s/alls/all
> + * for uring communication
> + */
> +struct fuse_ring {
> + /* back pointer */
> + struct fuse_conn *fc;
> +
> + /* number of ring queues */
I think it's worth calling out here too that this must be the number
of CPUs on the system and that each CPU operates its own ring queue.
> + size_t nr_queues;
> +
> + /* number of entries per queue */
> + size_t queue_depth;
> +
> + /* req_arg_len + sizeof(struct fuse_req) */
What is req_arg_len? In _fuse_uring_conn_cfg(), it looks like this
gets set to rcfg->user_req_buf_sz which is passed in from userspace,
but from what I understand, "struct fuse_req" is a kernel-defined
struct? I'm a bit confused overall what the comment refers to, but I
also haven't yet looked through the libfuse change yet for this
patchset.
> + size_t req_buf_sz;
> +
> + /* max number of background requests per queue */
> + size_t max_nr_async;
> +
> + /* max number of foreground requests */
nit: for consistency with the comment for max_nr_async,
s/requests/"requests per queue"
> + size_t max_nr_sync;
It's interesting to me that this can get configured by userspace for
background requests vs foreground requests. My perspective was that
from the userspace POV, there's no differentiation between background
vs foreground requests. Personally, I'm still not really even sure yet
which of the read requests are async vs sync when I do a 8 MiB read
call for example (iirc, I was seeing both, when it first tried the
readahead path). It seems a bit like overkill to me but maybe there
are some servers that actually do care a lot about this.
> +
> + /* size of struct fuse_ring_queue + queue-depth * entry-size */
> + size_t queue_size;
> +
> + /* one queue per core or a single queue only ? */
> + unsigned int per_core_queue : 1;
> +
> + /* numa aware memory allocation */
> + unsigned int numa_aware : 1;
> +
> + struct fuse_ring_queue queues[] ____cacheline_aligned_in_smp;
> +};
> +
> +void fuse_uring_abort_end_requests(struct fuse_ring *ring);
nit: I think it'd be a bit cleaner if this got moved to patch 12/17
(fuse: {uring} Handle teardown of ring entries) when it gets used
> +int fuse_uring_conn_cfg(struct file *file, void __user *argp);
> +
> +static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
> +{
> + if (fc->ring == NULL)
nit: if (!fc->ring)
> + return;
> +
> + kvfree(fc->ring);
> + fc->ring = NULL;
> +}
> +
> +static inline struct fuse_ring_queue *
> +fuse_uring_get_queue(struct fuse_ring *ring, int qid)
> +{
> + char *ptr = (char *)ring->queues;
Do we need to cast this to char * or can we just do the math below as
return ring->queues + qid;
> +
> + if (WARN_ON(qid > ring->nr_queues))
Should this be >= since qid is 0-indexed?
we should never reach here, but it feels like if we do, we should just
automatically return NULL.
> + qid = 0;
> +
> + return (struct fuse_ring_queue *)(ptr + qid * ring->queue_size);
> +}
> +
> +#else /* CONFIG_FUSE_IO_URING */
> +
> +struct fuse_ring;
> +
> +static inline void fuse_uring_conn_init(struct fuse_ring *ring,
> + struct fuse_conn *fc)
> +{
> +}
> +
> +static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
> +{
> +}
> +
> +#endif /* CONFIG_FUSE_IO_URING */
> +
> +#endif /* _FS_FUSE_DEV_URING_I_H */
> diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
> index 6c506f040d5f..e6289bafb788 100644
> --- a/fs/fuse/fuse_dev_i.h
> +++ b/fs/fuse/fuse_dev_i.h
> @@ -7,6 +7,7 @@
> #define _FS_FUSE_DEV_I_H
>
> #include <linux/types.h>
> +#include <linux/fs.h>
I think you accidentally included this.
>
> /* Ordinary requests have even IDs, while interrupts IDs are odd */
> #define FUSE_INT_REQ_BIT (1ULL << 0)
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index f23919610313..33e81b895fee 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -917,6 +917,11 @@ struct fuse_conn {
> /** IDR for backing files ids */
> struct idr backing_files_map;
> #endif
> +
> +#ifdef CONFIG_FUSE_IO_URING
> + /** uring connection information*/
nit: need extra space between information and */
> + struct fuse_ring *ring;
> +#endif
> };
>
> /*
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 99e44ea7d875..33a080b24d65 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -7,6 +7,7 @@
> */
>
> #include "fuse_i.h"
> +#include "dev_uring_i.h"
>
> #include <linux/pagemap.h>
> #include <linux/slab.h>
> @@ -947,6 +948,8 @@ static void delayed_release(struct rcu_head *p)
> {
> struct fuse_conn *fc = container_of(p, struct fuse_conn, rcu);
>
> + fuse_uring_conn_destruct(fc);
I think it's cleaner if this is moved to fuse_free_conn than here.
> +
> put_user_ns(fc->user_ns);
> fc->release(fc);
> }
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index d08b99d60f6f..a1c35e0338f0 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -1079,12 +1079,53 @@ struct fuse_backing_map {
> uint64_t padding;
> };
>
> +enum fuse_uring_ioctl_cmd {
Do you have a link to the libfuse side? I'm not seeing these get used
in the patchset - I'm curious how libfuse will be using these then?
> + /* not correctly initialized when set */
> + FUSE_URING_IOCTL_CMD_INVALID = 0,
> +
> + /* Ioctl to prepare communucation with io-uring */
nit: communication spelling
> + FUSE_URING_IOCTL_CMD_RING_CFG = 1,
> +
> + /* Ring queue configuration ioctl */
> + FUSE_URING_IOCTL_CMD_QUEUE_CFG = 2,
> +};
> +
> +enum fuse_uring_cfg_flags {
> + /* server/daemon side requests numa awareness */
> + FUSE_URING_WANT_NUMA = 1ul << 0,
nit: 1UL for consistency
> +};
> +
> +struct fuse_ring_config {
> + /* number of queues */
> + uint32_t nr_queues;
> +
> + /* number of foreground entries per queue */
> + uint32_t sync_queue_depth;
> +
> + /* number of background entries per queue */
> + uint32_t async_queue_depth;
> +
> + /*
> + * buffer size userspace allocated per request buffer
> + * from the mmaped queue buffer
> + */
> + uint32_t user_req_buf_sz;
> +
> + /* ring config flags */
> + uint64_t numa_aware:1;
> +
> + /* for future extensions */
> + uint8_t padding[64];
> +};
> +
> /* Device ioctls: */
> #define FUSE_DEV_IOC_MAGIC 229
> #define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
> #define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
> struct fuse_backing_map)
> #define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
> +#define FUSE_DEV_IOC_URING_CFG _IOR(FUSE_DEV_IOC_MAGIC, 3, \
> + struct fuse_ring_config)
>
> struct fuse_lseek_in {
> uint64_t fh;
> @@ -1186,4 +1227,10 @@ struct fuse_supp_groups {
> uint32_t groups[];
> };
>
> +/**
> + * Size of the ring buffer header
> + */
> +#define FUSE_RING_HEADER_BUF_SIZE 4096
> +#define FUSE_RING_MIN_IN_OUT_ARG_SIZE 4096
I think this'd be cleaner to review if this got moved to the patch
where this gets used
> +
> #endif /* _LINUX_FUSE_H */
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 05/17] fuse: Add a uring config ioctl
2024-09-04 0:43 ` Joanne Koong
@ 2024-09-04 22:24 ` Bernd Schubert
2024-09-06 19:23 ` Joanne Koong
0 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2024-09-04 22:24 UTC (permalink / raw)
To: Joanne Koong, Bernd Schubert
Cc: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd, linux-fsdevel,
io-uring, Josef Bacik, Amir Goldstein
On 9/4/24 02:43, Joanne Koong wrote:
> On Sun, Sep 1, 2024 at 6:37 AM Bernd Schubert <[email protected]> wrote:
>>
>> This only adds the initial ioctl for basic fuse-uring initialization.
>> More ioctl types will be added later to initialize queues.
>>
>> This also adds data structures needed or initialized by the ioctl
>> command and that will be used later.
>>
>> Signed-off-by: Bernd Schubert <[email protected]>
>
> Exciting to read through the work in this patchset!
>
> I left some comments, lots of which are more granular / implementation
> details than high-level design, in case that would be helpful to you
> in reducing the turnaround time for this patchset. Let me know if
> you'd prefer a hold-off on that though, if your intention with the RFC
> is more to get high-level feedback.
Thanks Joanne! I'm going to address your comments later this week.
>
>
> Thanks,
> Joanne
>
>> ---
>> fs/fuse/Kconfig | 12 ++++
>> fs/fuse/Makefile | 1 +
>> fs/fuse/dev.c | 33 ++++++++---
>> fs/fuse/dev_uring.c | 141 ++++++++++++++++++++++++++++++++++++++++++++++
>> fs/fuse/dev_uring_i.h | 113 +++++++++++++++++++++++++++++++++++++
>> fs/fuse/fuse_dev_i.h | 1 +
>> fs/fuse/fuse_i.h | 5 ++
>> fs/fuse/inode.c | 3 +
>> include/uapi/linux/fuse.h | 47 ++++++++++++++++
>> 9 files changed, 349 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>> index 8674dbfbe59d..11f37cefc94b 100644
>> --- a/fs/fuse/Kconfig
>> +++ b/fs/fuse/Kconfig
>> @@ -63,3 +63,15 @@ config FUSE_PASSTHROUGH
>> to be performed directly on a backing file.
>>
>> If you want to allow passthrough operations, answer Y.
>> +
>> +config FUSE_IO_URING
>> + bool "FUSE communication over io-uring"
>> + default y
>> + depends on FUSE_FS
>> + depends on IO_URING
>> + help
>> + This allows sending FUSE requests over the IO uring interface and
>> + also adds request core affinity.
>
> nit: this wording is a little bit awkward imo. Maybe something like
> "... over the IO uring interface and enables core affinity for each
> request" or "... over the IO uring interface and pins each request to
> a specific core"?
> I think there's an extra whitespace here in front of "also".
>
>> +
>> + If you want to allow fuse server/client communication through io-uring,
>> + answer Y
>
> super nit: missing a period at the end of Y.
>
>> diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
>> index 6e0228c6d0cb..7193a14374fd 100644
>> --- a/fs/fuse/Makefile
>> +++ b/fs/fuse/Makefile
>> @@ -11,5 +11,6 @@ fuse-y := dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
>> fuse-y += iomode.o
>> fuse-$(CONFIG_FUSE_DAX) += dax.o
>> fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o
>> +fuse-$(CONFIG_FUSE_IO_URING) += dev_uring.o
>>
>> virtiofs-y := virtio_fs.o
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index dbc222f9b0f0..6489179e7260 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -8,6 +8,7 @@
>>
>> #include "fuse_i.h"
>> #include "fuse_dev_i.h"
>> +#include "dev_uring_i.h"
>>
>> #include <linux/init.h>
>> #include <linux/module.h>
>> @@ -26,6 +27,13 @@
>> MODULE_ALIAS_MISCDEV(FUSE_MINOR);
>> MODULE_ALIAS("devname:fuse");
>>
>> +#ifdef CONFIG_FUSE_IO_URING
>> +static bool __read_mostly enable_uring;
>
> I don't see where enable_uring gets used in this patchset, are you
> planning to use this in a separate future patchset?
Ouch, thanks, I broke it from the previous patch, when I refactored ioctls. As I wrote in the introduction, this patch set is not completely tested - had missed it. Thanks again for spotting.
>
>> +module_param(enable_uring, bool, 0644);
>> +MODULE_PARM_DESC(enable_uring,
>> + "Enable uring userspace communication through uring.");
> ^^^ extra "uring" here?
>
>> +#endif
>> +
>> static struct kmem_cache *fuse_req_cachep;
>>
>> static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
>> @@ -2298,16 +2306,12 @@ static int fuse_device_clone(struct fuse_conn *fc, struct file *new)
>> return 0;
>> }
>>
>> -static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
>> +static long _fuse_dev_ioctl_clone(struct file *file, int oldfd)
>
> I think it'd be a bit clearer if this change was moved to patch 06/17
> "Add the queue configuration ioctl" where it gets used
Oh, yeah, accidentally in here.
>
>> {
>> int res;
>> - int oldfd;
>> struct fuse_dev *fud = NULL;
>> struct fd f;
>>
>> - if (get_user(oldfd, argp))
>> - return -EFAULT;
>> -
>> f = fdget(oldfd);
>> if (!f.file)
>> return -EINVAL;
>> @@ -2330,6 +2334,16 @@ static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
>> return res;
>> }
>>
>> +static long fuse_dev_ioctl_clone(struct file *file, __u32 __user *argp)
>> +{
>> + int oldfd;
>> +
>> + if (get_user(oldfd, argp))
>> + return -EFAULT;
>> +
>> + return _fuse_dev_ioctl_clone(file, oldfd);
>> +}
>> +
>> static long fuse_dev_ioctl_backing_open(struct file *file,
>> struct fuse_backing_map __user *argp)
>> {
>> @@ -2365,8 +2379,9 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
>> return fuse_backing_close(fud->fc, backing_id);
>> }
>>
>> -static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
>> - unsigned long arg)
>> +static long
>> +fuse_dev_ioctl(struct file *file, unsigned int cmd,
>> + unsigned long arg)
>
> I think you accidentally added this line break here?
Yeah :(
>
>> {
>> void __user *argp = (void __user *)arg;
>>
>> @@ -2380,6 +2395,10 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
>> case FUSE_DEV_IOC_BACKING_CLOSE:
>> return fuse_dev_ioctl_backing_close(file, argp);
>>
>> +#ifdef CONFIG_FUSE_IO_URING
>> + case FUSE_DEV_IOC_URING_CFG:
>> + return fuse_uring_conn_cfg(file, argp);
>> +#endif
>> default:
>> return -ENOTTY;
>> }
>> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
>> new file mode 100644
>> index 000000000000..4e7518ef6527
>> --- /dev/null
>> +++ b/fs/fuse/dev_uring.c
>> @@ -0,0 +1,141 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * FUSE: Filesystem in Userspace
>> + * Copyright (c) 2023-2024 DataDirect Networks.
>> + */
>> +
>> +#include "fuse_dev_i.h"
>> +#include "fuse_i.h"
>> +#include "dev_uring_i.h"
>> +
>> +#include <linux/init.h>
>> +#include <linux/module.h>
>> +#include <linux/poll.h>
>> +#include <linux/sched/signal.h>
>> +#include <linux/uio.h>
>> +#include <linux/miscdevice.h>
>> +#include <linux/pagemap.h>
>> +#include <linux/file.h>
>> +#include <linux/slab.h>
>> +#include <linux/pipe_fs_i.h>
>> +#include <linux/swap.h>
>> +#include <linux/splice.h>
>> +#include <linux/sched.h>
>> +#include <linux/io_uring.h>
>> +#include <linux/mm.h>
>> +#include <linux/io.h>
>> +#include <linux/io_uring.h>
>> +#include <linux/io_uring/cmd.h>
>> +#include <linux/topology.h>
>> +#include <linux/io_uring/cmd.h>
>
> Are all of these headers (eg miscdevice.h, pipe_fs_i.h, topology.h) needed?
Actually already removed in my local v4 branch. I noticed myself on Monday.
>
>> +
>> +static void fuse_uring_queue_cfg(struct fuse_ring_queue *queue, int qid,
>> + struct fuse_ring *ring)
>> +{
>> + int tag;
>> +
>> + queue->qid = qid;
>> + queue->ring = ring;
>> +
>> + for (tag = 0; tag < ring->queue_depth; tag++) {
>> + struct fuse_ring_ent *ent = &queue->ring_ent[tag];
>> +
>> + ent->queue = queue;
>> + ent->tag = tag;
>> + }
>> +}
>> +
>> +static int _fuse_uring_conn_cfg(struct fuse_ring_config *rcfg,
>> + struct fuse_conn *fc, struct fuse_ring *ring,
>> + size_t queue_sz)
>
> Should this function just be marked "void" as the return type?
Yeah, I had missed it.
>
>> +{
>> + ring->numa_aware = rcfg->numa_aware;
>> + ring->nr_queues = rcfg->nr_queues;
>> + ring->per_core_queue = rcfg->nr_queues > 1;
>> +
>> + ring->max_nr_sync = rcfg->sync_queue_depth;
>> + ring->max_nr_async = rcfg->async_queue_depth;
>> + ring->queue_depth = ring->max_nr_sync + ring->max_nr_async;
>> +
>> + ring->req_buf_sz = rcfg->user_req_buf_sz;
>> +
>> + ring->queue_size = queue_sz;
>> +
>> + fc->ring = ring;
>> + ring->fc = fc;
>> +
>> + return 0;
>> +}
>> +
>> +static int fuse_uring_cfg_sanity(struct fuse_ring_config *rcfg)
>> +{
>> + if (rcfg->nr_queues == 0) {
>> + pr_info("zero number of queues is invalid.\n");
>
> I think this might get misinterpreted as "zero queues are invalid" -
> maybe something like: "fuse_ring_config nr_queues=0 is invalid arg"
> might be clearer?
>
>> + return -EINVAL;
>> + }
>> +
>> + if (rcfg->nr_queues > 1 && rcfg->nr_queues != num_present_cpus()) {
>
> Will it always be that nr_queues must be the number of CPUs on the
> system or will that constraint be relaxed in the future?
In all my testing performance rather suffered when any kind of cpu switching was involved. I guess we should first find a good reason to relax it and then need to think about which queue to use, when a request comes on a different core. Do you have a use case?
>
>> + pr_info("nr-queues (%d) does not match nr-cores (%d).\n",
>
> nit: %u for nr_queues, s/nr-queues/nr_queues
> It might be useful here to specify "uring nr_queues" as well to make
> it more obvious
>
>> + rcfg->nr_queues, num_present_cpus());
>> + return -EINVAL;
>> + }
>> +
>
> Should this function also sanity check that the queue depth is <=
> FUSE_URING_MAX_QUEUE_DEPTH?
Right.
>
>> + return 0;
>> +}
>> +
>> +/*
>> + * Basic ring setup for this connection based on the provided configuration
>> + */
>> +int fuse_uring_conn_cfg(struct file *file, void __user *argp)
>
> Is there a reason we pass in "void __user *argp" instead of "struct
> fuse_ring_config __user *argp"?
Will fix it.
>
>> +{
>> + struct fuse_ring_config rcfg;
>> + int res;
>> + struct fuse_dev *fud;
>> + struct fuse_conn *fc;
>> + struct fuse_ring *ring = NULL;
>> + struct fuse_ring_queue *queue;
>> + int qid;
>> +
>> + res = copy_from_user(&rcfg, (void *)argp, sizeof(rcfg));
>
> I don't think we need this "(void *)" cast here
>
>> + if (res != 0)
>> + return -EFAULT;
>> + res = fuse_uring_cfg_sanity(&rcfg);
>> + if (res != 0)
>> + return res;
>> +
>> + fud = fuse_get_dev(file);
>> + if (fud == NULL)
>
> nit: if (!fud)
>
>> + return -ENODEV;
>
> Should this be -ENODEV or -EPERM? -ENODEV makes sense to me but I'm
> seeing other callers of fuse_get_dev() in fuse/dev.c return -EPERM
> when fud is NULL.
>
>> + fc = fud->fc;
>> +
>
> Should we add a check
> if (fc->ring)
> return -EINVAL (or -EALREADY);
>
> if not, then i think we need to move the "for (qid = 0; ...)" logic
> below to be within the "if (fc->ring == NULL)" check
>
>> + if (fc->ring == NULL) {
>
> nit: if (!fc->ring)
>
>> + size_t queue_depth = rcfg.async_queue_depth +
>> + rcfg.sync_queue_depth;
>> + size_t queue_sz = sizeof(struct fuse_ring_queue) +
>> + sizeof(struct fuse_ring_ent) * queue_depth;
>> +
>> + ring = kvzalloc(sizeof(*fc->ring) + queue_sz * rcfg.nr_queues,
>> + GFP_KERNEL_ACCOUNT);
>> + if (ring == NULL)
>
> nit: if (!ring)
>
>> + return -ENOMEM;
>> +
>> + spin_lock(&fc->lock);
>> + if (fc->ring == NULL)
>
> if (!fc->ring)
>
>> + res = _fuse_uring_conn_cfg(&rcfg, fc, ring, queue_sz);
>> + else
>> + res = -EALREADY;
>> + spin_unlock(&fc->lock);
>> + if (res != 0)
>
> nit: if (res)
>
>> + goto err;
>> + }
>> +
>> + for (qid = 0; qid < ring->nr_queues; qid++) {
>> + queue = fuse_uring_get_queue(ring, qid);
>> + fuse_uring_queue_cfg(queue, qid, ring);
>> + }
>> +
>> + return 0;
>> +err:
>> + kvfree(ring);
>> + return res;
>> +}
>> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
>> new file mode 100644
>> index 000000000000..d4eff87bcd1f
>> --- /dev/null
>> +++ b/fs/fuse/dev_uring_i.h
>> @@ -0,0 +1,113 @@
>> +/* SPDX-License-Identifier: GPL-2.0
>> + *
>> + * FUSE: Filesystem in Userspace
>> + * Copyright (c) 2023-2024 DataDirect Networks.
>> + */
>> +
>> +#ifndef _FS_FUSE_DEV_URING_I_H
>> +#define _FS_FUSE_DEV_URING_I_H
>> +
>> +#include "fuse_i.h"
>> +
>> +#ifdef CONFIG_FUSE_IO_URING
>> +
>> +/* IORING_MAX_ENTRIES */
>
> nit: I'm not sure this comment is that helpful. The
> "FUSE_URING_MAX_QUEUE_DEPTH" name is clear enough, I think.
>
>> +#define FUSE_URING_MAX_QUEUE_DEPTH 32768
>> +
>> +/* A fuse ring entry, part of the ring queue */
>> +struct fuse_ring_ent {
>> + /* back pointer */
>> + struct fuse_ring_queue *queue;
>
> Do you think it's worth using the tag to find the queue (i think we
> can just use some containerof magic to get the queue backpointer here
> since ring_ent is embedded within struct fuse_ring_queue?) instead of
> having this be an explicit 8 byte pointer? I'm thinking about the case
> where the user sets a queue depth of 32k (eg
> FUSE_URING_MAX_QUEUE_DEPTH) and is on an 8-core system where they set
> nr_queues to 8. This would end up in 8 * 32k * 8 = 2 MiB extra memory
> allocated which seems non-trivial (but I guess this is also an extreme
> case). Curious what your thoughts on this are.
>
>> +
>> + /* array index in the ring-queue */
>> + unsigned int tag;
>
> Just wondering, is this called "tag" instead of "index" to be
> consistent with an io-ring naming convention?
>
>> +};
>> +
>> +struct fuse_ring_queue {
>> + /*
>> + * back pointer to the main fuse uring structure that holds this
>> + * queue
>> + */
>> + struct fuse_ring *ring;
>> +
>> + /* queue id, typically also corresponds to the cpu core */
>> + unsigned int qid;
>> +
>> + /* size depends on queue depth */
>> + struct fuse_ring_ent ring_ent[] ____cacheline_aligned_in_smp;
>> +};
>> +
>> +/**
>> + * Describes if uring is for communication and holds alls the data needed
>
> nit: maybe this should just be "Holds all the data needed for uring
> communication"?
>
> nit: s/alls/all
>
>> + * for uring communication
>> + */
>> +struct fuse_ring {
>> + /* back pointer */
>> + struct fuse_conn *fc;
>> +
>> + /* number of ring queues */
>
> I think it's worth calling out here too that this must be the number
> of CPUs on the system and that each CPU operates its own ring queue.
>
>> + size_t nr_queues;
>> +
>> + /* number of entries per queue */
>> + size_t queue_depth;
>> +
>> + /* req_arg_len + sizeof(struct fuse_req) */
>
> What is req_arg_len? In _fuse_uring_conn_cfg(), it looks like this
> gets set to rcfg->user_req_buf_sz which is passed in from userspace,
> but from what I understand, "struct fuse_req" is a kernel-defined
> struct? I'm a bit confused overall what the comment refers to, but I
> also haven't yet looked through the libfuse change yet for this
> patchset.
Sorry, it is a typo, it is supposed to be 'sizeof(struct fuse_ring_req)'.
>
>> + size_t req_buf_sz;
>> +
>> + /* max number of background requests per queue */
>> + size_t max_nr_async;
>> +
>> + /* max number of foreground requests */
>
> nit: for consistency with the comment for max_nr_async,
> s/requests/"requests per queue"
>
>> + size_t max_nr_sync;
>
> It's interesting to me that this can get configured by userspace for
> background requests vs foreground requests. My perspective was that
> from the userspace POV, there's no differentiation between background
> vs foreground requests. Personally, I'm still not really even sure yet
> which of the read requests are async vs sync when I do a 8 MiB read
> call for example (iirc, I was seeing both, when it first tried the
> readahead path). It seems a bit like overkill to me but maybe there
> are some servers that actually do care a lot about this.
I think I need to rework this a bit. What I actually want is credits. With /dev/fuse bg requests get moved into the main request list and can then block everything. To keep the series small, maybe better if I entirely remove that in v4.
>
>> +
>> + /* size of struct fuse_ring_queue + queue-depth * entry-size */
>> + size_t queue_size;
>> +
>> + /* one queue per core or a single queue only ? */
>> + unsigned int per_core_queue : 1;
>> +
>> + /* numa aware memory allocation */
>> + unsigned int numa_aware : 1;
>> +
>> + struct fuse_ring_queue queues[] ____cacheline_aligned_in_smp;
>> +};
>> +
>> +void fuse_uring_abort_end_requests(struct fuse_ring *ring);
>
> nit: I think it'd be a bit cleaner if this got moved to patch 12/17
> (fuse: {uring} Handle teardown of ring entries) when it gets used
>
>> +int fuse_uring_conn_cfg(struct file *file, void __user *argp);
>> +
>> +static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
>> +{
>> + if (fc->ring == NULL)
>
> nit: if (!fc->ring)
Maybe I find a cocinelle script or write one for such things,
checkpatch.pl doesn't annotate it.
>
>> + return;
>> +
>> + kvfree(fc->ring);
>> + fc->ring = NULL;
>> +}
>> +
>> +static inline struct fuse_ring_queue *
>> +fuse_uring_get_queue(struct fuse_ring *ring, int qid)
>> +{
>> + char *ptr = (char *)ring->queues;
>
> Do we need to cast this to char * or can we just do the math below as
> return ring->queues + qid;
It is qid * ring->queue_size, as we have the variable length
array 'struct fuse_ring_ent ring_ent[]'. I'm still looking for a better
name for 'ring->queue_size'. Meaning is
sizeof(struct fuse_ring_queue) + queue_depth * sizeof(struct fuse_ring_ent)
>
>> +
>> + if (WARN_ON(qid > ring->nr_queues))
>
> Should this be >= since qid is 0-indexed?
Ouch, yeah.
>
> we should never reach here, but it feels like if we do, we should just
> automatically return NULL.
>
>> + qid = 0;
>> +
>> + return (struct fuse_ring_queue *)(ptr + qid * ring->queue_size);
>> +}
>> +
>> +#else /* CONFIG_FUSE_IO_URING */
>> +
>> +struct fuse_ring;
>> +
>> +static inline void fuse_uring_conn_init(struct fuse_ring *ring,
>> + struct fuse_conn *fc)
>> +{
>> +}
>> +
>> +static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
>> +{
>> +}
>> +
>> +#endif /* CONFIG_FUSE_IO_URING */
>> +
>> +#endif /* _FS_FUSE_DEV_URING_I_H */
>> diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
>> index 6c506f040d5f..e6289bafb788 100644
>> --- a/fs/fuse/fuse_dev_i.h
>> +++ b/fs/fuse/fuse_dev_i.h
>> @@ -7,6 +7,7 @@
>> #define _FS_FUSE_DEV_I_H
>>
>> #include <linux/types.h>
>> +#include <linux/fs.h>
>
> I think you accidentally included this.
>
When I remove it:
bschubert2@imesrv6 linux.git>make M=fs/fuse/
CC [M] fs/fuse/dev_uring.o
In file included from fs/fuse/dev_uring.c:7:
fs/fuse/fuse_dev_i.h:15:52: warning: declaration of 'struct file' will not be visible outside of this function [-Wvisibility]
static inline struct fuse_dev *fuse_get_dev(struct file *file)
^
fs/fuse/fuse_dev_i.h:21:9: error: call to undeclared function 'READ_ONCE'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
return READ_ONCE(file->private_data);
^
fs/fuse/fuse_dev_i.h:21:23: error: incomplete definition of type 'struct file'
return READ_ONCE(file->private_data);
~~~~^
I could also include <linux/fs.h> in dev_uring.c, but isn't it cleaner
to have the include in fuse_dev_i.h as it is that file that
adds dependencies?
>>
>> /* Ordinary requests have even IDs, while interrupts IDs are odd */
>> #define FUSE_INT_REQ_BIT (1ULL << 0)
>> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
>> index f23919610313..33e81b895fee 100644
>> --- a/fs/fuse/fuse_i.h
>> +++ b/fs/fuse/fuse_i.h
>> @@ -917,6 +917,11 @@ struct fuse_conn {
>> /** IDR for backing files ids */
>> struct idr backing_files_map;
>> #endif
>> +
>> +#ifdef CONFIG_FUSE_IO_URING
>> + /** uring connection information*/
> nit: need extra space between information and */
>> + struct fuse_ring *ring;
>> +#endif
>> };
>>
>> /*
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 99e44ea7d875..33a080b24d65 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -7,6 +7,7 @@
>> */
>>
>> #include "fuse_i.h"
>> +#include "dev_uring_i.h"
>>
>> #include <linux/pagemap.h>
>> #include <linux/slab.h>
>> @@ -947,6 +948,8 @@ static void delayed_release(struct rcu_head *p)
>> {
>> struct fuse_conn *fc = container_of(p, struct fuse_conn, rcu);
>>
>> + fuse_uring_conn_destruct(fc);
>
> I think it's cleaner if this is moved to fuse_free_conn than here.
>
>> +
>> put_user_ns(fc->user_ns);
>> fc->release(fc);
>> }
>> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
>> index d08b99d60f6f..a1c35e0338f0 100644
>> --- a/include/uapi/linux/fuse.h
>> +++ b/include/uapi/linux/fuse.h
>> @@ -1079,12 +1079,53 @@ struct fuse_backing_map {
>> uint64_t padding;
>> };
>>
>> +enum fuse_uring_ioctl_cmd {
>
> Do you have a link to the libfuse side? I'm not seeing these get used
> in the patchset - I'm curious how libfuse will be using these then?
I had written it in the introduction
https://github.com/bsbernd/libfuse/tree/uring
Don't look at the individual patches please, I will clean that up,
once we agree on the basic approach.
>
>> + /* not correctly initialized when set */
>> + FUSE_URING_IOCTL_CMD_INVALID = 0,
>> +
>> + /* Ioctl to prepare communucation with io-uring */
>
> nit: communication spelling
>
>> + FUSE_URING_IOCTL_CMD_RING_CFG = 1,
>> +
>> + /* Ring queue configuration ioctl */
>> + FUSE_URING_IOCTL_CMD_QUEUE_CFG = 2,
>> +};
>> +
>> +enum fuse_uring_cfg_flags {
>> + /* server/daemon side requests numa awareness */
>> + FUSE_URING_WANT_NUMA = 1ul << 0,
>
> nit: 1UL for consistency
>
>> +};
>> +
>> +struct fuse_ring_config {
>> + /* number of queues */
>> + uint32_t nr_queues;
>> +
>> + /* number of foreground entries per queue */
>> + uint32_t sync_queue_depth;
>> +
>> + /* number of background entries per queue */
>> + uint32_t async_queue_depth;
>> +
>> + /*
>> + * buffer size userspace allocated per request buffer
>> + * from the mmaped queue buffer
>> + */
>> + uint32_t user_req_buf_sz;
>> +
>> + /* ring config flags */
>> + uint64_t numa_aware:1;
>> +
>> + /* for future extensions */
>> + uint8_t padding[64];
>> +};
>> +
>> /* Device ioctls: */
>> #define FUSE_DEV_IOC_MAGIC 229
>> #define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
>> #define FUSE_DEV_IOC_BACKING_OPEN _IOW(FUSE_DEV_IOC_MAGIC, 1, \
>> struct fuse_backing_map)
>> #define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
>> +#define FUSE_DEV_IOC_URING_CFG _IOR(FUSE_DEV_IOC_MAGIC, 3, \
>> + struct fuse_ring_config)
>>
>> struct fuse_lseek_in {
>> uint64_t fh;
>> @@ -1186,4 +1227,10 @@ struct fuse_supp_groups {
>> uint32_t groups[];
>> };
>>
>> +/**
>> + * Size of the ring buffer header
>> + */
>> +#define FUSE_RING_HEADER_BUF_SIZE 4096
>> +#define FUSE_RING_MIN_IN_OUT_ARG_SIZE 4096
>
> I think this'd be cleaner to review if this got moved to the patch
> where this gets used
>
>> +
>> #endif /* _LINUX_FUSE_H */
>>
>> --
>> 2.43.0
>>
I will get it all fixed later this week! I will also review my own
patches before v4, I just wanted to get v3 out asap as it was already
taking so much time after v2.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 05/17] fuse: Add a uring config ioctl
2024-09-04 22:24 ` Bernd Schubert
@ 2024-09-06 19:23 ` Joanne Koong
0 siblings, 0 replies; 37+ messages in thread
From: Joanne Koong @ 2024-09-06 19:23 UTC (permalink / raw)
To: Bernd Schubert
Cc: Bernd Schubert, Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd,
linux-fsdevel, io-uring, Josef Bacik, Amir Goldstein
On Wed, Sep 4, 2024 at 3:24 PM Bernd Schubert
<[email protected]> wrote:
>
> On 9/4/24 02:43, Joanne Koong wrote:
> > On Sun, Sep 1, 2024 at 6:37 AM Bernd Schubert <[email protected]> wrote:
> >>
> >> This only adds the initial ioctl for basic fuse-uring initialization.
> >> More ioctl types will be added later to initialize queues.
...
> >
> >> + return -EINVAL;
> >> + }
> >> +
> >> + if (rcfg->nr_queues > 1 && rcfg->nr_queues != num_present_cpus()) {
> >
> > Will it always be that nr_queues must be the number of CPUs on the
> > system or will that constraint be relaxed in the future?
>
> In all my testing performance rather suffered when any kind of cpu switching was involved. I guess we should first find a good reason to relax it and then need to think about which queue to use, when a request comes on a different core. Do you have a use case?
Ah, gotcha. I don't have a use case in mind, just thought it'd be
common for some users to want more than 1 queue but not as many queues
as they have cores. This could always be added later in the future
though if this use case actually comes up.
>
> >> diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
> >> index 6c506f040d5f..e6289bafb788 100644
> >> --- a/fs/fuse/fuse_dev_i.h
> >> +++ b/fs/fuse/fuse_dev_i.h
> >> @@ -7,6 +7,7 @@
> >> #define _FS_FUSE_DEV_I_H
> >>
> >> #include <linux/types.h>
> >> +#include <linux/fs.h>
> >
> > I think you accidentally included this.
> >
>
> When I remove it:
>
> bschubert2@imesrv6 linux.git>make M=fs/fuse/
> CC [M] fs/fuse/dev_uring.o
> In file included from fs/fuse/dev_uring.c:7:
> fs/fuse/fuse_dev_i.h:15:52: warning: declaration of 'struct file' will not be visible outside of this function [-Wvisibility]
> static inline struct fuse_dev *fuse_get_dev(struct file *file)
> ^
> fs/fuse/fuse_dev_i.h:21:9: error: call to undeclared function 'READ_ONCE'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
> return READ_ONCE(file->private_data);
> ^
> fs/fuse/fuse_dev_i.h:21:23: error: incomplete definition of type 'struct file'
> return READ_ONCE(file->private_data);
> ~~~~^
>
>
> I could also include <linux/fs.h> in dev_uring.c, but isn't it cleaner
> to have the include in fuse_dev_i.h as it is that file that
> adds dependencies?
>
You're totally right, I had missed that this patch adds in a new
caller of this header (dev_uring.c) - sorry for the noise!
> >>
...
> >> +
> >> #endif /* _LINUX_FUSE_H */
> >>
> >> --
> >> 2.43.0
> >>
>
> I will get it all fixed later this week! I will also review my own
> patches before v4, I just wanted to get v3 out asap as it was already
> taking so much time after v2.
>
Gotcha, I'll wait until v4 to review the other patches in this set then.
Excited to follow all the progress on this!
Thanks,
Joanne
>
> Thanks,
> Bernd
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH RFC v3 06/17] fuse: Add the queue configuration ioctl
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (4 preceding siblings ...)
2024-09-01 13:36 ` [PATCH RFC v3 05/17] fuse: Add a uring config ioctl Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-04 22:23 ` Joanne Koong
2024-09-01 13:37 ` [PATCH RFC v3 07/17] fuse: {uring} Add a dev_release exception for fuse-over-io-uring Bernd Schubert
` (11 subsequent siblings)
17 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev.c | 30 ++++++++++++++++++++++++++++++
fs/fuse/dev_uring.c | 2 ++
fs/fuse/dev_uring_i.h | 13 +++++++++++++
fs/fuse/fuse_i.h | 4 ++++
include/uapi/linux/fuse.h | 39 +++++++++++++++++++++++++++++++++++++++
5 files changed, 88 insertions(+)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6489179e7260..06ea4dc5ffe1 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2379,6 +2379,33 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
return fuse_backing_close(fud->fc, backing_id);
}
+#ifdef CONFIG_FUSE_IO_URING
+static long fuse_uring_queue_ioc(struct file *file, __u32 __user *argp)
+{
+ int res = 0;
+ struct fuse_dev *fud;
+ struct fuse_conn *fc;
+ struct fuse_ring_queue_config qcfg;
+
+ res = copy_from_user(&qcfg, (void *)argp, sizeof(qcfg));
+ if (res != 0)
+ return -EFAULT;
+
+ res = _fuse_dev_ioctl_clone(file, qcfg.control_fd);
+ if (res != 0)
+ return res;
+
+ fud = fuse_get_dev(file);
+ if (fud == NULL)
+ return -ENODEV;
+ fc = fud->fc;
+
+ fud->ring_q = fuse_uring_get_queue(fc->ring, qcfg.qid);
+
+ return 0;
+}
+#endif
+
static long
fuse_dev_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
@@ -2398,6 +2425,9 @@ fuse_dev_ioctl(struct file *file, unsigned int cmd,
#ifdef CONFIG_FUSE_IO_URING
case FUSE_DEV_IOC_URING_CFG:
return fuse_uring_conn_cfg(file, argp);
+
+ case FUSE_DEV_IOC_URING_QUEUE_CFG:
+ return fuse_uring_queue_ioc(file, argp);
#endif
default:
return -ENOTTY;
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 4e7518ef6527..4dcb4972242e 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -42,6 +42,8 @@ static void fuse_uring_queue_cfg(struct fuse_ring_queue *queue, int qid,
ent->queue = queue;
ent->tag = tag;
+
+ ent->state = FRRS_INIT;
}
}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index d4eff87bcd1f..301b37d16506 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -14,6 +14,13 @@
/* IORING_MAX_ENTRIES */
#define FUSE_URING_MAX_QUEUE_DEPTH 32768
+enum fuse_ring_req_state {
+
+ /* request is basially initialized */
+ FRRS_INIT = 1,
+
+};
+
/* A fuse ring entry, part of the ring queue */
struct fuse_ring_ent {
/* back pointer */
@@ -21,6 +28,12 @@ struct fuse_ring_ent {
/* array index in the ring-queue */
unsigned int tag;
+
+ /*
+ * state the request is currently in
+ * (enum fuse_ring_req_state)
+ */
+ unsigned long state;
};
struct fuse_ring_queue {
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 33e81b895fee..5eb8552d9d7f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -540,6 +540,10 @@ struct fuse_dev {
/** list entry on fc->devices */
struct list_head entry;
+
+#ifdef CONFIG_FUSE_IO_URING
+ struct fuse_ring_queue *ring_q;
+#endif
};
enum fuse_dax_mode {
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index a1c35e0338f0..143ed3c1c7b3 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1118,6 +1118,18 @@ struct fuse_ring_config {
uint8_t padding[64];
};
+struct fuse_ring_queue_config {
+ /* qid the command is for */
+ uint32_t qid;
+
+ /* /dev/fuse fd that initiated the mount. */
+ uint32_t control_fd;
+
+ /* for future extensions */
+ uint8_t padding[64];
+};
+
+
/* Device ioctls: */
#define FUSE_DEV_IOC_MAGIC 229
#define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
@@ -1126,6 +1138,8 @@ struct fuse_ring_config {
#define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
#define FUSE_DEV_IOC_URING_CFG _IOR(FUSE_DEV_IOC_MAGIC, 3, \
struct fuse_ring_config)
+#define FUSE_DEV_IOC_URING_QUEUE_CFG _IOR(FUSE_DEV_IOC_MAGIC, 3, \
+ struct fuse_ring_queue_config)
struct fuse_lseek_in {
uint64_t fh;
@@ -1233,4 +1247,29 @@ struct fuse_supp_groups {
#define FUSE_RING_HEADER_BUF_SIZE 4096
#define FUSE_RING_MIN_IN_OUT_ARG_SIZE 4096
+/**
+ * This structure mapped onto the
+ */
+struct fuse_ring_req {
+ union {
+ /* The first 4K are command data */
+ char ring_header[FUSE_RING_HEADER_BUF_SIZE];
+
+ struct {
+ uint64_t flags;
+
+ uint32_t in_out_arg_len;
+ uint32_t padding;
+
+ /* kernel fills in, reads out */
+ union {
+ struct fuse_in_header in;
+ struct fuse_out_header out;
+ };
+ };
+ };
+
+ char in_out_arg[];
+};
+
#endif /* _LINUX_FUSE_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 06/17] fuse: Add the queue configuration ioctl
2024-09-01 13:37 ` [PATCH RFC v3 06/17] fuse: Add the queue configuration ioctl Bernd Schubert
@ 2024-09-04 22:23 ` Joanne Koong
2024-09-04 22:38 ` Bernd Schubert
0 siblings, 1 reply; 37+ messages in thread
From: Joanne Koong @ 2024-09-04 22:23 UTC (permalink / raw)
To: Bernd Schubert
Cc: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd, linux-fsdevel,
io-uring, Josef Bacik, Amir Goldstein
On Sun, Sep 1, 2024 at 6:37 AM Bernd Schubert <[email protected]> wrote:
>
> Signed-off-by: Bernd Schubert <[email protected]>
> ---
> fs/fuse/dev.c | 30 ++++++++++++++++++++++++++++++
> fs/fuse/dev_uring.c | 2 ++
> fs/fuse/dev_uring_i.h | 13 +++++++++++++
> fs/fuse/fuse_i.h | 4 ++++
> include/uapi/linux/fuse.h | 39 +++++++++++++++++++++++++++++++++++++++
> 5 files changed, 88 insertions(+)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 6489179e7260..06ea4dc5ffe1 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -2379,6 +2379,33 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
> return fuse_backing_close(fud->fc, backing_id);
> }
>
> +#ifdef CONFIG_FUSE_IO_URING
> +static long fuse_uring_queue_ioc(struct file *file, __u32 __user *argp)
> +{
> + int res = 0;
> + struct fuse_dev *fud;
> + struct fuse_conn *fc;
> + struct fuse_ring_queue_config qcfg;
> +
> + res = copy_from_user(&qcfg, (void *)argp, sizeof(qcfg));
> + if (res != 0)
> + return -EFAULT;
> +
> + res = _fuse_dev_ioctl_clone(file, qcfg.control_fd);
I'm confused how this works for > 1 queues. If I'm understanding this
correctly, if a system has multiple cores and the server would like
multi-queues, then the server needs to call the ioctl
FUSE_DEV_IOC_URING_QUEUE_CFG multiple times (each with a different
qid).
In this handler, when we get to _fuse_dev_ioctl_clone() ->
fuse_device_clone(), it allocates and installs a new fud and then sets
file->private_data to fud, but isn't this underlying file the same for
all of the queues since they are using the same fd for the ioctl
calls? It seems like every queue after the 1st would fail with -EINVAL
from the "if (new->private_data)" check in fuse_device_clone()?
Not sure if I'm missing something or if this intentionally doesn't
support multi-queue yet. If the latter, then I'm curious how you're
planning to get the fud for a specific queue given that
file->private_data and fuse_get_dev() only can support the single
queue case.
Thanks,
Joanne
> + if (res != 0)
> + return res;
> +
> + fud = fuse_get_dev(file);
> + if (fud == NULL)
> + return -ENODEV;
> + fc = fud->fc;
> +
> + fud->ring_q = fuse_uring_get_queue(fc->ring, qcfg.qid);
> +
> + return 0;
> +}
> +#endif
> +
> static long
> fuse_dev_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
> @@ -2398,6 +2425,9 @@ fuse_dev_ioctl(struct file *file, unsigned int cmd,
> #ifdef CONFIG_FUSE_IO_URING
> case FUSE_DEV_IOC_URING_CFG:
> return fuse_uring_conn_cfg(file, argp);
> +
> + case FUSE_DEV_IOC_URING_QUEUE_CFG:
> + return fuse_uring_queue_ioc(file, argp);
> #endif
> default:
> return -ENOTTY;
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index 4e7518ef6527..4dcb4972242e 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -42,6 +42,8 @@ static void fuse_uring_queue_cfg(struct fuse_ring_queue *queue, int qid,
>
> ent->queue = queue;
> ent->tag = tag;
> +
> + ent->state = FRRS_INIT;
> }
> }
>
> diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
> index d4eff87bcd1f..301b37d16506 100644
> --- a/fs/fuse/dev_uring_i.h
> +++ b/fs/fuse/dev_uring_i.h
> @@ -14,6 +14,13 @@
> /* IORING_MAX_ENTRIES */
> #define FUSE_URING_MAX_QUEUE_DEPTH 32768
>
> +enum fuse_ring_req_state {
> +
> + /* request is basially initialized */
> + FRRS_INIT = 1,
> +
> +};
> +
> /* A fuse ring entry, part of the ring queue */
> struct fuse_ring_ent {
> /* back pointer */
> @@ -21,6 +28,12 @@ struct fuse_ring_ent {
>
> /* array index in the ring-queue */
> unsigned int tag;
> +
> + /*
> + * state the request is currently in
> + * (enum fuse_ring_req_state)
> + */
> + unsigned long state;
> };
>
> struct fuse_ring_queue {
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 33e81b895fee..5eb8552d9d7f 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -540,6 +540,10 @@ struct fuse_dev {
>
> /** list entry on fc->devices */
> struct list_head entry;
> +
> +#ifdef CONFIG_FUSE_IO_URING
> + struct fuse_ring_queue *ring_q;
> +#endif
> };
>
> enum fuse_dax_mode {
> diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> index a1c35e0338f0..143ed3c1c7b3 100644
> --- a/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -1118,6 +1118,18 @@ struct fuse_ring_config {
> uint8_t padding[64];
> };
>
> +struct fuse_ring_queue_config {
> + /* qid the command is for */
> + uint32_t qid;
> +
> + /* /dev/fuse fd that initiated the mount. */
> + uint32_t control_fd;
> +
> + /* for future extensions */
> + uint8_t padding[64];
> +};
> +
> +
> /* Device ioctls: */
> #define FUSE_DEV_IOC_MAGIC 229
> #define FUSE_DEV_IOC_CLONE _IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
> @@ -1126,6 +1138,8 @@ struct fuse_ring_config {
> #define FUSE_DEV_IOC_BACKING_CLOSE _IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
> #define FUSE_DEV_IOC_URING_CFG _IOR(FUSE_DEV_IOC_MAGIC, 3, \
> struct fuse_ring_config)
> +#define FUSE_DEV_IOC_URING_QUEUE_CFG _IOR(FUSE_DEV_IOC_MAGIC, 3, \
> + struct fuse_ring_queue_config)
>
> struct fuse_lseek_in {
> uint64_t fh;
> @@ -1233,4 +1247,29 @@ struct fuse_supp_groups {
> #define FUSE_RING_HEADER_BUF_SIZE 4096
> #define FUSE_RING_MIN_IN_OUT_ARG_SIZE 4096
>
> +/**
> + * This structure mapped onto the
> + */
> +struct fuse_ring_req {
> + union {
> + /* The first 4K are command data */
> + char ring_header[FUSE_RING_HEADER_BUF_SIZE];
> +
> + struct {
> + uint64_t flags;
> +
> + uint32_t in_out_arg_len;
> + uint32_t padding;
> +
> + /* kernel fills in, reads out */
> + union {
> + struct fuse_in_header in;
> + struct fuse_out_header out;
> + };
> + };
> + };
> +
> + char in_out_arg[];
> +};
> +
> #endif /* _LINUX_FUSE_H */
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 06/17] fuse: Add the queue configuration ioctl
2024-09-04 22:23 ` Joanne Koong
@ 2024-09-04 22:38 ` Bernd Schubert
2024-09-04 22:42 ` Joanne Koong
0 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2024-09-04 22:38 UTC (permalink / raw)
To: Joanne Koong, Bernd Schubert
Cc: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd, linux-fsdevel,
io-uring, Josef Bacik, Amir Goldstein
On 9/5/24 00:23, Joanne Koong wrote:
> On Sun, Sep 1, 2024 at 6:37 AM Bernd Schubert <[email protected]> wrote:
>>
>> Signed-off-by: Bernd Schubert <[email protected]>
>> ---
>> fs/fuse/dev.c | 30 ++++++++++++++++++++++++++++++
>> fs/fuse/dev_uring.c | 2 ++
>> fs/fuse/dev_uring_i.h | 13 +++++++++++++
>> fs/fuse/fuse_i.h | 4 ++++
>> include/uapi/linux/fuse.h | 39 +++++++++++++++++++++++++++++++++++++++
>> 5 files changed, 88 insertions(+)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 6489179e7260..06ea4dc5ffe1 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -2379,6 +2379,33 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
>> return fuse_backing_close(fud->fc, backing_id);
>> }
>>
>> +#ifdef CONFIG_FUSE_IO_URING
>> +static long fuse_uring_queue_ioc(struct file *file, __u32 __user *argp)
>> +{
>> + int res = 0;
>> + struct fuse_dev *fud;
>> + struct fuse_conn *fc;
>> + struct fuse_ring_queue_config qcfg;
>> +
>> + res = copy_from_user(&qcfg, (void *)argp, sizeof(qcfg));
>> + if (res != 0)
>> + return -EFAULT;
>> +
>> + res = _fuse_dev_ioctl_clone(file, qcfg.control_fd);
>
> I'm confused how this works for > 1 queues. If I'm understanding this
> correctly, if a system has multiple cores and the server would like
> multi-queues, then the server needs to call the ioctl
> FUSE_DEV_IOC_URING_QUEUE_CFG multiple times (each with a different
> qid).
>
> In this handler, when we get to _fuse_dev_ioctl_clone() ->
> fuse_device_clone(), it allocates and installs a new fud and then sets
> file->private_data to fud, but isn't this underlying file the same for
> all of the queues since they are using the same fd for the ioctl
> calls? It seems like every queue after the 1st would fail with -EINVAL
> from the "if (new->private_data)" check in fuse_device_clone()?
Each queue is using it's own fd - this works exactly the same as
a existing FUSE_DEV_IOC_CLONE - each clone has to open /dev/fuse on its
own. A bit a pity that dup() isn't sufficient. Only difference to
FUSE_DEV_IOC_CLONE is the additional qid.
>
> Not sure if I'm missing something or if this intentionally doesn't
> support multi-queue yet. If the latter, then I'm curious how you're
> planning to get the fud for a specific queue given that
> file->private_data and fuse_get_dev() only can support the single
> queue case.
Strictly in the current patch set, the clone is only needed in the
next patch
"07/17] fuse: {uring} Add a dev_release exception for fuse-over-io-uring"
Though, since we have the fud anyway and link to the ring-queue, it makes
use of it in
08/17] fuse: {uring} Handle SQEs - register commands
in fuse_uring_cmd().
I hope I understood your question right.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 06/17] fuse: Add the queue configuration ioctl
2024-09-04 22:38 ` Bernd Schubert
@ 2024-09-04 22:42 ` Joanne Koong
0 siblings, 0 replies; 37+ messages in thread
From: Joanne Koong @ 2024-09-04 22:42 UTC (permalink / raw)
To: Bernd Schubert
Cc: Bernd Schubert, Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd,
linux-fsdevel, io-uring, Josef Bacik, Amir Goldstein
On Wed, Sep 4, 2024 at 3:38 PM Bernd Schubert
<[email protected]> wrote:
>
> On 9/5/24 00:23, Joanne Koong wrote:
> > On Sun, Sep 1, 2024 at 6:37 AM Bernd Schubert <[email protected]> wrote:
> >>
> >> Signed-off-by: Bernd Schubert <[email protected]>
> >> ---
> >> fs/fuse/dev.c | 30 ++++++++++++++++++++++++++++++
> >> fs/fuse/dev_uring.c | 2 ++
> >> fs/fuse/dev_uring_i.h | 13 +++++++++++++
> >> fs/fuse/fuse_i.h | 4 ++++
> >> include/uapi/linux/fuse.h | 39 +++++++++++++++++++++++++++++++++++++++
> >> 5 files changed, 88 insertions(+)
> >>
> >> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> >> index 6489179e7260..06ea4dc5ffe1 100644
> >> --- a/fs/fuse/dev.c
> >> +++ b/fs/fuse/dev.c
> >> @@ -2379,6 +2379,33 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
> >> return fuse_backing_close(fud->fc, backing_id);
> >> }
> >>
> >> +#ifdef CONFIG_FUSE_IO_URING
> >> +static long fuse_uring_queue_ioc(struct file *file, __u32 __user *argp)
> >> +{
> >> + int res = 0;
> >> + struct fuse_dev *fud;
> >> + struct fuse_conn *fc;
> >> + struct fuse_ring_queue_config qcfg;
> >> +
> >> + res = copy_from_user(&qcfg, (void *)argp, sizeof(qcfg));
> >> + if (res != 0)
> >> + return -EFAULT;
> >> +
> >> + res = _fuse_dev_ioctl_clone(file, qcfg.control_fd);
> >
> > I'm confused how this works for > 1 queues. If I'm understanding this
> > correctly, if a system has multiple cores and the server would like
> > multi-queues, then the server needs to call the ioctl
> > FUSE_DEV_IOC_URING_QUEUE_CFG multiple times (each with a different
> > qid).
> >
> > In this handler, when we get to _fuse_dev_ioctl_clone() ->
> > fuse_device_clone(), it allocates and installs a new fud and then sets
> > file->private_data to fud, but isn't this underlying file the same for
> > all of the queues since they are using the same fd for the ioctl
> > calls? It seems like every queue after the 1st would fail with -EINVAL
> > from the "if (new->private_data)" check in fuse_device_clone()?
>
> Each queue is using it's own fd - this works exactly the same as
> a existing FUSE_DEV_IOC_CLONE - each clone has to open /dev/fuse on its
Ah gotcha, this is the part I was missing. I didn't realize the
expectation is that the server needs to open a new /dev/fuse for each
queue. This makes more sense to me now, thanks.
> own. A bit a pity that dup() isn't sufficient. Only difference to
> FUSE_DEV_IOC_CLONE is the additional qid.
>
> >
> > Not sure if I'm missing something or if this intentionally doesn't
> > support multi-queue yet. If the latter, then I'm curious how you're
> > planning to get the fud for a specific queue given that
> > file->private_data and fuse_get_dev() only can support the single
> > queue case.
>
>
> Strictly in the current patch set, the clone is only needed in the
> next patch
> "07/17] fuse: {uring} Add a dev_release exception for fuse-over-io-uring"
> Though, since we have the fud anyway and link to the ring-queue, it makes
> use of it in
> 08/17] fuse: {uring} Handle SQEs - register commands
>
> in fuse_uring_cmd().
>
>
> I hope I understood your question right.
>
>
> Thanks,
> Bernd
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH RFC v3 07/17] fuse: {uring} Add a dev_release exception for fuse-over-io-uring
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (5 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 06/17] fuse: Add the queue configuration ioctl Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-01 13:37 ` [PATCH RFC v3 08/17] fuse: {uring} Handle SQEs - register commands Bernd Schubert
` (10 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
fuse-over-io-uring needs an implicit device clone, which is done per
queue to avoid hanging "umount" when daemon side is already terminated.
Reason is that fuse_dev_release() is not called when there are queued
(waiting) io_uring commands.
Solution is the implicit device clone and an exception in fuse_dev_release
for uring devices to abort the connection when only uring device
are left.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev.c | 34 ++++++++++++++++++++++++++++++++--
fs/fuse/dev_uring_i.h | 24 +++++++++++++++++-------
2 files changed, 49 insertions(+), 9 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 06ea4dc5ffe1..fec995818a9e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2258,6 +2258,8 @@ int fuse_dev_release(struct inode *inode, struct file *file)
struct fuse_pqueue *fpq = &fud->pq;
LIST_HEAD(to_end);
unsigned int i;
+ int dev_cnt;
+ bool abort_conn = false;
spin_lock(&fpq->lock);
WARN_ON(!list_empty(&fpq->io));
@@ -2267,8 +2269,36 @@ int fuse_dev_release(struct inode *inode, struct file *file)
fuse_dev_end_requests(&to_end);
- /* Are we the last open device? */
- if (atomic_dec_and_test(&fc->dev_count)) {
+ /* Are we the last open device? */
+ dev_cnt = atomic_dec_return(&fc->dev_count);
+ if (dev_cnt == 0)
+ abort_conn = true;
+
+#ifdef CONFIG_FUSE_IO_URING
+ /*
+ * Or is this with io_uring and only ring devices left?
+ * These devices will not receive a ->release() as long as
+ * there are io_uring_cmd's waiting and not completed
+ * with io_uring_cmd_done yet
+ */
+ if (fuse_uring_configured(fc)) {
+ struct fuse_dev *list_dev;
+ bool all_uring = true;
+
+ spin_lock(&fc->lock);
+ list_for_each_entry(list_dev, &fc->devices, entry) {
+ if (list_dev == fud)
+ continue;
+ if (!list_dev->ring_q)
+ all_uring = false;
+ }
+ spin_unlock(&fc->lock);
+ if (all_uring)
+ abort_conn = true;
+ }
+#endif
+
+ if (abort_conn) {
WARN_ON(fc->iq.fasync != NULL);
fuse_abort_conn(fc);
}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 301b37d16506..26266f923321 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -15,10 +15,10 @@
#define FUSE_URING_MAX_QUEUE_DEPTH 32768
enum fuse_ring_req_state {
+ FRRS_INVALID = 0,
/* request is basially initialized */
- FRRS_INIT = 1,
-
+ FRRS_INIT,
};
/* A fuse ring entry, part of the ring queue */
@@ -29,11 +29,8 @@ struct fuse_ring_ent {
/* array index in the ring-queue */
unsigned int tag;
- /*
- * state the request is currently in
- * (enum fuse_ring_req_state)
- */
- unsigned long state;
+ /* state the request is currently in */
+ enum fuse_ring_req_state state;
};
struct fuse_ring_queue {
@@ -108,6 +105,14 @@ fuse_uring_get_queue(struct fuse_ring *ring, int qid)
return (struct fuse_ring_queue *)(ptr + qid * ring->queue_size);
}
+static inline bool fuse_uring_configured(struct fuse_conn *fc)
+{
+ if (fc->ring != NULL)
+ return true;
+
+ return false;
+}
+
#else /* CONFIG_FUSE_IO_URING */
struct fuse_ring;
@@ -121,6 +126,11 @@ static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
{
}
+static inline bool fuse_uring_configured(struct fuse_conn *fc)
+{
+ return false;
+}
+
#endif /* CONFIG_FUSE_IO_URING */
#endif /* _FS_FUSE_DEV_URING_I_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 08/17] fuse: {uring} Handle SQEs - register commands
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (6 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 07/17] fuse: {uring} Add a dev_release exception for fuse-over-io-uring Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-04 15:40 ` Jens Axboe
2024-09-01 13:37 ` [PATCH RFC v3 09/17] fuse: Make fuse_copy non static Bernd Schubert
` (9 subsequent siblings)
17 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
This adds basic support for ring SQEs (with opcode=IORING_OP_URING_CMD).
For now only FUSE_URING_REQ_FETCH is handled to register queue entries.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev.c | 3 +
fs/fuse/dev_uring.c | 231 ++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/dev_uring_i.h | 60 ++++++++++++
include/uapi/linux/fuse.h | 38 ++++++++
4 files changed, 332 insertions(+)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index fec995818a9e..998027825481 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2477,6 +2477,9 @@ const struct file_operations fuse_dev_operations = {
.fasync = fuse_dev_fasync,
.unlocked_ioctl = fuse_dev_ioctl,
.compat_ioctl = compat_ptr_ioctl,
+#ifdef CONFIG_FUSE_IO_URING
+ .uring_cmd = fuse_uring_cmd,
+#endif
};
EXPORT_SYMBOL_GPL(fuse_dev_operations);
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 4dcb4972242e..46c2274193bf 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -29,6 +29,30 @@
#include <linux/topology.h>
#include <linux/io_uring/cmd.h>
+static int fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
+{
+ if (WARN_ON_ONCE(ent->state != FRRS_USERSPACE))
+ return -EIO;
+
+ ent->state = FRRS_COMMIT;
+ list_del_init(&ent->list);
+
+ return 0;
+}
+
+/* Update conn limits according to ring values */
+static void fuse_uring_conn_cfg_limits(struct fuse_ring *ring)
+{
+ struct fuse_conn *fc = ring->fc;
+
+ /*
+ * This not ideal, as multiplication with nr_queue assumes the limit
+ * gets reached when all queues are used, but even a single queue
+ * might reach the limit.
+ */
+ WRITE_ONCE(fc->max_background, ring->nr_queues * ring->max_nr_async);
+}
+
static void fuse_uring_queue_cfg(struct fuse_ring_queue *queue, int qid,
struct fuse_ring *ring)
{
@@ -37,6 +61,11 @@ static void fuse_uring_queue_cfg(struct fuse_ring_queue *queue, int qid,
queue->qid = qid;
queue->ring = ring;
+ spin_lock_init(&queue->lock);
+
+ INIT_LIST_HEAD(&queue->sync_ent_avail_queue);
+ INIT_LIST_HEAD(&queue->async_ent_avail_queue);
+
for (tag = 0; tag < ring->queue_depth; tag++) {
struct fuse_ring_ent *ent = &queue->ring_ent[tag];
@@ -44,6 +73,8 @@ static void fuse_uring_queue_cfg(struct fuse_ring_queue *queue, int qid,
ent->tag = tag;
ent->state = FRRS_INIT;
+
+ INIT_LIST_HEAD(&ent->list);
}
}
@@ -141,3 +172,203 @@ int fuse_uring_conn_cfg(struct file *file, void __user *argp)
kvfree(ring);
return res;
}
+
+/*
+ * Put a ring request onto hold, it is no longer used for now.
+ */
+static void fuse_uring_ent_avail(struct fuse_ring_ent *ring_ent,
+ struct fuse_ring_queue *queue)
+ __must_hold(&queue->lock)
+{
+ struct fuse_ring *ring = queue->ring;
+
+ lockdep_assert_held(&queue->lock);
+
+ /* unsets all previous flags - basically resets */
+ pr_devel("%s ring=%p qid=%d tag=%d state=%d async=%d\n", __func__,
+ ring, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
+ ring_ent->async);
+
+ if (WARN_ON(ring_ent->state != FRRS_COMMIT)) {
+ pr_warn("%s qid=%d tag=%d state=%d async=%d\n", __func__,
+ ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
+ ring_ent->async);
+ return;
+ }
+
+ WARN_ON_ONCE(!list_empty(&ring_ent->list));
+
+ if (ring_ent->async)
+ list_add(&ring_ent->list, &queue->async_ent_avail_queue);
+ else
+ list_add(&ring_ent->list, &queue->sync_ent_avail_queue);
+
+ ring_ent->state = FRRS_WAIT;
+}
+
+/*
+ * fuse_uring_req_fetch command handling
+ */
+static int _fuse_uring_fetch(struct fuse_ring_ent *ring_ent,
+ struct io_uring_cmd *cmd, unsigned int issue_flags)
+__must_hold(ring_ent->queue->lock)
+{
+ struct fuse_ring_queue *queue = ring_ent->queue;
+ struct fuse_ring *ring = queue->ring;
+ int nr_ring_sqe;
+
+ lockdep_assert_held(&queue->lock);
+
+ /* register requests for foreground requests first, then backgrounds */
+ if (queue->nr_req_sync >= ring->max_nr_sync) {
+ queue->nr_req_async++;
+ ring_ent->async = 1;
+ } else
+ queue->nr_req_sync++;
+
+ fuse_uring_ent_avail(ring_ent, queue);
+
+ if (WARN_ON_ONCE(queue->nr_req_sync +
+ queue->nr_req_async > ring->queue_depth)) {
+ /* should be caught by ring state before and queue depth
+ * check before
+ */
+ pr_info("qid=%d tag=%d req cnt (fg=%d async=%d exceeds depth=%zu",
+ queue->qid, ring_ent->tag, queue->nr_req_sync,
+ queue->nr_req_async, ring->queue_depth);
+ return -ERANGE;
+ }
+
+ WRITE_ONCE(ring_ent->cmd, cmd);
+
+ nr_ring_sqe = ring->queue_depth * ring->nr_queues;
+ if (atomic_inc_return(&ring->nr_sqe_init) == nr_ring_sqe) {
+ fuse_uring_conn_cfg_limits(ring);
+ ring->ready = 1;
+ }
+
+ return 0;
+}
+
+static int fuse_uring_fetch(struct fuse_ring_ent *ring_ent,
+ struct io_uring_cmd *cmd, unsigned int issue_flags)
+ __releases(ring_ent->queue->lock)
+{
+ struct fuse_ring *ring = ring_ent->queue->ring;
+ struct fuse_ring_queue *queue = ring_ent->queue;
+ int ret;
+
+ /* No other bit must be set here */
+ ret = -EINVAL;
+ if (ring_ent->state != FRRS_INIT)
+ goto err;
+
+ /*
+ * FUSE_URING_REQ_FETCH is an initialization exception, needs
+ * state override
+ */
+ ring_ent->state = FRRS_USERSPACE;
+ ret = fuse_ring_ring_ent_unset_userspace(ring_ent);
+ if (ret != 0) {
+ pr_info_ratelimited(
+ "qid=%d tag=%d register req state %d expected %d",
+ queue->qid, ring_ent->tag, ring_ent->state,
+ FRRS_INIT);
+ goto err;
+ }
+
+ ret = _fuse_uring_fetch(ring_ent, cmd, issue_flags);
+ if (ret)
+ goto err;
+
+ /*
+ * The ring entry is registered now and needs to be handled
+ * for shutdown.
+ */
+ atomic_inc(&ring->queue_refs);
+err:
+ spin_unlock(&queue->lock);
+ return ret;
+}
+
+/**
+ * Entry function from io_uring to handle the given passthrough command
+ * (op cocde IORING_OP_URING_CMD)
+ */
+int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
+{
+ const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe_cmd(cmd->sqe);
+ struct fuse_dev *fud;
+ struct fuse_conn *fc;
+ struct fuse_ring *ring;
+ struct fuse_ring_queue *queue;
+ struct fuse_ring_ent *ring_ent = NULL;
+ u32 cmd_op = cmd->cmd_op;
+ int ret = 0;
+
+ ret = -ENODEV;
+ fud = fuse_get_dev(cmd->file);
+ if (!fud)
+ goto out;
+ fc = fud->fc;
+
+ ring = fc->ring;
+ if (!ring)
+ goto out;
+
+ queue = fud->ring_q;
+ if (!queue)
+ goto out;
+
+ ret = -EINVAL;
+ if (queue->qid != cmd_req->qid)
+ goto out;
+
+ ret = -ERANGE;
+ if (cmd_req->tag > ring->queue_depth)
+ goto out;
+
+ ring_ent = &queue->ring_ent[cmd_req->tag];
+
+ pr_devel("%s:%d received: cmd op %d qid %d (%p) tag %d (%p)\n",
+ __func__, __LINE__, cmd_op, cmd_req->qid, queue, cmd_req->tag,
+ ring_ent);
+
+ spin_lock(&queue->lock);
+ ret = -ENOTCONN;
+ if (unlikely(fc->aborted || queue->stopped))
+ goto err_unlock;
+
+ switch (cmd_op) {
+ case FUSE_URING_REQ_FETCH:
+ ret = fuse_uring_fetch(ring_ent, cmd, issue_flags);
+ break;
+ default:
+ ret = -EINVAL;
+ pr_devel("Unknown uring command %d", cmd_op);
+ goto err_unlock;
+ }
+out:
+ pr_devel("uring cmd op=%d, qid=%d tag=%d ret=%d\n", cmd_op,
+ cmd_req->qid, cmd_req->tag, ret);
+
+ if (ret < 0) {
+ if (ring_ent != NULL) {
+ pr_info_ratelimited("error: uring cmd op=%d, qid=%d tag=%d ret=%d\n",
+ cmd_op, cmd_req->qid, cmd_req->tag,
+ ret);
+
+ /* must not change the entry state, as userspace
+ * might have sent random data, but valid requests
+ * might be registered already - don't confuse those.
+ */
+ }
+ io_uring_cmd_done(cmd, ret, 0, issue_flags);
+ }
+
+ return -EIOCBQUEUED;
+
+err_unlock:
+ spin_unlock(&queue->lock);
+ goto out;
+}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 26266f923321..6561f4178cac 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -19,6 +19,15 @@ enum fuse_ring_req_state {
/* request is basially initialized */
FRRS_INIT,
+
+ /* ring entry received from userspace and it being processed */
+ FRRS_COMMIT,
+
+ /* The ring request waits for a new fuse request */
+ FRRS_WAIT,
+
+ /* request is in or on the way to user space */
+ FRRS_USERSPACE,
};
/* A fuse ring entry, part of the ring queue */
@@ -31,6 +40,13 @@ struct fuse_ring_ent {
/* state the request is currently in */
enum fuse_ring_req_state state;
+
+ /* is this an async or sync entry */
+ unsigned int async : 1;
+
+ struct list_head list;
+
+ struct io_uring_cmd *cmd;
};
struct fuse_ring_queue {
@@ -43,6 +59,30 @@ struct fuse_ring_queue {
/* queue id, typically also corresponds to the cpu core */
unsigned int qid;
+ /*
+ * queue lock, taken when any value in the queue changes _and_ also
+ * a ring entry state changes.
+ */
+ spinlock_t lock;
+
+ /* available ring entries (struct fuse_ring_ent) */
+ struct list_head async_ent_avail_queue;
+ struct list_head sync_ent_avail_queue;
+
+ /*
+ * available number of sync requests,
+ * loosely bound to fuse foreground requests
+ */
+ int nr_req_sync;
+
+ /*
+ * available number of async requests
+ * loosely bound to fuse background requests
+ */
+ int nr_req_async;
+
+ unsigned int stopped : 1;
+
/* size depends on queue depth */
struct fuse_ring_ent ring_ent[] ____cacheline_aligned_in_smp;
};
@@ -79,11 +119,21 @@ struct fuse_ring {
/* numa aware memory allocation */
unsigned int numa_aware : 1;
+ /* Is the ring read to take requests */
+ unsigned int ready : 1;
+
+ /* number of SQEs initialized */
+ atomic_t nr_sqe_init;
+
+ /* Used to release the ring on stop */
+ atomic_t queue_refs;
+
struct fuse_ring_queue queues[] ____cacheline_aligned_in_smp;
};
void fuse_uring_abort_end_requests(struct fuse_ring *ring);
int fuse_uring_conn_cfg(struct file *file, void __user *argp);
+int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
{
@@ -113,6 +163,11 @@ static inline bool fuse_uring_configured(struct fuse_conn *fc)
return false;
}
+static inline bool fuse_per_core_queue(struct fuse_conn *fc)
+{
+ return fc->ring && fc->ring->per_core_queue;
+}
+
#else /* CONFIG_FUSE_IO_URING */
struct fuse_ring;
@@ -131,6 +186,11 @@ static inline bool fuse_uring_configured(struct fuse_conn *fc)
return false;
}
+static inline bool fuse_per_core_queue(struct fuse_conn *fc)
+{
+ return false;
+}
+
#endif /* CONFIG_FUSE_IO_URING */
#endif /* _FS_FUSE_DEV_URING_I_H */
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 143ed3c1c7b3..586358e9992c 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -1247,6 +1247,12 @@ struct fuse_supp_groups {
#define FUSE_RING_HEADER_BUF_SIZE 4096
#define FUSE_RING_MIN_IN_OUT_ARG_SIZE 4096
+/*
+ * Request is background type. Daemon side is free to use this information
+ * to handle foreground/background CQEs with different priorities.
+ */
+#define FUSE_RING_REQ_FLAG_ASYNC (1ull << 0)
+
/**
* This structure mapped onto the
*/
@@ -1272,4 +1278,36 @@ struct fuse_ring_req {
char in_out_arg[];
};
+/**
+ * sqe commands to the kernel
+ */
+enum fuse_uring_cmd {
+ FUSE_URING_REQ_INVALID = 0,
+
+ /* submit sqe to kernel to get a request */
+ FUSE_URING_REQ_FETCH = 1,
+
+ /* commit result and fetch next request */
+ FUSE_URING_REQ_COMMIT_AND_FETCH = 2,
+};
+
+/**
+ * In the 80B command area of the SQE.
+ */
+struct fuse_uring_cmd_req {
+ /* User buffer */
+ uint64_t buf_ptr;
+
+ /* length of the user buffer */
+ uint32_t buf_len;
+
+ /* queue the command is for (queue index) */
+ uint16_t qid;
+
+ /* queue entry (array index) */
+ uint16_t tag;
+
+ uint32_t flags;
+};
+
#endif /* _LINUX_FUSE_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 08/17] fuse: {uring} Handle SQEs - register commands
2024-09-01 13:37 ` [PATCH RFC v3 08/17] fuse: {uring} Handle SQEs - register commands Bernd Schubert
@ 2024-09-04 15:40 ` Jens Axboe
0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-09-04 15:40 UTC (permalink / raw)
To: Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/1/24 7:37 AM, Bernd Schubert wrote:
> +/**
> + * Entry function from io_uring to handle the given passthrough command
> + * (op cocde IORING_OP_URING_CMD)
> + */
> +int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> +{
> + const struct fuse_uring_cmd_req *cmd_req = io_uring_sqe_cmd(cmd->sqe);
> + struct fuse_dev *fud;
> + struct fuse_conn *fc;
> + struct fuse_ring *ring;
> + struct fuse_ring_queue *queue;
> + struct fuse_ring_ent *ring_ent = NULL;
> + u32 cmd_op = cmd->cmd_op;
> + int ret = 0;
> +
> + ret = -ENODEV;
> + fud = fuse_get_dev(cmd->file);
> + if (!fud)
> + goto out;
> + fc = fud->fc;
> +
> + ring = fc->ring;
> + if (!ring)
> + goto out;
> +
> + queue = fud->ring_q;
> + if (!queue)
> + goto out;
> +
> + ret = -EINVAL;
> + if (queue->qid != cmd_req->qid)
> + goto out;
> +
> + ret = -ERANGE;
> + if (cmd_req->tag > ring->queue_depth)
> + goto out;
> +
> + ring_ent = &queue->ring_ent[cmd_req->tag];
> +
> + pr_devel("%s:%d received: cmd op %d qid %d (%p) tag %d (%p)\n",
> + __func__, __LINE__, cmd_op, cmd_req->qid, queue, cmd_req->tag,
> + ring_ent);
> +
> + spin_lock(&queue->lock);
> + ret = -ENOTCONN;
> + if (unlikely(fc->aborted || queue->stopped))
> + goto err_unlock;
> +
> + switch (cmd_op) {
> + case FUSE_URING_REQ_FETCH:
> + ret = fuse_uring_fetch(ring_ent, cmd, issue_flags);
> + break;
> + default:
> + ret = -EINVAL;
> + pr_devel("Unknown uring command %d", cmd_op);
> + goto err_unlock;
> + }
> +out:
> + pr_devel("uring cmd op=%d, qid=%d tag=%d ret=%d\n", cmd_op,
> + cmd_req->qid, cmd_req->tag, ret);
> +
> + if (ret < 0) {
> + if (ring_ent != NULL) {
> + pr_info_ratelimited("error: uring cmd op=%d, qid=%d tag=%d ret=%d\n",
> + cmd_op, cmd_req->qid, cmd_req->tag,
> + ret);
> +
> + /* must not change the entry state, as userspace
> + * might have sent random data, but valid requests
> + * might be registered already - don't confuse those.
> + */
> + }
> + io_uring_cmd_done(cmd, ret, 0, issue_flags);
> + }
> +
> + return -EIOCBQUEUED;
> +
> +err_unlock:
> + spin_unlock(&queue->lock);
> + goto out;
> +}
Just a minor thing, but you should be able to just return an error from
here, at least for commands where you don't yet have teardown associated
with it, rather than punting through task_work for that too. Doesn't
really matter and maybe it's cleaner to just keep it the same.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH RFC v3 09/17] fuse: Make fuse_copy non static
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (7 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 08/17] fuse: {uring} Handle SQEs - register commands Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-01 13:37 ` [PATCH RFC v3 10/17] fuse: Add buffer offset for uring into fuse_copy_state Bernd Schubert
` (8 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
Move 'struct fuse_copy_state' and fuse_copy_* functions
to fuse_dev_i.h to make it available for fuse-uring.
'copy_out_args()' is renamed to 'fuse_copy_out_args'.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev.c | 30 ++++++++----------------------
fs/fuse/fuse_dev_i.h | 25 +++++++++++++++++++++++++
2 files changed, 33 insertions(+), 22 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 998027825481..9e012c902df2 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -637,22 +637,8 @@ static int unlock_request(struct fuse_req *req)
return err;
}
-struct fuse_copy_state {
- int write;
- struct fuse_req *req;
- struct iov_iter *iter;
- struct pipe_buffer *pipebufs;
- struct pipe_buffer *currbuf;
- struct pipe_inode_info *pipe;
- unsigned long nr_segs;
- struct page *pg;
- unsigned len;
- unsigned offset;
- unsigned move_pages:1;
-};
-
-static void fuse_copy_init(struct fuse_copy_state *cs, int write,
- struct iov_iter *iter)
+void fuse_copy_init(struct fuse_copy_state *cs, int write,
+ struct iov_iter *iter)
{
memset(cs, 0, sizeof(*cs));
cs->write = write;
@@ -1006,9 +992,9 @@ static int fuse_copy_one(struct fuse_copy_state *cs, void *val, unsigned size)
}
/* Copy request arguments to/from userspace buffer */
-static int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
- unsigned argpages, struct fuse_arg *args,
- int zeroing)
+int fuse_copy_args(struct fuse_copy_state *cs, unsigned numargs,
+ unsigned argpages, struct fuse_arg *args,
+ int zeroing)
{
int err = 0;
unsigned i;
@@ -1874,8 +1860,8 @@ static struct fuse_req *request_find(struct fuse_pqueue *fpq, u64 unique)
return NULL;
}
-static int copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
- unsigned nbytes)
+int fuse_copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
+ unsigned nbytes)
{
unsigned reqsize = sizeof(struct fuse_out_header);
@@ -1977,7 +1963,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
if (oh.error)
err = nbytes != sizeof(oh) ? -EINVAL : 0;
else
- err = copy_out_args(cs, req->args, nbytes);
+ err = fuse_copy_out_args(cs, req->args, nbytes);
fuse_copy_finish(cs);
spin_lock(&fpq->lock);
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index e6289bafb788..0fc7a0ff7b1c 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -13,6 +13,23 @@
#define FUSE_INT_REQ_BIT (1ULL << 0)
#define FUSE_REQ_ID_STEP (1ULL << 1)
+struct fuse_arg;
+struct fuse_args;
+
+struct fuse_copy_state {
+ int write;
+ struct fuse_req *req;
+ struct iov_iter *iter;
+ struct pipe_buffer *pipebufs;
+ struct pipe_buffer *currbuf;
+ struct pipe_inode_info *pipe;
+ unsigned long nr_segs;
+ struct page *pg;
+ unsigned int len;
+ unsigned int offset;
+ unsigned int move_pages:1;
+};
+
static inline struct fuse_dev *fuse_get_dev(struct file *file)
{
/*
@@ -24,6 +41,14 @@ static inline struct fuse_dev *fuse_get_dev(struct file *file)
void fuse_dev_end_requests(struct list_head *head);
+void fuse_copy_init(struct fuse_copy_state *cs, int write,
+ struct iov_iter *iter);
+int fuse_copy_args(struct fuse_copy_state *cs, unsigned int numargs,
+ unsigned int argpages, struct fuse_arg *args,
+ int zeroing);
+int fuse_copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
+ unsigned int nbytes);
+
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 10/17] fuse: Add buffer offset for uring into fuse_copy_state
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (8 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 09/17] fuse: Make fuse_copy non static Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-01 13:37 ` [PATCH RFC v3 11/17] fuse: {uring} Add uring sqe commit and fetch support Bernd Schubert
` (7 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
This is to know the size of the overall copy.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev.c | 13 ++++++++++++-
fs/fuse/fuse_dev_i.h | 5 +++++
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 9e012c902df2..71443a93f1d4 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -745,6 +745,9 @@ static int fuse_copy_do(struct fuse_copy_state *cs, void **val, unsigned *size)
*size -= ncpy;
cs->len -= ncpy;
cs->offset += ncpy;
+ if (cs->is_uring)
+ cs->ring.offset += ncpy;
+
return ncpy;
}
@@ -1863,7 +1866,15 @@ static struct fuse_req *request_find(struct fuse_pqueue *fpq, u64 unique)
int fuse_copy_out_args(struct fuse_copy_state *cs, struct fuse_args *args,
unsigned nbytes)
{
- unsigned reqsize = sizeof(struct fuse_out_header);
+
+ unsigned int reqsize = 0;
+
+ /*
+ * Uring has the out header outside of args
+ * XXX: This uring exception will probably change
+ */
+ if (!cs->is_uring)
+ reqsize = sizeof(struct fuse_out_header);
reqsize += fuse_len_args(args->out_numargs, args->out_args);
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 0fc7a0ff7b1c..0fbb4f28261c 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -28,6 +28,11 @@ struct fuse_copy_state {
unsigned int len;
unsigned int offset;
unsigned int move_pages:1;
+ unsigned int is_uring:1;
+ struct {
+ /* overall offset with the user buffer */
+ unsigned int offset;
+ } ring;
};
static inline struct fuse_dev *fuse_get_dev(struct file *file)
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 11/17] fuse: {uring} Add uring sqe commit and fetch support
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (9 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 10/17] fuse: Add buffer offset for uring into fuse_copy_state Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-01 13:37 ` [PATCH RFC v3 12/17] fuse: {uring} Handle teardown of ring entries Bernd Schubert
` (6 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
This adds support for fuse request completion through ring SQEs
(FUSE_URING_REQ_COMMIT_AND_FETCH handling). After committing
the ring entry it becomes available for new fuse requests.
Handling of requests through the ring (SQE/CQE handling)
is complete now.
Fuse request data are copied through the mmaped ring buffer,
there is no support for any zero copy yet.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev_uring.c | 407 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/dev_uring_i.h | 21 +++
2 files changed, 428 insertions(+)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 46c2274193bf..96347751668e 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -29,6 +29,26 @@
#include <linux/topology.h>
#include <linux/io_uring/cmd.h>
+struct fuse_uring_cmd_pdu {
+ struct fuse_ring_ent *ring_ent;
+};
+
+/*
+ * Finalize a fuse request, then fetch and send the next entry, if available
+ */
+static void fuse_uring_req_end(struct fuse_ring_ent *ring_ent,
+ bool set_err, int error)
+{
+ struct fuse_req *req = ring_ent->fuse_req;
+
+ if (set_err)
+ req->out.h.error = error;
+
+ clear_bit(FR_SENT, &req->flags);
+ fuse_request_end(ring_ent->fuse_req);
+ ring_ent->fuse_req = NULL;
+}
+
static int fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
{
if (WARN_ON_ONCE(ent->state != FRRS_USERSPACE))
@@ -40,6 +60,13 @@ static int fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
return 0;
}
+static void
+fuse_uring_async_send_to_ring(struct io_uring_cmd *cmd,
+ unsigned int issue_flags)
+{
+ io_uring_cmd_done(cmd, 0, 0, issue_flags);
+}
+
/* Update conn limits according to ring values */
static void fuse_uring_conn_cfg_limits(struct fuse_ring *ring)
{
@@ -65,6 +92,9 @@ static void fuse_uring_queue_cfg(struct fuse_ring_queue *queue, int qid,
INIT_LIST_HEAD(&queue->sync_ent_avail_queue);
INIT_LIST_HEAD(&queue->async_ent_avail_queue);
+ INIT_LIST_HEAD(&queue->ent_in_userspace);
+ INIT_LIST_HEAD(&queue->sync_fuse_req_queue);
+ INIT_LIST_HEAD(&queue->async_fuse_req_queue);
for (tag = 0; tag < ring->queue_depth; tag++) {
struct fuse_ring_ent *ent = &queue->ring_ent[tag];
@@ -173,6 +203,200 @@ int fuse_uring_conn_cfg(struct file *file, void __user *argp)
return res;
}
+/*
+ * Checks for errors and stores it into the request
+ */
+static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
+ struct fuse_req *req,
+ struct fuse_conn *fc)
+{
+ int err;
+
+ if (oh->unique == 0) {
+ /* Not supportd through request based uring, this needs another
+ * ring from user space to kernel
+ */
+ pr_warn("Unsupported fuse-notify\n");
+ err = -EINVAL;
+ goto seterr;
+ }
+
+ if (oh->error <= -512 || oh->error > 0) {
+ err = -EINVAL;
+ goto seterr;
+ }
+
+ if (oh->error) {
+ err = oh->error;
+ pr_devel("%s:%d err=%d op=%d req-ret=%d", __func__, __LINE__,
+ err, req->args->opcode, req->out.h.error);
+ goto err; /* error already set */
+ }
+
+ if ((oh->unique & ~FUSE_INT_REQ_BIT) != req->in.h.unique) {
+ pr_warn("Unpexted seqno mismatch, expected: %llu got %llu\n",
+ req->in.h.unique, oh->unique & ~FUSE_INT_REQ_BIT);
+ err = -ENOENT;
+ goto seterr;
+ }
+
+ /* Is it an interrupt reply ID? */
+ if (oh->unique & FUSE_INT_REQ_BIT) {
+ err = 0;
+ if (oh->error == -ENOSYS)
+ fc->no_interrupt = 1;
+ else if (oh->error == -EAGAIN) {
+ /* XXX Interrupts not handled yet */
+ /* err = queue_interrupt(req); */
+ pr_warn("Intrerupt EAGAIN not supported yet");
+ err = -EINVAL;
+ }
+
+ goto seterr;
+ }
+
+ return 0;
+
+seterr:
+ pr_devel("%s:%d err=%d op=%d req-ret=%d", __func__, __LINE__, err,
+ req->args->opcode, req->out.h.error);
+ oh->error = err;
+err:
+ pr_devel("%s:%d err=%d op=%d req-ret=%d", __func__, __LINE__, err,
+ req->args->opcode, req->out.h.error);
+ return err;
+}
+
+static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
+ struct fuse_req *req,
+ struct fuse_ring_ent *ent)
+{
+ struct fuse_ring_req __user *rreq = ent->rreq;
+ struct fuse_copy_state cs;
+ struct fuse_args *args = req->args;
+ struct iov_iter iter;
+ int err;
+ int res_arg_len;
+
+ err = copy_from_user(&res_arg_len, &rreq->in_out_arg_len,
+ sizeof(res_arg_len));
+ if (err)
+ return err;
+
+ err = import_ubuf(ITER_SOURCE, (void __user *)&rreq->in_out_arg,
+ ent->max_arg_len, &iter);
+ if (err)
+ return err;
+
+ fuse_copy_init(&cs, 0, &iter);
+ cs.is_uring = 1;
+ cs.req = req;
+
+ return fuse_copy_out_args(&cs, args, res_arg_len);
+}
+
+ /*
+ * Copy data from the req to the ring buffer
+ */
+static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
+ struct fuse_ring_ent *ent)
+{
+ struct fuse_ring_req __user *rreq = ent->rreq;
+ struct fuse_copy_state cs;
+ struct fuse_args *args = req->args;
+ int err, res;
+ struct iov_iter iter;
+
+ err = import_ubuf(ITER_DEST, (void __user *)&rreq->in_out_arg,
+ ent->max_arg_len, &iter);
+ if (err) {
+ pr_info("Import user buffer failed\n");
+ return err;
+ }
+
+ fuse_copy_init(&cs, 1, &iter);
+ cs.is_uring = 1;
+ cs.req = req;
+ err = fuse_copy_args(&cs, args->in_numargs, args->in_pages,
+ (struct fuse_arg *)args->in_args, 0);
+ if (err) {
+ pr_info("%s fuse_copy_args failed\n", __func__);
+ return err;
+ }
+
+ BUILD_BUG_ON((sizeof(rreq->in_out_arg_len) != sizeof(cs.ring.offset)));
+ res = copy_to_user(&rreq->in_out_arg_len, &cs.ring.offset,
+ sizeof(rreq->in_out_arg_len));
+ err = res > 0 ? -EFAULT : res;
+
+ return err;
+}
+
+static int
+fuse_uring_prepare_send(struct fuse_ring_ent *ring_ent)
+{
+ struct fuse_ring_req *rreq = ring_ent->rreq;
+ struct fuse_ring_queue *queue = ring_ent->queue;
+ struct fuse_ring *ring = queue->ring;
+ struct fuse_req *req = ring_ent->fuse_req;
+ int err = 0, res;
+
+ if (WARN_ON(ring_ent->state != FRRS_FUSE_REQ)) {
+ pr_err("qid=%d tag=%d ring-req=%p buf_req=%p invalid state %d on send\n",
+ queue->qid, ring_ent->tag, ring_ent, rreq,
+ ring_ent->state);
+ err = -EIO;
+ }
+
+ if (err)
+ return err;
+
+ pr_devel("%s qid=%d tag=%d state=%d cmd-done op=%d unique=%llu\n",
+ __func__, queue->qid, ring_ent->tag, ring_ent->state,
+ req->in.h.opcode, req->in.h.unique);
+
+ /* copy the request */
+ err = fuse_uring_copy_to_ring(ring, req, ring_ent);
+ if (unlikely(err)) {
+ pr_info("Copy to ring failed: %d\n", err);
+ goto err;
+ }
+
+ /* copy fuse_in_header */
+ res = copy_to_user(&rreq->in, &req->in.h, sizeof(rreq->in));
+ err = res > 0 ? -EFAULT : res;
+ if (err)
+ goto err;
+
+ set_bit(FR_SENT, &req->flags);
+ return 0;
+
+err:
+ fuse_uring_req_end(ring_ent, true, err);
+ return err;
+}
+
+/*
+ * Write data to the ring buffer and send the request to userspace,
+ * userspace will read it
+ * This is comparable with classical read(/dev/fuse)
+ */
+static int fuse_uring_send_next_to_ring(struct fuse_ring_ent *ring_ent)
+{
+ int err = 0;
+
+ err = fuse_uring_prepare_send(ring_ent);
+ if (err)
+ goto err;
+
+ io_uring_cmd_complete_in_task(ring_ent->cmd,
+ fuse_uring_async_send_to_ring);
+ return 0;
+
+err:
+ return err;
+}
+
/*
* Put a ring request onto hold, it is no longer used for now.
*/
@@ -206,6 +430,166 @@ static void fuse_uring_ent_avail(struct fuse_ring_ent *ring_ent,
ring_ent->state = FRRS_WAIT;
}
+/*
+ * Assign a fuse queue entry to the given entry
+ */
+static void fuse_uring_add_req_to_ring_ent(struct fuse_ring_ent *ring_ent,
+ struct fuse_req *req)
+{
+ lockdep_assert_held(&ring_ent->queue->lock);
+
+ if (WARN_ON_ONCE(ring_ent->state != FRRS_WAIT &&
+ ring_ent->state != FRRS_COMMIT)) {
+ pr_warn("%s qid=%d tag=%d state=%d async=%d\n", __func__,
+ ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
+ ring_ent->async);
+ }
+ list_del_init(&req->list);
+ clear_bit(FR_PENDING, &req->flags);
+ ring_ent->fuse_req = req;
+ ring_ent->state = FRRS_FUSE_REQ;
+}
+
+/*
+ * Release the ring entry and fetch the next fuse request if available
+ *
+ * @return true if a new request has been fetched
+ */
+static bool fuse_uring_ent_assign_req(struct fuse_ring_ent *ring_ent)
+ __must_hold(&queue->lock)
+{
+ struct fuse_req *req = NULL;
+ struct fuse_ring_queue *queue = ring_ent->queue;
+ struct list_head *req_queue = ring_ent->async ?
+ &queue->async_fuse_req_queue :
+ &queue->sync_fuse_req_queue;
+
+ lockdep_assert_held(&queue->lock);
+
+ /* get and assign the next entry while it is still holding the lock */
+ if (!list_empty(req_queue)) {
+ req = list_first_entry(req_queue, struct fuse_req, list);
+ fuse_uring_add_req_to_ring_ent(ring_ent, req);
+ list_del_init(&ring_ent->list);
+ }
+
+ return req ? true : false;
+}
+
+/*
+ * Read data from the ring buffer, which user space has written to
+ * This is comparible with handling of classical write(/dev/fuse).
+ * Also make the ring request available again for new fuse requests.
+ */
+static void fuse_uring_commit(struct fuse_ring_ent *ring_ent,
+ unsigned int issue_flags)
+{
+ struct fuse_ring *ring = ring_ent->queue->ring;
+ struct fuse_conn *fc = ring->fc;
+ struct fuse_ring_req *rreq = ring_ent->rreq;
+ struct fuse_req *req = ring_ent->fuse_req;
+ ssize_t err = 0;
+ bool set_err = false;
+
+ err = copy_from_user(&req->out.h, &rreq->out, sizeof(req->out.h));
+ if (err) {
+ req->out.h.error = err;
+ goto out;
+ }
+
+ err = fuse_uring_out_header_has_err(&req->out.h, req, fc);
+ if (err) {
+ /* req->out.h.error already set */
+ pr_devel("%s:%d err=%zd oh->err=%d\n", __func__, __LINE__, err,
+ req->out.h.error);
+ goto out;
+ }
+
+ err = fuse_uring_copy_from_ring(ring, req, ring_ent);
+ if (err)
+ set_err = true;
+
+out:
+ pr_devel("%s:%d ret=%zd op=%d req-ret=%d\n", __func__, __LINE__, err,
+ req->args->opcode, req->out.h.error);
+ fuse_uring_req_end(ring_ent, set_err, err);
+}
+
+/*
+ * Get the next fuse req and send it
+ */
+static void fuse_uring_next_fuse_req(struct fuse_ring_ent *ring_ent,
+ struct fuse_ring_queue *queue)
+{
+ int has_next, err;
+ int prev_state = ring_ent->state;
+
+ WARN_ON_ONCE(!list_empty(&ring_ent->list));
+
+ do {
+ spin_lock(&queue->lock);
+ has_next = fuse_uring_ent_assign_req(ring_ent);
+ if (!has_next) {
+ fuse_uring_ent_avail(ring_ent, queue);
+ spin_unlock(&queue->lock);
+ break; /* no request left */
+ }
+ spin_unlock(&queue->lock);
+
+ err = fuse_uring_send_next_to_ring(ring_ent);
+ if (err) {
+ ring_ent->state = prev_state;
+ continue;
+ }
+
+ err = 0;
+ spin_lock(&queue->lock);
+ ring_ent->state = FRRS_USERSPACE;
+ list_add(&ring_ent->list, &queue->ent_in_userspace);
+ spin_unlock(&queue->lock);
+ } while (err);
+}
+
+/* FUSE_URING_REQ_COMMIT_AND_FETCH handler */
+static int fuse_uring_commit_fetch(struct fuse_ring_ent *ring_ent,
+ struct io_uring_cmd *cmd, int issue_flags)
+__releases(ring_ent->queue->lock)
+{
+ int err;
+ struct fuse_ring_queue *queue = ring_ent->queue;
+ struct fuse_ring *ring = queue->ring;
+
+ err = -ENOTCONN;
+ if (unlikely(!ring->ready)) {
+ pr_info("commit and fetch, but fuse-uring is not ready.");
+ return err;
+ }
+
+ err = -EALREADY;
+ if (ring_ent->state != FRRS_USERSPACE) {
+ pr_info("qid=%d tag=%d state %d SQE already handled\n",
+ queue->qid, ring_ent->tag, ring_ent->state);
+ return err;
+ }
+
+ fuse_ring_ring_ent_unset_userspace(ring_ent);
+
+ ring_ent->cmd = cmd;
+ spin_unlock(&queue->lock);
+
+ /* without the queue lock, as other locks are taken */
+ fuse_uring_commit(ring_ent, issue_flags);
+
+ /*
+ * Fetching the next request is absolutely required as queued
+ * fuse requests would otherwise not get processed - committing
+ * and fetching is done in one step vs legacy fuse, which has separated
+ * read (fetch request) and write (commit result).
+ */
+ fuse_uring_next_fuse_req(ring_ent, queue);
+ return 0;
+}
+
/*
* fuse_uring_req_fetch command handling
*/
@@ -250,6 +634,7 @@ __must_hold(ring_ent->queue->lock)
return 0;
}
+/* FUSE_URING_REQ_FETCH handler */
static int fuse_uring_fetch(struct fuse_ring_ent *ring_ent,
struct io_uring_cmd *cmd, unsigned int issue_flags)
__releases(ring_ent->queue->lock)
@@ -339,10 +724,32 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
if (unlikely(fc->aborted || queue->stopped))
goto err_unlock;
+ ring_ent->rreq = (void __user *)cmd_req->buf_ptr;
+ ring_ent->max_arg_len = cmd_req->buf_len -
+ offsetof(struct fuse_ring_req, in_out_arg);
+ ret = -EINVAL;
+ if (cmd_req->buf_len < ring->req_buf_sz) {
+ pr_info("Invalid req buf len, expected: %zd got %d\n",
+ ring->req_buf_sz, cmd_req->buf_len);
+ goto err_unlock;
+ }
+
+ ring_ent->rreq = (void __user *)cmd_req->buf_ptr;
+ ring_ent->max_arg_len = cmd_req->buf_len -
+ offsetof(struct fuse_ring_req, in_out_arg);
+ if (cmd_req->buf_len < ring->req_buf_sz) {
+ pr_info("Invalid req buf len, expected: %zd got %d\n",
+ ring->req_buf_sz, cmd_req->buf_len);
+ goto err_unlock;
+ }
+
switch (cmd_op) {
case FUSE_URING_REQ_FETCH:
ret = fuse_uring_fetch(ring_ent, cmd, issue_flags);
break;
+ case FUSE_URING_REQ_COMMIT_AND_FETCH:
+ ret = fuse_uring_commit_fetch(ring_ent, cmd, issue_flags);
+ break;
default:
ret = -EINVAL;
pr_devel("Unknown uring command %d", cmd_op);
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 6561f4178cac..697963e5d524 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -26,6 +26,9 @@ enum fuse_ring_req_state {
/* The ring request waits for a new fuse request */
FRRS_WAIT,
+ /* The ring req got assigned a fuse req */
+ FRRS_FUSE_REQ,
+
/* request is in or on the way to user space */
FRRS_USERSPACE,
};
@@ -47,6 +50,17 @@ struct fuse_ring_ent {
struct list_head list;
struct io_uring_cmd *cmd;
+
+ /* fuse_req assigned to the ring entry */
+ struct fuse_req *fuse_req;
+
+ /*
+ * buffer provided by fuse server
+ */
+ struct fuse_ring_req __user *rreq;
+
+ /* struct fuse_ring_req::in_out_arg size*/
+ size_t max_arg_len;
};
struct fuse_ring_queue {
@@ -69,6 +83,13 @@ struct fuse_ring_queue {
struct list_head async_ent_avail_queue;
struct list_head sync_ent_avail_queue;
+ /* fuse fg/bg request types */
+ struct list_head async_fuse_req_queue;
+ struct list_head sync_fuse_req_queue;
+
+ /* entries sent to userspace */
+ struct list_head ent_in_userspace;
+
/*
* available number of sync requests,
* loosely bound to fuse foreground requests
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 12/17] fuse: {uring} Handle teardown of ring entries
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (10 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 11/17] fuse: {uring} Add uring sqe commit and fetch support Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-01 13:37 ` [PATCH RFC v3 13/17] fuse: {uring} Add a ring queue and send method Bernd Schubert
` (5 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
On teardown struct file_operations::uring_cmd requests
need to be completed by calling io_uring_cmd_done().
Not completing all ring entries would result in busy io-uring
tasks giving warning messages in intervals and unreleased
struct file.
Additionally the fuse connection and with that the ring can
only get released when all io-uring commands are completed.
Completion is done with ring entries that are
a) in waiting state for new fuse requests - io_uring_cmd_done
is needed
b) already in userspace - io_uring_cmd_done through teardown
is not needed, the request can just get released. If fuse server
is still active and commits such a ring entry, fuse_uring_cmd()
already checks if the connection is active and then complete the
io-uring itself with -ENOTCONN. I.e. special handling is not
needed.
This scheme is basically represented by the ring entry state
FRRS_WAIT and FRRS_USERSPACE.
Entries in state:
- FRRS_INIT: No action needed, do not contribute to
ring->queue_refs yet
- All other states: Are currently processed by other tasks,
async teardown is needed and it has to wait for the two
states above. It could be also solved without an async
teardown task, but would require additional if conditions
in hot code paths. Also in my personal opinion the code
looks cleaner with async teardown.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev.c | 10 +++
fs/fuse/dev_uring.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/dev_uring_i.h | 81 +++++++++++++++++++++
3 files changed, 287 insertions(+)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 71443a93f1d4..3485752e25aa 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2190,6 +2190,8 @@ void fuse_abort_conn(struct fuse_conn *fc)
fc->connected = 0;
spin_unlock(&fc->bg_lock);
+ fuse_uring_set_stopped(fc);
+
fuse_set_initialized(fc);
list_for_each_entry(fud, &fc->devices, entry) {
struct fuse_pqueue *fpq = &fud->pq;
@@ -2233,6 +2235,12 @@ void fuse_abort_conn(struct fuse_conn *fc)
spin_unlock(&fc->lock);
fuse_dev_end_requests(&to_end);
+
+ /*
+ * fc->lock must not be taken to avoid conflicts with io-uring
+ * locks
+ */
+ fuse_uring_abort(fc);
} else {
spin_unlock(&fc->lock);
}
@@ -2244,6 +2252,8 @@ void fuse_wait_aborted(struct fuse_conn *fc)
/* matches implicit memory barrier in fuse_drop_waiting() */
smp_mb();
wait_event(fc->blocked_waitq, atomic_read(&fc->num_waiting) == 0);
+
+ fuse_uring_wait_stopped_queues(fc);
}
int fuse_dev_release(struct inode *inode, struct file *file)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 96347751668e..52e2323cc258 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -67,6 +67,41 @@ fuse_uring_async_send_to_ring(struct io_uring_cmd *cmd,
io_uring_cmd_done(cmd, 0, 0, issue_flags);
}
+/* Abort all list queued request on the given ring queue */
+static void fuse_uring_abort_end_queue_requests(struct fuse_ring_queue *queue)
+{
+ struct fuse_req *req;
+ LIST_HEAD(sync_list);
+ LIST_HEAD(async_list);
+
+ spin_lock(&queue->lock);
+
+ list_for_each_entry(req, &queue->sync_fuse_req_queue, list)
+ clear_bit(FR_PENDING, &req->flags);
+ list_for_each_entry(req, &queue->async_fuse_req_queue, list)
+ clear_bit(FR_PENDING, &req->flags);
+
+ list_splice_init(&queue->async_fuse_req_queue, &sync_list);
+ list_splice_init(&queue->sync_fuse_req_queue, &async_list);
+
+ spin_unlock(&queue->lock);
+
+ /* must not hold queue lock to avoid order issues with fi->lock */
+ fuse_dev_end_requests(&sync_list);
+ fuse_dev_end_requests(&async_list);
+}
+
+void fuse_uring_abort_end_requests(struct fuse_ring *ring)
+{
+ int qid;
+
+ for (qid = 0; qid < ring->nr_queues; qid++) {
+ struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+ fuse_uring_abort_end_queue_requests(queue);
+ }
+}
+
/* Update conn limits according to ring values */
static void fuse_uring_conn_cfg_limits(struct fuse_ring *ring)
{
@@ -124,6 +159,8 @@ static int _fuse_uring_conn_cfg(struct fuse_ring_config *rcfg,
ring->queue_size = queue_sz;
+ init_waitqueue_head(&ring->stop_waitq);
+
fc->ring = ring;
ring->fc = fc;
@@ -203,6 +240,165 @@ int fuse_uring_conn_cfg(struct file *file, void __user *argp)
return res;
}
+static void fuse_uring_stop_fuse_req_end(struct fuse_ring_ent *ent)
+{
+ struct fuse_req *req = ent->fuse_req;
+
+ ent->fuse_req = NULL;
+ clear_bit(FR_SENT, &req->flags);
+ req->out.h.error = -ECONNABORTED;
+ fuse_request_end(req);
+}
+
+/*
+ * Release a request/entry on connection tear down
+ */
+static void fuse_uring_entry_teardown(struct fuse_ring_ent *ent,
+ bool need_cmd_done)
+{
+ struct fuse_ring_queue *queue = ent->queue;
+
+ /*
+ * fuse_request_end() might take other locks like fi->lock and
+ * can lead to lock ordering issues
+ */
+ lockdep_assert_not_held(&ent->queue->lock);
+
+ if (need_cmd_done) {
+ pr_devel("qid=%d tag=%d sending cmd_done\n", queue->qid,
+ ent->tag);
+
+ io_uring_cmd_done(ent->cmd, -ENOTCONN, 0,
+ IO_URING_F_UNLOCKED);
+ }
+
+ if (ent->fuse_req)
+ fuse_uring_stop_fuse_req_end(ent);
+
+ ent->state = FRRS_FREED;
+}
+
+static void fuse_uring_stop_list_entries(struct list_head *head,
+ struct fuse_ring_queue *queue,
+ enum fuse_ring_req_state exp_state)
+{
+ struct fuse_ring *ring = queue->ring;
+ struct fuse_ring_ent *ent, *next;
+ ssize_t queue_refs = SSIZE_MAX;
+ LIST_HEAD(to_teardown);
+
+ spin_lock(&queue->lock);
+ list_for_each_entry_safe(ent, next, head, list) {
+ if (ent->state != exp_state) {
+ pr_warn("entry teardown qid=%d tag=%d state=%d expected=%d",
+ queue->qid, ent->tag, ent->state, exp_state);
+ continue;
+ }
+
+ list_move(&ent->list, &to_teardown);
+ }
+ spin_unlock(&queue->lock);
+
+ /* no queue lock to avoid lock order issues */
+ list_for_each_entry_safe(ent, next, &to_teardown, list) {
+ bool need_cmd_done = ent->state != FRRS_USERSPACE;
+
+ fuse_uring_entry_teardown(ent, need_cmd_done);
+ queue_refs = atomic_dec_return(&ring->queue_refs);
+
+ if (WARN_ON_ONCE(queue_refs < 0))
+ pr_warn("qid=%d queue_refs=%zd", queue->qid,
+ queue_refs);
+ }
+}
+
+static void fuse_uring_stop_queue(struct fuse_ring_queue *queue)
+{
+ fuse_uring_stop_list_entries(&queue->ent_in_userspace, queue,
+ FRRS_USERSPACE);
+ fuse_uring_stop_list_entries(&queue->async_ent_avail_queue, queue,
+ FRRS_WAIT);
+ fuse_uring_stop_list_entries(&queue->sync_ent_avail_queue, queue,
+ FRRS_WAIT);
+}
+
+/*
+ * Log state debug info
+ */
+static void fuse_uring_log_ent_state(struct fuse_ring *ring)
+{
+ int qid, tag;
+
+ for (qid = 0; qid < ring->nr_queues; qid++) {
+ struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+ for (tag = 0; tag < ring->queue_depth; tag++) {
+ struct fuse_ring_ent *ent = &queue->ring_ent[tag];
+
+ if (ent->state != FRRS_FREED && ent->state != FRRS_INIT)
+ pr_info("ring=%p qid=%d tag=%d state=%d\n",
+ ring, qid, tag, ent->state);
+ }
+ }
+ ring->stop_debug_log = 1;
+}
+
+static void fuse_uring_async_stop_queues(struct work_struct *work)
+{
+ int qid;
+ struct fuse_ring *ring =
+ container_of(work, struct fuse_ring, async_teardown_work.work);
+
+ for (qid = 0; qid < ring->nr_queues; qid++) {
+ struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+ fuse_uring_stop_queue(queue);
+ }
+
+ /*
+ * Some ring entries are might be in the middle of IO operations,
+ * i.e. in process to get handled by file_operations::uring_cmd
+ * or on the way to userspace - we could handle that with conditions in
+ * run time code, but easier/cleaner to have an async tear down handler
+ * If there are still queue references left
+ */
+ if (atomic_read(&ring->queue_refs) > 0) {
+ if (time_after(jiffies,
+ ring->teardown_time + FUSE_URING_TEARDOWN_TIMEOUT))
+ fuse_uring_log_ent_state(ring);
+
+ schedule_delayed_work(&ring->async_teardown_work,
+ FUSE_URING_TEARDOWN_INTERVAL);
+ } else {
+ wake_up_all(&ring->stop_waitq);
+ }
+}
+
+/*
+ * Stop the ring queues
+ */
+void fuse_uring_stop_queues(struct fuse_ring *ring)
+{
+ int qid;
+
+ for (qid = 0; qid < ring->nr_queues; qid++) {
+ struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+ fuse_uring_stop_queue(queue);
+ }
+
+ if (atomic_read(&ring->queue_refs) > 0) {
+ pr_info("ring=%p scheduling async queue stop\n", ring);
+ ring->teardown_time = jiffies;
+ INIT_DELAYED_WORK(&ring->async_teardown_work,
+ fuse_uring_async_stop_queues);
+ schedule_delayed_work(&ring->async_teardown_work,
+ FUSE_URING_TEARDOWN_INTERVAL);
+ } else {
+ wake_up_all(&ring->stop_waitq);
+ }
+}
+
/*
* Checks for errors and stores it into the request
*/
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 697963e5d524..432465d4bfce 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -14,6 +14,9 @@
/* IORING_MAX_ENTRIES */
#define FUSE_URING_MAX_QUEUE_DEPTH 32768
+#define FUSE_URING_TEARDOWN_TIMEOUT (5 * HZ)
+#define FUSE_URING_TEARDOWN_INTERVAL (HZ/20)
+
enum fuse_ring_req_state {
FRRS_INVALID = 0,
@@ -31,6 +34,9 @@ enum fuse_ring_req_state {
/* request is in or on the way to user space */
FRRS_USERSPACE,
+
+ /* request is released */
+ FRRS_FREED,
};
/* A fuse ring entry, part of the ring queue */
@@ -143,17 +149,32 @@ struct fuse_ring {
/* Is the ring read to take requests */
unsigned int ready : 1;
+ /*
+ * Log ring entry states onces on stop when entries cannot be
+ * released
+ */
+ unsigned int stop_debug_log : 1;
+
/* number of SQEs initialized */
atomic_t nr_sqe_init;
/* Used to release the ring on stop */
atomic_t queue_refs;
+ wait_queue_head_t stop_waitq;
+
+ /* async tear down */
+ struct delayed_work async_teardown_work;
+
+ /* log */
+ unsigned long teardown_time;
+
struct fuse_ring_queue queues[] ____cacheline_aligned_in_smp;
};
void fuse_uring_abort_end_requests(struct fuse_ring *ring);
int fuse_uring_conn_cfg(struct file *file, void __user *argp);
+void fuse_uring_stop_queues(struct fuse_ring *ring);
int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
@@ -189,6 +210,55 @@ static inline bool fuse_per_core_queue(struct fuse_conn *fc)
return fc->ring && fc->ring->per_core_queue;
}
+static inline void fuse_uring_set_stopped_queues(struct fuse_ring *ring)
+{
+ int qid;
+
+ for (qid = 0; qid < ring->nr_queues; qid++) {
+ struct fuse_ring_queue *queue = fuse_uring_get_queue(ring, qid);
+
+ spin_lock(&queue->lock);
+ queue->stopped = 1;
+ spin_unlock(&queue->lock);
+ }
+}
+
+/*
+ * Set per queue aborted flag
+ */
+static inline void fuse_uring_set_stopped(struct fuse_conn *fc)
+ __must_hold(fc->lock)
+{
+ if (fc->ring == NULL)
+ return;
+
+ fc->ring->ready = false;
+
+ fuse_uring_set_stopped_queues(fc->ring);
+}
+
+static inline void fuse_uring_abort(struct fuse_conn *fc)
+{
+ struct fuse_ring *ring = fc->ring;
+
+ if (ring == NULL)
+ return;
+
+ if (atomic_read(&ring->queue_refs) > 0) {
+ fuse_uring_abort_end_requests(ring);
+ fuse_uring_stop_queues(ring);
+ }
+}
+
+static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
+{
+ struct fuse_ring *ring = fc->ring;
+
+ if (ring)
+ wait_event(ring->stop_waitq,
+ atomic_read(&ring->queue_refs) == 0);
+}
+
#else /* CONFIG_FUSE_IO_URING */
struct fuse_ring;
@@ -212,6 +282,17 @@ static inline bool fuse_per_core_queue(struct fuse_conn *fc)
return false;
}
+static inline void fuse_uring_set_stopped(struct fuse_conn *fc)
+{
+}
+
+static inline void fuse_uring_abort(struct fuse_conn *fc)
+{
+}
+
+static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
+{
+}
#endif /* CONFIG_FUSE_IO_URING */
#endif /* _FS_FUSE_DEV_URING_I_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 13/17] fuse: {uring} Add a ring queue and send method
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (11 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 12/17] fuse: {uring} Handle teardown of ring entries Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-01 13:37 ` [PATCH RFC v3 14/17] fuse: {uring} Allow to queue to the ring Bernd Schubert
` (4 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
This prepares queueing and sending through io-uring.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev_uring.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/dev_uring_i.h | 7 ++++
2 files changed, 111 insertions(+)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 52e2323cc258..43e7486d9f93 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -975,3 +975,107 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
spin_unlock(&queue->lock);
goto out;
}
+
+/*
+ * This prepares and sends the ring request in fuse-uring task context.
+ * User buffers are not mapped yet - the application does not have permission
+ * to write to it - this has to be executed in ring task context.
+ * XXX: Map and pin user paged and avoid this function.
+ */
+static void
+fuse_uring_send_req_in_task(struct io_uring_cmd *cmd,
+ unsigned int issue_flags)
+{
+ struct fuse_uring_cmd_pdu *pdu = (struct fuse_uring_cmd_pdu *)cmd->pdu;
+ struct fuse_ring_ent *ring_ent = pdu->ring_ent;
+ struct fuse_ring_queue *queue = ring_ent->queue;
+ int err;
+
+ BUILD_BUG_ON(sizeof(pdu) > sizeof(cmd->pdu));
+
+ err = fuse_uring_prepare_send(ring_ent);
+ if (err)
+ goto err;
+
+ io_uring_cmd_done(cmd, 0, 0, issue_flags);
+
+ spin_lock(&queue->lock);
+ ring_ent->state = FRRS_USERSPACE;
+ list_add(&ring_ent->list, &queue->ent_in_userspace);
+ spin_unlock(&queue->lock);
+
+ return;
+err:
+ fuse_uring_next_fuse_req(ring_ent, queue);
+}
+
+/* queue a fuse request and send it if a ring entry is available */
+int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
+{
+ struct fuse_ring *ring = fc->ring;
+ struct fuse_ring_queue *queue;
+ int qid = 0;
+ struct fuse_ring_ent *ring_ent = NULL;
+ int res;
+ bool async = test_bit(FR_BACKGROUND, &req->flags);
+ struct list_head *req_queue, *ent_queue;
+
+ if (ring->per_core_queue) {
+ /*
+ * async requests are best handled on another core, the current
+ * core can do application/page handling, while the async request
+ * is handled on another core in userspace.
+ * For sync request the application has to wait - no processing, so
+ * the request should continue on the current core and avoid context
+ * switches.
+ * XXX This should be on the same numa node and not busy - is there
+ * a scheduler function available that could make this decision?
+ * It should also not persistently switch between cores - makes
+ * it hard for the scheduler.
+ */
+ qid = task_cpu(current);
+
+ if (WARN_ONCE(qid >= ring->nr_queues,
+ "Core number (%u) exceeds nr ueues (%zu)\n",
+ qid, ring->nr_queues))
+ qid = 0;
+ }
+
+ queue = fuse_uring_get_queue(ring, qid);
+ req_queue = async ? &queue->async_fuse_req_queue :
+ &queue->sync_fuse_req_queue;
+ ent_queue = async ? &queue->async_ent_avail_queue :
+ &queue->sync_ent_avail_queue;
+
+ spin_lock(&queue->lock);
+
+ if (unlikely(queue->stopped)) {
+ res = -ENOTCONN;
+ goto err_unlock;
+ }
+
+ list_add_tail(&req->list, req_queue);
+
+ if (!list_empty(ent_queue)) {
+ ring_ent =
+ list_first_entry(ent_queue, struct fuse_ring_ent, list);
+ list_del_init(&ring_ent->list);
+ fuse_uring_add_req_to_ring_ent(ring_ent, req);
+ }
+ spin_unlock(&queue->lock);
+
+ if (ring_ent != NULL) {
+ struct io_uring_cmd *cmd = ring_ent->cmd;
+ struct fuse_uring_cmd_pdu *pdu =
+ (struct fuse_uring_cmd_pdu *)cmd->pdu;
+
+ pdu->ring_ent = ring_ent;
+ io_uring_cmd_complete_in_task(cmd, fuse_uring_send_req_in_task);
+ }
+
+ return 0;
+
+err_unlock:
+ spin_unlock(&queue->lock);
+ return res;
+}
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index 432465d4bfce..d9988d4beeed 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -176,6 +176,7 @@ void fuse_uring_abort_end_requests(struct fuse_ring *ring);
int fuse_uring_conn_cfg(struct file *file, void __user *argp);
void fuse_uring_stop_queues(struct fuse_ring *ring);
int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags);
+int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req);
static inline void fuse_uring_conn_destruct(struct fuse_conn *fc)
{
@@ -293,6 +294,12 @@ static inline void fuse_uring_abort(struct fuse_conn *fc)
static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
{
}
+
+static inline int
+fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
+{
+ return -EPFNOSUPPORT;
+}
#endif /* CONFIG_FUSE_IO_URING */
#endif /* _FS_FUSE_DEV_URING_I_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 14/17] fuse: {uring} Allow to queue to the ring
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (12 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 13/17] fuse: {uring} Add a ring queue and send method Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-01 13:37 ` [PATCH RFC v3 15/17] ate: 2024-08-30 15:43:32 +0100 Bernd Schubert
` (3 subsequent siblings)
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
This enables enqueuing requests through fuse uring queues.
For initial simplicity requests are always allocated the normal way
then added to ring queues lists and only then copied to ring queue
entries. Later on the allocation and adding the requests to a list
can be avoided, by directly using a ring entry. This introduces
some code complexity and is therefore not done for now.
FIXME: Needs update with new function pointers in fuse-next.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev.c | 74 +++++++++++++++++++++++++++++++++++++++++++++------
fs/fuse/dev_uring_i.h | 10 +++++++
2 files changed, 76 insertions(+), 8 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 3485752e25aa..9f0f2120b1fa 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -218,13 +218,24 @@ const struct fuse_iqueue_ops fuse_dev_fiq_ops = {
};
EXPORT_SYMBOL_GPL(fuse_dev_fiq_ops);
-static void queue_request_and_unlock(struct fuse_iqueue *fiq,
- struct fuse_req *req)
+
+static void queue_request_and_unlock(struct fuse_conn *fc,
+ struct fuse_req *req, bool allow_uring)
__releases(fiq->lock)
{
+ struct fuse_iqueue *fiq = &fc->iq;
+
req->in.h.len = sizeof(struct fuse_in_header) +
fuse_len_args(req->args->in_numargs,
(struct fuse_arg *) req->args->in_args);
+
+ if (allow_uring && fuse_uring_ready(fc)) {
+ /* this lock is not needed at all for ring req handling */
+ spin_unlock(&fiq->lock);
+ fuse_uring_queue_fuse_req(fc, req);
+ return;
+ }
+
list_add_tail(&req->list, &fiq->pending);
fiq->ops->wake_pending_and_unlock(fiq);
}
@@ -261,7 +272,7 @@ static void flush_bg_queue(struct fuse_conn *fc)
fc->active_background++;
spin_lock(&fiq->lock);
req->in.h.unique = fuse_get_unique(fiq);
- queue_request_and_unlock(fiq, req);
+ queue_request_and_unlock(fc, req, true);
}
}
@@ -405,7 +416,8 @@ static void request_wait_answer(struct fuse_req *req)
static void __fuse_request_send(struct fuse_req *req)
{
- struct fuse_iqueue *fiq = &req->fm->fc->iq;
+ struct fuse_conn *fc = req->fm->fc;
+ struct fuse_iqueue *fiq = &fc->iq;
BUG_ON(test_bit(FR_BACKGROUND, &req->flags));
spin_lock(&fiq->lock);
@@ -417,7 +429,7 @@ static void __fuse_request_send(struct fuse_req *req)
/* acquire extra reference, since request is still needed
after fuse_request_end() */
__fuse_get_request(req);
- queue_request_and_unlock(fiq, req);
+ queue_request_and_unlock(fc, req, true);
request_wait_answer(req);
/* Pairs with smp_wmb() in fuse_request_end() */
@@ -487,6 +499,10 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
if (args->force) {
atomic_inc(&fc->num_waiting);
req = fuse_request_alloc(fm, GFP_KERNEL | __GFP_NOFAIL);
+ if (unlikely(!req)) {
+ ret = -ENOTCONN;
+ goto err;
+ }
if (!args->nocreds)
fuse_force_creds(req);
@@ -514,16 +530,55 @@ ssize_t fuse_simple_request(struct fuse_mount *fm, struct fuse_args *args)
}
fuse_put_request(req);
+err:
return ret;
}
-static bool fuse_request_queue_background(struct fuse_req *req)
+static bool fuse_request_queue_background_uring(struct fuse_conn *fc,
+ struct fuse_req *req)
+{
+ struct fuse_iqueue *fiq = &fc->iq;
+ int err;
+
+ req->in.h.unique = fuse_get_unique(fiq);
+ req->in.h.len = sizeof(struct fuse_in_header) +
+ fuse_len_args(req->args->in_numargs,
+ (struct fuse_arg *) req->args->in_args);
+
+ err = fuse_uring_queue_fuse_req(fc, req);
+ if (!err) {
+ /* XXX remove and lets the users of that use per queue values -
+ * avoid the shared spin lock...
+ * Is this needed at all?
+ */
+ spin_lock(&fc->bg_lock);
+ fc->num_background++;
+ fc->active_background++;
+
+
+ /* XXX block when per ring queues get occupied */
+ if (fc->num_background == fc->max_background)
+ fc->blocked = 1;
+ spin_unlock(&fc->bg_lock);
+ }
+
+ return err ? false : true;
+}
+
+/*
+ * @return true if queued
+ */
+static int fuse_request_queue_background(struct fuse_req *req)
{
struct fuse_mount *fm = req->fm;
struct fuse_conn *fc = fm->fc;
bool queued = false;
WARN_ON(!test_bit(FR_BACKGROUND, &req->flags));
+
+ if (fuse_uring_ready(fc))
+ return fuse_request_queue_background_uring(fc, req);
+
if (!test_bit(FR_WAITING, &req->flags)) {
__set_bit(FR_WAITING, &req->flags);
atomic_inc(&fc->num_waiting);
@@ -576,7 +631,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
struct fuse_args *args, u64 unique)
{
struct fuse_req *req;
- struct fuse_iqueue *fiq = &fm->fc->iq;
+ struct fuse_conn *fc = fm->fc;
+ struct fuse_iqueue *fiq = &fc->iq;
int err = 0;
req = fuse_get_req(fm, false);
@@ -590,7 +646,8 @@ static int fuse_simple_notify_reply(struct fuse_mount *fm,
spin_lock(&fiq->lock);
if (fiq->connected) {
- queue_request_and_unlock(fiq, req);
+ /* uring for notify not supported yet */
+ queue_request_and_unlock(fc, req, false);
} else {
err = -ENODEV;
spin_unlock(&fiq->lock);
@@ -2193,6 +2250,7 @@ void fuse_abort_conn(struct fuse_conn *fc)
fuse_uring_set_stopped(fc);
fuse_set_initialized(fc);
+
list_for_each_entry(fud, &fc->devices, entry) {
struct fuse_pqueue *fpq = &fud->pq;
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index d9988d4beeed..f1247ee57dc4 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -260,6 +260,11 @@ static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
atomic_read(&ring->queue_refs) == 0);
}
+static inline bool fuse_uring_ready(struct fuse_conn *fc)
+{
+ return fc->ring && fc->ring->ready;
+}
+
#else /* CONFIG_FUSE_IO_URING */
struct fuse_ring;
@@ -295,6 +300,11 @@ static inline void fuse_uring_wait_stopped_queues(struct fuse_conn *fc)
{
}
+static inline bool fuse_uring_ready(struct fuse_conn *fc)
+{
+ return false;
+}
+
static inline int
fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 15/17] ate: 2024-08-30 15:43:32 +0100
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (13 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 14/17] fuse: {uring} Allow to queue to the ring Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-04 15:43 ` Jens Axboe
2024-09-01 13:37 ` [PATCH RFC v3 16/17] fuse: {uring} Handle IO_URING_F_TASK_DEAD Bernd Schubert
` (2 subsequent siblings)
17 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
From: Pavel Begunkov <[email protected]>
io_uring/cmd: let cmds tw know about dying task
When the taks that submitted a request is dying, a task work for that
request might get run by a kernel thread or even worse by a half
dismantled task. We can't just cancel the task work without running the
callback as the cmd might need to do some clean up, so pass a flag
instead. If set, it's not safe to access any task resources and the
callback is expected to cancel the cmd ASAP.
Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 1 +
io_uring/uring_cmd.c | 6 +++++-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 7abdc0927124..869a81c63e49 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -37,6 +37,7 @@ enum io_uring_cmd_flags {
/* set when uring wants to cancel a previously issued command */
IO_URING_F_CANCEL = (1 << 11),
IO_URING_F_COMPAT = (1 << 12),
+ IO_URING_F_TASK_DEAD = (1 << 13),
};
struct io_wq_work_node {
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 21ac5fb2d5f0..e6d22b6fc0f4 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -119,9 +119,13 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable);
static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
{
struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+ unsigned flags = IO_URING_F_COMPLETE_DEFER;
+
+ if (req->task->flags & PF_EXITING)
+ flags |= IO_URING_F_TASK_DEAD;
/* task_work executor checks the deffered list completion */
- ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
+ ioucmd->task_work_cb(ioucmd, flags);
}
void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 15/17] ate: 2024-08-30 15:43:32 +0100
2024-09-01 13:37 ` [PATCH RFC v3 15/17] ate: 2024-08-30 15:43:32 +0100 Bernd Schubert
@ 2024-09-04 15:43 ` Jens Axboe
2024-09-04 15:54 ` Bernd Schubert
0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-09-04 15:43 UTC (permalink / raw)
To: Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
Something went wrong with the subject line in this one.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 15/17] ate: 2024-08-30 15:43:32 +0100
2024-09-04 15:43 ` Jens Axboe
@ 2024-09-04 15:54 ` Bernd Schubert
0 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-04 15:54 UTC (permalink / raw)
To: Jens Axboe, Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/4/24 17:43, Jens Axboe wrote:
> Something went wrong with the subject line in this one.
>
Oh sorry,
b4 shazam -C --single-message [email protected]
didn't find the message-id and I had then manually added in the patch -
something must have gone wrong.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 37+ messages in thread
* [PATCH RFC v3 16/17] fuse: {uring} Handle IO_URING_F_TASK_DEAD
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (14 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 15/17] ate: 2024-08-30 15:43:32 +0100 Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-01 13:37 ` [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer Bernd Schubert
2024-09-04 16:42 ` [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Jens Axboe
17 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
The ring task is terminating, it not safe to still access
its resources. Also no need for further actions.
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev_uring.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 43e7486d9f93..a65c5d08fce1 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -993,6 +993,9 @@ fuse_uring_send_req_in_task(struct io_uring_cmd *cmd,
BUILD_BUG_ON(sizeof(pdu) > sizeof(cmd->pdu));
+ if (unlikely(issue_flags & IO_URING_F_TASK_DEAD))
+ goto terminating;
+
err = fuse_uring_prepare_send(ring_ent);
if (err)
goto err;
@@ -1007,6 +1010,10 @@ fuse_uring_send_req_in_task(struct io_uring_cmd *cmd,
return;
err:
fuse_uring_next_fuse_req(ring_ent, queue);
+
+terminating:
+ /* Avoid all actions as the task that issues the ring is terminating */
+ io_uring_cmd_done(cmd, -ECANCELED, 0, issue_flags);
}
/* queue a fuse request and send it if a ring entry is available */
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (15 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 16/17] fuse: {uring} Handle IO_URING_F_TASK_DEAD Bernd Schubert
@ 2024-09-01 13:37 ` Bernd Schubert
2024-09-04 15:47 ` Jens Axboe
2024-09-04 18:59 ` Jens Axboe
2024-09-04 16:42 ` [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Jens Axboe
17 siblings, 2 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-01 13:37 UTC (permalink / raw)
To: Miklos Szeredi, Jens Axboe, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein, Bernd Schubert
This is to allow copying into the buffer from the application
without the need to copy in ring context (and with that,
the need that the ring task is active in kernel space).
Also absolutely needed for now to avoid this teardown issue
1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
[ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G O 6.10.0+ #48
[ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 1525.922470] Workqueue: events io_fallback_req_func
[ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80
[ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7
[ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002
[ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001
[ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0
[ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000
[ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001
[ 1525.968225] FS: 0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000
[ 1525.973932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0
[ 1525.980030] Call Trace:
[ 1525.981371] <TASK>
[ 1525.982567] ? __die_body+0x66/0xb0
[ 1525.984376] ? die_addr+0xc1/0x100
[ 1525.986111] ? exc_general_protection+0x1c6/0x330
[ 1525.988401] ? asm_exc_general_protection+0x22/0x30
[ 1525.990864] ? __lock_acquire+0x74/0x7b80
[ 1525.992901] ? mark_lock+0x9f/0x360
[ 1525.994635] ? __lock_acquire+0x1420/0x7b80
[ 1525.996629] ? attach_entity_load_avg+0x47d/0x550
[ 1525.998765] ? hlock_conflict+0x5a/0x1f0
[ 1526.000515] ? __bfs+0x2dc/0x5a0
[ 1526.001993] lock_acquire+0x1fb/0x3d0
[ 1526.004727] ? gup_fast_fallback+0x13f/0x1d80
[ 1526.006586] ? gup_fast_fallback+0x13f/0x1d80
[ 1526.008412] gup_fast_fallback+0x158/0x1d80
[ 1526.010170] ? gup_fast_fallback+0x13f/0x1d80
[ 1526.011999] ? __lock_acquire+0x2b07/0x7b80
[ 1526.013793] __iov_iter_get_pages_alloc+0x36e/0x980
[ 1526.015876] ? do_raw_spin_unlock+0x5a/0x8a0
[ 1526.017734] iov_iter_get_pages2+0x56/0x70
[ 1526.019491] fuse_copy_fill+0x48e/0x980 [fuse]
[ 1526.021400] fuse_copy_args+0x174/0x6a0 [fuse]
[ 1526.023199] fuse_uring_prepare_send+0x319/0x6c0 [fuse]
[ 1526.025178] fuse_uring_send_req_in_task+0x42/0x100 [fuse]
[ 1526.027163] io_fallback_req_func+0xb4/0x170
[ 1526.028737] ? process_scheduled_works+0x75b/0x1160
[ 1526.030445] process_scheduled_works+0x85c/0x1160
[ 1526.032073] worker_thread+0x8ba/0xce0
[ 1526.033388] kthread+0x23e/0x2b0
[ 1526.035404] ? pr_cont_work_flush+0x290/0x290
[ 1526.036958] ? kthread_blkcg+0xa0/0xa0
[ 1526.038321] ret_from_fork+0x30/0x60
[ 1526.039600] ? kthread_blkcg+0xa0/0xa0
[ 1526.040942] ret_from_fork_asm+0x11/0x20
[ 1526.042353] </TASK>
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/fuse/dev.c | 9 +++
fs/fuse/dev_uring.c | 186 ++++++++++++++++++++++++++++++++------------------
fs/fuse/dev_uring_i.h | 15 ++--
fs/fuse/fuse_dev_i.h | 2 +
4 files changed, 143 insertions(+), 69 deletions(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 9f0f2120b1fa..492bb95fde4e 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -769,6 +769,15 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
cs->pipebufs++;
cs->nr_segs++;
}
+ } else if (cs->is_uring) {
+ cs->pg = cs->ring.pages[cs->ring.page_idx++];
+ /*
+ * non stricly needed, just to avoid a uring exception in
+ * fuse_copy_finish
+ */
+ get_page(cs->pg);
+ cs->len = PAGE_SIZE;
+ cs->offset = 0;
} else {
size_t off;
err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index a65c5d08fce1..4cc0facaaae3 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -29,6 +29,9 @@
#include <linux/topology.h>
#include <linux/io_uring/cmd.h>
+#define FUSE_RING_HEADER_PG 0
+#define FUSE_RING_PAYLOAD_PG 1
+
struct fuse_uring_cmd_pdu {
struct fuse_ring_ent *ring_ent;
};
@@ -250,6 +253,21 @@ static void fuse_uring_stop_fuse_req_end(struct fuse_ring_ent *ent)
fuse_request_end(req);
}
+/*
+ * Copy from memmap.c, should be exported
+ */
+static void io_pages_free(struct page ***pages, int npages)
+{
+ struct page **page_array = *pages;
+
+ if (!page_array)
+ return;
+
+ unpin_user_pages(page_array, npages);
+ kvfree(page_array);
+ *pages = NULL;
+}
+
/*
* Release a request/entry on connection tear down
*/
@@ -275,6 +293,8 @@ static void fuse_uring_entry_teardown(struct fuse_ring_ent *ent,
if (ent->fuse_req)
fuse_uring_stop_fuse_req_end(ent);
+ io_pages_free(&ent->user_pages, ent->nr_user_pages);
+
ent->state = FRRS_FREED;
}
@@ -417,6 +437,7 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
goto seterr;
}
+ /* FIXME copied from dev.c, check what 512 means */
if (oh->error <= -512 || oh->error > 0) {
err = -EINVAL;
goto seterr;
@@ -465,53 +486,41 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
struct fuse_req *req,
- struct fuse_ring_ent *ent)
+ struct fuse_ring_ent *ent,
+ struct fuse_ring_req *rreq)
{
- struct fuse_ring_req __user *rreq = ent->rreq;
struct fuse_copy_state cs;
struct fuse_args *args = req->args;
struct iov_iter iter;
- int err;
- int res_arg_len;
+ int res_arg_len, err;
- err = copy_from_user(&res_arg_len, &rreq->in_out_arg_len,
- sizeof(res_arg_len));
- if (err)
- return err;
-
- err = import_ubuf(ITER_SOURCE, (void __user *)&rreq->in_out_arg,
- ent->max_arg_len, &iter);
- if (err)
- return err;
+ res_arg_len = rreq->in_out_arg_len;
fuse_copy_init(&cs, 0, &iter);
cs.is_uring = 1;
+ cs.ring.pages = &ent->user_pages[FUSE_RING_PAYLOAD_PG];
cs.req = req;
- return fuse_copy_out_args(&cs, args, res_arg_len);
+ err = fuse_copy_out_args(&cs, args, res_arg_len);
+
+ return err;
}
- /*
- * Copy data from the req to the ring buffer
- */
+/*
+ * Copy data from the req to the ring buffer
+ */
static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
- struct fuse_ring_ent *ent)
+ struct fuse_ring_ent *ent,
+ struct fuse_ring_req *rreq)
{
- struct fuse_ring_req __user *rreq = ent->rreq;
struct fuse_copy_state cs;
struct fuse_args *args = req->args;
- int err, res;
+ int err;
struct iov_iter iter;
- err = import_ubuf(ITER_DEST, (void __user *)&rreq->in_out_arg,
- ent->max_arg_len, &iter);
- if (err) {
- pr_info("Import user buffer failed\n");
- return err;
- }
-
fuse_copy_init(&cs, 1, &iter);
cs.is_uring = 1;
+ cs.ring.pages = &ent->user_pages[FUSE_RING_PAYLOAD_PG];
cs.req = req;
err = fuse_copy_args(&cs, args->in_numargs, args->in_pages,
(struct fuse_arg *)args->in_args, 0);
@@ -520,10 +529,7 @@ static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
return err;
}
- BUILD_BUG_ON((sizeof(rreq->in_out_arg_len) != sizeof(cs.ring.offset)));
- res = copy_to_user(&rreq->in_out_arg_len, &cs.ring.offset,
- sizeof(rreq->in_out_arg_len));
- err = res > 0 ? -EFAULT : res;
+ rreq->in_out_arg_len = cs.ring.offset;
return err;
}
@@ -531,11 +537,11 @@ static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
static int
fuse_uring_prepare_send(struct fuse_ring_ent *ring_ent)
{
- struct fuse_ring_req *rreq = ring_ent->rreq;
+ struct fuse_ring_req *rreq = NULL;
struct fuse_ring_queue *queue = ring_ent->queue;
struct fuse_ring *ring = queue->ring;
struct fuse_req *req = ring_ent->fuse_req;
- int err = 0, res;
+ int err = 0;
if (WARN_ON(ring_ent->state != FRRS_FUSE_REQ)) {
pr_err("qid=%d tag=%d ring-req=%p buf_req=%p invalid state %d on send\n",
@@ -551,25 +557,27 @@ fuse_uring_prepare_send(struct fuse_ring_ent *ring_ent)
__func__, queue->qid, ring_ent->tag, ring_ent->state,
req->in.h.opcode, req->in.h.unique);
+ rreq = kmap_local_page(ring_ent->user_pages[FUSE_RING_HEADER_PG]);
+
/* copy the request */
- err = fuse_uring_copy_to_ring(ring, req, ring_ent);
+ err = fuse_uring_copy_to_ring(ring, req, ring_ent, rreq);
if (unlikely(err)) {
pr_info("Copy to ring failed: %d\n", err);
goto err;
}
/* copy fuse_in_header */
- res = copy_to_user(&rreq->in, &req->in.h, sizeof(rreq->in));
- err = res > 0 ? -EFAULT : res;
- if (err)
- goto err;
+ rreq->in = req->in.h;
+ err = 0;
set_bit(FR_SENT, &req->flags);
- return 0;
-
+out:
+ if (rreq)
+ kunmap_local(rreq);
+ return err;
err:
fuse_uring_req_end(ring_ent, true, err);
- return err;
+ goto out;
}
/*
@@ -682,16 +690,13 @@ static void fuse_uring_commit(struct fuse_ring_ent *ring_ent,
{
struct fuse_ring *ring = ring_ent->queue->ring;
struct fuse_conn *fc = ring->fc;
- struct fuse_ring_req *rreq = ring_ent->rreq;
+ struct fuse_ring_req *rreq;
struct fuse_req *req = ring_ent->fuse_req;
ssize_t err = 0;
bool set_err = false;
- err = copy_from_user(&req->out.h, &rreq->out, sizeof(req->out.h));
- if (err) {
- req->out.h.error = err;
- goto out;
- }
+ rreq = kmap_local_page(ring_ent->user_pages[FUSE_RING_HEADER_PG]);
+ req->out.h = rreq->out;
err = fuse_uring_out_header_has_err(&req->out.h, req, fc);
if (err) {
@@ -701,7 +706,8 @@ static void fuse_uring_commit(struct fuse_ring_ent *ring_ent,
goto out;
}
- err = fuse_uring_copy_from_ring(ring, req, ring_ent);
+ err = fuse_uring_copy_from_ring(ring, req, ring_ent, rreq);
+ kunmap_local(rreq);
if (err)
set_err = true;
@@ -830,6 +836,46 @@ __must_hold(ring_ent->queue->lock)
return 0;
}
+/*
+ * Copy from memmap.c, should be exported there
+ */
+static struct page **io_pin_pages(unsigned long uaddr, unsigned long len,
+ int *npages)
+{
+ unsigned long start, end, nr_pages;
+ struct page **pages;
+ int ret;
+
+ end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ start = uaddr >> PAGE_SHIFT;
+ nr_pages = end - start;
+ if (WARN_ON_ONCE(!nr_pages))
+ return ERR_PTR(-EINVAL);
+
+ pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL);
+ if (!pages)
+ return ERR_PTR(-ENOMEM);
+
+ ret = pin_user_pages_fast(uaddr, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
+ pages);
+ /* success, mapped all pages */
+ if (ret == nr_pages) {
+ *npages = nr_pages;
+ return pages;
+ }
+
+ /* partial map, or didn't map anything */
+ if (ret >= 0) {
+ /* if we did partial map, release any pages we did get */
+ if (ret)
+ unpin_user_pages(pages, ret);
+ ret = -EFAULT;
+ }
+ kvfree(pages);
+ return ERR_PTR(ret);
+}
+
+
/* FUSE_URING_REQ_FETCH handler */
static int fuse_uring_fetch(struct fuse_ring_ent *ring_ent,
struct io_uring_cmd *cmd, unsigned int issue_flags)
@@ -837,39 +883,48 @@ static int fuse_uring_fetch(struct fuse_ring_ent *ring_ent,
{
struct fuse_ring *ring = ring_ent->queue->ring;
struct fuse_ring_queue *queue = ring_ent->queue;
- int ret;
+ int err;
/* No other bit must be set here */
- ret = -EINVAL;
+ err = -EINVAL;
if (ring_ent->state != FRRS_INIT)
- goto err;
+ goto err_unlock;
/*
* FUSE_URING_REQ_FETCH is an initialization exception, needs
* state override
*/
ring_ent->state = FRRS_USERSPACE;
- ret = fuse_ring_ring_ent_unset_userspace(ring_ent);
- if (ret != 0) {
- pr_info_ratelimited(
- "qid=%d tag=%d register req state %d expected %d",
- queue->qid, ring_ent->tag, ring_ent->state,
- FRRS_INIT);
+ fuse_ring_ring_ent_unset_userspace(ring_ent);
+
+ err = _fuse_uring_fetch(ring_ent, cmd, issue_flags);
+ if (err)
+ goto err_unlock;
+
+ spin_unlock(&queue->lock);
+
+ /* must not hold the queue->lock */
+ ring_ent->user_pages = io_pin_pages(ring_ent->user_buf,
+ ring_ent->user_buf_len,
+ &ring_ent->nr_user_pages);
+ if (IS_ERR(ring_ent->user_pages)) {
+ err = PTR_ERR(ring_ent->user_pages);
+ pr_info("qid=%d ent=%d pin-res=%d\n",
+ queue->qid, ring_ent->tag, err);
goto err;
}
- ret = _fuse_uring_fetch(ring_ent, cmd, issue_flags);
- if (ret)
- goto err;
-
/*
* The ring entry is registered now and needs to be handled
* for shutdown.
*/
atomic_inc(&ring->queue_refs);
-err:
+ return 0;
+
+err_unlock:
spin_unlock(&queue->lock);
- return ret;
+err:
+ return err;
}
/**
@@ -920,7 +975,9 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
if (unlikely(fc->aborted || queue->stopped))
goto err_unlock;
- ring_ent->rreq = (void __user *)cmd_req->buf_ptr;
+ ring_ent->user_buf = cmd_req->buf_ptr;
+ ring_ent->user_buf_len = cmd_req->buf_len;
+
ring_ent->max_arg_len = cmd_req->buf_len -
offsetof(struct fuse_ring_req, in_out_arg);
ret = -EINVAL;
@@ -930,7 +987,6 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
goto err_unlock;
}
- ring_ent->rreq = (void __user *)cmd_req->buf_ptr;
ring_ent->max_arg_len = cmd_req->buf_len -
offsetof(struct fuse_ring_req, in_out_arg);
if (cmd_req->buf_len < ring->req_buf_sz) {
diff --git a/fs/fuse/dev_uring_i.h b/fs/fuse/dev_uring_i.h
index f1247ee57dc4..2e43b2e9bcf2 100644
--- a/fs/fuse/dev_uring_i.h
+++ b/fs/fuse/dev_uring_i.h
@@ -60,10 +60,17 @@ struct fuse_ring_ent {
/* fuse_req assigned to the ring entry */
struct fuse_req *fuse_req;
- /*
- * buffer provided by fuse server
- */
- struct fuse_ring_req __user *rreq;
+ /* buffer provided by fuse server */
+ unsigned long __user user_buf;
+
+ /* length of user_buf */
+ size_t user_buf_len;
+
+ /* mapped user_buf pages */
+ struct page **user_pages;
+
+ /* number of user pages */
+ int nr_user_pages;
/* struct fuse_ring_req::in_out_arg size*/
size_t max_arg_len;
diff --git a/fs/fuse/fuse_dev_i.h b/fs/fuse/fuse_dev_i.h
index 0fbb4f28261c..63e0e5dcb9f4 100644
--- a/fs/fuse/fuse_dev_i.h
+++ b/fs/fuse/fuse_dev_i.h
@@ -32,6 +32,8 @@ struct fuse_copy_state {
struct {
/* overall offset with the user buffer */
unsigned int offset;
+ struct page **pages;
+ int page_idx;
} ring;
};
--
2.43.0
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer
2024-09-01 13:37 ` [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer Bernd Schubert
@ 2024-09-04 15:47 ` Jens Axboe
2024-09-04 16:08 ` Bernd Schubert
2024-09-04 18:59 ` Jens Axboe
1 sibling, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-09-04 15:47 UTC (permalink / raw)
To: Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/1/24 7:37 AM, Bernd Schubert wrote:
> This is to allow copying into the buffer from the application
> without the need to copy in ring context (and with that,
> the need that the ring task is active in kernel space).
>
> Also absolutely needed for now to avoid this teardown issue
I'm fine using these helpers, but they are absolutely not needed to
avoid that teardown issue - well they may help because it's already
mapped, but it's really the fault of your handler from attempting to map
in user pages from when it's teardown/fallback task_work. If invoked and
the ring is dying or not in the right task (as per the patch from
Pavel), then just cleanup and return -ECANCELED.
> +/*
> + * Copy from memmap.c, should be exported
> + */
> +static void io_pages_free(struct page ***pages, int npages)
> +{
> + struct page **page_array = *pages;
> +
> + if (!page_array)
> + return;
> +
> + unpin_user_pages(page_array, npages);
> + kvfree(page_array);
> + *pages = NULL;
> +}
I noticed this and the mapping helper being copied before seeing the
comments - just export them from memmap.c and use those rather than
copying in the code. Add that as a prep patch.
> @@ -417,6 +437,7 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
> goto seterr;
> }
>
> + /* FIXME copied from dev.c, check what 512 means */
> if (oh->error <= -512 || oh->error > 0) {
> err = -EINVAL;
> goto seterr;
-512 is -ERESTARTSYS
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer
2024-09-04 15:47 ` Jens Axboe
@ 2024-09-04 16:08 ` Bernd Schubert
2024-09-04 16:16 ` Jens Axboe
0 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2024-09-04 16:08 UTC (permalink / raw)
To: Jens Axboe, Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
Hi Jens,
thanks for your help.
On 9/4/24 17:47, Jens Axboe wrote:
> On 9/1/24 7:37 AM, Bernd Schubert wrote:
>> This is to allow copying into the buffer from the application
>> without the need to copy in ring context (and with that,
>> the need that the ring task is active in kernel space).
>>
>> Also absolutely needed for now to avoid this teardown issue
>
> I'm fine using these helpers, but they are absolutely not needed to
> avoid that teardown issue - well they may help because it's already
> mapped, but it's really the fault of your handler from attempting to map
> in user pages from when it's teardown/fallback task_work. If invoked and
> the ring is dying or not in the right task (as per the patch from
> Pavel), then just cleanup and return -ECANCELED.
As I had posted on Friday/Saturday, it didn't work. I had added a
debug pr_info into Pavels patch, somehow it didn't trigger on PF_EXITING
and I didn't further debug it yet as I was working on the pin anyway.
And since Monday occupied with other work...
For this series it is needed to avoid kernel crashes. If we can can fix
patch 15 and 16, the better. Although we will still later on need it as
optimization.
>
>> +/*
>> + * Copy from memmap.c, should be exported
>> + */
>> +static void io_pages_free(struct page ***pages, int npages)
>> +{
>> + struct page **page_array = *pages;
>> +
>> + if (!page_array)
>> + return;
>> +
>> + unpin_user_pages(page_array, npages);
>> + kvfree(page_array);
>> + *pages = NULL;
>> +}
>
> I noticed this and the mapping helper being copied before seeing the
> comments - just export them from memmap.c and use those rather than
> copying in the code. Add that as a prep patch.
No issue to do that either. The hard part is then to get it through
different branches. I had removed the big optimization of
__wake_up_on_current_cpu in this series, because it needs another
export.
>
>> @@ -417,6 +437,7 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
>> goto seterr;
>> }
>>
>> + /* FIXME copied from dev.c, check what 512 means */
>> if (oh->error <= -512 || oh->error > 0) {
>> err = -EINVAL;
>> goto seterr;
>
> -512 is -ERESTARTSYS
>
Ah thank you! I'm going to add separate patch for dev.c, as I wrote, this was
just a copy-and-paste.
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 592d0d96a106..779b23fa01c2 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2028,7 +2028,7 @@ static ssize_t fuse_dev_do_write(struct fuse_dev *fud,
}
err = -EINVAL;
- if (oh.error <= -512 || oh.error > 0)
+ if (oh.error <= -ERESTARTSYS || oh.error > 0)
goto copy_finish;
spin_lock(&fpq->lock);
Thanks,
Bernd
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer
2024-09-04 16:08 ` Bernd Schubert
@ 2024-09-04 16:16 ` Jens Axboe
2024-09-04 19:25 ` Bernd Schubert
0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-09-04 16:16 UTC (permalink / raw)
To: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Pavel Begunkov,
bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/4/24 10:08 AM, Bernd Schubert wrote:
> Hi Jens,
>
> thanks for your help.
>
> On 9/4/24 17:47, Jens Axboe wrote:
>> On 9/1/24 7:37 AM, Bernd Schubert wrote:
>>> This is to allow copying into the buffer from the application
>>> without the need to copy in ring context (and with that,
>>> the need that the ring task is active in kernel space).
>>>
>>> Also absolutely needed for now to avoid this teardown issue
>>
>> I'm fine using these helpers, but they are absolutely not needed to
>> avoid that teardown issue - well they may help because it's already
>> mapped, but it's really the fault of your handler from attempting to map
>> in user pages from when it's teardown/fallback task_work. If invoked and
>> the ring is dying or not in the right task (as per the patch from
>> Pavel), then just cleanup and return -ECANCELED.
>
> As I had posted on Friday/Saturday, it didn't work. I had added a
> debug pr_info into Pavels patch, somehow it didn't trigger on PF_EXITING
> and I didn't further debug it yet as I was working on the pin anyway.
> And since Monday occupied with other work...
Then there's something wrong with that patch, as it definitely should
work. How did you reproduce the teardown crash? I'll take a look here.
That said, it may indeed be the better approach to pin upfront. I just
want to make sure it's not done as a bug fix for something that should
not be happening.
> For this series it is needed to avoid kernel crashes. If we can can fix
> patch 15 and 16, the better. Although we will still later on need it as
> optimization.
Yeah exactly, didn't see this before typing the above :-)
>>> +/*
>>> + * Copy from memmap.c, should be exported
>>> + */
>>> +static void io_pages_free(struct page ***pages, int npages)
>>> +{
>>> + struct page **page_array = *pages;
>>> +
>>> + if (!page_array)
>>> + return;
>>> +
>>> + unpin_user_pages(page_array, npages);
>>> + kvfree(page_array);
>>> + *pages = NULL;
>>> +}
>>
>> I noticed this and the mapping helper being copied before seeing the
>> comments - just export them from memmap.c and use those rather than
>> copying in the code. Add that as a prep patch.
>
> No issue to do that either. The hard part is then to get it through
> different branches. I had removed the big optimization of
> __wake_up_on_current_cpu in this series, because it needs another
> export.
It's not that hard, just split it out in the next patch and I'll be
happy to ack/review it so it can go in with the other patches rather
than needing to go in separately.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer
2024-09-04 16:16 ` Jens Axboe
@ 2024-09-04 19:25 ` Bernd Schubert
2024-09-04 19:40 ` Jens Axboe
0 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2024-09-04 19:25 UTC (permalink / raw)
To: Jens Axboe, Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/4/24 18:16, Jens Axboe wrote:
> On 9/4/24 10:08 AM, Bernd Schubert wrote:
>> Hi Jens,
>>
>> thanks for your help.
>>
>> On 9/4/24 17:47, Jens Axboe wrote:
>>> On 9/1/24 7:37 AM, Bernd Schubert wrote:
>>>> This is to allow copying into the buffer from the application
>>>> without the need to copy in ring context (and with that,
>>>> the need that the ring task is active in kernel space).
>>>>
>>>> Also absolutely needed for now to avoid this teardown issue
>>>
>>> I'm fine using these helpers, but they are absolutely not needed to
>>> avoid that teardown issue - well they may help because it's already
>>> mapped, but it's really the fault of your handler from attempting to map
>>> in user pages from when it's teardown/fallback task_work. If invoked and
>>> the ring is dying or not in the right task (as per the patch from
>>> Pavel), then just cleanup and return -ECANCELED.
>>
>> As I had posted on Friday/Saturday, it didn't work. I had added a
>> debug pr_info into Pavels patch, somehow it didn't trigger on PF_EXITING
>> and I didn't further debug it yet as I was working on the pin anyway.
>> And since Monday occupied with other work...
>
> Then there's something wrong with that patch, as it definitely should
> work. How did you reproduce the teardown crash? I'll take a look here.
Thank you! In this specific case
1) Run passthrough_hp with --debug-fuse
2) dd if=/dev/zero of=/scratch/test/testfile bs=1M count=1
Then on the console that has passthrough_hp output and runs slow with my
ASAN/etc kernel: ctrl-z and kill -9 %
I guess a pkill -9 passthrough_hp should also work
But I can investigate later on myself what is the issue with PF_EXITING,
just not today and maybe not tomorrow either.
>
> That said, it may indeed be the better approach to pin upfront. I just
> want to make sure it's not done as a bug fix for something that should
> not be happening.
>
>> For this series it is needed to avoid kernel crashes. If we can can fix
>> patch 15 and 16, the better. Although we will still later on need it as
>> optimization.
>
> Yeah exactly, didn't see this before typing the above :-)
>
>>>> +/*
>>>> + * Copy from memmap.c, should be exported
>>>> + */
>>>> +static void io_pages_free(struct page ***pages, int npages)
>>>> +{
>>>> + struct page **page_array = *pages;
>>>> +
>>>> + if (!page_array)
>>>> + return;
>>>> +
>>>> + unpin_user_pages(page_array, npages);
>>>> + kvfree(page_array);
>>>> + *pages = NULL;
>>>> +}
>>>
>>> I noticed this and the mapping helper being copied before seeing the
>>> comments - just export them from memmap.c and use those rather than
>>> copying in the code. Add that as a prep patch.
>>
>> No issue to do that either. The hard part is then to get it through
>> different branches. I had removed the big optimization of
>> __wake_up_on_current_cpu in this series, because it needs another
>> export.
>
> It's not that hard, just split it out in the next patch and I'll be
> happy to ack/review it so it can go in with the other patches rather
> than needing to go in separately.
Great thank you very much, will do!
Thanks,
Bernd
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer
2024-09-04 19:25 ` Bernd Schubert
@ 2024-09-04 19:40 ` Jens Axboe
2024-09-05 21:04 ` Bernd Schubert
0 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-09-04 19:40 UTC (permalink / raw)
To: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Pavel Begunkov,
bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/4/24 1:25 PM, Bernd Schubert wrote:
>
>
> On 9/4/24 18:16, Jens Axboe wrote:
>> On 9/4/24 10:08 AM, Bernd Schubert wrote:
>>> Hi Jens,
>>>
>>> thanks for your help.
>>>
>>> On 9/4/24 17:47, Jens Axboe wrote:
>>>> On 9/1/24 7:37 AM, Bernd Schubert wrote:
>>>>> This is to allow copying into the buffer from the application
>>>>> without the need to copy in ring context (and with that,
>>>>> the need that the ring task is active in kernel space).
>>>>>
>>>>> Also absolutely needed for now to avoid this teardown issue
>>>>
>>>> I'm fine using these helpers, but they are absolutely not needed to
>>>> avoid that teardown issue - well they may help because it's already
>>>> mapped, but it's really the fault of your handler from attempting to map
>>>> in user pages from when it's teardown/fallback task_work. If invoked and
>>>> the ring is dying or not in the right task (as per the patch from
>>>> Pavel), then just cleanup and return -ECANCELED.
>>>
>>> As I had posted on Friday/Saturday, it didn't work. I had added a
>>> debug pr_info into Pavels patch, somehow it didn't trigger on PF_EXITING
>>> and I didn't further debug it yet as I was working on the pin anyway.
>>> And since Monday occupied with other work...
>>
>> Then there's something wrong with that patch, as it definitely should
>> work. How did you reproduce the teardown crash? I'll take a look here.
>
> Thank you! In this specific case
>
> 1) Run passthrough_hp with --debug-fuse
>
> 2) dd if=/dev/zero of=/scratch/test/testfile bs=1M count=1
>
> Then on the console that has passthrough_hp output and runs slow with my
> ASAN/etc kernel: ctrl-z and kill -9 %
> I guess a pkill -9 passthrough_hp should also work
Eerily similar to what I tried, but I managed to get it to trigger.
Should work what's in there, but I think checking for task != current is
better and not race prone like PF_EXITING is. So maybe? Try with the
below incremental.
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 55bdcb4b63b3..fa5a0f724a84 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -121,7 +121,8 @@ static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
unsigned flags = IO_URING_F_COMPLETE_DEFER;
- if (req->task->flags & PF_EXITING)
+ /* Different task should only happen if the original is going away */
+ if (req->task != current)
flags |= IO_URING_F_TASK_DEAD;
/* task_work executor checks the deffered list completion */
--
Jens Axboe
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer
2024-09-04 19:40 ` Jens Axboe
@ 2024-09-05 21:04 ` Bernd Schubert
0 siblings, 0 replies; 37+ messages in thread
From: Bernd Schubert @ 2024-09-05 21:04 UTC (permalink / raw)
To: Jens Axboe, Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/4/24 21:40, Jens Axboe wrote:
> On 9/4/24 1:25 PM, Bernd Schubert wrote:
>>
>>
>> On 9/4/24 18:16, Jens Axboe wrote:
>>> On 9/4/24 10:08 AM, Bernd Schubert wrote:
>>>> Hi Jens,
>>>>
>>>> thanks for your help.
>>>>
>>>> On 9/4/24 17:47, Jens Axboe wrote:
>>>>> On 9/1/24 7:37 AM, Bernd Schubert wrote:
>>>>>> This is to allow copying into the buffer from the application
>>>>>> without the need to copy in ring context (and with that,
>>>>>> the need that the ring task is active in kernel space).
>>>>>>
>>>>>> Also absolutely needed for now to avoid this teardown issue
>>>>>
>>>>> I'm fine using these helpers, but they are absolutely not needed to
>>>>> avoid that teardown issue - well they may help because it's already
>>>>> mapped, but it's really the fault of your handler from attempting to map
>>>>> in user pages from when it's teardown/fallback task_work. If invoked and
>>>>> the ring is dying or not in the right task (as per the patch from
>>>>> Pavel), then just cleanup and return -ECANCELED.
>>>>
>>>> As I had posted on Friday/Saturday, it didn't work. I had added a
>>>> debug pr_info into Pavels patch, somehow it didn't trigger on PF_EXITING
>>>> and I didn't further debug it yet as I was working on the pin anyway.
>>>> And since Monday occupied with other work...
>>>
>>> Then there's something wrong with that patch, as it definitely should
>>> work. How did you reproduce the teardown crash? I'll take a look here.
>>
>> Thank you! In this specific case
>>
>> 1) Run passthrough_hp with --debug-fuse
>>
>> 2) dd if=/dev/zero of=/scratch/test/testfile bs=1M count=1
>>
>> Then on the console that has passthrough_hp output and runs slow with my
>> ASAN/etc kernel: ctrl-z and kill -9 %
>> I guess a pkill -9 passthrough_hp should also work
>
> Eerily similar to what I tried, but I managed to get it to trigger.
> Should work what's in there, but I think checking for task != current is
> better and not race prone like PF_EXITING is. So maybe? Try with the
> below incremental.
>
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 55bdcb4b63b3..fa5a0f724a84 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -121,7 +121,8 @@ static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
> struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> unsigned flags = IO_URING_F_COMPLETE_DEFER;
>
> - if (req->task->flags & PF_EXITING)
> + /* Different task should only happen if the original is going away */
> + if (req->task != current)
> flags |= IO_URING_F_TASK_DEAD;
>
> /* task_work executor checks the deffered list completion */
>
Thanks, just tested this version works fine!
My user of that (patch 16/17) left the fuse ring entry in bad state -
fixed in my v4 branch.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer
2024-09-01 13:37 ` [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer Bernd Schubert
2024-09-04 15:47 ` Jens Axboe
@ 2024-09-04 18:59 ` Jens Axboe
1 sibling, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-09-04 18:59 UTC (permalink / raw)
To: Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/1/24 7:37 AM, Bernd Schubert wrote:
> @@ -465,53 +486,41 @@ static int fuse_uring_out_header_has_err(struct fuse_out_header *oh,
>
> static int fuse_uring_copy_from_ring(struct fuse_ring *ring,
> struct fuse_req *req,
> - struct fuse_ring_ent *ent)
> + struct fuse_ring_ent *ent,
> + struct fuse_ring_req *rreq)
> {
> - struct fuse_ring_req __user *rreq = ent->rreq;
> struct fuse_copy_state cs;
> struct fuse_args *args = req->args;
> struct iov_iter iter;
> - int err;
> - int res_arg_len;
> + int res_arg_len, err;
>
> - err = copy_from_user(&res_arg_len, &rreq->in_out_arg_len,
> - sizeof(res_arg_len));
> - if (err)
> - return err;
> -
> - err = import_ubuf(ITER_SOURCE, (void __user *)&rreq->in_out_arg,
> - ent->max_arg_len, &iter);
> - if (err)
> - return err;
> + res_arg_len = rreq->in_out_arg_len;
>
> fuse_copy_init(&cs, 0, &iter);
> cs.is_uring = 1;
> + cs.ring.pages = &ent->user_pages[FUSE_RING_PAYLOAD_PG];
> cs.req = req;
>
> - return fuse_copy_out_args(&cs, args, res_arg_len);
> + err = fuse_copy_out_args(&cs, args, res_arg_len);
> +
> + return err;
> }
This last assignment, and 'err' in general, can go away after this
patch.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 00/17] fuse: fuse-over-io-uring
2024-09-01 13:36 [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Bernd Schubert
` (16 preceding siblings ...)
2024-09-01 13:37 ` [PATCH RFC v3 17/17] fuse: {uring} Pin the user buffer Bernd Schubert
@ 2024-09-04 16:42 ` Jens Axboe
2024-09-04 19:37 ` Bernd Schubert
17 siblings, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2024-09-04 16:42 UTC (permalink / raw)
To: Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
Overall I think this looks pretty reasonable from an io_uring point of
view. Some minor comments in the replies that would need to get
resolved, and we'll need to get Ming's buffer work done to reap the dio
benefits.
I ran a quick benchmark here, doing 4k buffered random reads from a big
file. I see about 25% improvement for that case, and notably at half the
CPU usage.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 00/17] fuse: fuse-over-io-uring
2024-09-04 16:42 ` [PATCH RFC v3 00/17] fuse: fuse-over-io-uring Jens Axboe
@ 2024-09-04 19:37 ` Bernd Schubert
2024-09-04 19:41 ` Jens Axboe
0 siblings, 1 reply; 37+ messages in thread
From: Bernd Schubert @ 2024-09-04 19:37 UTC (permalink / raw)
To: Jens Axboe, Bernd Schubert, Miklos Szeredi, Pavel Begunkov, bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/4/24 18:42, Jens Axboe wrote:
> Overall I think this looks pretty reasonable from an io_uring point of
> view. Some minor comments in the replies that would need to get
> resolved, and we'll need to get Ming's buffer work done to reap the dio
> benefits.
>
> I ran a quick benchmark here, doing 4k buffered random reads from a big
> file. I see about 25% improvement for that case, and notably at half the
> CPU usage.
That is a bit low for my needs, but you will definitely need to wake up on
the same core - not applied in this patch version. I also need to re-test
with current kernel versions, but I think even that is not perfect.
We had a rather long discussion here
https://lore.kernel.org/lkml/[email protected]/T/#r58884ee2c68f9ac5fdb89c4e3a968007ff08468e
and there is a seesaw hack, which makes it work perfectly.
Then got persistently distracted with other work - so far I didn't track down yet why
__wake_up_on_current_cpu didn't work. Back that time it was also only still
patch and not in linux yet. I need to retest and possible figure out where
the task switch happens.
Also, if you are testing with with buffered writes,
v2 series had more optimization, like a core+1 hack for async IO.
I think in order to get it landed and to agree on the approach with
Miklos it is better to first remove all these optimizations and then
fix it later... Though for performance testing it is not optimal.
Thanks,
Bernd
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [PATCH RFC v3 00/17] fuse: fuse-over-io-uring
2024-09-04 19:37 ` Bernd Schubert
@ 2024-09-04 19:41 ` Jens Axboe
0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2024-09-04 19:41 UTC (permalink / raw)
To: Bernd Schubert, Bernd Schubert, Miklos Szeredi, Pavel Begunkov,
bernd
Cc: linux-fsdevel, io-uring, Joanne Koong, Josef Bacik,
Amir Goldstein
On 9/4/24 1:37 PM, Bernd Schubert wrote:
>
>
> On 9/4/24 18:42, Jens Axboe wrote:
>> Overall I think this looks pretty reasonable from an io_uring point of
>> view. Some minor comments in the replies that would need to get
>> resolved, and we'll need to get Ming's buffer work done to reap the dio
>> benefits.
>>
>> I ran a quick benchmark here, doing 4k buffered random reads from a big
>> file. I see about 25% improvement for that case, and notably at half the
>> CPU usage.
>
> That is a bit low for my needs, but you will definitely need to wake up on
> the same core - not applied in this patch version. I also need to re-test
> with current kernel versions, but I think even that is not perfect.
>
> We had a rather long discussion here
> https://lore.kernel.org/lkml/[email protected]/T/#r58884ee2c68f9ac5fdb89c4e3a968007ff08468e
> and there is a seesaw hack, which makes it work perfectly.
> Then got persistently distracted with other work - so far I didn't track down yet why
> __wake_up_on_current_cpu didn't work. Back that time it was also only still
> patch and not in linux yet. I need to retest and possible figure out where
> the task switch happens.
I'll give it a look, wasn't too worried about it as we're also still
missing the zero copy bits. More concerned with just getting the core of
it sane, which I think we're pretty close to. Then we can work on making
it even faster post that.
> Also, if you are testing with with buffered writes,
> v2 series had more optimization, like a core+1 hack for async IO.
> I think in order to get it landed and to agree on the approach with
> Miklos it is better to first remove all these optimizations and then
> fix it later... Though for performance testing it is not optimal.
Exactly, that's why I objected to some of the v2 io_uring hackery that
just wasn't palatable.
--
Jens Axboe
^ permalink raw reply [flat|nested] 37+ messages in thread