* [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads
@ 2025-04-22 10:45 Zhiwei Jiang
2025-04-22 10:45 ` [PATCH 1/2] io_uring: Add new functions to handle user fault scenarios Zhiwei Jiang
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Zhiwei Jiang @ 2025-04-22 10:45 UTC (permalink / raw)
To: viro
Cc: brauner, jack, akpm, peterx, axboe, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring, Zhiwei Jiang
In the Firecracker VM scenario, sporadically encountered threads with
the UN state in the following call stack:
[<0>] io_wq_put_and_exit+0xa1/0x210
[<0>] io_uring_clean_tctx+0x8e/0xd0
[<0>] io_uring_cancel_generic+0x19f/0x370
[<0>] __io_uring_cancel+0x14/0x20
[<0>] do_exit+0x17f/0x510
[<0>] do_group_exit+0x35/0x90
[<0>] get_signal+0x963/0x970
[<0>] arch_do_signal_or_restart+0x39/0x120
[<0>] syscall_exit_to_user_mode+0x206/0x260
[<0>] do_syscall_64+0x8d/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
The cause is a large number of IOU kernel threads saturating the CPU
and not exiting. When the issue occurs, CPU usage 100% and can only
be resolved by rebooting. Each thread's appears as follows:
iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
iou-wrk-44588 [kernel.kallsyms] [k] io_write
iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
iou-wrk-44588 [kernel.kallsyms] [k] schedule
iou-wrk-44588 [kernel.kallsyms] [k] __schedule
iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
I tracked the address that triggered the fault and the related function
graph, as well as the wake-up side of the user fault, and discovered this
: In the IOU worker, when fault in a user space page, this space is
associated with a userfault but does not sleep. This is because during
scheduling, the judgment in the IOU worker context leads to early return.
Meanwhile, the listener on the userfaultfd user side never performs a COPY
to respond, causing the page table entry to remain empty. However, due to
the early return, it does not sleep and wait to be awakened as in a normal
user fault, thus continuously faulting at the same address,so CPU loop.
Therefore, I believe it is necessary to specifically handle user faults by
setting a new flag to allow schedule function to continue in such cases,
make sure the thread to sleep.
Patch 1 io_uring: Add new functions to handle user fault scenarios
Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
fs/userfaultfd.c | 7 ++++++
io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
3 files changed, 68 insertions(+), 41 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/2] io_uring: Add new functions to handle user fault scenarios
2025-04-22 10:45 [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads Zhiwei Jiang
@ 2025-04-22 10:45 ` Zhiwei Jiang
2025-04-22 10:45 ` [PATCH 2/2] userfaultfd: Set the corresponding flag in IOU worker context Zhiwei Jiang
2025-04-22 13:34 ` [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads Jens Axboe
2 siblings, 0 replies; 11+ messages in thread
From: Zhiwei Jiang @ 2025-04-22 10:45 UTC (permalink / raw)
To: viro
Cc: brauner, jack, akpm, peterx, axboe, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring, Zhiwei Jiang
In the Firecracker VM scenario, sporadically encountered threads with
the UN state in the following call stack:
[<0>] io_wq_put_and_exit+0xa1/0x210
[<0>] io_uring_clean_tctx+0x8e/0xd0
[<0>] io_uring_cancel_generic+0x19f/0x370
[<0>] __io_uring_cancel+0x14/0x20
[<0>] do_exit+0x17f/0x510
[<0>] do_group_exit+0x35/0x90
[<0>] get_signal+0x963/0x970
[<0>] arch_do_signal_or_restart+0x39/0x120
[<0>] syscall_exit_to_user_mode+0x206/0x260
[<0>] do_syscall_64+0x8d/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
The cause is a large number of IOU kernel threads saturating the CPU
and not exiting. When the issue occurs, CPU usage 100% and can only
be resolved by rebooting. Each thread's appears as follows:
iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
iou-wrk-44588 [kernel.kallsyms] [k] io_write
iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
iou-wrk-44588 [kernel.kallsyms] [k] schedule
iou-wrk-44588 [kernel.kallsyms] [k] __schedule
iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
I tracked the address that triggered the fault and the related function
graph, as well as the wake-up side of the user fault, and discovered this
: In the IOU worker, when fault in a user space page, this space is
associated with a userfault but does not sleep. This is because during
scheduling, the judgment in the IOU worker context leads to early return.
Meanwhile, the listener on the userfaultfd user side never performs a COPY
to respond, causing the page table entry to remain empty. However, due to
the early return, it does not sleep and wait to be awakened as in a normal
user fault, thus continuously faulting at the same address,so CPU loop.
Therefore, I believe it is necessary to specifically handle user faults by
setting a new flag to allow schedule function to continue in such cases,
make sure the thread to sleep.Export the relevant functions and struct for
user fault.
Signed-off-by: Zhiwei Jiang <qq282012236@gmail.com>
---
io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
2 files changed, 61 insertions(+), 41 deletions(-)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 04a75d666195..8faad766d565 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -26,12 +26,6 @@
#define WORKER_IDLE_TIMEOUT (5 * HZ)
#define WORKER_INIT_LIMIT 3
-enum {
- IO_WORKER_F_UP = 0, /* up and active */
- IO_WORKER_F_RUNNING = 1, /* account as running */
- IO_WORKER_F_FREE = 2, /* worker on free list */
-};
-
enum {
IO_WQ_BIT_EXIT = 0, /* wq exiting */
};
@@ -40,33 +34,6 @@ enum {
IO_ACCT_STALLED_BIT = 0, /* stalled on hash */
};
-/*
- * One for each thread in a wq pool
- */
-struct io_worker {
- refcount_t ref;
- unsigned long flags;
- struct hlist_nulls_node nulls_node;
- struct list_head all_list;
- struct task_struct *task;
- struct io_wq *wq;
- struct io_wq_acct *acct;
-
- struct io_wq_work *cur_work;
- raw_spinlock_t lock;
-
- struct completion ref_done;
-
- unsigned long create_state;
- struct callback_head create_work;
- int init_retries;
-
- union {
- struct rcu_head rcu;
- struct delayed_work work;
- };
-};
-
#if BITS_PER_LONG == 64
#define IO_WQ_HASH_ORDER 6
#else
@@ -706,6 +673,16 @@ static int io_wq_worker(void *data)
return 0;
}
+void set_userfault_flag_for_ioworker(struct io_worker *worker)
+{
+ set_bit(IO_WORKER_F_FAULT, &worker->flags);
+}
+
+void clear_userfault_flag_for_ioworker(struct io_worker *worker)
+{
+ clear_bit(IO_WORKER_F_FAULT, &worker->flags);
+}
+
/*
* Called when a worker is scheduled in. Mark us as currently running.
*/
@@ -715,12 +692,14 @@ void io_wq_worker_running(struct task_struct *tsk)
if (!worker)
return;
- if (!test_bit(IO_WORKER_F_UP, &worker->flags))
- return;
- if (test_bit(IO_WORKER_F_RUNNING, &worker->flags))
- return;
- set_bit(IO_WORKER_F_RUNNING, &worker->flags);
- io_wq_inc_running(worker);
+ if (!test_bit(IO_WORKER_F_FAULT, &worker->flags)) {
+ if (!test_bit(IO_WORKER_F_UP, &worker->flags))
+ return;
+ if (test_bit(IO_WORKER_F_RUNNING, &worker->flags))
+ return;
+ set_bit(IO_WORKER_F_RUNNING, &worker->flags);
+ io_wq_inc_running(worker);
+ }
}
/*
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index d4fb2940e435..9444912d038d 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -15,6 +15,13 @@ enum {
IO_WQ_HASH_SHIFT = 24, /* upper 8 bits are used for hash key */
};
+enum {
+ IO_WORKER_F_UP = 0, /* up and active */
+ IO_WORKER_F_RUNNING = 1, /* account as running */
+ IO_WORKER_F_FREE = 2, /* worker on free list */
+ IO_WORKER_F_FAULT = 3, /* used for userfault */
+};
+
enum io_wq_cancel {
IO_WQ_CANCEL_OK, /* cancelled before started */
IO_WQ_CANCEL_RUNNING, /* found, running, and attempted cancelled */
@@ -24,6 +31,32 @@ enum io_wq_cancel {
typedef struct io_wq_work *(free_work_fn)(struct io_wq_work *);
typedef void (io_wq_work_fn)(struct io_wq_work *);
+/*
+ * One for each thread in a wq pool
+ */
+struct io_worker {
+ refcount_t ref;
+ unsigned long flags;
+ struct hlist_nulls_node nulls_node;
+ struct list_head all_list;
+ struct task_struct *task;
+ struct io_wq *wq;
+ struct io_wq_acct *acct;
+
+ struct io_wq_work *cur_work;
+ raw_spinlock_t lock;
+ struct completion ref_done;
+
+ unsigned long create_state;
+ struct callback_head create_work;
+ int init_retries;
+
+ union {
+ struct rcu_head rcu;
+ struct delayed_work work;
+ };
+};
+
struct io_wq_hash {
refcount_t refs;
unsigned long map;
@@ -70,8 +103,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
void *data, bool cancel_all);
#if defined(CONFIG_IO_WQ)
-extern void io_wq_worker_sleeping(struct task_struct *);
-extern void io_wq_worker_running(struct task_struct *);
+extern void io_wq_worker_sleeping(struct task_struct *tsk);
+extern void io_wq_worker_running(struct task_struct *tsk);
+extern void set_userfault_flag_for_ioworker(struct io_worker *worker);
+extern void clear_userfault_flag_for_ioworker(struct io_worker *worker);
#else
static inline void io_wq_worker_sleeping(struct task_struct *tsk)
{
@@ -79,6 +114,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
static inline void io_wq_worker_running(struct task_struct *tsk)
{
}
+static inline void set_userfault_flag_for_ioworker(struct io_worker *worker)
+{
+}
+static inline void clear_userfault_flag_for_ioworker(struct io_worker *worker)
+{
+}
#endif
static inline bool io_wq_current_is_worker(void)
--
2.34.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] userfaultfd: Set the corresponding flag in IOU worker context
2025-04-22 10:45 [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads Zhiwei Jiang
2025-04-22 10:45 ` [PATCH 1/2] io_uring: Add new functions to handle user fault scenarios Zhiwei Jiang
@ 2025-04-22 10:45 ` Zhiwei Jiang
2025-04-22 13:34 ` [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads Jens Axboe
2 siblings, 0 replies; 11+ messages in thread
From: Zhiwei Jiang @ 2025-04-22 10:45 UTC (permalink / raw)
To: viro
Cc: brauner, jack, akpm, peterx, axboe, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring, Zhiwei Jiang
Set this to avoid premature return from schedule in IOU worker threads,
ensuring it sleeps and waits to be woken up as in normal cases.
Signed-off-by: Zhiwei Jiang <qq282012236@gmail.com>
---
fs/userfaultfd.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d80f94346199..74bead069e85 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -32,6 +32,7 @@
#include <linux/swapops.h>
#include <linux/miscdevice.h>
#include <linux/uio.h>
+#include "../io_uring/io-wq.h"
static int sysctl_unprivileged_userfaultfd __read_mostly;
@@ -369,7 +370,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
vm_fault_t ret = VM_FAULT_SIGBUS;
bool must_wait;
unsigned int blocking_state;
+ struct io_worker *worker = current->worker_private;
+ if (worker)
+ set_userfault_flag_for_ioworker(worker);
/*
* We don't do userfault handling for the final child pid update
* and when coredumping (faults triggered by get_dump_page()).
@@ -506,6 +510,9 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
__set_current_state(TASK_RUNNING);
+ if (worker)
+ clear_userfault_flag_for_ioworker(worker);
+
/*
* Here we race with the list_del; list_add in
* userfaultfd_ctx_read(), however because we don't ever run
--
2.34.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads
2025-04-22 10:45 [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads Zhiwei Jiang
2025-04-22 10:45 ` [PATCH 1/2] io_uring: Add new functions to handle user fault scenarios Zhiwei Jiang
2025-04-22 10:45 ` [PATCH 2/2] userfaultfd: Set the corresponding flag in IOU worker context Zhiwei Jiang
@ 2025-04-22 13:34 ` Jens Axboe
2025-04-22 14:10 ` 姜智伟
2 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2025-04-22 13:34 UTC (permalink / raw)
To: Zhiwei Jiang, viro
Cc: brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring
On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
> In the Firecracker VM scenario, sporadically encountered threads with
> the UN state in the following call stack:
> [<0>] io_wq_put_and_exit+0xa1/0x210
> [<0>] io_uring_clean_tctx+0x8e/0xd0
> [<0>] io_uring_cancel_generic+0x19f/0x370
> [<0>] __io_uring_cancel+0x14/0x20
> [<0>] do_exit+0x17f/0x510
> [<0>] do_group_exit+0x35/0x90
> [<0>] get_signal+0x963/0x970
> [<0>] arch_do_signal_or_restart+0x39/0x120
> [<0>] syscall_exit_to_user_mode+0x206/0x260
> [<0>] do_syscall_64+0x8d/0x170
> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
> The cause is a large number of IOU kernel threads saturating the CPU
> and not exiting. When the issue occurs, CPU usage 100% and can only
> be resolved by rebooting. Each thread's appears as follows:
> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
> iou-wrk-44588 [kernel.kallsyms] [k] io_write
> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
> iou-wrk-44588 [kernel.kallsyms] [k] schedule
> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
>
> I tracked the address that triggered the fault and the related function
> graph, as well as the wake-up side of the user fault, and discovered this
> : In the IOU worker, when fault in a user space page, this space is
> associated with a userfault but does not sleep. This is because during
> scheduling, the judgment in the IOU worker context leads to early return.
> Meanwhile, the listener on the userfaultfd user side never performs a COPY
> to respond, causing the page table entry to remain empty. However, due to
> the early return, it does not sleep and wait to be awakened as in a normal
> user fault, thus continuously faulting at the same address,so CPU loop.
> Therefore, I believe it is necessary to specifically handle user faults by
> setting a new flag to allow schedule function to continue in such cases,
> make sure the thread to sleep.
>
> Patch 1 io_uring: Add new functions to handle user fault scenarios
> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
>
> fs/userfaultfd.c | 7 ++++++
> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
> 3 files changed, 68 insertions(+), 41 deletions(-)
Do you have a test case for this? I don't think the proposed solution is
very elegant, userfaultfd should not need to know about thread workers.
I'll ponder this a bit...
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads
2025-04-22 13:34 ` [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads Jens Axboe
@ 2025-04-22 14:10 ` 姜智伟
2025-04-22 14:13 ` Jens Axboe
0 siblings, 1 reply; 11+ messages in thread
From: 姜智伟 @ 2025-04-22 14:10 UTC (permalink / raw)
To: Jens Axboe
Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring
On Tue, Apr 22, 2025 at 9:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
> > In the Firecracker VM scenario, sporadically encountered threads with
> > the UN state in the following call stack:
> > [<0>] io_wq_put_and_exit+0xa1/0x210
> > [<0>] io_uring_clean_tctx+0x8e/0xd0
> > [<0>] io_uring_cancel_generic+0x19f/0x370
> > [<0>] __io_uring_cancel+0x14/0x20
> > [<0>] do_exit+0x17f/0x510
> > [<0>] do_group_exit+0x35/0x90
> > [<0>] get_signal+0x963/0x970
> > [<0>] arch_do_signal_or_restart+0x39/0x120
> > [<0>] syscall_exit_to_user_mode+0x206/0x260
> > [<0>] do_syscall_64+0x8d/0x170
> > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
> > The cause is a large number of IOU kernel threads saturating the CPU
> > and not exiting. When the issue occurs, CPU usage 100% and can only
> > be resolved by rebooting. Each thread's appears as follows:
> > iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
> > iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
> > iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
> > iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
> > iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
> > iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
> > iou-wrk-44588 [kernel.kallsyms] [k] io_write
> > iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
> > iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
> > iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
> > iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
> > iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
> > iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
> > iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
> > iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
> > iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
> > iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
> > iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
> > iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
> > iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
> > iou-wrk-44588 [kernel.kallsyms] [k] schedule
> > iou-wrk-44588 [kernel.kallsyms] [k] __schedule
> > iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
> > iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
> >
> > I tracked the address that triggered the fault and the related function
> > graph, as well as the wake-up side of the user fault, and discovered this
> > : In the IOU worker, when fault in a user space page, this space is
> > associated with a userfault but does not sleep. This is because during
> > scheduling, the judgment in the IOU worker context leads to early return.
> > Meanwhile, the listener on the userfaultfd user side never performs a COPY
> > to respond, causing the page table entry to remain empty. However, due to
> > the early return, it does not sleep and wait to be awakened as in a normal
> > user fault, thus continuously faulting at the same address,so CPU loop.
> > Therefore, I believe it is necessary to specifically handle user faults by
> > setting a new flag to allow schedule function to continue in such cases,
> > make sure the thread to sleep.
> >
> > Patch 1 io_uring: Add new functions to handle user fault scenarios
> > Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
> >
> > fs/userfaultfd.c | 7 ++++++
> > io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
> > io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
> > 3 files changed, 68 insertions(+), 41 deletions(-)
>
> Do you have a test case for this? I don't think the proposed solution is
> very elegant, userfaultfd should not need to know about thread workers.
> I'll ponder this a bit...
>
> --
> Jens Axboe
Sorry,The issue occurs very infrequently, and I can't manually
reproduce it. It's not very elegant, but for corner cases, it seems
necessary to make some compromises.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads
2025-04-22 14:10 ` 姜智伟
@ 2025-04-22 14:13 ` Jens Axboe
2025-04-22 14:18 ` 姜智伟
0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2025-04-22 14:13 UTC (permalink / raw)
To: 姜智伟
Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring
On 4/22/25 8:10 AM, ??? wrote:
> On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
>>> In the Firecracker VM scenario, sporadically encountered threads with
>>> the UN state in the following call stack:
>>> [<0>] io_wq_put_and_exit+0xa1/0x210
>>> [<0>] io_uring_clean_tctx+0x8e/0xd0
>>> [<0>] io_uring_cancel_generic+0x19f/0x370
>>> [<0>] __io_uring_cancel+0x14/0x20
>>> [<0>] do_exit+0x17f/0x510
>>> [<0>] do_group_exit+0x35/0x90
>>> [<0>] get_signal+0x963/0x970
>>> [<0>] arch_do_signal_or_restart+0x39/0x120
>>> [<0>] syscall_exit_to_user_mode+0x206/0x260
>>> [<0>] do_syscall_64+0x8d/0x170
>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
>>> The cause is a large number of IOU kernel threads saturating the CPU
>>> and not exiting. When the issue occurs, CPU usage 100% and can only
>>> be resolved by rebooting. Each thread's appears as follows:
>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
>>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
>>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
>>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
>>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
>>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
>>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
>>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
>>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
>>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
>>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
>>>
>>> I tracked the address that triggered the fault and the related function
>>> graph, as well as the wake-up side of the user fault, and discovered this
>>> : In the IOU worker, when fault in a user space page, this space is
>>> associated with a userfault but does not sleep. This is because during
>>> scheduling, the judgment in the IOU worker context leads to early return.
>>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
>>> to respond, causing the page table entry to remain empty. However, due to
>>> the early return, it does not sleep and wait to be awakened as in a normal
>>> user fault, thus continuously faulting at the same address,so CPU loop.
>>> Therefore, I believe it is necessary to specifically handle user faults by
>>> setting a new flag to allow schedule function to continue in such cases,
>>> make sure the thread to sleep.
>>>
>>> Patch 1 io_uring: Add new functions to handle user fault scenarios
>>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
>>>
>>> fs/userfaultfd.c | 7 ++++++
>>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
>>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
>>> 3 files changed, 68 insertions(+), 41 deletions(-)
>>
>> Do you have a test case for this? I don't think the proposed solution is
>> very elegant, userfaultfd should not need to know about thread workers.
>> I'll ponder this a bit...
>>
>> --
>> Jens Axboe
> Sorry,The issue occurs very infrequently, and I can't manually
> reproduce it. It's not very elegant, but for corner cases, it seems
> necessary to make some compromises.
I'm going to see if I can create one. Not sure I fully understand the
issue yet, but I'd be surprised if there isn't a more appropriate and
elegant solution rather than exposing the io-wq guts and having
userfaultfd manipulate them. That really should not be necessary.
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads
2025-04-22 14:13 ` Jens Axboe
@ 2025-04-22 14:18 ` 姜智伟
2025-04-22 14:29 ` Jens Axboe
0 siblings, 1 reply; 11+ messages in thread
From: 姜智伟 @ 2025-04-22 14:18 UTC (permalink / raw)
To: Jens Axboe
Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring
On Tue, Apr 22, 2025 at 10:13 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/22/25 8:10 AM, ??? wrote:
> > On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
> >>> In the Firecracker VM scenario, sporadically encountered threads with
> >>> the UN state in the following call stack:
> >>> [<0>] io_wq_put_and_exit+0xa1/0x210
> >>> [<0>] io_uring_clean_tctx+0x8e/0xd0
> >>> [<0>] io_uring_cancel_generic+0x19f/0x370
> >>> [<0>] __io_uring_cancel+0x14/0x20
> >>> [<0>] do_exit+0x17f/0x510
> >>> [<0>] do_group_exit+0x35/0x90
> >>> [<0>] get_signal+0x963/0x970
> >>> [<0>] arch_do_signal_or_restart+0x39/0x120
> >>> [<0>] syscall_exit_to_user_mode+0x206/0x260
> >>> [<0>] do_syscall_64+0x8d/0x170
> >>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
> >>> The cause is a large number of IOU kernel threads saturating the CPU
> >>> and not exiting. When the issue occurs, CPU usage 100% and can only
> >>> be resolved by rebooting. Each thread's appears as follows:
> >>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
> >>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
> >>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
> >>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
> >>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
> >>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
> >>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
> >>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
> >>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
> >>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
> >>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
> >>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
> >>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
> >>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
> >>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
> >>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
> >>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
> >>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
> >>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
> >>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
> >>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
> >>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
> >>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
> >>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
> >>>
> >>> I tracked the address that triggered the fault and the related function
> >>> graph, as well as the wake-up side of the user fault, and discovered this
> >>> : In the IOU worker, when fault in a user space page, this space is
> >>> associated with a userfault but does not sleep. This is because during
> >>> scheduling, the judgment in the IOU worker context leads to early return.
> >>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
> >>> to respond, causing the page table entry to remain empty. However, due to
> >>> the early return, it does not sleep and wait to be awakened as in a normal
> >>> user fault, thus continuously faulting at the same address,so CPU loop.
> >>> Therefore, I believe it is necessary to specifically handle user faults by
> >>> setting a new flag to allow schedule function to continue in such cases,
> >>> make sure the thread to sleep.
> >>>
> >>> Patch 1 io_uring: Add new functions to handle user fault scenarios
> >>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
> >>>
> >>> fs/userfaultfd.c | 7 ++++++
> >>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
> >>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
> >>> 3 files changed, 68 insertions(+), 41 deletions(-)
> >>
> >> Do you have a test case for this? I don't think the proposed solution is
> >> very elegant, userfaultfd should not need to know about thread workers.
> >> I'll ponder this a bit...
> >>
> >> --
> >> Jens Axboe
> > Sorry,The issue occurs very infrequently, and I can't manually
> > reproduce it. It's not very elegant, but for corner cases, it seems
> > necessary to make some compromises.
>
> I'm going to see if I can create one. Not sure I fully understand the
> issue yet, but I'd be surprised if there isn't a more appropriate and
> elegant solution rather than exposing the io-wq guts and having
> userfaultfd manipulate them. That really should not be necessary.
>
> --
> Jens Axboe
Thanks.I'm looking forward to your good news.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads
2025-04-22 14:18 ` 姜智伟
@ 2025-04-22 14:29 ` Jens Axboe
2025-04-22 15:49 ` Jens Axboe
0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2025-04-22 14:29 UTC (permalink / raw)
To: 姜智伟
Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring
On 4/22/25 8:18 AM, ??? wrote:
> On Tue, Apr 22, 2025 at 10:13?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/22/25 8:10 AM, ??? wrote:
>>> On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
>>>>> In the Firecracker VM scenario, sporadically encountered threads with
>>>>> the UN state in the following call stack:
>>>>> [<0>] io_wq_put_and_exit+0xa1/0x210
>>>>> [<0>] io_uring_clean_tctx+0x8e/0xd0
>>>>> [<0>] io_uring_cancel_generic+0x19f/0x370
>>>>> [<0>] __io_uring_cancel+0x14/0x20
>>>>> [<0>] do_exit+0x17f/0x510
>>>>> [<0>] do_group_exit+0x35/0x90
>>>>> [<0>] get_signal+0x963/0x970
>>>>> [<0>] arch_do_signal_or_restart+0x39/0x120
>>>>> [<0>] syscall_exit_to_user_mode+0x206/0x260
>>>>> [<0>] do_syscall_64+0x8d/0x170
>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
>>>>> The cause is a large number of IOU kernel threads saturating the CPU
>>>>> and not exiting. When the issue occurs, CPU usage 100% and can only
>>>>> be resolved by rebooting. Each thread's appears as follows:
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
>>>>>
>>>>> I tracked the address that triggered the fault and the related function
>>>>> graph, as well as the wake-up side of the user fault, and discovered this
>>>>> : In the IOU worker, when fault in a user space page, this space is
>>>>> associated with a userfault but does not sleep. This is because during
>>>>> scheduling, the judgment in the IOU worker context leads to early return.
>>>>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
>>>>> to respond, causing the page table entry to remain empty. However, due to
>>>>> the early return, it does not sleep and wait to be awakened as in a normal
>>>>> user fault, thus continuously faulting at the same address,so CPU loop.
>>>>> Therefore, I believe it is necessary to specifically handle user faults by
>>>>> setting a new flag to allow schedule function to continue in such cases,
>>>>> make sure the thread to sleep.
>>>>>
>>>>> Patch 1 io_uring: Add new functions to handle user fault scenarios
>>>>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
>>>>>
>>>>> fs/userfaultfd.c | 7 ++++++
>>>>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
>>>>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
>>>>> 3 files changed, 68 insertions(+), 41 deletions(-)
>>>>
>>>> Do you have a test case for this? I don't think the proposed solution is
>>>> very elegant, userfaultfd should not need to know about thread workers.
>>>> I'll ponder this a bit...
>>>>
>>>> --
>>>> Jens Axboe
>>> Sorry,The issue occurs very infrequently, and I can't manually
>>> reproduce it. It's not very elegant, but for corner cases, it seems
>>> necessary to make some compromises.
>>
>> I'm going to see if I can create one. Not sure I fully understand the
>> issue yet, but I'd be surprised if there isn't a more appropriate and
>> elegant solution rather than exposing the io-wq guts and having
>> userfaultfd manipulate them. That really should not be necessary.
>>
>> --
>> Jens Axboe
> Thanks.I'm looking forward to your good news.
Well, let's hope there is! In any case, your patches could be
considerably improved if you did:
void set_userfault_flag_for_ioworker(void)
{
struct io_worker *worker;
if (!(current->flags & PF_IO_WORKER))
return;
worker = current->worker_private;
set_bit(IO_WORKER_F_FAULT, &worker->flags);
}
void clear_userfault_flag_for_ioworker(void)
{
struct io_worker *worker;
if (!(current->flags & PF_IO_WORKER))
return;
worker = current->worker_private;
clear_bit(IO_WORKER_F_FAULT, &worker->flags);
}
and then userfaultfd would not need any odd checking, or needing io-wq
related structures public. That'd drastically cut down on the size of
them, and make it a bit more palatable.
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads
2025-04-22 14:29 ` Jens Axboe
@ 2025-04-22 15:49 ` Jens Axboe
2025-04-22 16:14 ` 姜智伟
0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2025-04-22 15:49 UTC (permalink / raw)
To: 姜智伟
Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring
On 4/22/25 8:29 AM, Jens Axboe wrote:
> On 4/22/25 8:18 AM, ??? wrote:
>> On Tue, Apr 22, 2025 at 10:13?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 4/22/25 8:10 AM, ??? wrote:
>>>> On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
>>>>>> In the Firecracker VM scenario, sporadically encountered threads with
>>>>>> the UN state in the following call stack:
>>>>>> [<0>] io_wq_put_and_exit+0xa1/0x210
>>>>>> [<0>] io_uring_clean_tctx+0x8e/0xd0
>>>>>> [<0>] io_uring_cancel_generic+0x19f/0x370
>>>>>> [<0>] __io_uring_cancel+0x14/0x20
>>>>>> [<0>] do_exit+0x17f/0x510
>>>>>> [<0>] do_group_exit+0x35/0x90
>>>>>> [<0>] get_signal+0x963/0x970
>>>>>> [<0>] arch_do_signal_or_restart+0x39/0x120
>>>>>> [<0>] syscall_exit_to_user_mode+0x206/0x260
>>>>>> [<0>] do_syscall_64+0x8d/0x170
>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
>>>>>> The cause is a large number of IOU kernel threads saturating the CPU
>>>>>> and not exiting. When the issue occurs, CPU usage 100% and can only
>>>>>> be resolved by rebooting. Each thread's appears as follows:
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
>>>>>>
>>>>>> I tracked the address that triggered the fault and the related function
>>>>>> graph, as well as the wake-up side of the user fault, and discovered this
>>>>>> : In the IOU worker, when fault in a user space page, this space is
>>>>>> associated with a userfault but does not sleep. This is because during
>>>>>> scheduling, the judgment in the IOU worker context leads to early return.
>>>>>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
>>>>>> to respond, causing the page table entry to remain empty. However, due to
>>>>>> the early return, it does not sleep and wait to be awakened as in a normal
>>>>>> user fault, thus continuously faulting at the same address,so CPU loop.
>>>>>> Therefore, I believe it is necessary to specifically handle user faults by
>>>>>> setting a new flag to allow schedule function to continue in such cases,
>>>>>> make sure the thread to sleep.
>>>>>>
>>>>>> Patch 1 io_uring: Add new functions to handle user fault scenarios
>>>>>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
>>>>>>
>>>>>> fs/userfaultfd.c | 7 ++++++
>>>>>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
>>>>>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
>>>>>> 3 files changed, 68 insertions(+), 41 deletions(-)
>>>>>
>>>>> Do you have a test case for this? I don't think the proposed solution is
>>>>> very elegant, userfaultfd should not need to know about thread workers.
>>>>> I'll ponder this a bit...
>>>>>
>>>>> --
>>>>> Jens Axboe
>>>> Sorry,The issue occurs very infrequently, and I can't manually
>>>> reproduce it. It's not very elegant, but for corner cases, it seems
>>>> necessary to make some compromises.
>>>
>>> I'm going to see if I can create one. Not sure I fully understand the
>>> issue yet, but I'd be surprised if there isn't a more appropriate and
>>> elegant solution rather than exposing the io-wq guts and having
>>> userfaultfd manipulate them. That really should not be necessary.
>>>
>>> --
>>> Jens Axboe
>> Thanks.I'm looking forward to your good news.
>
> Well, let's hope there is! In any case, your patches could be
> considerably improved if you did:
>
> void set_userfault_flag_for_ioworker(void)
> {
> struct io_worker *worker;
> if (!(current->flags & PF_IO_WORKER))
> return;
> worker = current->worker_private;
> set_bit(IO_WORKER_F_FAULT, &worker->flags);
> }
>
> void clear_userfault_flag_for_ioworker(void)
> {
> struct io_worker *worker;
> if (!(current->flags & PF_IO_WORKER))
> return;
> worker = current->worker_private;
> clear_bit(IO_WORKER_F_FAULT, &worker->flags);
> }
>
> and then userfaultfd would not need any odd checking, or needing io-wq
> related structures public. That'd drastically cut down on the size of
> them, and make it a bit more palatable.
Forgot to ask, what kernel are you running on?
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads
2025-04-22 15:49 ` Jens Axboe
@ 2025-04-22 16:14 ` 姜智伟
2025-04-22 16:24 ` Jens Axboe
0 siblings, 1 reply; 11+ messages in thread
From: 姜智伟 @ 2025-04-22 16:14 UTC (permalink / raw)
To: Jens Axboe
Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring
On Tue, Apr 22, 2025 at 11:50 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/22/25 8:29 AM, Jens Axboe wrote:
> > On 4/22/25 8:18 AM, ??? wrote:
> >> On Tue, Apr 22, 2025 at 10:13?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> On 4/22/25 8:10 AM, ??? wrote:
> >>>> On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>
> >>>>> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
> >>>>>> In the Firecracker VM scenario, sporadically encountered threads with
> >>>>>> the UN state in the following call stack:
> >>>>>> [<0>] io_wq_put_and_exit+0xa1/0x210
> >>>>>> [<0>] io_uring_clean_tctx+0x8e/0xd0
> >>>>>> [<0>] io_uring_cancel_generic+0x19f/0x370
> >>>>>> [<0>] __io_uring_cancel+0x14/0x20
> >>>>>> [<0>] do_exit+0x17f/0x510
> >>>>>> [<0>] do_group_exit+0x35/0x90
> >>>>>> [<0>] get_signal+0x963/0x970
> >>>>>> [<0>] arch_do_signal_or_restart+0x39/0x120
> >>>>>> [<0>] syscall_exit_to_user_mode+0x206/0x260
> >>>>>> [<0>] do_syscall_64+0x8d/0x170
> >>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
> >>>>>> The cause is a large number of IOU kernel threads saturating the CPU
> >>>>>> and not exiting. When the issue occurs, CPU usage 100% and can only
> >>>>>> be resolved by rebooting. Each thread's appears as follows:
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
> >>>>>>
> >>>>>> I tracked the address that triggered the fault and the related function
> >>>>>> graph, as well as the wake-up side of the user fault, and discovered this
> >>>>>> : In the IOU worker, when fault in a user space page, this space is
> >>>>>> associated with a userfault but does not sleep. This is because during
> >>>>>> scheduling, the judgment in the IOU worker context leads to early return.
> >>>>>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
> >>>>>> to respond, causing the page table entry to remain empty. However, due to
> >>>>>> the early return, it does not sleep and wait to be awakened as in a normal
> >>>>>> user fault, thus continuously faulting at the same address,so CPU loop.
> >>>>>> Therefore, I believe it is necessary to specifically handle user faults by
> >>>>>> setting a new flag to allow schedule function to continue in such cases,
> >>>>>> make sure the thread to sleep.
> >>>>>>
> >>>>>> Patch 1 io_uring: Add new functions to handle user fault scenarios
> >>>>>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
> >>>>>>
> >>>>>> fs/userfaultfd.c | 7 ++++++
> >>>>>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
> >>>>>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
> >>>>>> 3 files changed, 68 insertions(+), 41 deletions(-)
> >>>>>
> >>>>> Do you have a test case for this? I don't think the proposed solution is
> >>>>> very elegant, userfaultfd should not need to know about thread workers.
> >>>>> I'll ponder this a bit...
> >>>>>
> >>>>> --
> >>>>> Jens Axboe
> >>>> Sorry,The issue occurs very infrequently, and I can't manually
> >>>> reproduce it. It's not very elegant, but for corner cases, it seems
> >>>> necessary to make some compromises.
> >>>
> >>> I'm going to see if I can create one. Not sure I fully understand the
> >>> issue yet, but I'd be surprised if there isn't a more appropriate and
> >>> elegant solution rather than exposing the io-wq guts and having
> >>> userfaultfd manipulate them. That really should not be necessary.
> >>>
> >>> --
> >>> Jens Axboe
> >> Thanks.I'm looking forward to your good news.
> >
> > Well, let's hope there is! In any case, your patches could be
> > considerably improved if you did:
> >
> > void set_userfault_flag_for_ioworker(void)
> > {
> > struct io_worker *worker;
> > if (!(current->flags & PF_IO_WORKER))
> > return;
> > worker = current->worker_private;
> > set_bit(IO_WORKER_F_FAULT, &worker->flags);
> > }
> >
> > void clear_userfault_flag_for_ioworker(void)
> > {
> > struct io_worker *worker;
> > if (!(current->flags & PF_IO_WORKER))
> > return;
> > worker = current->worker_private;
> > clear_bit(IO_WORKER_F_FAULT, &worker->flags);
> > }
> >
> > and then userfaultfd would not need any odd checking, or needing io-wq
> > related structures public. That'd drastically cut down on the size of
> > them, and make it a bit more palatable.
>
> Forgot to ask, what kernel are you running on?
>
> --
> Jens Axboe
Thanks Jens It is linux-image-6.8.0-1026-gcp
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads
2025-04-22 16:14 ` 姜智伟
@ 2025-04-22 16:24 ` Jens Axboe
0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2025-04-22 16:24 UTC (permalink / raw)
To: 姜智伟
Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
linux-mm, linux-kernel, io-uring
On 4/22/25 10:14 AM, ??? wrote:
> On Tue, Apr 22, 2025 at 11:50?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/22/25 8:29 AM, Jens Axboe wrote:
>>> On 4/22/25 8:18 AM, ??? wrote:
>>>> On Tue, Apr 22, 2025 at 10:13?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 4/22/25 8:10 AM, ??? wrote:
>>>>>> On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
>>>>>>>> In the Firecracker VM scenario, sporadically encountered threads with
>>>>>>>> the UN state in the following call stack:
>>>>>>>> [<0>] io_wq_put_and_exit+0xa1/0x210
>>>>>>>> [<0>] io_uring_clean_tctx+0x8e/0xd0
>>>>>>>> [<0>] io_uring_cancel_generic+0x19f/0x370
>>>>>>>> [<0>] __io_uring_cancel+0x14/0x20
>>>>>>>> [<0>] do_exit+0x17f/0x510
>>>>>>>> [<0>] do_group_exit+0x35/0x90
>>>>>>>> [<0>] get_signal+0x963/0x970
>>>>>>>> [<0>] arch_do_signal_or_restart+0x39/0x120
>>>>>>>> [<0>] syscall_exit_to_user_mode+0x206/0x260
>>>>>>>> [<0>] do_syscall_64+0x8d/0x170
>>>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
>>>>>>>> The cause is a large number of IOU kernel threads saturating the CPU
>>>>>>>> and not exiting. When the issue occurs, CPU usage 100% and can only
>>>>>>>> be resolved by rebooting. Each thread's appears as follows:
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
>>>>>>>>
>>>>>>>> I tracked the address that triggered the fault and the related function
>>>>>>>> graph, as well as the wake-up side of the user fault, and discovered this
>>>>>>>> : In the IOU worker, when fault in a user space page, this space is
>>>>>>>> associated with a userfault but does not sleep. This is because during
>>>>>>>> scheduling, the judgment in the IOU worker context leads to early return.
>>>>>>>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
>>>>>>>> to respond, causing the page table entry to remain empty. However, due to
>>>>>>>> the early return, it does not sleep and wait to be awakened as in a normal
>>>>>>>> user fault, thus continuously faulting at the same address,so CPU loop.
>>>>>>>> Therefore, I believe it is necessary to specifically handle user faults by
>>>>>>>> setting a new flag to allow schedule function to continue in such cases,
>>>>>>>> make sure the thread to sleep.
>>>>>>>>
>>>>>>>> Patch 1 io_uring: Add new functions to handle user fault scenarios
>>>>>>>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
>>>>>>>>
>>>>>>>> fs/userfaultfd.c | 7 ++++++
>>>>>>>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
>>>>>>>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
>>>>>>>> 3 files changed, 68 insertions(+), 41 deletions(-)
>>>>>>>
>>>>>>> Do you have a test case for this? I don't think the proposed solution is
>>>>>>> very elegant, userfaultfd should not need to know about thread workers.
>>>>>>> I'll ponder this a bit...
>>>>>>>
>>>>>>> --
>>>>>>> Jens Axboe
>>>>>> Sorry,The issue occurs very infrequently, and I can't manually
>>>>>> reproduce it. It's not very elegant, but for corner cases, it seems
>>>>>> necessary to make some compromises.
>>>>>
>>>>> I'm going to see if I can create one. Not sure I fully understand the
>>>>> issue yet, but I'd be surprised if there isn't a more appropriate and
>>>>> elegant solution rather than exposing the io-wq guts and having
>>>>> userfaultfd manipulate them. That really should not be necessary.
>>>>>
>>>>> --
>>>>> Jens Axboe
>>>> Thanks.I'm looking forward to your good news.
>>>
>>> Well, let's hope there is! In any case, your patches could be
>>> considerably improved if you did:
>>>
>>> void set_userfault_flag_for_ioworker(void)
>>> {
>>> struct io_worker *worker;
>>> if (!(current->flags & PF_IO_WORKER))
>>> return;
>>> worker = current->worker_private;
>>> set_bit(IO_WORKER_F_FAULT, &worker->flags);
>>> }
>>>
>>> void clear_userfault_flag_for_ioworker(void)
>>> {
>>> struct io_worker *worker;
>>> if (!(current->flags & PF_IO_WORKER))
>>> return;
>>> worker = current->worker_private;
>>> clear_bit(IO_WORKER_F_FAULT, &worker->flags);
>>> }
>>>
>>> and then userfaultfd would not need any odd checking, or needing io-wq
>>> related structures public. That'd drastically cut down on the size of
>>> them, and make it a bit more palatable.
>>
>> Forgot to ask, what kernel are you running on?
>>
>> --
>> Jens Axboe
> Thanks Jens It is linux-image-6.8.0-1026-gcp
OK, that's ancient and unsupported in that no stable release is
happening for that kernel. Does it happen on newer kernels too?
FWIW, I haven't been able to reproduce anything odd so far. The io_uring
writes going via io-wq and hitting the userfaultfd path end up sleeping
in the schedule() in handle_userfault() - which is what I'd expect.
Do you know how many pending writes there are? I have a hard time
understanding your description of the problem, but it sounds like a ton
of workers are being created. But it's still not clear to me why that
would be, workers would only get created if there's more work to do, and
the current worker is going to sleep.
Puzzled...
--
Jens Axboe
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-04-22 16:24 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22 10:45 [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads Zhiwei Jiang
2025-04-22 10:45 ` [PATCH 1/2] io_uring: Add new functions to handle user fault scenarios Zhiwei Jiang
2025-04-22 10:45 ` [PATCH 2/2] userfaultfd: Set the corresponding flag in IOU worker context Zhiwei Jiang
2025-04-22 13:34 ` [PATCH 0/2] Fix 100% CPU usage issue in IOU worker threads Jens Axboe
2025-04-22 14:10 ` 姜智伟
2025-04-22 14:13 ` Jens Axboe
2025-04-22 14:18 ` 姜智伟
2025-04-22 14:29 ` Jens Axboe
2025-04-22 15:49 ` Jens Axboe
2025-04-22 16:14 ` 姜智伟
2025-04-22 16:24 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox