[PATCH v2 0/2] Fix 100% CPU usage issue in IOU worker threads

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Fix 100% CPU usage issue in IOU worker threads
@ 2025-04-22 16:29 Zhiwei Jiang
  2025-04-22 16:29 ` [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios Zhiwei Jiang
  2025-04-22 16:29 ` [PATCH v2 2/2] userfaultfd: Set the corresponding flag in IOU worker context Zhiwei Jiang
  0 siblings, 2 replies; 25+ messages in thread
From: Zhiwei Jiang @ 2025-04-22 16:29 UTC (permalink / raw)
  To: viro
  Cc: brauner, jack, akpm, peterx, axboe, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring, Zhiwei Jiang

In the Firecracker VM scenario, sporadically encountered threads with
the UN state in the following call stack:
[<0>] io_wq_put_and_exit+0xa1/0x210
[<0>] io_uring_clean_tctx+0x8e/0xd0
[<0>] io_uring_cancel_generic+0x19f/0x370
[<0>] __io_uring_cancel+0x14/0x20
[<0>] do_exit+0x17f/0x510
[<0>] do_group_exit+0x35/0x90
[<0>] get_signal+0x963/0x970
[<0>] arch_do_signal_or_restart+0x39/0x120
[<0>] syscall_exit_to_user_mode+0x206/0x260
[<0>] do_syscall_64+0x8d/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
The cause is a large number of IOU kernel threads saturating the CPU
and not exiting. When the issue occurs, CPU usage 100% and can only
be resolved by rebooting. Each thread's appears as follows:
iou-wrk-44588  [kernel.kallsyms]  [k] ret_from_fork_asm
iou-wrk-44588  [kernel.kallsyms]  [k] ret_from_fork
iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_worker
iou-wrk-44588  [kernel.kallsyms]  [k] io_worker_handle_work
iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_submit_work
iou-wrk-44588  [kernel.kallsyms]  [k] io_issue_sqe
iou-wrk-44588  [kernel.kallsyms]  [k] io_write
iou-wrk-44588  [kernel.kallsyms]  [k] blkdev_write_iter
iou-wrk-44588  [kernel.kallsyms]  [k] iomap_file_buffered_write
iou-wrk-44588  [kernel.kallsyms]  [k] iomap_write_iter
iou-wrk-44588  [kernel.kallsyms]  [k] fault_in_iov_iter_readable
iou-wrk-44588  [kernel.kallsyms]  [k] fault_in_readable
iou-wrk-44588  [kernel.kallsyms]  [k] asm_exc_page_fault
iou-wrk-44588  [kernel.kallsyms]  [k] exc_page_fault
iou-wrk-44588  [kernel.kallsyms]  [k] do_user_addr_fault
iou-wrk-44588  [kernel.kallsyms]  [k] handle_mm_fault
iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_fault
iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_no_page
iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_handle_userfault
iou-wrk-44588  [kernel.kallsyms]  [k] handle_userfault
iou-wrk-44588  [kernel.kallsyms]  [k] schedule
iou-wrk-44588  [kernel.kallsyms]  [k] __schedule
iou-wrk-44588  [kernel.kallsyms]  [k] __raw_spin_unlock_irq
iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_worker_sleeping

I tracked the address that triggered the fault and the related function
graph, as well as the wake-up side of the user fault, and discovered this
: In the IOU worker, when fault in a user space page, this space is
associated with a userfault but does not sleep. This is because during
scheduling, the judgment in the IOU worker context leads to early return.
Meanwhile, the listener on the userfaultfd user side never performs a COPY
to respond, causing the page table entry to remain empty. However, due to
the early return, it does not sleep and wait to be awakened as in a normal
user fault, thus continuously faulting at the same address,so CPU loop.
Therefore, I believe it is necessary to specifically handle user faults by
setting a new flag to allow schedule function to continue in such cases,
make sure the thread to sleep.

Patch 1  io_uring: Add new functions to handle user fault scenarios
Patch 2  userfaultfd: Set the corresponding flag in IOU worker context

Changes since v1:
- Optimized the code under Jens Axboe's suggestion to reduce the exposure
  of IO worker structure.


 fs/userfaultfd.c |  4 ++++
 io_uring/io-wq.c | 35 +++++++++++++++++++++++++++++------
 io_uring/io-wq.h | 12 ++++++++++--
 3 files changed, 43 insertions(+), 8 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-22 16:29 [PATCH v2 0/2] Fix 100% CPU usage issue in IOU worker threads Zhiwei Jiang
@ 2025-04-22 16:29 ` Zhiwei Jiang
  2025-04-22 16:32   ` Jens Axboe
  2025-04-22 16:29 ` [PATCH v2 2/2] userfaultfd: Set the corresponding flag in IOU worker context Zhiwei Jiang
  1 sibling, 1 reply; 25+ messages in thread
From: Zhiwei Jiang @ 2025-04-22 16:29 UTC (permalink / raw)
  To: viro
  Cc: brauner, jack, akpm, peterx, axboe, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring, Zhiwei Jiang

In the Firecracker VM scenario, sporadically encountered threads with
the UN state in the following call stack:
[<0>] io_wq_put_and_exit+0xa1/0x210
[<0>] io_uring_clean_tctx+0x8e/0xd0
[<0>] io_uring_cancel_generic+0x19f/0x370
[<0>] __io_uring_cancel+0x14/0x20
[<0>] do_exit+0x17f/0x510
[<0>] do_group_exit+0x35/0x90
[<0>] get_signal+0x963/0x970
[<0>] arch_do_signal_or_restart+0x39/0x120
[<0>] syscall_exit_to_user_mode+0x206/0x260
[<0>] do_syscall_64+0x8d/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
The cause is a large number of IOU kernel threads saturating the CPU
and not exiting. When the issue occurs, CPU usage 100% and can only
be resolved by rebooting. Each thread's appears as follows:
iou-wrk-44588  [kernel.kallsyms]  [k] ret_from_fork_asm
iou-wrk-44588  [kernel.kallsyms]  [k] ret_from_fork
iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_worker
iou-wrk-44588  [kernel.kallsyms]  [k] io_worker_handle_work
iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_submit_work
iou-wrk-44588  [kernel.kallsyms]  [k] io_issue_sqe
iou-wrk-44588  [kernel.kallsyms]  [k] io_write
iou-wrk-44588  [kernel.kallsyms]  [k] blkdev_write_iter
iou-wrk-44588  [kernel.kallsyms]  [k] iomap_file_buffered_write
iou-wrk-44588  [kernel.kallsyms]  [k] iomap_write_iter
iou-wrk-44588  [kernel.kallsyms]  [k] fault_in_iov_iter_readable
iou-wrk-44588  [kernel.kallsyms]  [k] fault_in_readable
iou-wrk-44588  [kernel.kallsyms]  [k] asm_exc_page_fault
iou-wrk-44588  [kernel.kallsyms]  [k] exc_page_fault
iou-wrk-44588  [kernel.kallsyms]  [k] do_user_addr_fault
iou-wrk-44588  [kernel.kallsyms]  [k] handle_mm_fault
iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_fault
iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_no_page
iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_handle_userfault
iou-wrk-44588  [kernel.kallsyms]  [k] handle_userfault
iou-wrk-44588  [kernel.kallsyms]  [k] schedule
iou-wrk-44588  [kernel.kallsyms]  [k] __schedule
iou-wrk-44588  [kernel.kallsyms]  [k] __raw_spin_unlock_irq
iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_worker_sleeping

I tracked the address that triggered the fault and the related function
graph, as well as the wake-up side of the user fault, and discovered this
: In the IOU worker, when fault in a user space page, this space is
associated with a userfault but does not sleep. This is because during
scheduling, the judgment in the IOU worker context leads to early return.
Meanwhile, the listener on the userfaultfd user side never performs a COPY
to respond, causing the page table entry to remain empty. However, due to
the early return, it does not sleep and wait to be awakened as in a normal
user fault, thus continuously faulting at the same address,so CPU loop.

Therefore, I believe it is necessary to specifically handle user faults by
setting a new flag to allow schedule function to continue in such cases,
make sure the thread to sleep.Export the relevant functions and struct for
user fault.

Signed-off-by: Zhiwei Jiang <qq282012236@gmail.com>
---
 io_uring/io-wq.c | 35 +++++++++++++++++++++++++++++------
 io_uring/io-wq.h | 12 ++++++++++--
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 04a75d666195..4a4f65de6699 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -30,6 +30,7 @@ enum {
 	IO_WORKER_F_UP		= 0,	/* up and active */
 	IO_WORKER_F_RUNNING	= 1,	/* account as running */
 	IO_WORKER_F_FREE	= 2,	/* worker on free list */
+	IO_WORKER_F_FAULT   = 3,    /* used for userfault */
 };
 
 enum {
@@ -706,6 +707,26 @@ static int io_wq_worker(void *data)
 	return 0;
 }
 
+void set_userfault_flag_for_ioworker(void)
+{
+	struct io_worker *worker;
+
+	if (!(current->flags & PF_IO_WORKER))
+		return;
+	worker = current->worker_private;
+	set_bit(IO_WORKER_F_FAULT, &worker->flags);
+}
+
+void clear_userfault_flag_for_ioworker(void)
+{
+	struct io_worker *worker;
+
+	if (!(current->flags & PF_IO_WORKER))
+		return;
+	worker = current->worker_private;
+	clear_bit(IO_WORKER_F_FAULT, &worker->flags);
+}
+
 /*
  * Called when a worker is scheduled in. Mark us as currently running.
  */
@@ -715,12 +736,14 @@ void io_wq_worker_running(struct task_struct *tsk)
 
 	if (!worker)
 		return;
-	if (!test_bit(IO_WORKER_F_UP, &worker->flags))
-		return;
-	if (test_bit(IO_WORKER_F_RUNNING, &worker->flags))
-		return;
-	set_bit(IO_WORKER_F_RUNNING, &worker->flags);
-	io_wq_inc_running(worker);
+	if (!test_bit(IO_WORKER_F_FAULT, &worker->flags)) {
+		if (!test_bit(IO_WORKER_F_UP, &worker->flags))
+			return;
+		if (test_bit(IO_WORKER_F_RUNNING, &worker->flags))
+			return;
+		set_bit(IO_WORKER_F_RUNNING, &worker->flags);
+		io_wq_inc_running(worker);
+	}
 }
 
 /*
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index d4fb2940e435..8567a9c819db 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
 					void *data, bool cancel_all);
 
 #if defined(CONFIG_IO_WQ)
-extern void io_wq_worker_sleeping(struct task_struct *);
-extern void io_wq_worker_running(struct task_struct *);
+extern void io_wq_worker_sleeping(struct task_struct *tsk);
+extern void io_wq_worker_running(struct task_struct *tsk);
+extern void set_userfault_flag_for_ioworker(void);
+extern void clear_userfault_flag_for_ioworker(void);
 #else
 static inline void io_wq_worker_sleeping(struct task_struct *tsk)
 {
@@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
 static inline void io_wq_worker_running(struct task_struct *tsk)
 {
 }
+static inline void set_userfault_flag_for_ioworker(void)
+{
+}
+static inline void clear_userfault_flag_for_ioworker(void)
+{
+}
 #endif
 
 static inline bool io_wq_current_is_worker(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/2] userfaultfd: Set the corresponding flag in IOU worker context
  2025-04-22 16:29 [PATCH v2 0/2] Fix 100% CPU usage issue in IOU worker threads Zhiwei Jiang
  2025-04-22 16:29 ` [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios Zhiwei Jiang
@ 2025-04-22 16:29 ` Zhiwei Jiang
  1 sibling, 0 replies; 25+ messages in thread
From: Zhiwei Jiang @ 2025-04-22 16:29 UTC (permalink / raw)
  To: viro
  Cc: brauner, jack, akpm, peterx, axboe, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring, Zhiwei Jiang

Set this to avoid premature return from schedule in IOU worker threads,
ensuring it sleeps and waits to be woken up as in normal cases.

Signed-off-by: Zhiwei Jiang <qq282012236@gmail.com>
---
 fs/userfaultfd.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d80f94346199..972eb10925a9 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -32,6 +32,7 @@
 #include <linux/swapops.h>
 #include <linux/miscdevice.h>
 #include <linux/uio.h>
+#include "../io_uring/io-wq.h"
 
 static int sysctl_unprivileged_userfaultfd __read_mostly;
 
@@ -370,6 +371,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	bool must_wait;
 	unsigned int blocking_state;
 
+	set_userfault_flag_for_ioworker();
 	/*
 	 * We don't do userfault handling for the final child pid update
 	 * and when coredumping (faults triggered by get_dump_page()).
@@ -506,6 +508,8 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	__set_current_state(TASK_RUNNING);
 
+	clear_userfault_flag_for_ioworker();
+
 	/*
 	 * Here we race with the list_del; list_add in
 	 * userfaultfd_ctx_read(), however because we don't ever run
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-22 16:29 ` [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios Zhiwei Jiang
@ 2025-04-22 16:32   ` Jens Axboe
  2025-04-22 17:04     ` 姜智伟
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2025-04-22 16:32 UTC (permalink / raw)
  To: Zhiwei Jiang, viro
  Cc: brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
> index d4fb2940e435..8567a9c819db 100644
> --- a/io_uring/io-wq.h
> +++ b/io_uring/io-wq.h
> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
>  					void *data, bool cancel_all);
>  
>  #if defined(CONFIG_IO_WQ)
> -extern void io_wq_worker_sleeping(struct task_struct *);
> -extern void io_wq_worker_running(struct task_struct *);
> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
> +extern void io_wq_worker_running(struct task_struct *tsk);
> +extern void set_userfault_flag_for_ioworker(void);
> +extern void clear_userfault_flag_for_ioworker(void);
>  #else
>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>  {
> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>  static inline void io_wq_worker_running(struct task_struct *tsk)
>  {
>  }
> +static inline void set_userfault_flag_for_ioworker(void)
> +{
> +}
> +static inline void clear_userfault_flag_for_ioworker(void)
> +{
> +}
>  #endif
>  
>  static inline bool io_wq_current_is_worker(void)

This should go in include/linux/io_uring.h and then userfaultfd would
not have to include io_uring private headers.

But that's beside the point, like I said we still need to get to the
bottom of what is going on here first, rather than try and paper around
it. So please don't post more versions of this before we have that
understanding.

See previous emails on 6.8 and other kernel versions.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-22 16:32   ` Jens Axboe
@ 2025-04-22 17:04     ` 姜智伟
  2025-04-22 17:33       ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: 姜智伟 @ 2025-04-22 17:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On Wed, Apr 23, 2025 at 12:32 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
> > diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
> > index d4fb2940e435..8567a9c819db 100644
> > --- a/io_uring/io-wq.h
> > +++ b/io_uring/io-wq.h
> > @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
> >                                       void *data, bool cancel_all);
> >
> >  #if defined(CONFIG_IO_WQ)
> > -extern void io_wq_worker_sleeping(struct task_struct *);
> > -extern void io_wq_worker_running(struct task_struct *);
> > +extern void io_wq_worker_sleeping(struct task_struct *tsk);
> > +extern void io_wq_worker_running(struct task_struct *tsk);
> > +extern void set_userfault_flag_for_ioworker(void);
> > +extern void clear_userfault_flag_for_ioworker(void);
> >  #else
> >  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> >  {
> > @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> >  static inline void io_wq_worker_running(struct task_struct *tsk)
> >  {
> >  }
> > +static inline void set_userfault_flag_for_ioworker(void)
> > +{
> > +}
> > +static inline void clear_userfault_flag_for_ioworker(void)
> > +{
> > +}
> >  #endif
> >
> >  static inline bool io_wq_current_is_worker(void)
>
> This should go in include/linux/io_uring.h and then userfaultfd would
> not have to include io_uring private headers.
>
> But that's beside the point, like I said we still need to get to the
> bottom of what is going on here first, rather than try and paper around
> it. So please don't post more versions of this before we have that
> understanding.
>
> See previous emails on 6.8 and other kernel versions.
>
> --
> Jens Axboe
The issue did not involve creating new worker processes. Instead, the
existing IOU worker kernel threads (about a dozen) associated with the VM
process were fully utilizing CPU without writing data, caused by a fault
while reading user data pages in the fault_in_iov_iter_readable function
when pulling user memory into kernel space.

This issue occurs like during VM snapshot loading (which uses
userfaultfd for on-demand memory loading), while the task in the guest is
writing data to disk.

Normally, the VM first triggers a user fault to fill the page table.
So in the IOU worker thread, the page tables are already filled,
fault no chance happens when faulting in memory pages
in fault_in_iov_iter_readable.

I suspect that during snapshot loading, a memory access in the
VM triggers an async page fault handled by the kernel thread,
while the IOU worker's async kernel thread is also running.
Maybe If the IOU worker's thread is scheduled first.
I’m going to bed now.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-22 17:04     ` 姜智伟
@ 2025-04-22 17:33       ` Jens Axboe
  2025-04-23  2:49         ` 姜智伟
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2025-04-22 17:33 UTC (permalink / raw)
  To: 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On 4/22/25 11:04 AM, ??? wrote:
> On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
>>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
>>> index d4fb2940e435..8567a9c819db 100644
>>> --- a/io_uring/io-wq.h
>>> +++ b/io_uring/io-wq.h
>>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
>>>                                       void *data, bool cancel_all);
>>>
>>>  #if defined(CONFIG_IO_WQ)
>>> -extern void io_wq_worker_sleeping(struct task_struct *);
>>> -extern void io_wq_worker_running(struct task_struct *);
>>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
>>> +extern void io_wq_worker_running(struct task_struct *tsk);
>>> +extern void set_userfault_flag_for_ioworker(void);
>>> +extern void clear_userfault_flag_for_ioworker(void);
>>>  #else
>>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>  {
>>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>  static inline void io_wq_worker_running(struct task_struct *tsk)
>>>  {
>>>  }
>>> +static inline void set_userfault_flag_for_ioworker(void)
>>> +{
>>> +}
>>> +static inline void clear_userfault_flag_for_ioworker(void)
>>> +{
>>> +}
>>>  #endif
>>>
>>>  static inline bool io_wq_current_is_worker(void)
>>
>> This should go in include/linux/io_uring.h and then userfaultfd would
>> not have to include io_uring private headers.
>>
>> But that's beside the point, like I said we still need to get to the
>> bottom of what is going on here first, rather than try and paper around
>> it. So please don't post more versions of this before we have that
>> understanding.
>>
>> See previous emails on 6.8 and other kernel versions.
>>
>> --
>> Jens Axboe
> The issue did not involve creating new worker processes. Instead, the
> existing IOU worker kernel threads (about a dozen) associated with the VM
> process were fully utilizing CPU without writing data, caused by a fault
> while reading user data pages in the fault_in_iov_iter_readable function
> when pulling user memory into kernel space.

OK that makes more sense, I can certainly reproduce a loop in this path:

iou-wrk-726     729    36.910071:       9737 cycles:P: 
        ffff800080456c44 handle_userfault+0x47c
        ffff800080381fc0 hugetlb_fault+0xb68
        ffff80008031fee4 handle_mm_fault+0x2fc
        ffff8000812ada6c do_page_fault+0x1e4
        ffff8000812ae024 do_translation_fault+0x9c
        ffff800080049a9c do_mem_abort+0x44
        ffff80008129bd78 el1_abort+0x38
        ffff80008129ceb4 el1h_64_sync_handler+0xd4
        ffff8000800112b4 el1h_64_sync+0x6c
        ffff80008030984c fault_in_readable+0x74
        ffff800080476f3c iomap_file_buffered_write+0x14c
        ffff8000809b1230 blkdev_write_iter+0x1a8
        ffff800080a1f378 io_write+0x188
        ffff800080a14f30 io_issue_sqe+0x68
        ffff800080a155d0 io_wq_submit_work+0xa8
        ffff800080a32afc io_worker_handle_work+0x1f4
        ffff800080a332b8 io_wq_worker+0x110
        ffff80008002dd38 ret_from_fork+0x10

which seems to be expected, we'd continually try and fault in the
ranges, if the userfaultfd handler isn't filling them.

I guess this is where I'm still confused, because I don't see how this
is different from if you have a normal write(2) syscall doing the same
thing - you'd get the same looping.

??

> This issue occurs like during VM snapshot loading (which uses
> userfaultfd for on-demand memory loading), while the task in the guest is
> writing data to disk.
> 
> Normally, the VM first triggers a user fault to fill the page table.
> So in the IOU worker thread, the page tables are already filled,
> fault no chance happens when faulting in memory pages
> in fault_in_iov_iter_readable.
> 
> I suspect that during snapshot loading, a memory access in the
> VM triggers an async page fault handled by the kernel thread,
> while the IOU worker's async kernel thread is also running.
> Maybe If the IOU worker's thread is scheduled first.
> I?m going to bed now.

Ah ok, so what you're saying is that because we end up not sleeping
(because a signal is pending, it seems), then the fault will never get
filled and hence progress not made? And the signal is pending because
someone tried to create a net worker, and this work is not getting
processed.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-22 17:33       ` Jens Axboe
@ 2025-04-23  2:49         ` 姜智伟
  2025-04-23  3:11           ` 姜智伟
  2025-04-23 13:34           ` Jens Axboe
  0 siblings, 2 replies; 25+ messages in thread
From: 姜智伟 @ 2025-04-23  2:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On Wed, Apr 23, 2025 at 1:33 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/22/25 11:04 AM, ??? wrote:
> > On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
> >>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
> >>> index d4fb2940e435..8567a9c819db 100644
> >>> --- a/io_uring/io-wq.h
> >>> +++ b/io_uring/io-wq.h
> >>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
> >>>                                       void *data, bool cancel_all);
> >>>
> >>>  #if defined(CONFIG_IO_WQ)
> >>> -extern void io_wq_worker_sleeping(struct task_struct *);
> >>> -extern void io_wq_worker_running(struct task_struct *);
> >>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
> >>> +extern void io_wq_worker_running(struct task_struct *tsk);
> >>> +extern void set_userfault_flag_for_ioworker(void);
> >>> +extern void clear_userfault_flag_for_ioworker(void);
> >>>  #else
> >>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> >>>  {
> >>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> >>>  static inline void io_wq_worker_running(struct task_struct *tsk)
> >>>  {
> >>>  }
> >>> +static inline void set_userfault_flag_for_ioworker(void)
> >>> +{
> >>> +}
> >>> +static inline void clear_userfault_flag_for_ioworker(void)
> >>> +{
> >>> +}
> >>>  #endif
> >>>
> >>>  static inline bool io_wq_current_is_worker(void)
> >>
> >> This should go in include/linux/io_uring.h and then userfaultfd would
> >> not have to include io_uring private headers.
> >>
> >> But that's beside the point, like I said we still need to get to the
> >> bottom of what is going on here first, rather than try and paper around
> >> it. So please don't post more versions of this before we have that
> >> understanding.
> >>
> >> See previous emails on 6.8 and other kernel versions.
> >>
> >> --
> >> Jens Axboe
> > The issue did not involve creating new worker processes. Instead, the
> > existing IOU worker kernel threads (about a dozen) associated with the VM
> > process were fully utilizing CPU without writing data, caused by a fault
> > while reading user data pages in the fault_in_iov_iter_readable function
> > when pulling user memory into kernel space.
>
> OK that makes more sense, I can certainly reproduce a loop in this path:
>
> iou-wrk-726     729    36.910071:       9737 cycles:P:
>         ffff800080456c44 handle_userfault+0x47c
>         ffff800080381fc0 hugetlb_fault+0xb68
>         ffff80008031fee4 handle_mm_fault+0x2fc
>         ffff8000812ada6c do_page_fault+0x1e4
>         ffff8000812ae024 do_translation_fault+0x9c
>         ffff800080049a9c do_mem_abort+0x44
>         ffff80008129bd78 el1_abort+0x38
>         ffff80008129ceb4 el1h_64_sync_handler+0xd4
>         ffff8000800112b4 el1h_64_sync+0x6c
>         ffff80008030984c fault_in_readable+0x74
>         ffff800080476f3c iomap_file_buffered_write+0x14c
>         ffff8000809b1230 blkdev_write_iter+0x1a8
>         ffff800080a1f378 io_write+0x188
>         ffff800080a14f30 io_issue_sqe+0x68
>         ffff800080a155d0 io_wq_submit_work+0xa8
>         ffff800080a32afc io_worker_handle_work+0x1f4
>         ffff800080a332b8 io_wq_worker+0x110
>         ffff80008002dd38 ret_from_fork+0x10
>
> which seems to be expected, we'd continually try and fault in the
> ranges, if the userfaultfd handler isn't filling them.
>
> I guess this is where I'm still confused, because I don't see how this
> is different from if you have a normal write(2) syscall doing the same
> thing - you'd get the same looping.
>
> ??
>
> > This issue occurs like during VM snapshot loading (which uses
> > userfaultfd for on-demand memory loading), while the task in the guest is
> > writing data to disk.
> >
> > Normally, the VM first triggers a user fault to fill the page table.
> > So in the IOU worker thread, the page tables are already filled,
> > fault no chance happens when faulting in memory pages
> > in fault_in_iov_iter_readable.
> >
> > I suspect that during snapshot loading, a memory access in the
> > VM triggers an async page fault handled by the kernel thread,
> > while the IOU worker's async kernel thread is also running.
> > Maybe If the IOU worker's thread is scheduled first.
> > I?m going to bed now.
>
> Ah ok, so what you're saying is that because we end up not sleeping
> (because a signal is pending, it seems), then the fault will never get
> filled and hence progress not made? And the signal is pending because
> someone tried to create a net worker, and this work is not getting
> processed.
>
> --
> Jens Axboe
        handle_userfault() {
          hugetlb_vma_lock_read();
          _raw_spin_lock_irq() {
            __pv_queued_spin_lock_slowpath();
          }
          vma_mmu_pagesize() {
            hugetlb_vm_op_pagesize();
          }
          huge_pte_offset();
          hugetlb_vma_unlock_read();
          up_read();
          __wake_up() {
            _raw_spin_lock_irqsave() {
              __pv_queued_spin_lock_slowpath();
            }
            __wake_up_common();
            _raw_spin_unlock_irqrestore();
          }
          schedule() {
            io_wq_worker_sleeping() {
              io_wq_dec_running();
            }
            rcu_note_context_switch();
            raw_spin_rq_lock_nested() {
              _raw_spin_lock();
            }
            update_rq_clock();
            pick_next_task() {
              pick_next_task_fair() {
                update_curr() {
                  update_curr_se();
                  __calc_delta.constprop.0();
                  update_min_vruntime();
                }
                check_cfs_rq_runtime();
                pick_next_entity() {
                  pick_eevdf();
                }
                update_curr() {
                  update_curr_se();
                  __calc_delta.constprop.0();
                  update_min_vruntime();
                }
                check_cfs_rq_runtime();
                pick_next_entity() {
                  pick_eevdf();
                }
                update_curr() {
                  update_curr_se();
                  update_min_vruntime();
                  cpuacct_charge();
                  __cgroup_account_cputime() {
                    cgroup_rstat_updated();
                  }
                }
                check_cfs_rq_runtime();
                pick_next_entity() {
                  pick_eevdf();
                }
              }
            }
            raw_spin_rq_unlock();
            io_wq_worker_running();
          }
          _raw_spin_lock_irq() {
            __pv_queued_spin_lock_slowpath();
          }
          userfaultfd_ctx_put();
        }
      }
The execution flow above is the one that kept faulting
repeatedly in the IOU worker during the issue. The entire fault path,
including this final userfault handling code you're seeing, would be
triggered in an infinite loop. That's why I traced and found that the
io_wq_worker_running() function returns early, causing the flow to
differ from a normal user fault, where it should be sleeping.

However, your call stack appears to behave normally,
which makes me curious about what's different about execution flow.
Would you be able to share your test case code so I can study it
and try to reproduce the behavior on my side?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23  2:49         ` 姜智伟
@ 2025-04-23  3:11           ` 姜智伟
  2025-04-23  6:22             ` 姜智伟
  2025-04-23 13:34           ` Jens Axboe
  1 sibling, 1 reply; 25+ messages in thread
From: 姜智伟 @ 2025-04-23  3:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

Sorry, I may have misunderstood. I thought your test case
was working correctly. In io_wq_worker_running() it will return
if in io worker context, that is different from common progress
context.I hope the graph above can help you understand.

On Wed, Apr 23, 2025 at 10:49 AM 姜智伟 <qq282012236@gmail.com> wrote:
>
> On Wed, Apr 23, 2025 at 1:33 AM Jens Axboe <axboe@kernel.dk> wrote:
> >
> > On 4/22/25 11:04 AM, ??? wrote:
> > > On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
> > >>
> > >> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
> > >>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
> > >>> index d4fb2940e435..8567a9c819db 100644
> > >>> --- a/io_uring/io-wq.h
> > >>> +++ b/io_uring/io-wq.h
> > >>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
> > >>>                                       void *data, bool cancel_all);
> > >>>
> > >>>  #if defined(CONFIG_IO_WQ)
> > >>> -extern void io_wq_worker_sleeping(struct task_struct *);
> > >>> -extern void io_wq_worker_running(struct task_struct *);
> > >>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
> > >>> +extern void io_wq_worker_running(struct task_struct *tsk);
> > >>> +extern void set_userfault_flag_for_ioworker(void);
> > >>> +extern void clear_userfault_flag_for_ioworker(void);
> > >>>  #else
> > >>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> > >>>  {
> > >>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> > >>>  static inline void io_wq_worker_running(struct task_struct *tsk)
> > >>>  {
> > >>>  }
> > >>> +static inline void set_userfault_flag_for_ioworker(void)
> > >>> +{
> > >>> +}
> > >>> +static inline void clear_userfault_flag_for_ioworker(void)
> > >>> +{
> > >>> +}
> > >>>  #endif
> > >>>
> > >>>  static inline bool io_wq_current_is_worker(void)
> > >>
> > >> This should go in include/linux/io_uring.h and then userfaultfd would
> > >> not have to include io_uring private headers.
> > >>
> > >> But that's beside the point, like I said we still need to get to the
> > >> bottom of what is going on here first, rather than try and paper around
> > >> it. So please don't post more versions of this before we have that
> > >> understanding.
> > >>
> > >> See previous emails on 6.8 and other kernel versions.
> > >>
> > >> --
> > >> Jens Axboe
> > > The issue did not involve creating new worker processes. Instead, the
> > > existing IOU worker kernel threads (about a dozen) associated with the VM
> > > process were fully utilizing CPU without writing data, caused by a fault
> > > while reading user data pages in the fault_in_iov_iter_readable function
> > > when pulling user memory into kernel space.
> >
> > OK that makes more sense, I can certainly reproduce a loop in this path:
> >
> > iou-wrk-726     729    36.910071:       9737 cycles:P:
> >         ffff800080456c44 handle_userfault+0x47c
> >         ffff800080381fc0 hugetlb_fault+0xb68
> >         ffff80008031fee4 handle_mm_fault+0x2fc
> >         ffff8000812ada6c do_page_fault+0x1e4
> >         ffff8000812ae024 do_translation_fault+0x9c
> >         ffff800080049a9c do_mem_abort+0x44
> >         ffff80008129bd78 el1_abort+0x38
> >         ffff80008129ceb4 el1h_64_sync_handler+0xd4
> >         ffff8000800112b4 el1h_64_sync+0x6c
> >         ffff80008030984c fault_in_readable+0x74
> >         ffff800080476f3c iomap_file_buffered_write+0x14c
> >         ffff8000809b1230 blkdev_write_iter+0x1a8
> >         ffff800080a1f378 io_write+0x188
> >         ffff800080a14f30 io_issue_sqe+0x68
> >         ffff800080a155d0 io_wq_submit_work+0xa8
> >         ffff800080a32afc io_worker_handle_work+0x1f4
> >         ffff800080a332b8 io_wq_worker+0x110
> >         ffff80008002dd38 ret_from_fork+0x10
> >
> > which seems to be expected, we'd continually try and fault in the
> > ranges, if the userfaultfd handler isn't filling them.
> >
> > I guess this is where I'm still confused, because I don't see how this
> > is different from if you have a normal write(2) syscall doing the same
> > thing - you'd get the same looping.
> >
> > ??
> >
> > > This issue occurs like during VM snapshot loading (which uses
> > > userfaultfd for on-demand memory loading), while the task in the guest is
> > > writing data to disk.
> > >
> > > Normally, the VM first triggers a user fault to fill the page table.
> > > So in the IOU worker thread, the page tables are already filled,
> > > fault no chance happens when faulting in memory pages
> > > in fault_in_iov_iter_readable.
> > >
> > > I suspect that during snapshot loading, a memory access in the
> > > VM triggers an async page fault handled by the kernel thread,
> > > while the IOU worker's async kernel thread is also running.
> > > Maybe If the IOU worker's thread is scheduled first.
> > > I?m going to bed now.
> >
> > Ah ok, so what you're saying is that because we end up not sleeping
> > (because a signal is pending, it seems), then the fault will never get
> > filled and hence progress not made? And the signal is pending because
> > someone tried to create a net worker, and this work is not getting
> > processed.
> >
> > --
> > Jens Axboe
>         handle_userfault() {
>           hugetlb_vma_lock_read();
>           _raw_spin_lock_irq() {
>             __pv_queued_spin_lock_slowpath();
>           }
>           vma_mmu_pagesize() {
>             hugetlb_vm_op_pagesize();
>           }
>           huge_pte_offset();
>           hugetlb_vma_unlock_read();
>           up_read();
>           __wake_up() {
>             _raw_spin_lock_irqsave() {
>               __pv_queued_spin_lock_slowpath();
>             }
>             __wake_up_common();
>             _raw_spin_unlock_irqrestore();
>           }
>           schedule() {
>             io_wq_worker_sleeping() {
>               io_wq_dec_running();
>             }
>             rcu_note_context_switch();
>             raw_spin_rq_lock_nested() {
>               _raw_spin_lock();
>             }
>             update_rq_clock();
>             pick_next_task() {
>               pick_next_task_fair() {
>                 update_curr() {
>                   update_curr_se();
>                   __calc_delta.constprop.0();
>                   update_min_vruntime();
>                 }
>                 check_cfs_rq_runtime();
>                 pick_next_entity() {
>                   pick_eevdf();
>                 }
>                 update_curr() {
>                   update_curr_se();
>                   __calc_delta.constprop.0();
>                   update_min_vruntime();
>                 }
>                 check_cfs_rq_runtime();
>                 pick_next_entity() {
>                   pick_eevdf();
>                 }
>                 update_curr() {
>                   update_curr_se();
>                   update_min_vruntime();
>                   cpuacct_charge();
>                   __cgroup_account_cputime() {
>                     cgroup_rstat_updated();
>                   }
>                 }
>                 check_cfs_rq_runtime();
>                 pick_next_entity() {
>                   pick_eevdf();
>                 }
>               }
>             }
>             raw_spin_rq_unlock();
>             io_wq_worker_running();
>           }
>           _raw_spin_lock_irq() {
>             __pv_queued_spin_lock_slowpath();
>           }
>           userfaultfd_ctx_put();
>         }
>       }
> The execution flow above is the one that kept faulting
> repeatedly in the IOU worker during the issue. The entire fault path,
> including this final userfault handling code you're seeing, would be
> triggered in an infinite loop. That's why I traced and found that the
> io_wq_worker_running() function returns early, causing the flow to
> differ from a normal user fault, where it should be sleeping.
>
> However, your call stack appears to behave normally,
> which makes me curious about what's different about execution flow.
> Would you be able to share your test case code so I can study it
> and try to reproduce the behavior on my side?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23  3:11           ` 姜智伟
@ 2025-04-23  6:22             ` 姜智伟
  0 siblings, 0 replies; 25+ messages in thread
From: 姜智伟 @ 2025-04-23  6:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On Wed, Apr 23, 2025 at 11:11 AM 姜智伟 <qq282012236@gmail.com> wrote:
>
> Sorry, I may have misunderstood. I thought your test case
> was working correctly. In io_wq_worker_running() it will return
> if in io worker context, that is different from common progress
> context.I hope the graph above can help you understand.
>
> On Wed, Apr 23, 2025 at 10:49 AM 姜智伟 <qq282012236@gmail.com> wrote:
> >
> > On Wed, Apr 23, 2025 at 1:33 AM Jens Axboe <axboe@kernel.dk> wrote:
> > >
> > > On 4/22/25 11:04 AM, ??? wrote:
> > > > On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
> > > >>
> > > >> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
> > > >>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
> > > >>> index d4fb2940e435..8567a9c819db 100644
> > > >>> --- a/io_uring/io-wq.h
> > > >>> +++ b/io_uring/io-wq.h
> > > >>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
> > > >>>                                       void *data, bool cancel_all);
> > > >>>
> > > >>>  #if defined(CONFIG_IO_WQ)
> > > >>> -extern void io_wq_worker_sleeping(struct task_struct *);
> > > >>> -extern void io_wq_worker_running(struct task_struct *);
> > > >>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
> > > >>> +extern void io_wq_worker_running(struct task_struct *tsk);
> > > >>> +extern void set_userfault_flag_for_ioworker(void);
> > > >>> +extern void clear_userfault_flag_for_ioworker(void);
> > > >>>  #else
> > > >>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> > > >>>  {
> > > >>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> > > >>>  static inline void io_wq_worker_running(struct task_struct *tsk)
> > > >>>  {
> > > >>>  }
> > > >>> +static inline void set_userfault_flag_for_ioworker(void)
> > > >>> +{
> > > >>> +}
> > > >>> +static inline void clear_userfault_flag_for_ioworker(void)
> > > >>> +{
> > > >>> +}
> > > >>>  #endif
> > > >>>
> > > >>>  static inline bool io_wq_current_is_worker(void)
> > > >>
> > > >> This should go in include/linux/io_uring.h and then userfaultfd would
> > > >> not have to include io_uring private headers.
> > > >>
> > > >> But that's beside the point, like I said we still need to get to the
> > > >> bottom of what is going on here first, rather than try and paper around
> > > >> it. So please don't post more versions of this before we have that
> > > >> understanding.
> > > >>
> > > >> See previous emails on 6.8 and other kernel versions.
> > > >>
> > > >> --
> > > >> Jens Axboe
> > > > The issue did not involve creating new worker processes. Instead, the
> > > > existing IOU worker kernel threads (about a dozen) associated with the VM
> > > > process were fully utilizing CPU without writing data, caused by a fault
> > > > while reading user data pages in the fault_in_iov_iter_readable function
> > > > when pulling user memory into kernel space.
> > >
> > > OK that makes more sense, I can certainly reproduce a loop in this path:
> > >
> > > iou-wrk-726     729    36.910071:       9737 cycles:P:
> > >         ffff800080456c44 handle_userfault+0x47c
> > >         ffff800080381fc0 hugetlb_fault+0xb68
> > >         ffff80008031fee4 handle_mm_fault+0x2fc
> > >         ffff8000812ada6c do_page_fault+0x1e4
> > >         ffff8000812ae024 do_translation_fault+0x9c
> > >         ffff800080049a9c do_mem_abort+0x44
> > >         ffff80008129bd78 el1_abort+0x38
> > >         ffff80008129ceb4 el1h_64_sync_handler+0xd4
> > >         ffff8000800112b4 el1h_64_sync+0x6c
> > >         ffff80008030984c fault_in_readable+0x74
> > >         ffff800080476f3c iomap_file_buffered_write+0x14c
> > >         ffff8000809b1230 blkdev_write_iter+0x1a8
> > >         ffff800080a1f378 io_write+0x188
> > >         ffff800080a14f30 io_issue_sqe+0x68
> > >         ffff800080a155d0 io_wq_submit_work+0xa8
> > >         ffff800080a32afc io_worker_handle_work+0x1f4
> > >         ffff800080a332b8 io_wq_worker+0x110
> > >         ffff80008002dd38 ret_from_fork+0x10
> > >
> > > which seems to be expected, we'd continually try and fault in the
> > > ranges, if the userfaultfd handler isn't filling them.
> > >
> > > I guess this is where I'm still confused, because I don't see how this
> > > is different from if you have a normal write(2) syscall doing the same
> > > thing - you'd get the same looping.
> > >
> > > ??
> > >
> > > > This issue occurs like during VM snapshot loading (which uses
> > > > userfaultfd for on-demand memory loading), while the task in the guest is
> > > > writing data to disk.
> > > >
> > > > Normally, the VM first triggers a user fault to fill the page table.
> > > > So in the IOU worker thread, the page tables are already filled,
> > > > fault no chance happens when faulting in memory pages
> > > > in fault_in_iov_iter_readable.
> > > >
> > > > I suspect that during snapshot loading, a memory access in the
> > > > VM triggers an async page fault handled by the kernel thread,
> > > > while the IOU worker's async kernel thread is also running.
> > > > Maybe If the IOU worker's thread is scheduled first.
> > > > I?m going to bed now.
> > >
> > > Ah ok, so what you're saying is that because we end up not sleeping
> > > (because a signal is pending, it seems), then the fault will never get
> > > filled and hence progress not made? And the signal is pending because
> > > someone tried to create a net worker, and this work is not getting
> > > processed.
> > >
> > > --
> > > Jens Axboe
> >         handle_userfault() {
> >           hugetlb_vma_lock_read();
> >           _raw_spin_lock_irq() {
> >             __pv_queued_spin_lock_slowpath();
> >           }
> >           vma_mmu_pagesize() {
> >             hugetlb_vm_op_pagesize();
> >           }
> >           huge_pte_offset();
> >           hugetlb_vma_unlock_read();
> >           up_read();
> >           __wake_up() {
> >             _raw_spin_lock_irqsave() {
> >               __pv_queued_spin_lock_slowpath();
> >             }
> >             __wake_up_common();
> >             _raw_spin_unlock_irqrestore();
> >           }
> >           schedule() {
> >             io_wq_worker_sleeping() {
> >               io_wq_dec_running();
> >             }
> >             rcu_note_context_switch();
> >             raw_spin_rq_lock_nested() {
> >               _raw_spin_lock();
> >             }
> >             update_rq_clock();
> >             pick_next_task() {
> >               pick_next_task_fair() {
> >                 update_curr() {
> >                   update_curr_se();
> >                   __calc_delta.constprop.0();
> >                   update_min_vruntime();
> >                 }
> >                 check_cfs_rq_runtime();
> >                 pick_next_entity() {
> >                   pick_eevdf();
> >                 }
> >                 update_curr() {
> >                   update_curr_se();
> >                   __calc_delta.constprop.0();
> >                   update_min_vruntime();
> >                 }
> >                 check_cfs_rq_runtime();
> >                 pick_next_entity() {
> >                   pick_eevdf();
> >                 }
> >                 update_curr() {
> >                   update_curr_se();
> >                   update_min_vruntime();
> >                   cpuacct_charge();
> >                   __cgroup_account_cputime() {
> >                     cgroup_rstat_updated();
> >                   }
> >                 }
> >                 check_cfs_rq_runtime();
> >                 pick_next_entity() {
> >                   pick_eevdf();
> >                 }
> >               }
> >             }
> >             raw_spin_rq_unlock();
> >             io_wq_worker_running();
> >           }
> >           _raw_spin_lock_irq() {
> >             __pv_queued_spin_lock_slowpath();
> >           }
> >           userfaultfd_ctx_put();
> >         }
> >       }
> > The execution flow above is the one that kept faulting
> > repeatedly in the IOU worker during the issue. The entire fault path,
> > including this final userfault handling code you're seeing, would be
> > triggered in an infinite loop. That's why I traced and found that the
> > io_wq_worker_running() function returns early, causing the flow to
> > differ from a normal user fault, where it should be sleeping.
> >
> > However, your call stack appears to behave normally,
> > which makes me curious about what's different about execution flow.
> > Would you be able to share your test case code so I can study it
> > and try to reproduce the behavior on my side?
Sorry, I may have misunderstood. I thought your test case
was working correctly. In io_wq_worker_running() it will return
if in io worker context, that is different from common progress
context.I hope the graph above can help you understand.

Also, regarding your initial suggestion to move the function into
include/linux/io_uring.h,I’m not sure that’s the best fit — the
problematic context (io_wq_worker) and the function needing changes
 (io_wq_worker_running) are both heavily tied to the internals of io-wq.
I’m wondering if doing something like #include "../../io_uring/io-wq.h"
as in kernel/sched/core.c:96 might actually be a better choice here?

And I’d still really appreciate it if you could share your test case code
 it would help a lot. Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23  2:49         ` 姜智伟
  2025-04-23  3:11           ` 姜智伟
@ 2025-04-23 13:34           ` Jens Axboe
  2025-04-23 14:29             ` 姜智伟
  2025-04-23 15:55             ` Jens Axboe
  1 sibling, 2 replies; 25+ messages in thread
From: Jens Axboe @ 2025-04-23 13:34 UTC (permalink / raw)
  To: 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On 4/22/25 8:49 PM, ??? wrote:
> On Wed, Apr 23, 2025 at 1:33?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/22/25 11:04 AM, ??? wrote:
>>> On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
>>>>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
>>>>> index d4fb2940e435..8567a9c819db 100644
>>>>> --- a/io_uring/io-wq.h
>>>>> +++ b/io_uring/io-wq.h
>>>>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
>>>>>                                       void *data, bool cancel_all);
>>>>>
>>>>>  #if defined(CONFIG_IO_WQ)
>>>>> -extern void io_wq_worker_sleeping(struct task_struct *);
>>>>> -extern void io_wq_worker_running(struct task_struct *);
>>>>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
>>>>> +extern void io_wq_worker_running(struct task_struct *tsk);
>>>>> +extern void set_userfault_flag_for_ioworker(void);
>>>>> +extern void clear_userfault_flag_for_ioworker(void);
>>>>>  #else
>>>>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>  {
>>>>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>  static inline void io_wq_worker_running(struct task_struct *tsk)
>>>>>  {
>>>>>  }
>>>>> +static inline void set_userfault_flag_for_ioworker(void)
>>>>> +{
>>>>> +}
>>>>> +static inline void clear_userfault_flag_for_ioworker(void)
>>>>> +{
>>>>> +}
>>>>>  #endif
>>>>>
>>>>>  static inline bool io_wq_current_is_worker(void)
>>>>
>>>> This should go in include/linux/io_uring.h and then userfaultfd would
>>>> not have to include io_uring private headers.
>>>>
>>>> But that's beside the point, like I said we still need to get to the
>>>> bottom of what is going on here first, rather than try and paper around
>>>> it. So please don't post more versions of this before we have that
>>>> understanding.
>>>>
>>>> See previous emails on 6.8 and other kernel versions.
>>>>
>>>> --
>>>> Jens Axboe
>>> The issue did not involve creating new worker processes. Instead, the
>>> existing IOU worker kernel threads (about a dozen) associated with the VM
>>> process were fully utilizing CPU without writing data, caused by a fault
>>> while reading user data pages in the fault_in_iov_iter_readable function
>>> when pulling user memory into kernel space.
>>
>> OK that makes more sense, I can certainly reproduce a loop in this path:
>>
>> iou-wrk-726     729    36.910071:       9737 cycles:P:
>>         ffff800080456c44 handle_userfault+0x47c
>>         ffff800080381fc0 hugetlb_fault+0xb68
>>         ffff80008031fee4 handle_mm_fault+0x2fc
>>         ffff8000812ada6c do_page_fault+0x1e4
>>         ffff8000812ae024 do_translation_fault+0x9c
>>         ffff800080049a9c do_mem_abort+0x44
>>         ffff80008129bd78 el1_abort+0x38
>>         ffff80008129ceb4 el1h_64_sync_handler+0xd4
>>         ffff8000800112b4 el1h_64_sync+0x6c
>>         ffff80008030984c fault_in_readable+0x74
>>         ffff800080476f3c iomap_file_buffered_write+0x14c
>>         ffff8000809b1230 blkdev_write_iter+0x1a8
>>         ffff800080a1f378 io_write+0x188
>>         ffff800080a14f30 io_issue_sqe+0x68
>>         ffff800080a155d0 io_wq_submit_work+0xa8
>>         ffff800080a32afc io_worker_handle_work+0x1f4
>>         ffff800080a332b8 io_wq_worker+0x110
>>         ffff80008002dd38 ret_from_fork+0x10
>>
>> which seems to be expected, we'd continually try and fault in the
>> ranges, if the userfaultfd handler isn't filling them.
>>
>> I guess this is where I'm still confused, because I don't see how this
>> is different from if you have a normal write(2) syscall doing the same
>> thing - you'd get the same looping.
>>
>> ??
>>
>>> This issue occurs like during VM snapshot loading (which uses
>>> userfaultfd for on-demand memory loading), while the task in the guest is
>>> writing data to disk.
>>>
>>> Normally, the VM first triggers a user fault to fill the page table.
>>> So in the IOU worker thread, the page tables are already filled,
>>> fault no chance happens when faulting in memory pages
>>> in fault_in_iov_iter_readable.
>>>
>>> I suspect that during snapshot loading, a memory access in the
>>> VM triggers an async page fault handled by the kernel thread,
>>> while the IOU worker's async kernel thread is also running.
>>> Maybe If the IOU worker's thread is scheduled first.
>>> I?m going to bed now.
>>
>> Ah ok, so what you're saying is that because we end up not sleeping
>> (because a signal is pending, it seems), then the fault will never get
>> filled and hence progress not made? And the signal is pending because
>> someone tried to create a net worker, and this work is not getting
>> processed.
>>
>> --
>> Jens Axboe
>         handle_userfault() {
>           hugetlb_vma_lock_read();
>           _raw_spin_lock_irq() {
>             __pv_queued_spin_lock_slowpath();
>           }
>           vma_mmu_pagesize() {
>             hugetlb_vm_op_pagesize();
>           }
>           huge_pte_offset();
>           hugetlb_vma_unlock_read();
>           up_read();
>           __wake_up() {
>             _raw_spin_lock_irqsave() {
>               __pv_queued_spin_lock_slowpath();
>             }
>             __wake_up_common();
>             _raw_spin_unlock_irqrestore();
>           }
>           schedule() {
>             io_wq_worker_sleeping() {
>               io_wq_dec_running();
>             }
>             rcu_note_context_switch();
>             raw_spin_rq_lock_nested() {
>               _raw_spin_lock();
>             }
>             update_rq_clock();
>             pick_next_task() {
>               pick_next_task_fair() {
>                 update_curr() {
>                   update_curr_se();
>                   __calc_delta.constprop.0();
>                   update_min_vruntime();
>                 }
>                 check_cfs_rq_runtime();
>                 pick_next_entity() {
>                   pick_eevdf();
>                 }
>                 update_curr() {
>                   update_curr_se();
>                   __calc_delta.constprop.0();
>                   update_min_vruntime();
>                 }
>                 check_cfs_rq_runtime();
>                 pick_next_entity() {
>                   pick_eevdf();
>                 }
>                 update_curr() {
>                   update_curr_se();
>                   update_min_vruntime();
>                   cpuacct_charge();
>                   __cgroup_account_cputime() {
>                     cgroup_rstat_updated();
>                   }
>                 }
>                 check_cfs_rq_runtime();
>                 pick_next_entity() {
>                   pick_eevdf();
>                 }
>               }
>             }
>             raw_spin_rq_unlock();
>             io_wq_worker_running();
>           }
>           _raw_spin_lock_irq() {
>             __pv_queued_spin_lock_slowpath();
>           }
>           userfaultfd_ctx_put();
>         }
>       }
> The execution flow above is the one that kept faulting
> repeatedly in the IOU worker during the issue. The entire fault path,
> including this final userfault handling code you're seeing, would be
> triggered in an infinite loop. That's why I traced and found that the
> io_wq_worker_running() function returns early, causing the flow to
> differ from a normal user fault, where it should be sleeping.

io_wq_worker_running() is called when the task is scheduled back in.
There's no "returning early" here, it simply updates the accounting.
Which is part of why your patch makes very little sense to me, we
would've called both io_wq_worker_sleeping() and _running() from the
userfaultfd path. The latter doesn't really do much, it simply just
increments the running worker count, if the worker was previously marked
as sleeping.

And I strongly suspect that the latter is the issue, not the marking of
running. The above loop is fine if we do go to sleep in schedule.
However, if there's task_work (either TWA_SIGNAL or TWA_NOTIFY_SIGNAL
based) pending, then schedule() will be a no-op and we're going to
repeatedly go through that loop. This is because the expectation here is
that the loop will be aborted if either of those is true, so that
task_work can get run (or a signal handled, whatever), and then the
operation retried.

> However, your call stack appears to behave normally,
> which makes me curious about what's different about execution flow.
> Would you be able to share your test case code so I can study it
> and try to reproduce the behavior on my side?

It behaves normally for the initial attempt - we end up sleeping in
schedule(). However, then a new worker gets created, or the ring
shutdown, in which case schedule() ends up being a no-op because
TWA_NOTIFY_SIGNAL is set, and then we just sit there in a loop running
the same code again and again to no avail. So I do think my test case
and your issue is the same, I just reproduce it by calling
io_uring_queue_exit(), but the exact same thing would happen if worker
creation is attempted while an io-wq worker is blocked
handle_userfault().

This is why I want to fully understand the issue rather than paper
around it, as I don't think the fix is correct as-is. We really want to
abort the loop and allow the task to handle whatever signaling is
currently preventing proper sleeps.

I'll dabble a bit more and send out the test case too, in case it'll
help on your end.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23 13:34           ` Jens Axboe
@ 2025-04-23 14:29             ` 姜智伟
  2025-04-23 15:10               ` Jens Axboe
  2025-04-23 15:55             ` Jens Axboe
  1 sibling, 1 reply; 25+ messages in thread
From: 姜智伟 @ 2025-04-23 14:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

Jens Axboe <axboe@kernel.dk> 于2025年4月23日周三 21:34写道：
>
> On 4/22/25 8:49 PM, ??? wrote:
> > On Wed, Apr 23, 2025 at 1:33?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 4/22/25 11:04 AM, ??? wrote:
> >>> On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
> >>>>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
> >>>>> index d4fb2940e435..8567a9c819db 100644
> >>>>> --- a/io_uring/io-wq.h
> >>>>> +++ b/io_uring/io-wq.h
> >>>>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
> >>>>>                                       void *data, bool cancel_all);
> >>>>>
> >>>>>  #if defined(CONFIG_IO_WQ)
> >>>>> -extern void io_wq_worker_sleeping(struct task_struct *);
> >>>>> -extern void io_wq_worker_running(struct task_struct *);
> >>>>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
> >>>>> +extern void io_wq_worker_running(struct task_struct *tsk);
> >>>>> +extern void set_userfault_flag_for_ioworker(void);
> >>>>> +extern void clear_userfault_flag_for_ioworker(void);
> >>>>>  #else
> >>>>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> >>>>>  {
> >>>>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
> >>>>>  static inline void io_wq_worker_running(struct task_struct *tsk)
> >>>>>  {
> >>>>>  }
> >>>>> +static inline void set_userfault_flag_for_ioworker(void)
> >>>>> +{
> >>>>> +}
> >>>>> +static inline void clear_userfault_flag_for_ioworker(void)
> >>>>> +{
> >>>>> +}
> >>>>>  #endif
> >>>>>
> >>>>>  static inline bool io_wq_current_is_worker(void)
> >>>>
> >>>> This should go in include/linux/io_uring.h and then userfaultfd would
> >>>> not have to include io_uring private headers.
> >>>>
> >>>> But that's beside the point, like I said we still need to get to the
> >>>> bottom of what is going on here first, rather than try and paper around
> >>>> it. So please don't post more versions of this before we have that
> >>>> understanding.
> >>>>
> >>>> See previous emails on 6.8 and other kernel versions.
> >>>>
> >>>> --
> >>>> Jens Axboe
> >>> The issue did not involve creating new worker processes. Instead, the
> >>> existing IOU worker kernel threads (about a dozen) associated with the VM
> >>> process were fully utilizing CPU without writing data, caused by a fault
> >>> while reading user data pages in the fault_in_iov_iter_readable function
> >>> when pulling user memory into kernel space.
> >>
> >> OK that makes more sense, I can certainly reproduce a loop in this path:
> >>
> >> iou-wrk-726     729    36.910071:       9737 cycles:P:
> >>         ffff800080456c44 handle_userfault+0x47c
> >>         ffff800080381fc0 hugetlb_fault+0xb68
> >>         ffff80008031fee4 handle_mm_fault+0x2fc
> >>         ffff8000812ada6c do_page_fault+0x1e4
> >>         ffff8000812ae024 do_translation_fault+0x9c
> >>         ffff800080049a9c do_mem_abort+0x44
> >>         ffff80008129bd78 el1_abort+0x38
> >>         ffff80008129ceb4 el1h_64_sync_handler+0xd4
> >>         ffff8000800112b4 el1h_64_sync+0x6c
> >>         ffff80008030984c fault_in_readable+0x74
> >>         ffff800080476f3c iomap_file_buffered_write+0x14c
> >>         ffff8000809b1230 blkdev_write_iter+0x1a8
> >>         ffff800080a1f378 io_write+0x188
> >>         ffff800080a14f30 io_issue_sqe+0x68
> >>         ffff800080a155d0 io_wq_submit_work+0xa8
> >>         ffff800080a32afc io_worker_handle_work+0x1f4
> >>         ffff800080a332b8 io_wq_worker+0x110
> >>         ffff80008002dd38 ret_from_fork+0x10
> >>
> >> which seems to be expected, we'd continually try and fault in the
> >> ranges, if the userfaultfd handler isn't filling them.
> >>
> >> I guess this is where I'm still confused, because I don't see how this
> >> is different from if you have a normal write(2) syscall doing the same
> >> thing - you'd get the same looping.
> >>
> >> ??
> >>
> >>> This issue occurs like during VM snapshot loading (which uses
> >>> userfaultfd for on-demand memory loading), while the task in the guest is
> >>> writing data to disk.
> >>>
> >>> Normally, the VM first triggers a user fault to fill the page table.
> >>> So in the IOU worker thread, the page tables are already filled,
> >>> fault no chance happens when faulting in memory pages
> >>> in fault_in_iov_iter_readable.
> >>>
> >>> I suspect that during snapshot loading, a memory access in the
> >>> VM triggers an async page fault handled by the kernel thread,
> >>> while the IOU worker's async kernel thread is also running.
> >>> Maybe If the IOU worker's thread is scheduled first.
> >>> I?m going to bed now.
> >>
> >> Ah ok, so what you're saying is that because we end up not sleeping
> >> (because a signal is pending, it seems), then the fault will never get
> >> filled and hence progress not made? And the signal is pending because
> >> someone tried to create a net worker, and this work is not getting
> >> processed.
> >>
> >> --
> >> Jens Axboe
> >         handle_userfault() {
> >           hugetlb_vma_lock_read();
> >           _raw_spin_lock_irq() {
> >             __pv_queued_spin_lock_slowpath();
> >           }
> >           vma_mmu_pagesize() {
> >             hugetlb_vm_op_pagesize();
> >           }
> >           huge_pte_offset();
> >           hugetlb_vma_unlock_read();
> >           up_read();
> >           __wake_up() {
> >             _raw_spin_lock_irqsave() {
> >               __pv_queued_spin_lock_slowpath();
> >             }
> >             __wake_up_common();
> >             _raw_spin_unlock_irqrestore();
> >           }
> >           schedule() {
> >             io_wq_worker_sleeping() {
> >               io_wq_dec_running();
> >             }
> >             rcu_note_context_switch();
> >             raw_spin_rq_lock_nested() {
> >               _raw_spin_lock();
> >             }
> >             update_rq_clock();
> >             pick_next_task() {
> >               pick_next_task_fair() {
> >                 update_curr() {
> >                   update_curr_se();
> >                   __calc_delta.constprop.0();
> >                   update_min_vruntime();
> >                 }
> >                 check_cfs_rq_runtime();
> >                 pick_next_entity() {
> >                   pick_eevdf();
> >                 }
> >                 update_curr() {
> >                   update_curr_se();
> >                   __calc_delta.constprop.0();
> >                   update_min_vruntime();
> >                 }
> >                 check_cfs_rq_runtime();
> >                 pick_next_entity() {
> >                   pick_eevdf();
> >                 }
> >                 update_curr() {
> >                   update_curr_se();
> >                   update_min_vruntime();
> >                   cpuacct_charge();
> >                   __cgroup_account_cputime() {
> >                     cgroup_rstat_updated();
> >                   }
> >                 }
> >                 check_cfs_rq_runtime();
> >                 pick_next_entity() {
> >                   pick_eevdf();
> >                 }
> >               }
> >             }
> >             raw_spin_rq_unlock();
> >             io_wq_worker_running();
> >           }
> >           _raw_spin_lock_irq() {
> >             __pv_queued_spin_lock_slowpath();
> >           }
> >           userfaultfd_ctx_put();
> >         }
> >       }
> > The execution flow above is the one that kept faulting
> > repeatedly in the IOU worker during the issue. The entire fault path,
> > including this final userfault handling code you're seeing, would be
> > triggered in an infinite loop. That's why I traced and found that the
> > io_wq_worker_running() function returns early, causing the flow to
> > differ from a normal user fault, where it should be sleeping.
>
> io_wq_worker_running() is called when the task is scheduled back in.
> There's no "returning early" here, it simply updates the accounting.
> Which is part of why your patch makes very little sense to me, we
> would've called both io_wq_worker_sleeping() and _running() from the
> userfaultfd path. The latter doesn't really do much, it simply just
> increments the running worker count, if the worker was previously marked
> as sleeping.
>
> And I strongly suspect that the latter is the issue, not the marking of
> running. The above loop is fine if we do go to sleep in schedule.
> However, if there's task_work (either TWA_SIGNAL or TWA_NOTIFY_SIGNAL
> based) pending, then schedule() will be a no-op and we're going to
> repeatedly go through that loop. This is because the expectation here is
> that the loop will be aborted if either of those is true, so that
> task_work can get run (or a signal handled, whatever), and then the
> operation retried.
>
> > However, your call stack appears to behave normally,
> > which makes me curious about what's different about execution flow.
> > Would you be able to share your test case code so I can study it
> > and try to reproduce the behavior on my side?
>
> It behaves normally for the initial attempt - we end up sleeping in
> schedule(). However, then a new worker gets created, or the ring
> shutdown, in which case schedule() ends up being a no-op because
> TWA_NOTIFY_SIGNAL is set, and then we just sit there in a loop running
> the same code again and again to no avail. So I do think my test case
> and your issue is the same, I just reproduce it by calling
> io_uring_queue_exit(), but the exact same thing would happen if worker
> creation is attempted while an io-wq worker is blocked
> handle_userfault().
>
> This is why I want to fully understand the issue rather than paper
> around it, as I don't think the fix is correct as-is. We really want to
> abort the loop and allow the task to handle whatever signaling is
> currently preventing proper sleeps.
>
> I'll dabble a bit more and send out the test case too, in case it'll
> help on your end.
>
> --
> Jens Axboe
I’m really looking forward to your test case. Also, I’d like to emphasize one
more point: the handle_userfault graph path I sent you, including the schedule
function, is complete and unmodified. You can see that the schedule function
is very, very short. I understand your point about signal handling, but in this
very brief function graph, I haven’t yet seen any functions related to signal
handling. Additionally, there is no context switch here, nor is it the situation
where the thread is being scheduled back in. Perhaps the scenario you’ve
reproduced is still different from the one I’ve encountered in some subtle way?

void io_wq_worker_running(struct task_struct *tsk)
{
struct io_worker *worker = tsk->worker_private;

if (!worker)
return;
if (!test_bit(IO_WORKER_F_FAULT, &worker->flags)) {
if (!test_bit(IO_WORKER_F_UP, &worker->flags))
return;
if (test_bit(IO_WORKER_F_RUNNING, &worker->flags))
return;
set_bit(IO_WORKER_F_RUNNING, &worker->flags);
io_wq_inc_running(worker);
}
}
However, from my observation during the crash live memory analysis,
when this happens in the IOU worker thread, the
IO_WORKER_F_RUNNING flag is set. This is what I said "early return",
rather than just a simple accounting function.I look forward to your
deeper analysis and any corrections you may have.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23 14:29             ` 姜智伟
@ 2025-04-23 15:10               ` Jens Axboe
  2025-04-23 18:55                 ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2025-04-23 15:10 UTC (permalink / raw)
  To: 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On 4/23/25 8:29 AM, ??? wrote:
> Jens Axboe <axboe@kernel.dk> ?2025?4?23??? 21:34???
>>
>> On 4/22/25 8:49 PM, ??? wrote:
>>> On Wed, Apr 23, 2025 at 1:33?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 4/22/25 11:04 AM, ??? wrote:
>>>>> On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
>>>>>>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
>>>>>>> index d4fb2940e435..8567a9c819db 100644
>>>>>>> --- a/io_uring/io-wq.h
>>>>>>> +++ b/io_uring/io-wq.h
>>>>>>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
>>>>>>>                                       void *data, bool cancel_all);
>>>>>>>
>>>>>>>  #if defined(CONFIG_IO_WQ)
>>>>>>> -extern void io_wq_worker_sleeping(struct task_struct *);
>>>>>>> -extern void io_wq_worker_running(struct task_struct *);
>>>>>>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
>>>>>>> +extern void io_wq_worker_running(struct task_struct *tsk);
>>>>>>> +extern void set_userfault_flag_for_ioworker(void);
>>>>>>> +extern void clear_userfault_flag_for_ioworker(void);
>>>>>>>  #else
>>>>>>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>>>  {
>>>>>>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>>>  static inline void io_wq_worker_running(struct task_struct *tsk)
>>>>>>>  {
>>>>>>>  }
>>>>>>> +static inline void set_userfault_flag_for_ioworker(void)
>>>>>>> +{
>>>>>>> +}
>>>>>>> +static inline void clear_userfault_flag_for_ioworker(void)
>>>>>>> +{
>>>>>>> +}
>>>>>>>  #endif
>>>>>>>
>>>>>>>  static inline bool io_wq_current_is_worker(void)
>>>>>>
>>>>>> This should go in include/linux/io_uring.h and then userfaultfd would
>>>>>> not have to include io_uring private headers.
>>>>>>
>>>>>> But that's beside the point, like I said we still need to get to the
>>>>>> bottom of what is going on here first, rather than try and paper around
>>>>>> it. So please don't post more versions of this before we have that
>>>>>> understanding.
>>>>>>
>>>>>> See previous emails on 6.8 and other kernel versions.
>>>>>>
>>>>>> --
>>>>>> Jens Axboe
>>>>> The issue did not involve creating new worker processes. Instead, the
>>>>> existing IOU worker kernel threads (about a dozen) associated with the VM
>>>>> process were fully utilizing CPU without writing data, caused by a fault
>>>>> while reading user data pages in the fault_in_iov_iter_readable function
>>>>> when pulling user memory into kernel space.
>>>>
>>>> OK that makes more sense, I can certainly reproduce a loop in this path:
>>>>
>>>> iou-wrk-726     729    36.910071:       9737 cycles:P:
>>>>         ffff800080456c44 handle_userfault+0x47c
>>>>         ffff800080381fc0 hugetlb_fault+0xb68
>>>>         ffff80008031fee4 handle_mm_fault+0x2fc
>>>>         ffff8000812ada6c do_page_fault+0x1e4
>>>>         ffff8000812ae024 do_translation_fault+0x9c
>>>>         ffff800080049a9c do_mem_abort+0x44
>>>>         ffff80008129bd78 el1_abort+0x38
>>>>         ffff80008129ceb4 el1h_64_sync_handler+0xd4
>>>>         ffff8000800112b4 el1h_64_sync+0x6c
>>>>         ffff80008030984c fault_in_readable+0x74
>>>>         ffff800080476f3c iomap_file_buffered_write+0x14c
>>>>         ffff8000809b1230 blkdev_write_iter+0x1a8
>>>>         ffff800080a1f378 io_write+0x188
>>>>         ffff800080a14f30 io_issue_sqe+0x68
>>>>         ffff800080a155d0 io_wq_submit_work+0xa8
>>>>         ffff800080a32afc io_worker_handle_work+0x1f4
>>>>         ffff800080a332b8 io_wq_worker+0x110
>>>>         ffff80008002dd38 ret_from_fork+0x10
>>>>
>>>> which seems to be expected, we'd continually try and fault in the
>>>> ranges, if the userfaultfd handler isn't filling them.
>>>>
>>>> I guess this is where I'm still confused, because I don't see how this
>>>> is different from if you have a normal write(2) syscall doing the same
>>>> thing - you'd get the same looping.
>>>>
>>>> ??
>>>>
>>>>> This issue occurs like during VM snapshot loading (which uses
>>>>> userfaultfd for on-demand memory loading), while the task in the guest is
>>>>> writing data to disk.
>>>>>
>>>>> Normally, the VM first triggers a user fault to fill the page table.
>>>>> So in the IOU worker thread, the page tables are already filled,
>>>>> fault no chance happens when faulting in memory pages
>>>>> in fault_in_iov_iter_readable.
>>>>>
>>>>> I suspect that during snapshot loading, a memory access in the
>>>>> VM triggers an async page fault handled by the kernel thread,
>>>>> while the IOU worker's async kernel thread is also running.
>>>>> Maybe If the IOU worker's thread is scheduled first.
>>>>> I?m going to bed now.
>>>>
>>>> Ah ok, so what you're saying is that because we end up not sleeping
>>>> (because a signal is pending, it seems), then the fault will never get
>>>> filled and hence progress not made? And the signal is pending because
>>>> someone tried to create a net worker, and this work is not getting
>>>> processed.
>>>>
>>>> --
>>>> Jens Axboe
>>>         handle_userfault() {
>>>           hugetlb_vma_lock_read();
>>>           _raw_spin_lock_irq() {
>>>             __pv_queued_spin_lock_slowpath();
>>>           }
>>>           vma_mmu_pagesize() {
>>>             hugetlb_vm_op_pagesize();
>>>           }
>>>           huge_pte_offset();
>>>           hugetlb_vma_unlock_read();
>>>           up_read();
>>>           __wake_up() {
>>>             _raw_spin_lock_irqsave() {
>>>               __pv_queued_spin_lock_slowpath();
>>>             }
>>>             __wake_up_common();
>>>             _raw_spin_unlock_irqrestore();
>>>           }
>>>           schedule() {
>>>             io_wq_worker_sleeping() {
>>>               io_wq_dec_running();
>>>             }
>>>             rcu_note_context_switch();
>>>             raw_spin_rq_lock_nested() {
>>>               _raw_spin_lock();
>>>             }
>>>             update_rq_clock();
>>>             pick_next_task() {
>>>               pick_next_task_fair() {
>>>                 update_curr() {
>>>                   update_curr_se();
>>>                   __calc_delta.constprop.0();
>>>                   update_min_vruntime();
>>>                 }
>>>                 check_cfs_rq_runtime();
>>>                 pick_next_entity() {
>>>                   pick_eevdf();
>>>                 }
>>>                 update_curr() {
>>>                   update_curr_se();
>>>                   __calc_delta.constprop.0();
>>>                   update_min_vruntime();
>>>                 }
>>>                 check_cfs_rq_runtime();
>>>                 pick_next_entity() {
>>>                   pick_eevdf();
>>>                 }
>>>                 update_curr() {
>>>                   update_curr_se();
>>>                   update_min_vruntime();
>>>                   cpuacct_charge();
>>>                   __cgroup_account_cputime() {
>>>                     cgroup_rstat_updated();
>>>                   }
>>>                 }
>>>                 check_cfs_rq_runtime();
>>>                 pick_next_entity() {
>>>                   pick_eevdf();
>>>                 }
>>>               }
>>>             }
>>>             raw_spin_rq_unlock();
>>>             io_wq_worker_running();
>>>           }
>>>           _raw_spin_lock_irq() {
>>>             __pv_queued_spin_lock_slowpath();
>>>           }
>>>           userfaultfd_ctx_put();
>>>         }
>>>       }
>>> The execution flow above is the one that kept faulting
>>> repeatedly in the IOU worker during the issue. The entire fault path,
>>> including this final userfault handling code you're seeing, would be
>>> triggered in an infinite loop. That's why I traced and found that the
>>> io_wq_worker_running() function returns early, causing the flow to
>>> differ from a normal user fault, where it should be sleeping.
>>
>> io_wq_worker_running() is called when the task is scheduled back in.
>> There's no "returning early" here, it simply updates the accounting.
>> Which is part of why your patch makes very little sense to me, we
>> would've called both io_wq_worker_sleeping() and _running() from the
>> userfaultfd path. The latter doesn't really do much, it simply just
>> increments the running worker count, if the worker was previously marked
>> as sleeping.
>>
>> And I strongly suspect that the latter is the issue, not the marking of
>> running. The above loop is fine if we do go to sleep in schedule.
>> However, if there's task_work (either TWA_SIGNAL or TWA_NOTIFY_SIGNAL
>> based) pending, then schedule() will be a no-op and we're going to
>> repeatedly go through that loop. This is because the expectation here is
>> that the loop will be aborted if either of those is true, so that
>> task_work can get run (or a signal handled, whatever), and then the
>> operation retried.
>>
>>> However, your call stack appears to behave normally,
>>> which makes me curious about what's different about execution flow.
>>> Would you be able to share your test case code so I can study it
>>> and try to reproduce the behavior on my side?
>>
>> It behaves normally for the initial attempt - we end up sleeping in
>> schedule(). However, then a new worker gets created, or the ring
>> shutdown, in which case schedule() ends up being a no-op because
>> TWA_NOTIFY_SIGNAL is set, and then we just sit there in a loop running
>> the same code again and again to no avail. So I do think my test case
>> and your issue is the same, I just reproduce it by calling
>> io_uring_queue_exit(), but the exact same thing would happen if worker
>> creation is attempted while an io-wq worker is blocked
>> handle_userfault().
>>
>> This is why I want to fully understand the issue rather than paper
>> around it, as I don't think the fix is correct as-is. We really want to
>> abort the loop and allow the task to handle whatever signaling is
>> currently preventing proper sleeps.
>>
>> I'll dabble a bit more and send out the test case too, in case it'll
>> help on your end.
>>
>> --
>> Jens Axboe
> I?m really looking forward to your test case. Also, I?d like to
> emphasize one more point: the handle_userfault graph path I sent you,
> including the schedule function, is complete and unmodified. You can
> see that the schedule function is very, very short. I understand your
> point about signal handling, but in this very brief function graph, I
> haven?t yet seen any functions related to signal handling.
> Additionally, there is no context switch here, nor is it the situation
> where the thread is being scheduled back in. Perhaps the scenario
> you?ve reproduced is still different from the one I?ve encountered in
> some subtle way?

Ask yourself, why would schedule() return immediately rather than
actually block? There's a few cases here:

1) The task state is set to TASK_RUNNING - either because it was never
   set to TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE, or because someone
   raced and woke up the task after the initial check on whether it
   should be sleeping or not.

2) Some kind of notification or signal is pending. This is usually when
   task_sigpending() returns true, or if TIF_NOTIFY_SIGNAL is set. Those
   need clearing, and that's generally done on return to userspace.

#1 isn't the case here, but #2 looks highly plausible. The io-wq workers
rely on running this kind of work manually, and retrying. If we loop
further down with these conditions being true, then we're just busy
looping at that point and will NEVER sleep. You don't see any functions
related to signal handling etc EXACTLY because of that, there's nowhere
it gets run.

> void io_wq_worker_running(struct task_struct *tsk)
> {
> struct io_worker *worker = tsk->worker_private;
> 
> if (!worker)
> return;
> if (!test_bit(IO_WORKER_F_FAULT, &worker->flags)) {
> if (!test_bit(IO_WORKER_F_UP, &worker->flags))
> return;
> if (test_bit(IO_WORKER_F_RUNNING, &worker->flags))
> return;
> set_bit(IO_WORKER_F_RUNNING, &worker->flags);
> io_wq_inc_running(worker);
> }
> }
> However, from my observation during the crash live memory analysis,
> when this happens in the IOU worker thread, the
> IO_WORKER_F_RUNNING flag is set. This is what I said "early return",
> rather than just a simple accounting function.I look forward to your
> deeper analysis and any corrections you may have.

It's set becase of what I outlined above. If schedule() would actually
sleep, then io_wq_worker_sleeping() would've been called. The fact that
you're getting io_wq_worker_running() called without WORKER_F_RUNNING
cleared is because of that.

But you're too focused on the symptom here, not the underlying issue. It
doesn't matter at all that io_wq_worker_running() is called when the
task is already running, it'll just ignore that. It's explicitly tested.
Your patch won't make a single difference for this case because of that,
you're just wrapping what's esssentially a no-op call with another no-op
call, as you've now nested RUNNING inside the FAULT flag. It won't
change your outcome at all.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23 13:34           ` Jens Axboe
  2025-04-23 14:29             ` 姜智伟
@ 2025-04-23 15:55             ` Jens Axboe
  2025-04-23 16:07               ` 姜智伟
                                 ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Jens Axboe @ 2025-04-23 15:55 UTC (permalink / raw)
  To: 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

[-- Attachment #1: Type: text/plain, Size: 1904 bytes --]

Something like this, perhaps - it'll ensure that io-wq workers get a
chance to flush out pending work, which should prevent the looping. I've
attached a basic test case. It'll issue a write that will fault, and
then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
looping.

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d80f94346199..e18926dbf20a 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -32,6 +32,7 @@
 #include <linux/swapops.h>
 #include <linux/miscdevice.h>
 #include <linux/uio.h>
+#include <linux/io_uring.h>
 
 static int sysctl_unprivileged_userfaultfd __read_mostly;
 
@@ -376,6 +377,8 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	 */
 	if (current->flags & (PF_EXITING|PF_DUMPCORE))
 		goto out;
+	else if (current->flags & PF_IO_WORKER)
+		io_worker_fault();
 
 	assert_fault_locked(vmf);
 
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 85fe4e6b275c..d93dd7402a28 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -28,6 +28,7 @@ static inline void io_uring_free(struct task_struct *tsk)
 	if (tsk->io_uring)
 		__io_uring_free(tsk);
 }
+void io_worker_fault(void);
 #else
 static inline void io_uring_task_cancel(void)
 {
@@ -46,6 +47,9 @@ static inline bool io_is_uring_fops(struct file *file)
 {
 	return false;
 }
+static inline void io_worker_fault(void)
+{
+}
 #endif
 
 #endif
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index d52069b1177b..f74bea028ec7 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -1438,3 +1438,13 @@ static __init int io_wq_init(void)
 	return 0;
 }
 subsys_initcall(io_wq_init);
+
+void io_worker_fault(void)
+{
+	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
+		clear_notify_signal();
+	if (test_thread_flag(TIF_NOTIFY_RESUME))
+		resume_user_mode_work(NULL);
+	if (task_work_pending(current))
+		task_work_run();
+}

-- 
Jens Axboe

[-- Attachment #2: ufd.c --]
[-- Type: text/x-csrc, Size: 3324 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <poll.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <linux/mman.h>
#include <sys/uio.h>
#include <liburing.h>
#include <pthread.h>
#include <linux/userfaultfd.h>

#define HP_SIZE		(2 * 1024 * 1024ULL)
#define NR_HUGEPAGES	(3000)

#ifndef NR_userfaultfd
#define NR_userfaultfd	282
#endif

struct thread_data {
	pthread_t thread;
	pthread_barrier_t barrier;
	int uffd;
};

static void *fault_handler(void *data)
{
	struct thread_data *td = data;
	struct uffd_msg msg;
	struct pollfd pfd;
	int ret, nready;

	pthread_barrier_wait(&td->barrier);

	do {
		pfd.fd = td->uffd;
		pfd.events = POLLIN;
		nready = poll(&pfd, 1, -1);
		if (nready < 0) {
			perror("poll");
			exit(1);
		}

		ret = read(td->uffd, &msg, sizeof(msg));
		if (ret < 0) {
			if (errno == EAGAIN)
				continue;
			perror("read");
			exit(1);
		}

		if (msg.event != UFFD_EVENT_PAGEFAULT) {
			printf("unspected event: %x\n", msg.event);
			exit(1);
		}

		printf("Page fault\n");
		printf("flags = %lx; ", (long) msg.arg.pagefault.flags);
		printf("address = %lx\n", (long)msg.arg.pagefault.address);
	} while (1);

	return NULL;
}

static void do_io(struct io_uring *ring, void *buf, size_t len)
{
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;
	int fd, ret, i;

	fd = open("/dev/nvme0n1", O_RDWR);
	if (fd < 0) {
		perror("open create");
		return;
	}

	/* issue faulting write */
	sqe = io_uring_get_sqe(ring);
	io_uring_prep_write(sqe, fd, buf, len, 0);
	sqe->user_data = 1;
	io_uring_submit(ring);

	printf("blocking issued\n");
	sleep(1);

	/* cancel above write */
	sqe = io_uring_get_sqe(ring);
	io_uring_prep_cancel64(sqe, 1, IORING_ASYNC_CANCEL_USERDATA);
	sqe->user_data = 2;
	io_uring_submit(ring);

	printf("cancel issued\n");
	sleep(1);

	for (i = 0; i < 2; i++) {
again:
		ret = io_uring_wait_cqe(ring, &cqe);
		if (ret < 0) {
			printf("wait: %d\n", ret);
			if (ret == -EINTR)
				goto again;
			break;
		}
		printf("got res %d, %ld\n", cqe->res, (long) cqe->user_data);
		io_uring_cqe_seen(ring, cqe);
	}
}

static void sig_usr1(int sig)
{
	printf("got USR1\n");
}

static int test(void)
{
	struct uffdio_api api = { };
	struct uffdio_register reg = { };
	struct io_uring ring;
	struct sigaction act = { };
	struct thread_data td = { };
	void *buf;

	act.sa_handler = sig_usr1;
	sigaction(SIGUSR1, &act, NULL);

	io_uring_queue_init(4, &ring, 0);

	buf = mmap(NULL, HP_SIZE, PROT_READ|PROT_WRITE,
			MAP_PRIVATE | MAP_HUGETLB | MAP_HUGE_2MB | MAP_ANONYMOUS,
			-1, 0);
	if (buf == MAP_FAILED) {
		perror("mmap");
		return 1;
	}
	printf("got buf %p\n", buf);

	td.uffd = syscall(NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
	if (td.uffd < 0) {
		perror("userfaultfd");
		return 1;
	}

	api.api = UFFD_API;
	if (ioctl(td.uffd, UFFDIO_API, &api) < 0) {
		perror("ioctl UFFDIO_API");
		return 1;
	}

	reg.range.start = (unsigned long) buf;
	reg.range.len = HP_SIZE;
	reg.mode = UFFDIO_REGISTER_MODE_MISSING;
	if (ioctl(td.uffd, UFFDIO_REGISTER, &reg) < 0) {
		perror("ioctl UFFDIO_REGISTER");
		return 1;
	}

	pthread_barrier_init(&td.barrier, NULL, 2);
	pthread_create(&td.thread, NULL, fault_handler, &td);

	pthread_barrier_wait(&td.barrier);

	do_io(&ring, buf, HP_SIZE);
	return 0;
}

int main(int argc, char *argv[])
{
	return test();
}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23 15:55             ` Jens Axboe
@ 2025-04-23 16:07               ` 姜智伟
  2025-04-23 16:17               ` Pavel Begunkov
  2025-04-23 22:57               ` Jens Axboe
  2 siblings, 0 replies; 25+ messages in thread
From: 姜智伟 @ 2025-04-23 16:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

Jens Axboe <axboe@kernel.dk> 于2025年4月23日周三 23:55写道：
>
> Something like this, perhaps - it'll ensure that io-wq workers get a
> chance to flush out pending work, which should prevent the looping. I've
> attached a basic test case. It'll issue a write that will fault, and
> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
> looping.
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index d80f94346199..e18926dbf20a 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -32,6 +32,7 @@
>  #include <linux/swapops.h>
>  #include <linux/miscdevice.h>
>  #include <linux/uio.h>
> +#include <linux/io_uring.h>
>
>  static int sysctl_unprivileged_userfaultfd __read_mostly;
>
> @@ -376,6 +377,8 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>          */
>         if (current->flags & (PF_EXITING|PF_DUMPCORE))
>                 goto out;
> +       else if (current->flags & PF_IO_WORKER)
> +               io_worker_fault();
>
>         assert_fault_locked(vmf);
>
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 85fe4e6b275c..d93dd7402a28 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -28,6 +28,7 @@ static inline void io_uring_free(struct task_struct *tsk)
>         if (tsk->io_uring)
>                 __io_uring_free(tsk);
>  }
> +void io_worker_fault(void);
>  #else
>  static inline void io_uring_task_cancel(void)
>  {
> @@ -46,6 +47,9 @@ static inline bool io_is_uring_fops(struct file *file)
>  {
>         return false;
>  }
> +static inline void io_worker_fault(void)
> +{
> +}
>  #endif
>
>  #endif
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index d52069b1177b..f74bea028ec7 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -1438,3 +1438,13 @@ static __init int io_wq_init(void)
>         return 0;
>  }
>  subsys_initcall(io_wq_init);
> +
> +void io_worker_fault(void)
> +{
> +       if (test_thread_flag(TIF_NOTIFY_SIGNAL))
> +               clear_notify_signal();
> +       if (test_thread_flag(TIF_NOTIFY_RESUME))
> +               resume_user_mode_work(NULL);
> +       if (task_work_pending(current))
> +               task_work_run();
> +}
>
> --
> Jens Axboe
I understand some of what you said. I was indeed lost in appearances.
It's very late here now, and I'll take a deeper look tomorrow.
Thank you very much.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23 15:55             ` Jens Axboe
  2025-04-23 16:07               ` 姜智伟
@ 2025-04-23 16:17               ` Pavel Begunkov
  2025-04-23 16:23                 ` Jens Axboe
  2025-04-23 22:57               ` Jens Axboe
  2 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2025-04-23 16:17 UTC (permalink / raw)
  To: Jens Axboe, 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, linux-fsdevel, linux-mm,
	linux-kernel, io-uring

On 4/23/25 16:55, Jens Axboe wrote:
> Something like this, perhaps - it'll ensure that io-wq workers get a
> chance to flush out pending work, which should prevent the looping. I've
> attached a basic test case. It'll issue a write that will fault, and
> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
> looping.
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index d80f94346199..e18926dbf20a 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -32,6 +32,7 @@
>   #include <linux/swapops.h>
>   #include <linux/miscdevice.h>
>   #include <linux/uio.h>
> +#include <linux/io_uring.h>
>   
>   static int sysctl_unprivileged_userfaultfd __read_mostly;
>   
> @@ -376,6 +377,8 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>   	 */
>   	if (current->flags & (PF_EXITING|PF_DUMPCORE))
>   		goto out;
> +	else if (current->flags & PF_IO_WORKER)
> +		io_worker_fault();
>   
>   	assert_fault_locked(vmf);
>   
> diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
> index 85fe4e6b275c..d93dd7402a28 100644
> --- a/include/linux/io_uring.h
> +++ b/include/linux/io_uring.h
> @@ -28,6 +28,7 @@ static inline void io_uring_free(struct task_struct *tsk)
>   	if (tsk->io_uring)
>   		__io_uring_free(tsk);
>   }
> +void io_worker_fault(void);
>   #else
>   static inline void io_uring_task_cancel(void)
>   {
> @@ -46,6 +47,9 @@ static inline bool io_is_uring_fops(struct file *file)
>   {
>   	return false;
>   }
> +static inline void io_worker_fault(void)
> +{
> +}
>   #endif
>   
>   #endif
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index d52069b1177b..f74bea028ec7 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -1438,3 +1438,13 @@ static __init int io_wq_init(void)
>   	return 0;
>   }
>   subsys_initcall(io_wq_init);
> +
> +void io_worker_fault(void)
> +{
> +	if (test_thread_flag(TIF_NOTIFY_SIGNAL))
> +		clear_notify_signal();
> +	if (test_thread_flag(TIF_NOTIFY_RESUME))
> +		resume_user_mode_work(NULL);
> +	if (task_work_pending(current))
> +		task_work_run();

Looking at the stacktrace, that sounds dangerous

iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_worker
iou-wrk-44588  [kernel.kallsyms]  [k] io_worker_handle_work
iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_submit_work
iou-wrk-44588  [kernel.kallsyms]  [k] io_issue_sqe
iou-wrk-44588  [kernel.kallsyms]  [k] io_write
iou-wrk-44588  [kernel.kallsyms]  [k] blkdev_write_iter
iou-wrk-44588  [kernel.kallsyms]  [k] iomap_file_buffered_write
iou-wrk-44588  [kernel.kallsyms]  [k] iomap_write_iter
iou-wrk-44588  [kernel.kallsyms]  [k] fault_in_iov_iter_readable
iou-wrk-44588  [kernel.kallsyms]  [k] fault_in_readable
iou-wrk-44588  [kernel.kallsyms]  [k] asm_exc_page_fault
iou-wrk-44588  [kernel.kallsyms]  [k] exc_page_fault
iou-wrk-44588  [kernel.kallsyms]  [k] do_user_addr_fault
iou-wrk-44588  [kernel.kallsyms]  [k] handle_mm_fault
iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_fault
iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_no_page
iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_handle_userfault
iou-wrk-44588  [kernel.kallsyms]  [k] handle_userfault

It might be holding a good bunch of locks, and then it's trapped
in a page fault handler. Do normal / non-PF_IO_WORKER tasks run
task_work from handle_userfault?

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23 16:17               ` Pavel Begunkov
@ 2025-04-23 16:23                 ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2025-04-23 16:23 UTC (permalink / raw)
  To: Pavel Begunkov, 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, linux-fsdevel, linux-mm,
	linux-kernel, io-uring

On 4/23/25 10:17 AM, Pavel Begunkov wrote:
> On 4/23/25 16:55, Jens Axboe wrote:
>> Something like this, perhaps - it'll ensure that io-wq workers get a
>> chance to flush out pending work, which should prevent the looping. I've
>> attached a basic test case. It'll issue a write that will fault, and
>> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
>> looping.
>>
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index d80f94346199..e18926dbf20a 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -32,6 +32,7 @@
>>   #include <linux/swapops.h>
>>   #include <linux/miscdevice.h>
>>   #include <linux/uio.h>
>> +#include <linux/io_uring.h>
>>     static int sysctl_unprivileged_userfaultfd __read_mostly;
>>   @@ -376,6 +377,8 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>        */
>>       if (current->flags & (PF_EXITING|PF_DUMPCORE))
>>           goto out;
>> +    else if (current->flags & PF_IO_WORKER)
>> +        io_worker_fault();
>>         assert_fault_locked(vmf);
>>   diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
>> index 85fe4e6b275c..d93dd7402a28 100644
>> --- a/include/linux/io_uring.h
>> +++ b/include/linux/io_uring.h
>> @@ -28,6 +28,7 @@ static inline void io_uring_free(struct task_struct *tsk)
>>       if (tsk->io_uring)
>>           __io_uring_free(tsk);
>>   }
>> +void io_worker_fault(void);
>>   #else
>>   static inline void io_uring_task_cancel(void)
>>   {
>> @@ -46,6 +47,9 @@ static inline bool io_is_uring_fops(struct file *file)
>>   {
>>       return false;
>>   }
>> +static inline void io_worker_fault(void)
>> +{
>> +}
>>   #endif
>>     #endif
>> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
>> index d52069b1177b..f74bea028ec7 100644
>> --- a/io_uring/io-wq.c
>> +++ b/io_uring/io-wq.c
>> @@ -1438,3 +1438,13 @@ static __init int io_wq_init(void)
>>       return 0;
>>   }
>>   subsys_initcall(io_wq_init);
>> +
>> +void io_worker_fault(void)
>> +{
>> +    if (test_thread_flag(TIF_NOTIFY_SIGNAL))
>> +        clear_notify_signal();
>> +    if (test_thread_flag(TIF_NOTIFY_RESUME))
>> +        resume_user_mode_work(NULL);
>> +    if (task_work_pending(current))
>> +        task_work_run();
> 
> Looking at the stacktrace, that sounds dangerous
> 
> iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_worker
> iou-wrk-44588  [kernel.kallsyms]  [k] io_worker_handle_work
> iou-wrk-44588  [kernel.kallsyms]  [k] io_wq_submit_work
> iou-wrk-44588  [kernel.kallsyms]  [k] io_issue_sqe
> iou-wrk-44588  [kernel.kallsyms]  [k] io_write
> iou-wrk-44588  [kernel.kallsyms]  [k] blkdev_write_iter
> iou-wrk-44588  [kernel.kallsyms]  [k] iomap_file_buffered_write
> iou-wrk-44588  [kernel.kallsyms]  [k] iomap_write_iter
> iou-wrk-44588  [kernel.kallsyms]  [k] fault_in_iov_iter_readable
> iou-wrk-44588  [kernel.kallsyms]  [k] fault_in_readable
> iou-wrk-44588  [kernel.kallsyms]  [k] asm_exc_page_fault
> iou-wrk-44588  [kernel.kallsyms]  [k] exc_page_fault
> iou-wrk-44588  [kernel.kallsyms]  [k] do_user_addr_fault
> iou-wrk-44588  [kernel.kallsyms]  [k] handle_mm_fault
> iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_fault
> iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_no_page
> iou-wrk-44588  [kernel.kallsyms]  [k] hugetlb_handle_userfault
> iou-wrk-44588  [kernel.kallsyms]  [k] handle_userfault
> 
> It might be holding a good bunch of locks, and then it's trapped
> in a page fault handler. Do normal / non-PF_IO_WORKER tasks run
> task_work from handle_userfault?

Yeah, it's really just a test patch. Ideally we want this to do the
usual thing, which is fall back and let it retry, where we can handle
all of this too.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23 15:10               ` Jens Axboe
@ 2025-04-23 18:55                 ` Jens Axboe
  0 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2025-04-23 18:55 UTC (permalink / raw)
  To: 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

[-- Attachment #1: Type: text/plain, Size: 13943 bytes --]

On 4/23/25 9:10 AM, Jens Axboe wrote:
> On 4/23/25 8:29 AM, ??? wrote:
>> Jens Axboe <axboe@kernel.dk> ?2025?4?23??? 21:34???
>>>
>>> On 4/22/25 8:49 PM, ??? wrote:
>>>> On Wed, Apr 23, 2025 at 1:33?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 4/22/25 11:04 AM, ??? wrote:
>>>>>> On Wed, Apr 23, 2025 at 12:32?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> On 4/22/25 10:29 AM, Zhiwei Jiang wrote:
>>>>>>>> diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
>>>>>>>> index d4fb2940e435..8567a9c819db 100644
>>>>>>>> --- a/io_uring/io-wq.h
>>>>>>>> +++ b/io_uring/io-wq.h
>>>>>>>> @@ -70,8 +70,10 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
>>>>>>>>                                       void *data, bool cancel_all);
>>>>>>>>
>>>>>>>>  #if defined(CONFIG_IO_WQ)
>>>>>>>> -extern void io_wq_worker_sleeping(struct task_struct *);
>>>>>>>> -extern void io_wq_worker_running(struct task_struct *);
>>>>>>>> +extern void io_wq_worker_sleeping(struct task_struct *tsk);
>>>>>>>> +extern void io_wq_worker_running(struct task_struct *tsk);
>>>>>>>> +extern void set_userfault_flag_for_ioworker(void);
>>>>>>>> +extern void clear_userfault_flag_for_ioworker(void);
>>>>>>>>  #else
>>>>>>>>  static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>>>>  {
>>>>>>>> @@ -79,6 +81,12 @@ static inline void io_wq_worker_sleeping(struct task_struct *tsk)
>>>>>>>>  static inline void io_wq_worker_running(struct task_struct *tsk)
>>>>>>>>  {
>>>>>>>>  }
>>>>>>>> +static inline void set_userfault_flag_for_ioworker(void)
>>>>>>>> +{
>>>>>>>> +}
>>>>>>>> +static inline void clear_userfault_flag_for_ioworker(void)
>>>>>>>> +{
>>>>>>>> +}
>>>>>>>>  #endif
>>>>>>>>
>>>>>>>>  static inline bool io_wq_current_is_worker(void)
>>>>>>>
>>>>>>> This should go in include/linux/io_uring.h and then userfaultfd would
>>>>>>> not have to include io_uring private headers.
>>>>>>>
>>>>>>> But that's beside the point, like I said we still need to get to the
>>>>>>> bottom of what is going on here first, rather than try and paper around
>>>>>>> it. So please don't post more versions of this before we have that
>>>>>>> understanding.
>>>>>>>
>>>>>>> See previous emails on 6.8 and other kernel versions.
>>>>>>>
>>>>>>> --
>>>>>>> Jens Axboe
>>>>>> The issue did not involve creating new worker processes. Instead, the
>>>>>> existing IOU worker kernel threads (about a dozen) associated with the VM
>>>>>> process were fully utilizing CPU without writing data, caused by a fault
>>>>>> while reading user data pages in the fault_in_iov_iter_readable function
>>>>>> when pulling user memory into kernel space.
>>>>>
>>>>> OK that makes more sense, I can certainly reproduce a loop in this path:
>>>>>
>>>>> iou-wrk-726     729    36.910071:       9737 cycles:P:
>>>>>         ffff800080456c44 handle_userfault+0x47c
>>>>>         ffff800080381fc0 hugetlb_fault+0xb68
>>>>>         ffff80008031fee4 handle_mm_fault+0x2fc
>>>>>         ffff8000812ada6c do_page_fault+0x1e4
>>>>>         ffff8000812ae024 do_translation_fault+0x9c
>>>>>         ffff800080049a9c do_mem_abort+0x44
>>>>>         ffff80008129bd78 el1_abort+0x38
>>>>>         ffff80008129ceb4 el1h_64_sync_handler+0xd4
>>>>>         ffff8000800112b4 el1h_64_sync+0x6c
>>>>>         ffff80008030984c fault_in_readable+0x74
>>>>>         ffff800080476f3c iomap_file_buffered_write+0x14c
>>>>>         ffff8000809b1230 blkdev_write_iter+0x1a8
>>>>>         ffff800080a1f378 io_write+0x188
>>>>>         ffff800080a14f30 io_issue_sqe+0x68
>>>>>         ffff800080a155d0 io_wq_submit_work+0xa8
>>>>>         ffff800080a32afc io_worker_handle_work+0x1f4
>>>>>         ffff800080a332b8 io_wq_worker+0x110
>>>>>         ffff80008002dd38 ret_from_fork+0x10
>>>>>
>>>>> which seems to be expected, we'd continually try and fault in the
>>>>> ranges, if the userfaultfd handler isn't filling them.
>>>>>
>>>>> I guess this is where I'm still confused, because I don't see how this
>>>>> is different from if you have a normal write(2) syscall doing the same
>>>>> thing - you'd get the same looping.
>>>>>
>>>>> ??
>>>>>
>>>>>> This issue occurs like during VM snapshot loading (which uses
>>>>>> userfaultfd for on-demand memory loading), while the task in the guest is
>>>>>> writing data to disk.
>>>>>>
>>>>>> Normally, the VM first triggers a user fault to fill the page table.
>>>>>> So in the IOU worker thread, the page tables are already filled,
>>>>>> fault no chance happens when faulting in memory pages
>>>>>> in fault_in_iov_iter_readable.
>>>>>>
>>>>>> I suspect that during snapshot loading, a memory access in the
>>>>>> VM triggers an async page fault handled by the kernel thread,
>>>>>> while the IOU worker's async kernel thread is also running.
>>>>>> Maybe If the IOU worker's thread is scheduled first.
>>>>>> I?m going to bed now.
>>>>>
>>>>> Ah ok, so what you're saying is that because we end up not sleeping
>>>>> (because a signal is pending, it seems), then the fault will never get
>>>>> filled and hence progress not made? And the signal is pending because
>>>>> someone tried to create a net worker, and this work is not getting
>>>>> processed.
>>>>>
>>>>> --
>>>>> Jens Axboe
>>>>         handle_userfault() {
>>>>           hugetlb_vma_lock_read();
>>>>           _raw_spin_lock_irq() {
>>>>             __pv_queued_spin_lock_slowpath();
>>>>           }
>>>>           vma_mmu_pagesize() {
>>>>             hugetlb_vm_op_pagesize();
>>>>           }
>>>>           huge_pte_offset();
>>>>           hugetlb_vma_unlock_read();
>>>>           up_read();
>>>>           __wake_up() {
>>>>             _raw_spin_lock_irqsave() {
>>>>               __pv_queued_spin_lock_slowpath();
>>>>             }
>>>>             __wake_up_common();
>>>>             _raw_spin_unlock_irqrestore();
>>>>           }
>>>>           schedule() {
>>>>             io_wq_worker_sleeping() {
>>>>               io_wq_dec_running();
>>>>             }
>>>>             rcu_note_context_switch();
>>>>             raw_spin_rq_lock_nested() {
>>>>               _raw_spin_lock();
>>>>             }
>>>>             update_rq_clock();
>>>>             pick_next_task() {
>>>>               pick_next_task_fair() {
>>>>                 update_curr() {
>>>>                   update_curr_se();
>>>>                   __calc_delta.constprop.0();
>>>>                   update_min_vruntime();
>>>>                 }
>>>>                 check_cfs_rq_runtime();
>>>>                 pick_next_entity() {
>>>>                   pick_eevdf();
>>>>                 }
>>>>                 update_curr() {
>>>>                   update_curr_se();
>>>>                   __calc_delta.constprop.0();
>>>>                   update_min_vruntime();
>>>>                 }
>>>>                 check_cfs_rq_runtime();
>>>>                 pick_next_entity() {
>>>>                   pick_eevdf();
>>>>                 }
>>>>                 update_curr() {
>>>>                   update_curr_se();
>>>>                   update_min_vruntime();
>>>>                   cpuacct_charge();
>>>>                   __cgroup_account_cputime() {
>>>>                     cgroup_rstat_updated();
>>>>                   }
>>>>                 }
>>>>                 check_cfs_rq_runtime();
>>>>                 pick_next_entity() {
>>>>                   pick_eevdf();
>>>>                 }
>>>>               }
>>>>             }
>>>>             raw_spin_rq_unlock();
>>>>             io_wq_worker_running();
>>>>           }
>>>>           _raw_spin_lock_irq() {
>>>>             __pv_queued_spin_lock_slowpath();
>>>>           }
>>>>           userfaultfd_ctx_put();
>>>>         }
>>>>       }
>>>> The execution flow above is the one that kept faulting
>>>> repeatedly in the IOU worker during the issue. The entire fault path,
>>>> including this final userfault handling code you're seeing, would be
>>>> triggered in an infinite loop. That's why I traced and found that the
>>>> io_wq_worker_running() function returns early, causing the flow to
>>>> differ from a normal user fault, where it should be sleeping.
>>>
>>> io_wq_worker_running() is called when the task is scheduled back in.
>>> There's no "returning early" here, it simply updates the accounting.
>>> Which is part of why your patch makes very little sense to me, we
>>> would've called both io_wq_worker_sleeping() and _running() from the
>>> userfaultfd path. The latter doesn't really do much, it simply just
>>> increments the running worker count, if the worker was previously marked
>>> as sleeping.
>>>
>>> And I strongly suspect that the latter is the issue, not the marking of
>>> running. The above loop is fine if we do go to sleep in schedule.
>>> However, if there's task_work (either TWA_SIGNAL or TWA_NOTIFY_SIGNAL
>>> based) pending, then schedule() will be a no-op and we're going to
>>> repeatedly go through that loop. This is because the expectation here is
>>> that the loop will be aborted if either of those is true, so that
>>> task_work can get run (or a signal handled, whatever), and then the
>>> operation retried.
>>>
>>>> However, your call stack appears to behave normally,
>>>> which makes me curious about what's different about execution flow.
>>>> Would you be able to share your test case code so I can study it
>>>> and try to reproduce the behavior on my side?
>>>
>>> It behaves normally for the initial attempt - we end up sleeping in
>>> schedule(). However, then a new worker gets created, or the ring
>>> shutdown, in which case schedule() ends up being a no-op because
>>> TWA_NOTIFY_SIGNAL is set, and then we just sit there in a loop running
>>> the same code again and again to no avail. So I do think my test case
>>> and your issue is the same, I just reproduce it by calling
>>> io_uring_queue_exit(), but the exact same thing would happen if worker
>>> creation is attempted while an io-wq worker is blocked
>>> handle_userfault().
>>>
>>> This is why I want to fully understand the issue rather than paper
>>> around it, as I don't think the fix is correct as-is. We really want to
>>> abort the loop and allow the task to handle whatever signaling is
>>> currently preventing proper sleeps.
>>>
>>> I'll dabble a bit more and send out the test case too, in case it'll
>>> help on your end.
>>>
>>> --
>>> Jens Axboe
>> I?m really looking forward to your test case. Also, I?d like to
>> emphasize one more point: the handle_userfault graph path I sent you,
>> including the schedule function, is complete and unmodified. You can
>> see that the schedule function is very, very short. I understand your
>> point about signal handling, but in this very brief function graph, I
>> haven?t yet seen any functions related to signal handling.
>> Additionally, there is no context switch here, nor is it the situation
>> where the thread is being scheduled back in. Perhaps the scenario
>> you?ve reproduced is still different from the one I?ve encountered in
>> some subtle way?
> 
> Ask yourself, why would schedule() return immediately rather than
> actually block? There's a few cases here:
> 
> 1) The task state is set to TASK_RUNNING - either because it was never
>    set to TASK_INTERRUPTIBLE/TASK_UNINTERRUPTIBLE, or because someone
>    raced and woke up the task after the initial check on whether it
>    should be sleeping or not.
> 
> 2) Some kind of notification or signal is pending. This is usually when
>    task_sigpending() returns true, or if TIF_NOTIFY_SIGNAL is set. Those
>    need clearing, and that's generally done on return to userspace.
> 
> #1 isn't the case here, but #2 looks highly plausible. The io-wq workers
> rely on running this kind of work manually, and retrying. If we loop
> further down with these conditions being true, then we're just busy
> looping at that point and will NEVER sleep. You don't see any functions
> related to signal handling etc EXACTLY because of that, there's nowhere
> it gets run.
> 
>> void io_wq_worker_running(struct task_struct *tsk)
>> {
>> struct io_worker *worker = tsk->worker_private;
>>
>> if (!worker)
>> return;
>> if (!test_bit(IO_WORKER_F_FAULT, &worker->flags)) {
>> if (!test_bit(IO_WORKER_F_UP, &worker->flags))
>> return;
>> if (test_bit(IO_WORKER_F_RUNNING, &worker->flags))
>> return;
>> set_bit(IO_WORKER_F_RUNNING, &worker->flags);
>> io_wq_inc_running(worker);
>> }
>> }
>> However, from my observation during the crash live memory analysis,
>> when this happens in the IOU worker thread, the
>> IO_WORKER_F_RUNNING flag is set. This is what I said "early return",
>> rather than just a simple accounting function.I look forward to your
>> deeper analysis and any corrections you may have.
> 
> It's set becase of what I outlined above. If schedule() would actually
> sleep, then io_wq_worker_sleeping() would've been called. The fact that
> you're getting io_wq_worker_running() called without WORKER_F_RUNNING
> cleared is because of that.
> 
> But you're too focused on the symptom here, not the underlying issue. It
> doesn't matter at all that io_wq_worker_running() is called when the
> task is already running, it'll just ignore that. It's explicitly tested.
> Your patch won't make a single difference for this case because of that,
> you're just wrapping what's esssentially a no-op call with another no-op
> call, as you've now nested RUNNING inside the FAULT flag. It won't
> change your outcome at all.

BTW, same thing can be observed without using io_uring at all - just
have a normal task do a write(2) as in my test case, and have the parent
send it a signal. We'll loop in page fault handling if userfaultfd is
used and it doesn't fill the fault. Example attached.

IOW, this is a generic "problem". I use quotes here as it's not _really_
a problem, it'll just loop excessively if a signal is pending. It can
still very much get killed or terminated, but it's not going to make
progress as the page fault isn't filled.

-- 
Jens Axboe

[-- Attachment #2: tufd.c --]
[-- Type: text/x-csrc, Size: 2977 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <poll.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <sys/wait.h>
#include <linux/mman.h>
#include <sys/uio.h>
#include <errno.h>
#include <pthread.h>
#include <signal.h>
#include <linux/userfaultfd.h>

#define HP_SIZE		(2 * 1024 * 1024ULL)

#ifndef NR_userfaultfd
#define NR_userfaultfd	282
#endif

struct thread_data {
	pthread_t thread;
	pthread_barrier_t barrier;
	int uffd;
};

static void *fault_handler(void *data)
{
	struct thread_data *td = data;
	struct uffd_msg msg;
	struct pollfd pfd;
	int ret, nready;

	pthread_barrier_wait(&td->barrier);

	do {
		pfd.fd = td->uffd;
		pfd.events = POLLIN;
		nready = poll(&pfd, 1, -1);
		if (nready < 0) {
			perror("poll");
			exit(1);
		}

		ret = read(td->uffd, &msg, sizeof(msg));
		if (ret < 0) {
			if (errno == EAGAIN)
				continue;
			perror("read");
			exit(1);
		}

		if (msg.event != UFFD_EVENT_PAGEFAULT) {
			printf("unspected event: %x\n", msg.event);
			exit(1);
		}

		printf("Page fault\n");
		printf("flags = %lx; ", (long) msg.arg.pagefault.flags);
		printf("address = %lx\n", (long)msg.arg.pagefault.address);
	} while (1);

	return NULL;
}

static void *arm_fault_handler(struct thread_data *td, size_t len)
{
	struct uffdio_api api = { };
	struct uffdio_register reg = { };
	void *buf;

	buf = mmap(NULL, HP_SIZE, PROT_READ|PROT_WRITE,
			MAP_PRIVATE | MAP_HUGETLB | MAP_HUGE_2MB | MAP_ANONYMOUS,
			-1, 0);
	if (buf == MAP_FAILED) {
		perror("mmap");
		return NULL;
	}
	printf("got buf %p\n", buf);

	td->uffd = syscall(NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
	if (td->uffd < 0) {
		perror("userfaultfd");
		return NULL;
	}

	api.api = UFFD_API;
	if (ioctl(td->uffd, UFFDIO_API, &api) < 0) {
		perror("ioctl UFFDIO_API");
		return NULL;
	}

	reg.range.start = (unsigned long) buf;
	reg.range.len = HP_SIZE;
	reg.mode = UFFDIO_REGISTER_MODE_MISSING;
	if (ioctl(td->uffd, UFFDIO_REGISTER, &reg) < 0) {
		perror("ioctl UFFDIO_REGISTER");
		return NULL;
	}

	return buf;
}

static void sig_usr1(int sig)
{
}

static void __do_io(int fd, void *buf, size_t len)
{
	struct sigaction act = { };
	int ret;

	act.sa_handler = sig_usr1;
	sigaction(SIGUSR1, &act, NULL);

	printf("child will write\n");
	ret = write(fd, buf, len);
	printf("ret=%d\n", ret);
}

static void do_io(struct thread_data *td, size_t len)
{
	void *buf;
	pid_t pid;
	int fd;

	fd = open("/dev/nvme0n1", O_RDWR);
	if (fd < 0) {
		perror("open create");
		return;
	}

	pid = fork();
	if (pid) {
		int wstat;

		sleep(1);
		kill(pid, SIGUSR1);
		printf("wait on child\n");
		waitpid(pid, &wstat, 0);
	} else {
		buf = arm_fault_handler(td, len);
		pthread_barrier_wait(&td->barrier);
		__do_io(fd, buf, len);
		exit(0);
	}
}

int main(int argc, char *argv[])
{
	struct thread_data td = { };

	pthread_barrier_init(&td.barrier, NULL, 2);
	pthread_create(&td.thread, NULL, fault_handler, &td);

	do_io(&td, HP_SIZE);
	return 0;
}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23 15:55             ` Jens Axboe
  2025-04-23 16:07               ` 姜智伟
  2025-04-23 16:17               ` Pavel Begunkov
@ 2025-04-23 22:57               ` Jens Axboe
  2025-04-24 14:08                 ` 姜智伟
  2 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2025-04-23 22:57 UTC (permalink / raw)
  To: 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On 4/23/25 9:55 AM, Jens Axboe wrote:
> Something like this, perhaps - it'll ensure that io-wq workers get a
> chance to flush out pending work, which should prevent the looping. I've
> attached a basic test case. It'll issue a write that will fault, and
> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
> looping.

Something that may actually work - use TASK_UNINTERRUPTIBLE IFF
signal_pending() is true AND the fault has already been tried once
before. If that's the case, rather than just call schedule() with
TASK_INTERRUPTIBLE, use TASK_UNINTERRUPTIBLE and schedule_timeout() with
a suitable timeout length that prevents the annoying parts busy looping.
I used HZ / 10.

I don't see how to fix userfaultfd for this case, either using io_uring
or normal write(2). Normal syscalls can pass back -ERESTARTSYS and get
it retried, but there's no way to do that from inside fault handling. So
I think we just have to be nicer about it.

Andrew, as the userfaultfd maintainer, what do you think?

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index d80f94346199..1016268c7b51 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
-static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
+struct userfault_wait {
+	unsigned int task_state;
+	bool timeout;
+};
+
+static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
 {
+	/*
+	 * If the fault has already been tried AND there's a signal pending
+	 * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
+	 * This prevents busy looping where schedule() otherwise does nothing
+	 * for TASK_INTERRUPTIBLE when the task has a signal pending.
+	 */
+	if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
+		return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
+
 	if (flags & FAULT_FLAG_INTERRUPTIBLE)
-		return TASK_INTERRUPTIBLE;
+		return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
 
 	if (flags & FAULT_FLAG_KILLABLE)
-		return TASK_KILLABLE;
+		return (struct userfault_wait) { TASK_KILLABLE, false };
 
-	return TASK_UNINTERRUPTIBLE;
+	return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
 }
 
 /*
@@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	struct userfaultfd_wait_queue uwq;
 	vm_fault_t ret = VM_FAULT_SIGBUS;
 	bool must_wait;
-	unsigned int blocking_state;
+	struct userfault_wait wait_mode;
 
 	/*
 	 * We don't do userfault handling for the final child pid update
@@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	uwq.ctx = ctx;
 	uwq.waken = false;
 
-	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
+	wait_mode = userfaultfd_get_blocking_state(vmf->flags);
 
         /*
          * Take the vma lock now, in order to safely call
@@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	 * following the spin_unlock to happen before the list_add in
 	 * __add_wait_queue.
 	 */
-	set_current_state(blocking_state);
+	set_current_state(wait_mode.task_state);
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
 	if (!is_vm_hugetlb_page(vma))
@@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
 		wake_up_poll(&ctx->fd_wqh, EPOLLIN);
-		schedule();
+		/* See comment in userfaultfd_get_blocking_state() */
+		if (!wait_mode.timeout)
+			schedule();
+		else
+			schedule_timeout(HZ / 10);
 	}
 
 	__set_current_state(TASK_RUNNING);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-23 22:57               ` Jens Axboe
@ 2025-04-24 14:08                 ` 姜智伟
  2025-04-24 14:13                   ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: 姜智伟 @ 2025-04-24 14:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

Jens Axboe <axboe@kernel.dk> 于2025年4月24日周四 06:58写道：
>
> On 4/23/25 9:55 AM, Jens Axboe wrote:
> > Something like this, perhaps - it'll ensure that io-wq workers get a
> > chance to flush out pending work, which should prevent the looping. I've
> > attached a basic test case. It'll issue a write that will fault, and
> > then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
> > looping.
>
> Something that may actually work - use TASK_UNINTERRUPTIBLE IFF
> signal_pending() is true AND the fault has already been tried once
> before. If that's the case, rather than just call schedule() with
> TASK_INTERRUPTIBLE, use TASK_UNINTERRUPTIBLE and schedule_timeout() with
> a suitable timeout length that prevents the annoying parts busy looping.
> I used HZ / 10.
>
> I don't see how to fix userfaultfd for this case, either using io_uring
> or normal write(2). Normal syscalls can pass back -ERESTARTSYS and get
> it retried, but there's no way to do that from inside fault handling. So
> I think we just have to be nicer about it.
>
> Andrew, as the userfaultfd maintainer, what do you think?
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index d80f94346199..1016268c7b51 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>         return ret;
>  }
>
> -static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
> +struct userfault_wait {
> +       unsigned int task_state;
> +       bool timeout;
> +};
> +
> +static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
>  {
> +       /*
> +        * If the fault has already been tried AND there's a signal pending
> +        * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
> +        * This prevents busy looping where schedule() otherwise does nothing
> +        * for TASK_INTERRUPTIBLE when the task has a signal pending.
> +        */
> +       if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
> +               return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
> +
>         if (flags & FAULT_FLAG_INTERRUPTIBLE)
> -               return TASK_INTERRUPTIBLE;
> +               return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
>
>         if (flags & FAULT_FLAG_KILLABLE)
> -               return TASK_KILLABLE;
> +               return (struct userfault_wait) { TASK_KILLABLE, false };
>
> -       return TASK_UNINTERRUPTIBLE;
> +       return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
>  }
>
>  /*
> @@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>         struct userfaultfd_wait_queue uwq;
>         vm_fault_t ret = VM_FAULT_SIGBUS;
>         bool must_wait;
> -       unsigned int blocking_state;
> +       struct userfault_wait wait_mode;
>
>         /*
>          * We don't do userfault handling for the final child pid update
> @@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>         uwq.ctx = ctx;
>         uwq.waken = false;
>
> -       blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> +       wait_mode = userfaultfd_get_blocking_state(vmf->flags);
>
>          /*
>           * Take the vma lock now, in order to safely call
> @@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>          * following the spin_unlock to happen before the list_add in
>          * __add_wait_queue.
>          */
> -       set_current_state(blocking_state);
> +       set_current_state(wait_mode.task_state);
>         spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>
>         if (!is_vm_hugetlb_page(vma))
> @@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>
>         if (likely(must_wait && !READ_ONCE(ctx->released))) {
>                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> -               schedule();
> +               /* See comment in userfaultfd_get_blocking_state() */
> +               if (!wait_mode.timeout)
> +                       schedule();
> +               else
> +                       schedule_timeout(HZ / 10);
>         }
>
>         __set_current_state(TASK_RUNNING);
>
> --
> Jens Axboe
I guess the previous io_work_fault patch might have already addressed the
issue sufficiently. The later patch that adds a timeout for userfaultfd might
not be necessary  wouldn’t returning after a timeout just cause the same
fault to repeat indefinitely again? Regardless of whether the thread is in
UN or IN state, the expected behavior should be to wait until the page is
filled or the uffd resource is released to be woken up,
which seems like the correct logic.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-24 14:08                 ` 姜智伟
@ 2025-04-24 14:13                   ` Jens Axboe
  2025-04-24 14:45                     ` 姜智伟
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2025-04-24 14:13 UTC (permalink / raw)
  To: 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On 4/24/25 8:08 AM, ??? wrote:
> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 06:58???
>>
>> On 4/23/25 9:55 AM, Jens Axboe wrote:
>>> Something like this, perhaps - it'll ensure that io-wq workers get a
>>> chance to flush out pending work, which should prevent the looping. I've
>>> attached a basic test case. It'll issue a write that will fault, and
>>> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
>>> looping.
>>
>> Something that may actually work - use TASK_UNINTERRUPTIBLE IFF
>> signal_pending() is true AND the fault has already been tried once
>> before. If that's the case, rather than just call schedule() with
>> TASK_INTERRUPTIBLE, use TASK_UNINTERRUPTIBLE and schedule_timeout() with
>> a suitable timeout length that prevents the annoying parts busy looping.
>> I used HZ / 10.
>>
>> I don't see how to fix userfaultfd for this case, either using io_uring
>> or normal write(2). Normal syscalls can pass back -ERESTARTSYS and get
>> it retried, but there's no way to do that from inside fault handling. So
>> I think we just have to be nicer about it.
>>
>> Andrew, as the userfaultfd maintainer, what do you think?
>>
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index d80f94346199..1016268c7b51 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>>         return ret;
>>  }
>>
>> -static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
>> +struct userfault_wait {
>> +       unsigned int task_state;
>> +       bool timeout;
>> +};
>> +
>> +static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
>>  {
>> +       /*
>> +        * If the fault has already been tried AND there's a signal pending
>> +        * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
>> +        * This prevents busy looping where schedule() otherwise does nothing
>> +        * for TASK_INTERRUPTIBLE when the task has a signal pending.
>> +        */
>> +       if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
>> +               return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
>> +
>>         if (flags & FAULT_FLAG_INTERRUPTIBLE)
>> -               return TASK_INTERRUPTIBLE;
>> +               return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
>>
>>         if (flags & FAULT_FLAG_KILLABLE)
>> -               return TASK_KILLABLE;
>> +               return (struct userfault_wait) { TASK_KILLABLE, false };
>>
>> -       return TASK_UNINTERRUPTIBLE;
>> +       return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
>>  }
>>
>>  /*
>> @@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>         struct userfaultfd_wait_queue uwq;
>>         vm_fault_t ret = VM_FAULT_SIGBUS;
>>         bool must_wait;
>> -       unsigned int blocking_state;
>> +       struct userfault_wait wait_mode;
>>
>>         /*
>>          * We don't do userfault handling for the final child pid update
>> @@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>         uwq.ctx = ctx;
>>         uwq.waken = false;
>>
>> -       blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>> +       wait_mode = userfaultfd_get_blocking_state(vmf->flags);
>>
>>          /*
>>           * Take the vma lock now, in order to safely call
>> @@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>          * following the spin_unlock to happen before the list_add in
>>          * __add_wait_queue.
>>          */
>> -       set_current_state(blocking_state);
>> +       set_current_state(wait_mode.task_state);
>>         spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>>
>>         if (!is_vm_hugetlb_page(vma))
>> @@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>
>>         if (likely(must_wait && !READ_ONCE(ctx->released))) {
>>                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
>> -               schedule();
>> +               /* See comment in userfaultfd_get_blocking_state() */
>> +               if (!wait_mode.timeout)
>> +                       schedule();
>> +               else
>> +                       schedule_timeout(HZ / 10);
>>         }
>>
>>         __set_current_state(TASK_RUNNING);
>>
>> --
>> Jens Axboe
> I guess the previous io_work_fault patch might have already addressed
> the issue sufficiently. The later patch that adds a timeout for
> userfaultfd might

That one isn't guaranteed to be safe, as it's not necessarily a safe
context to prune the conditions that lead to a busy loop rather than the
normal "schedule until the condition is resolved". Running task_work
should only be done at the outermost point in the kernel, where the task
state is known sane in terms of what locks etc are being held. For some
conditions the patch will work just fine, but it's not guaranteed to be
the case.

> not be necessary  wouldn?t returning after a timeout just cause the
> same fault to repeat indefinitely again? Regardless of whether the
> thread is in UN or IN state, the expected behavior should be to wait
> until the page is filled or the uffd resource is released to be woken
> up, which seems like the correct logic.

Right, it'll just sleep timeout for a bit as not to be a 100% busy loop.
That's unfortunately the best we can do for this case... The expected
behavior is indeed to schedule until we get woken, however that just
doesn't work if there are signals pending, or other conditions that lead
to TASK_INTERRUPTIBLE + schedule() being a no-op.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-24 14:13                   ` Jens Axboe
@ 2025-04-24 14:45                     ` 姜智伟
  2025-04-24 14:52                       ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: 姜智伟 @ 2025-04-24 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

Jens Axboe <axboe@kernel.dk> 于2025年4月24日周四 22:13写道：
>
> On 4/24/25 8:08 AM, ??? wrote:
> > Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 06:58???
> >>
> >> On 4/23/25 9:55 AM, Jens Axboe wrote:
> >>> Something like this, perhaps - it'll ensure that io-wq workers get a
> >>> chance to flush out pending work, which should prevent the looping. I've
> >>> attached a basic test case. It'll issue a write that will fault, and
> >>> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
> >>> looping.
> >>
> >> Something that may actually work - use TASK_UNINTERRUPTIBLE IFF
> >> signal_pending() is true AND the fault has already been tried once
> >> before. If that's the case, rather than just call schedule() with
> >> TASK_INTERRUPTIBLE, use TASK_UNINTERRUPTIBLE and schedule_timeout() with
> >> a suitable timeout length that prevents the annoying parts busy looping.
> >> I used HZ / 10.
> >>
> >> I don't see how to fix userfaultfd for this case, either using io_uring
> >> or normal write(2). Normal syscalls can pass back -ERESTARTSYS and get
> >> it retried, but there's no way to do that from inside fault handling. So
> >> I think we just have to be nicer about it.
> >>
> >> Andrew, as the userfaultfd maintainer, what do you think?
> >>
> >> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> >> index d80f94346199..1016268c7b51 100644
> >> --- a/fs/userfaultfd.c
> >> +++ b/fs/userfaultfd.c
> >> @@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> >>         return ret;
> >>  }
> >>
> >> -static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
> >> +struct userfault_wait {
> >> +       unsigned int task_state;
> >> +       bool timeout;
> >> +};
> >> +
> >> +static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
> >>  {
> >> +       /*
> >> +        * If the fault has already been tried AND there's a signal pending
> >> +        * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
> >> +        * This prevents busy looping where schedule() otherwise does nothing
> >> +        * for TASK_INTERRUPTIBLE when the task has a signal pending.
> >> +        */
> >> +       if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
> >> +               return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
> >> +
> >>         if (flags & FAULT_FLAG_INTERRUPTIBLE)
> >> -               return TASK_INTERRUPTIBLE;
> >> +               return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
> >>
> >>         if (flags & FAULT_FLAG_KILLABLE)
> >> -               return TASK_KILLABLE;
> >> +               return (struct userfault_wait) { TASK_KILLABLE, false };
> >>
> >> -       return TASK_UNINTERRUPTIBLE;
> >> +       return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
> >>  }
> >>
> >>  /*
> >> @@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>         struct userfaultfd_wait_queue uwq;
> >>         vm_fault_t ret = VM_FAULT_SIGBUS;
> >>         bool must_wait;
> >> -       unsigned int blocking_state;
> >> +       struct userfault_wait wait_mode;
> >>
> >>         /*
> >>          * We don't do userfault handling for the final child pid update
> >> @@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>         uwq.ctx = ctx;
> >>         uwq.waken = false;
> >>
> >> -       blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> >> +       wait_mode = userfaultfd_get_blocking_state(vmf->flags);
> >>
> >>          /*
> >>           * Take the vma lock now, in order to safely call
> >> @@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>          * following the spin_unlock to happen before the list_add in
> >>          * __add_wait_queue.
> >>          */
> >> -       set_current_state(blocking_state);
> >> +       set_current_state(wait_mode.task_state);
> >>         spin_unlock_irq(&ctx->fault_pending_wqh.lock);
> >>
> >>         if (!is_vm_hugetlb_page(vma))
> >> @@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>
> >>         if (likely(must_wait && !READ_ONCE(ctx->released))) {
> >>                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> >> -               schedule();
> >> +               /* See comment in userfaultfd_get_blocking_state() */
> >> +               if (!wait_mode.timeout)
> >> +                       schedule();
> >> +               else
> >> +                       schedule_timeout(HZ / 10);
> >>         }
> >>
> >>         __set_current_state(TASK_RUNNING);
> >>
> >> --
> >> Jens Axboe
> > I guess the previous io_work_fault patch might have already addressed
> > the issue sufficiently. The later patch that adds a timeout for
> > userfaultfd might
>
> That one isn't guaranteed to be safe, as it's not necessarily a safe
> context to prune the conditions that lead to a busy loop rather than the
> normal "schedule until the condition is resolved". Running task_work
> should only be done at the outermost point in the kernel, where the task
> state is known sane in terms of what locks etc are being held. For some
> conditions the patch will work just fine, but it's not guaranteed to be
> the case.
>
> > not be necessary  wouldn?t returning after a timeout just cause the
> > same fault to repeat indefinitely again? Regardless of whether the
> > thread is in UN or IN state, the expected behavior should be to wait
> > until the page is filled or the uffd resource is released to be woken
> > up, which seems like the correct logic.
>
> Right, it'll just sleep timeout for a bit as not to be a 100% busy loop.
> That's unfortunately the best we can do for this case... The expected
> behavior is indeed to schedule until we get woken, however that just
> doesn't work if there are signals pending, or other conditions that lead
> to TASK_INTERRUPTIBLE + schedule() being a no-op.
>
> --
> Jens Axboe
In my testing, clearing the NOTIFY flag in the original io_work_fault
ensures that the next schedule correctly waits. However, adding a
timeout causes the issue to return to multiple faults again.
Also, after clearing the NOTIFY flag in handle_userfault,
it’s possible that some task work hasn’t been executed.
But if task_work_run isn’t added back, tasks might get lost?
It seems like there isn’t an appropriate place to add it back.
So, do you suggest adjusting the fault frequency in userfaultfd
to make it more rhythmic to alleviate the issue?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-24 14:45                     ` 姜智伟
@ 2025-04-24 14:52                       ` Jens Axboe
  2025-04-24 15:12                         ` 姜智伟
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2025-04-24 14:52 UTC (permalink / raw)
  To: 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On 4/24/25 8:45 AM, ??? wrote:
> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 22:13???
>>
>> On 4/24/25 8:08 AM, ??? wrote:
>>> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 06:58???
>>>>
>>>> On 4/23/25 9:55 AM, Jens Axboe wrote:
>>>>> Something like this, perhaps - it'll ensure that io-wq workers get a
>>>>> chance to flush out pending work, which should prevent the looping. I've
>>>>> attached a basic test case. It'll issue a write that will fault, and
>>>>> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
>>>>> looping.
>>>>
>>>> Something that may actually work - use TASK_UNINTERRUPTIBLE IFF
>>>> signal_pending() is true AND the fault has already been tried once
>>>> before. If that's the case, rather than just call schedule() with
>>>> TASK_INTERRUPTIBLE, use TASK_UNINTERRUPTIBLE and schedule_timeout() with
>>>> a suitable timeout length that prevents the annoying parts busy looping.
>>>> I used HZ / 10.
>>>>
>>>> I don't see how to fix userfaultfd for this case, either using io_uring
>>>> or normal write(2). Normal syscalls can pass back -ERESTARTSYS and get
>>>> it retried, but there's no way to do that from inside fault handling. So
>>>> I think we just have to be nicer about it.
>>>>
>>>> Andrew, as the userfaultfd maintainer, what do you think?
>>>>
>>>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>>>> index d80f94346199..1016268c7b51 100644
>>>> --- a/fs/userfaultfd.c
>>>> +++ b/fs/userfaultfd.c
>>>> @@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>>>>         return ret;
>>>>  }
>>>>
>>>> -static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
>>>> +struct userfault_wait {
>>>> +       unsigned int task_state;
>>>> +       bool timeout;
>>>> +};
>>>> +
>>>> +static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
>>>>  {
>>>> +       /*
>>>> +        * If the fault has already been tried AND there's a signal pending
>>>> +        * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
>>>> +        * This prevents busy looping where schedule() otherwise does nothing
>>>> +        * for TASK_INTERRUPTIBLE when the task has a signal pending.
>>>> +        */
>>>> +       if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
>>>> +               return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
>>>> +
>>>>         if (flags & FAULT_FLAG_INTERRUPTIBLE)
>>>> -               return TASK_INTERRUPTIBLE;
>>>> +               return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
>>>>
>>>>         if (flags & FAULT_FLAG_KILLABLE)
>>>> -               return TASK_KILLABLE;
>>>> +               return (struct userfault_wait) { TASK_KILLABLE, false };
>>>>
>>>> -       return TASK_UNINTERRUPTIBLE;
>>>> +       return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
>>>>  }
>>>>
>>>>  /*
>>>> @@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>         struct userfaultfd_wait_queue uwq;
>>>>         vm_fault_t ret = VM_FAULT_SIGBUS;
>>>>         bool must_wait;
>>>> -       unsigned int blocking_state;
>>>> +       struct userfault_wait wait_mode;
>>>>
>>>>         /*
>>>>          * We don't do userfault handling for the final child pid update
>>>> @@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>         uwq.ctx = ctx;
>>>>         uwq.waken = false;
>>>>
>>>> -       blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>>>> +       wait_mode = userfaultfd_get_blocking_state(vmf->flags);
>>>>
>>>>          /*
>>>>           * Take the vma lock now, in order to safely call
>>>> @@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>          * following the spin_unlock to happen before the list_add in
>>>>          * __add_wait_queue.
>>>>          */
>>>> -       set_current_state(blocking_state);
>>>> +       set_current_state(wait_mode.task_state);
>>>>         spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>>>>
>>>>         if (!is_vm_hugetlb_page(vma))
>>>> @@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>
>>>>         if (likely(must_wait && !READ_ONCE(ctx->released))) {
>>>>                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
>>>> -               schedule();
>>>> +               /* See comment in userfaultfd_get_blocking_state() */
>>>> +               if (!wait_mode.timeout)
>>>> +                       schedule();
>>>> +               else
>>>> +                       schedule_timeout(HZ / 10);
>>>>         }
>>>>
>>>>         __set_current_state(TASK_RUNNING);
>>>>
>>>> --
>>>> Jens Axboe
>>> I guess the previous io_work_fault patch might have already addressed
>>> the issue sufficiently. The later patch that adds a timeout for
>>> userfaultfd might
>>
>> That one isn't guaranteed to be safe, as it's not necessarily a safe
>> context to prune the conditions that lead to a busy loop rather than the
>> normal "schedule until the condition is resolved". Running task_work
>> should only be done at the outermost point in the kernel, where the task
>> state is known sane in terms of what locks etc are being held. For some
>> conditions the patch will work just fine, but it's not guaranteed to be
>> the case.
>>
>>> not be necessary  wouldn?t returning after a timeout just cause the
>>> same fault to repeat indefinitely again? Regardless of whether the
>>> thread is in UN or IN state, the expected behavior should be to wait
>>> until the page is filled or the uffd resource is released to be woken
>>> up, which seems like the correct logic.
>>
>> Right, it'll just sleep timeout for a bit as not to be a 100% busy loop.
>> That's unfortunately the best we can do for this case... The expected
>> behavior is indeed to schedule until we get woken, however that just
>> doesn't work if there are signals pending, or other conditions that lead
>> to TASK_INTERRUPTIBLE + schedule() being a no-op.
>>
>> --
>> Jens Axboe
> In my testing, clearing the NOTIFY flag in the original io_work_fault
> ensures that the next schedule correctly waits. However, adding a

That's symptom fixing again - the NOTIFY flag is the thing that triggers
for io_uring, but any legitimate signal (or task_work added with
signaling) will cause the same issue.

> timeout causes the issue to return to multiple faults again.
> Also, after clearing the NOTIFY flag in handle_userfault,
> it?s possible that some task work hasn?t been executed.
> But if task_work_run isn?t added back, tasks might get lost?
> It seems like there isn?t an appropriate place to add it back.
> So, do you suggest adjusting the fault frequency in userfaultfd
> to make it more rhythmic to alleviate the issue?

The task_work is still there, you just removed the notification
mechanism that tells the kernel that there's task_work there. For this
particular case, you could re-set TIF_NOTIFY_SIGNAL at the end after
schedule(), but again it'd only fix that specific one case, not the
generic issue.

What's the objection to the sleep approach? If the task is woken by the
fault being filled, it'll still wake on time, no delay. If not, then it
prevents a busy loop, which is counterproductive.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-24 14:52                       ` Jens Axboe
@ 2025-04-24 15:12                         ` 姜智伟
  2025-04-24 15:21                           ` Jens Axboe
  0 siblings, 1 reply; 25+ messages in thread
From: 姜智伟 @ 2025-04-24 15:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

Jens Axboe <axboe@kernel.dk> 于2025年4月24日周四 22:53写道：
>
> On 4/24/25 8:45 AM, ??? wrote:
> > Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 22:13???
> >>
> >> On 4/24/25 8:08 AM, ??? wrote:
> >>> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 06:58???
> >>>>
> >>>> On 4/23/25 9:55 AM, Jens Axboe wrote:
> >>>>> Something like this, perhaps - it'll ensure that io-wq workers get a
> >>>>> chance to flush out pending work, which should prevent the looping. I've
> >>>>> attached a basic test case. It'll issue a write that will fault, and
> >>>>> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
> >>>>> looping.
> >>>>
> >>>> Something that may actually work - use TASK_UNINTERRUPTIBLE IFF
> >>>> signal_pending() is true AND the fault has already been tried once
> >>>> before. If that's the case, rather than just call schedule() with
> >>>> TASK_INTERRUPTIBLE, use TASK_UNINTERRUPTIBLE and schedule_timeout() with
> >>>> a suitable timeout length that prevents the annoying parts busy looping.
> >>>> I used HZ / 10.
> >>>>
> >>>> I don't see how to fix userfaultfd for this case, either using io_uring
> >>>> or normal write(2). Normal syscalls can pass back -ERESTARTSYS and get
> >>>> it retried, but there's no way to do that from inside fault handling. So
> >>>> I think we just have to be nicer about it.
> >>>>
> >>>> Andrew, as the userfaultfd maintainer, what do you think?
> >>>>
> >>>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> >>>> index d80f94346199..1016268c7b51 100644
> >>>> --- a/fs/userfaultfd.c
> >>>> +++ b/fs/userfaultfd.c
> >>>> @@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> >>>>         return ret;
> >>>>  }
> >>>>
> >>>> -static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
> >>>> +struct userfault_wait {
> >>>> +       unsigned int task_state;
> >>>> +       bool timeout;
> >>>> +};
> >>>> +
> >>>> +static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
> >>>>  {
> >>>> +       /*
> >>>> +        * If the fault has already been tried AND there's a signal pending
> >>>> +        * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
> >>>> +        * This prevents busy looping where schedule() otherwise does nothing
> >>>> +        * for TASK_INTERRUPTIBLE when the task has a signal pending.
> >>>> +        */
> >>>> +       if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
> >>>> +               return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
> >>>> +
> >>>>         if (flags & FAULT_FLAG_INTERRUPTIBLE)
> >>>> -               return TASK_INTERRUPTIBLE;
> >>>> +               return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
> >>>>
> >>>>         if (flags & FAULT_FLAG_KILLABLE)
> >>>> -               return TASK_KILLABLE;
> >>>> +               return (struct userfault_wait) { TASK_KILLABLE, false };
> >>>>
> >>>> -       return TASK_UNINTERRUPTIBLE;
> >>>> +       return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
> >>>>  }
> >>>>
> >>>>  /*
> >>>> @@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>>>         struct userfaultfd_wait_queue uwq;
> >>>>         vm_fault_t ret = VM_FAULT_SIGBUS;
> >>>>         bool must_wait;
> >>>> -       unsigned int blocking_state;
> >>>> +       struct userfault_wait wait_mode;
> >>>>
> >>>>         /*
> >>>>          * We don't do userfault handling for the final child pid update
> >>>> @@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>>>         uwq.ctx = ctx;
> >>>>         uwq.waken = false;
> >>>>
> >>>> -       blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> >>>> +       wait_mode = userfaultfd_get_blocking_state(vmf->flags);
> >>>>
> >>>>          /*
> >>>>           * Take the vma lock now, in order to safely call
> >>>> @@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>>>          * following the spin_unlock to happen before the list_add in
> >>>>          * __add_wait_queue.
> >>>>          */
> >>>> -       set_current_state(blocking_state);
> >>>> +       set_current_state(wait_mode.task_state);
> >>>>         spin_unlock_irq(&ctx->fault_pending_wqh.lock);
> >>>>
> >>>>         if (!is_vm_hugetlb_page(vma))
> >>>> @@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>>>
> >>>>         if (likely(must_wait && !READ_ONCE(ctx->released))) {
> >>>>                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> >>>> -               schedule();
> >>>> +               /* See comment in userfaultfd_get_blocking_state() */
> >>>> +               if (!wait_mode.timeout)
> >>>> +                       schedule();
> >>>> +               else
> >>>> +                       schedule_timeout(HZ / 10);
> >>>>         }
> >>>>
> >>>>         __set_current_state(TASK_RUNNING);
> >>>>
> >>>> --
> >>>> Jens Axboe
> >>> I guess the previous io_work_fault patch might have already addressed
> >>> the issue sufficiently. The later patch that adds a timeout for
> >>> userfaultfd might
> >>
> >> That one isn't guaranteed to be safe, as it's not necessarily a safe
> >> context to prune the conditions that lead to a busy loop rather than the
> >> normal "schedule until the condition is resolved". Running task_work
> >> should only be done at the outermost point in the kernel, where the task
> >> state is known sane in terms of what locks etc are being held. For some
> >> conditions the patch will work just fine, but it's not guaranteed to be
> >> the case.
> >>
> >>> not be necessary  wouldn?t returning after a timeout just cause the
> >>> same fault to repeat indefinitely again? Regardless of whether the
> >>> thread is in UN or IN state, the expected behavior should be to wait
> >>> until the page is filled or the uffd resource is released to be woken
> >>> up, which seems like the correct logic.
> >>
> >> Right, it'll just sleep timeout for a bit as not to be a 100% busy loop.
> >> That's unfortunately the best we can do for this case... The expected
> >> behavior is indeed to schedule until we get woken, however that just
> >> doesn't work if there are signals pending, or other conditions that lead
> >> to TASK_INTERRUPTIBLE + schedule() being a no-op.
> >>
> >> --
> >> Jens Axboe
> > In my testing, clearing the NOTIFY flag in the original io_work_fault
> > ensures that the next schedule correctly waits. However, adding a
>
> That's symptom fixing again - the NOTIFY flag is the thing that triggers
> for io_uring, but any legitimate signal (or task_work added with
> signaling) will cause the same issue.
>
> > timeout causes the issue to return to multiple faults again.
> > Also, after clearing the NOTIFY flag in handle_userfault,
> > it?s possible that some task work hasn?t been executed.
> > But if task_work_run isn?t added back, tasks might get lost?
> > It seems like there isn?t an appropriate place to add it back.
> > So, do you suggest adjusting the fault frequency in userfaultfd
> > to make it more rhythmic to alleviate the issue?
>
> The task_work is still there, you just removed the notification
> mechanism that tells the kernel that there's task_work there. For this
> particular case, you could re-set TIF_NOTIFY_SIGNAL at the end after
> schedule(), but again it'd only fix that specific one case, not the
> generic issue.
>
> What's the objection to the sleep approach? If the task is woken by the
> fault being filled, it'll still wake on time, no delay. If not, then it
> prevents a busy loop, which is counterproductive.
>
> --
> Jens Axboe
OK Thanks .and i’m curious about what exactly is meant by a
'specific one case 'and what qualifies as a 'generic issue' with re-set
TIF_NOTIFY_SIGNAL.
So, in your final opinion, do you think the code in io_uring is not suitable
for modification, should focus on making adjustments in userfaultfd to
mitigate the issue?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-24 15:12                         ` 姜智伟
@ 2025-04-24 15:21                           ` Jens Axboe
  2025-04-24 15:51                             ` 姜智伟
  0 siblings, 1 reply; 25+ messages in thread
From: Jens Axboe @ 2025-04-24 15:21 UTC (permalink / raw)
  To: 姜智伟
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

On 4/24/25 9:12 AM, ??? wrote:
> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 22:53???
>>
>> On 4/24/25 8:45 AM, ??? wrote:
>>> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 22:13???
>>>>
>>>> On 4/24/25 8:08 AM, ??? wrote:
>>>>> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 06:58???
>>>>>>
>>>>>> On 4/23/25 9:55 AM, Jens Axboe wrote:
>>>>>>> Something like this, perhaps - it'll ensure that io-wq workers get a
>>>>>>> chance to flush out pending work, which should prevent the looping. I've
>>>>>>> attached a basic test case. It'll issue a write that will fault, and
>>>>>>> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
>>>>>>> looping.
>>>>>>
>>>>>> Something that may actually work - use TASK_UNINTERRUPTIBLE IFF
>>>>>> signal_pending() is true AND the fault has already been tried once
>>>>>> before. If that's the case, rather than just call schedule() with
>>>>>> TASK_INTERRUPTIBLE, use TASK_UNINTERRUPTIBLE and schedule_timeout() with
>>>>>> a suitable timeout length that prevents the annoying parts busy looping.
>>>>>> I used HZ / 10.
>>>>>>
>>>>>> I don't see how to fix userfaultfd for this case, either using io_uring
>>>>>> or normal write(2). Normal syscalls can pass back -ERESTARTSYS and get
>>>>>> it retried, but there's no way to do that from inside fault handling. So
>>>>>> I think we just have to be nicer about it.
>>>>>>
>>>>>> Andrew, as the userfaultfd maintainer, what do you think?
>>>>>>
>>>>>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>>>>>> index d80f94346199..1016268c7b51 100644
>>>>>> --- a/fs/userfaultfd.c
>>>>>> +++ b/fs/userfaultfd.c
>>>>>> @@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>>>>>>         return ret;
>>>>>>  }
>>>>>>
>>>>>> -static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
>>>>>> +struct userfault_wait {
>>>>>> +       unsigned int task_state;
>>>>>> +       bool timeout;
>>>>>> +};
>>>>>> +
>>>>>> +static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
>>>>>>  {
>>>>>> +       /*
>>>>>> +        * If the fault has already been tried AND there's a signal pending
>>>>>> +        * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
>>>>>> +        * This prevents busy looping where schedule() otherwise does nothing
>>>>>> +        * for TASK_INTERRUPTIBLE when the task has a signal pending.
>>>>>> +        */
>>>>>> +       if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
>>>>>> +               return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
>>>>>> +
>>>>>>         if (flags & FAULT_FLAG_INTERRUPTIBLE)
>>>>>> -               return TASK_INTERRUPTIBLE;
>>>>>> +               return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
>>>>>>
>>>>>>         if (flags & FAULT_FLAG_KILLABLE)
>>>>>> -               return TASK_KILLABLE;
>>>>>> +               return (struct userfault_wait) { TASK_KILLABLE, false };
>>>>>>
>>>>>> -       return TASK_UNINTERRUPTIBLE;
>>>>>> +       return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
>>>>>>  }
>>>>>>
>>>>>>  /*
>>>>>> @@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>>>         struct userfaultfd_wait_queue uwq;
>>>>>>         vm_fault_t ret = VM_FAULT_SIGBUS;
>>>>>>         bool must_wait;
>>>>>> -       unsigned int blocking_state;
>>>>>> +       struct userfault_wait wait_mode;
>>>>>>
>>>>>>         /*
>>>>>>          * We don't do userfault handling for the final child pid update
>>>>>> @@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>>>         uwq.ctx = ctx;
>>>>>>         uwq.waken = false;
>>>>>>
>>>>>> -       blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>>>>>> +       wait_mode = userfaultfd_get_blocking_state(vmf->flags);
>>>>>>
>>>>>>          /*
>>>>>>           * Take the vma lock now, in order to safely call
>>>>>> @@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>>>          * following the spin_unlock to happen before the list_add in
>>>>>>          * __add_wait_queue.
>>>>>>          */
>>>>>> -       set_current_state(blocking_state);
>>>>>> +       set_current_state(wait_mode.task_state);
>>>>>>         spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>>>>>>
>>>>>>         if (!is_vm_hugetlb_page(vma))
>>>>>> @@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>>>
>>>>>>         if (likely(must_wait && !READ_ONCE(ctx->released))) {
>>>>>>                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
>>>>>> -               schedule();
>>>>>> +               /* See comment in userfaultfd_get_blocking_state() */
>>>>>> +               if (!wait_mode.timeout)
>>>>>> +                       schedule();
>>>>>> +               else
>>>>>> +                       schedule_timeout(HZ / 10);
>>>>>>         }
>>>>>>
>>>>>>         __set_current_state(TASK_RUNNING);
>>>>>>
>>>>>> --
>>>>>> Jens Axboe
>>>>> I guess the previous io_work_fault patch might have already addressed
>>>>> the issue sufficiently. The later patch that adds a timeout for
>>>>> userfaultfd might
>>>>
>>>> That one isn't guaranteed to be safe, as it's not necessarily a safe
>>>> context to prune the conditions that lead to a busy loop rather than the
>>>> normal "schedule until the condition is resolved". Running task_work
>>>> should only be done at the outermost point in the kernel, where the task
>>>> state is known sane in terms of what locks etc are being held. For some
>>>> conditions the patch will work just fine, but it's not guaranteed to be
>>>> the case.
>>>>
>>>>> not be necessary  wouldn?t returning after a timeout just cause the
>>>>> same fault to repeat indefinitely again? Regardless of whether the
>>>>> thread is in UN or IN state, the expected behavior should be to wait
>>>>> until the page is filled or the uffd resource is released to be woken
>>>>> up, which seems like the correct logic.
>>>>
>>>> Right, it'll just sleep timeout for a bit as not to be a 100% busy loop.
>>>> That's unfortunately the best we can do for this case... The expected
>>>> behavior is indeed to schedule until we get woken, however that just
>>>> doesn't work if there are signals pending, or other conditions that lead
>>>> to TASK_INTERRUPTIBLE + schedule() being a no-op.
>>>>
>>>> --
>>>> Jens Axboe
>>> In my testing, clearing the NOTIFY flag in the original io_work_fault
>>> ensures that the next schedule correctly waits. However, adding a
>>
>> That's symptom fixing again - the NOTIFY flag is the thing that triggers
>> for io_uring, but any legitimate signal (or task_work added with
>> signaling) will cause the same issue.
>>
>>> timeout causes the issue to return to multiple faults again.
>>> Also, after clearing the NOTIFY flag in handle_userfault,
>>> it?s possible that some task work hasn?t been executed.
>>> But if task_work_run isn?t added back, tasks might get lost?
>>> It seems like there isn?t an appropriate place to add it back.
>>> So, do you suggest adjusting the fault frequency in userfaultfd
>>> to make it more rhythmic to alleviate the issue?
>>
>> The task_work is still there, you just removed the notification
>> mechanism that tells the kernel that there's task_work there. For this
>> particular case, you could re-set TIF_NOTIFY_SIGNAL at the end after
>> schedule(), but again it'd only fix that specific one case, not the
>> generic issue.
>>
>> What's the objection to the sleep approach? If the task is woken by the
>> fault being filled, it'll still wake on time, no delay. If not, then it
>> prevents a busy loop, which is counterproductive.
>>
>> --
>> Jens Axboe
> OK Thanks .and i?m curious about what exactly is meant by a
> 'specific one case 'and what qualifies as a 'generic issue' with re-set
> TIF_NOTIFY_SIGNAL.

I already outlined that in earlier replies, find the email that states
the various conditions that can lead to schedule() w/TASK_INTERRUPTIBLE
to return immediately rather than sleeping. TIF_NOTIFY_SIGNAL is _one_
such condition, it's not _all_ conditions.

> So, in your final opinion, do you think the code in io_uring is not
> suitable for modification, should focus on making adjustments in
> userfaultfd to mitigate the issue?

The problem isn't in io_uring in the first place, you just happened to
trip over it via that path. I even sent out a test case that
demonstrates how to trigger this without io_uring as well. I'm a bit
puzzled as to why all of this isn't clear already.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios
  2025-04-24 15:21                           ` Jens Axboe
@ 2025-04-24 15:51                             ` 姜智伟
  0 siblings, 0 replies; 25+ messages in thread
From: 姜智伟 @ 2025-04-24 15:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: viro, brauner, jack, akpm, peterx, asml.silence, linux-fsdevel,
	linux-mm, linux-kernel, io-uring

Jens Axboe <axboe@kernel.dk> 于2025年4月24日周四 23:21写道：
>
> On 4/24/25 9:12 AM, ??? wrote:
> > Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 22:53???
> >>
> >> On 4/24/25 8:45 AM, ??? wrote:
> >>> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 22:13???
> >>>>
> >>>> On 4/24/25 8:08 AM, ??? wrote:
> >>>>> Jens Axboe <axboe@kernel.dk> ?2025?4?24??? 06:58???
> >>>>>>
> >>>>>> On 4/23/25 9:55 AM, Jens Axboe wrote:
> >>>>>>> Something like this, perhaps - it'll ensure that io-wq workers get a
> >>>>>>> chance to flush out pending work, which should prevent the looping. I've
> >>>>>>> attached a basic test case. It'll issue a write that will fault, and
> >>>>>>> then try and cancel that as a way to trigger the TIF_NOTIFY_SIGNAL based
> >>>>>>> looping.
> >>>>>>
> >>>>>> Something that may actually work - use TASK_UNINTERRUPTIBLE IFF
> >>>>>> signal_pending() is true AND the fault has already been tried once
> >>>>>> before. If that's the case, rather than just call schedule() with
> >>>>>> TASK_INTERRUPTIBLE, use TASK_UNINTERRUPTIBLE and schedule_timeout() with
> >>>>>> a suitable timeout length that prevents the annoying parts busy looping.
> >>>>>> I used HZ / 10.
> >>>>>>
> >>>>>> I don't see how to fix userfaultfd for this case, either using io_uring
> >>>>>> or normal write(2). Normal syscalls can pass back -ERESTARTSYS and get
> >>>>>> it retried, but there's no way to do that from inside fault handling. So
> >>>>>> I think we just have to be nicer about it.
> >>>>>>
> >>>>>> Andrew, as the userfaultfd maintainer, what do you think?
> >>>>>>
> >>>>>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> >>>>>> index d80f94346199..1016268c7b51 100644
> >>>>>> --- a/fs/userfaultfd.c
> >>>>>> +++ b/fs/userfaultfd.c
> >>>>>> @@ -334,15 +334,29 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> >>>>>>         return ret;
> >>>>>>  }
> >>>>>>
> >>>>>> -static inline unsigned int userfaultfd_get_blocking_state(unsigned int flags)
> >>>>>> +struct userfault_wait {
> >>>>>> +       unsigned int task_state;
> >>>>>> +       bool timeout;
> >>>>>> +};
> >>>>>> +
> >>>>>> +static struct userfault_wait userfaultfd_get_blocking_state(unsigned int flags)
> >>>>>>  {
> >>>>>> +       /*
> >>>>>> +        * If the fault has already been tried AND there's a signal pending
> >>>>>> +        * for this task, use TASK_UNINTERRUPTIBLE with a small timeout.
> >>>>>> +        * This prevents busy looping where schedule() otherwise does nothing
> >>>>>> +        * for TASK_INTERRUPTIBLE when the task has a signal pending.
> >>>>>> +        */
> >>>>>> +       if ((flags & FAULT_FLAG_TRIED) && signal_pending(current))
> >>>>>> +               return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, true };
> >>>>>> +
> >>>>>>         if (flags & FAULT_FLAG_INTERRUPTIBLE)
> >>>>>> -               return TASK_INTERRUPTIBLE;
> >>>>>> +               return (struct userfault_wait) { TASK_INTERRUPTIBLE, false };
> >>>>>>
> >>>>>>         if (flags & FAULT_FLAG_KILLABLE)
> >>>>>> -               return TASK_KILLABLE;
> >>>>>> +               return (struct userfault_wait) { TASK_KILLABLE, false };
> >>>>>>
> >>>>>> -       return TASK_UNINTERRUPTIBLE;
> >>>>>> +       return (struct userfault_wait) { TASK_UNINTERRUPTIBLE, false };
> >>>>>>  }
> >>>>>>
> >>>>>>  /*
> >>>>>> @@ -368,7 +382,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>>>>>         struct userfaultfd_wait_queue uwq;
> >>>>>>         vm_fault_t ret = VM_FAULT_SIGBUS;
> >>>>>>         bool must_wait;
> >>>>>> -       unsigned int blocking_state;
> >>>>>> +       struct userfault_wait wait_mode;
> >>>>>>
> >>>>>>         /*
> >>>>>>          * We don't do userfault handling for the final child pid update
> >>>>>> @@ -466,7 +480,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>>>>>         uwq.ctx = ctx;
> >>>>>>         uwq.waken = false;
> >>>>>>
> >>>>>> -       blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> >>>>>> +       wait_mode = userfaultfd_get_blocking_state(vmf->flags);
> >>>>>>
> >>>>>>          /*
> >>>>>>           * Take the vma lock now, in order to safely call
> >>>>>> @@ -488,7 +502,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>>>>>          * following the spin_unlock to happen before the list_add in
> >>>>>>          * __add_wait_queue.
> >>>>>>          */
> >>>>>> -       set_current_state(blocking_state);
> >>>>>> +       set_current_state(wait_mode.task_state);
> >>>>>>         spin_unlock_irq(&ctx->fault_pending_wqh.lock);
> >>>>>>
> >>>>>>         if (!is_vm_hugetlb_page(vma))
> >>>>>> @@ -501,7 +515,11 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>>>>>
> >>>>>>         if (likely(must_wait && !READ_ONCE(ctx->released))) {
> >>>>>>                 wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> >>>>>> -               schedule();
> >>>>>> +               /* See comment in userfaultfd_get_blocking_state() */
> >>>>>> +               if (!wait_mode.timeout)
> >>>>>> +                       schedule();
> >>>>>> +               else
> >>>>>> +                       schedule_timeout(HZ / 10);
> >>>>>>         }
> >>>>>>
> >>>>>>         __set_current_state(TASK_RUNNING);
> >>>>>>
> >>>>>> --
> >>>>>> Jens Axboe
> >>>>> I guess the previous io_work_fault patch might have already addressed
> >>>>> the issue sufficiently. The later patch that adds a timeout for
> >>>>> userfaultfd might
> >>>>
> >>>> That one isn't guaranteed to be safe, as it's not necessarily a safe
> >>>> context to prune the conditions that lead to a busy loop rather than the
> >>>> normal "schedule until the condition is resolved". Running task_work
> >>>> should only be done at the outermost point in the kernel, where the task
> >>>> state is known sane in terms of what locks etc are being held. For some
> >>>> conditions the patch will work just fine, but it's not guaranteed to be
> >>>> the case.
> >>>>
> >>>>> not be necessary  wouldn?t returning after a timeout just cause the
> >>>>> same fault to repeat indefinitely again? Regardless of whether the
> >>>>> thread is in UN or IN state, the expected behavior should be to wait
> >>>>> until the page is filled or the uffd resource is released to be woken
> >>>>> up, which seems like the correct logic.
> >>>>
> >>>> Right, it'll just sleep timeout for a bit as not to be a 100% busy loop.
> >>>> That's unfortunately the best we can do for this case... The expected
> >>>> behavior is indeed to schedule until we get woken, however that just
> >>>> doesn't work if there are signals pending, or other conditions that lead
> >>>> to TASK_INTERRUPTIBLE + schedule() being a no-op.
> >>>>
> >>>> --
> >>>> Jens Axboe
> >>> In my testing, clearing the NOTIFY flag in the original io_work_fault
> >>> ensures that the next schedule correctly waits. However, adding a
> >>
> >> That's symptom fixing again - the NOTIFY flag is the thing that triggers
> >> for io_uring, but any legitimate signal (or task_work added with
> >> signaling) will cause the same issue.
> >>
> >>> timeout causes the issue to return to multiple faults again.
> >>> Also, after clearing the NOTIFY flag in handle_userfault,
> >>> it?s possible that some task work hasn?t been executed.
> >>> But if task_work_run isn?t added back, tasks might get lost?
> >>> It seems like there isn?t an appropriate place to add it back.
> >>> So, do you suggest adjusting the fault frequency in userfaultfd
> >>> to make it more rhythmic to alleviate the issue?
> >>
> >> The task_work is still there, you just removed the notification
> >> mechanism that tells the kernel that there's task_work there. For this
> >> particular case, you could re-set TIF_NOTIFY_SIGNAL at the end after
> >> schedule(), but again it'd only fix that specific one case, not the
> >> generic issue.
> >>
> >> What's the objection to the sleep approach? If the task is woken by the
> >> fault being filled, it'll still wake on time, no delay. If not, then it
> >> prevents a busy loop, which is counterproductive.
> >>
> >> --
> >> Jens Axboe
> > OK Thanks .and i?m curious about what exactly is meant by a
> > 'specific one case 'and what qualifies as a 'generic issue' with re-set
> > TIF_NOTIFY_SIGNAL.
>
> I already outlined that in earlier replies, find the email that states
> the various conditions that can lead to schedule() w/TASK_INTERRUPTIBLE
> to return immediately rather than sleeping. TIF_NOTIFY_SIGNAL is _one_
> such condition, it's not _all_ conditions.
>
> > So, in your final opinion, do you think the code in io_uring is not
> > suitable for modification, should focus on making adjustments in
> > userfaultfd to mitigate the issue?
>
> The problem isn't in io_uring in the first place, you just happened to
> trip over it via that path. I even sent out a test case that
> demonstrates how to trigger this without io_uring as well. I'm a bit
> puzzled as to why all of this isn't clear already.
>
> --
> Jens Axboe
Thank you. I think the final solution for my scenario might involve the
user-space monitoring of the uffd business thread to ensure that it can
unregister the uffd memory, allowing the fault process to exit the loop
in IOU via an error return, which would further help exit the VM
process's D state.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-04-24 15:51 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-22 16:29 [PATCH v2 0/2] Fix 100% CPU usage issue in IOU worker threads Zhiwei Jiang
2025-04-22 16:29 ` [PATCH v2 1/2] io_uring: Add new functions to handle user fault scenarios Zhiwei Jiang
2025-04-22 16:32   ` Jens Axboe
2025-04-22 17:04     ` 姜智伟
2025-04-22 17:33       ` Jens Axboe
2025-04-23  2:49         ` 姜智伟
2025-04-23  3:11           ` 姜智伟
2025-04-23  6:22             ` 姜智伟
2025-04-23 13:34           ` Jens Axboe
2025-04-23 14:29             ` 姜智伟
2025-04-23 15:10               ` Jens Axboe
2025-04-23 18:55                 ` Jens Axboe
2025-04-23 15:55             ` Jens Axboe
2025-04-23 16:07               ` 姜智伟
2025-04-23 16:17               ` Pavel Begunkov
2025-04-23 16:23                 ` Jens Axboe
2025-04-23 22:57               ` Jens Axboe
2025-04-24 14:08                 ` 姜智伟
2025-04-24 14:13                   ` Jens Axboe
2025-04-24 14:45                     ` 姜智伟
2025-04-24 14:52                       ` Jens Axboe
2025-04-24 15:12                         ` 姜智伟
2025-04-24 15:21                           ` Jens Axboe
2025-04-24 15:51                             ` 姜智伟
2025-04-22 16:29 ` [PATCH v2 2/2] userfaultfd: Set the corresponding flag in IOU worker context Zhiwei Jiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox