public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused
@ 2026-02-02 14:37 Li Chen
  2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Li Chen @ 2026-02-02 14:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-kernel

io_uring uses io-wq to offload regular file I/O. When that happens, the kernel
creates per-task iou-wrk-<tgid> workers (PF_IO_WORKER) via create_io_thread(),
so the worker is part of the process thread group and shows up under
/proc/<pid>/task/.

io-wq shrinks the pool on idle, but it intentionally keeps the last worker
around indefinitely as a keepalive to avoid churn. Combined with io_uring's
per-task context lifetime (tctx stays attached to the task until exit), a
process may permanently retain an idle iou-wrk thread even after it has closed
its last io_uring instance and has no active rings.

The keepalive behavior is a reasonable default(I guess): workloads may have
bursty I/O patterns, and always tearing down the last worker would add thread
churn and latency. Creating io-wq workers goes through create_io_thread()
(copy_process), which is not cheap to do repeatedly.

However, CRIU currently doesn't cope well with such workers being part of the
checkpointed thread group. The iou-wrk thread is a kernel-managed worker
(PF_IO_WORKER) running io_wq_worker() on a kernel stack, rather than a normal
userspace thread executing application code. In our setup, if the iou-wrk
thread remains present after quiescing and closing the last io_uring instance,
criu dump may hang while trying to stop and dump the thread group.

Besides the resource overhead and surprising userspace-visible threads, this is
a problem for checkpoint/restore. CRIU needs to freeze and dump all threads in
the thread group. With a lingering iou-wrk thread, we observed criu dump can
hang even after the ring has been quiesced and the io_uring fd closed, e.g.:

  criu dump -t $PID -D images -o dump.log -v4 --shell-job
  ps -T -p $PID -o pid,tid,comm | grep iou-wrk

This series is a kernel-side enabler for checkpoint/restore in the current
reality where userspace needs to quiesce and close io_uring rings before dump.
It is not trying to make io_uring rings checkpointable, nor does it change what
CRIU can or cannot restore (e.g. in-flight SQEs/CQEs, SQPOLL, SQE128/CQE32,
registered resources). Even with userspace gaining limited io_uring support,
this series only targets the specific "no active io_uring contexts left, but an
idle iou-wrk keepalive thread remains" case.

This series adds an explicit exit-on-idle mode to io-wq, and toggles it from
io_uring task context when the task has no active io_uring contexts
(xa_empty(&tctx->xa)). The mode is cleared on subsequent io_uring usage, so the
default behavior for active io_uring users is unchanged.

Tested on x86_64 with CRIU 4.2.
With this series applied, after closing the ring iou-wrk exited within ~200ms
and criu dump completed.

Li Chen (2):
  io-wq: add exit-on-idle mode
  io_uring: allow io-wq workers to exit when unused

 io_uring/io-wq.c | 31 +++++++++++++++++++++++++++++++
 io_uring/io-wq.h |  1 +
 io_uring/tctx.c  | 11 +++++++++++
 3 files changed, 43 insertions(+)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v1 1/2] io-wq: add exit-on-idle mode
  2026-02-02 14:37 [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused Li Chen
@ 2026-02-02 14:37 ` Li Chen
  2026-02-02 14:52   ` Jens Axboe
  2026-02-02 15:14   ` Jens Axboe
  2026-02-02 14:37 ` [PATCH v1 2/2] io_uring: allow io-wq workers to exit when unused Li Chen
  2026-02-02 15:21 ` [PATCH v1 0/2] io_uring/io-wq: let workers " Jens Axboe
  2 siblings, 2 replies; 6+ messages in thread
From: Li Chen @ 2026-02-02 14:37 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring, linux-kernel; +Cc: Li Chen

io-wq uses an idle timeout to shrink the pool, but keeps the last worker
around indefinitely to avoid churn.

For tasks that used io_uring for file I/O and then stop using io_uring,
this can leave an iou-wrk-* thread behind even after all io_uring instances
are gone. This is unnecessary overhead and also gets in the way of process
checkpoint/restore.

Add an exit-on-idle mode that makes all io-wq workers exit as soon as they
become idle, and provide io_wq_set_exit_on_idle() to toggle it.

Signed-off-by: Li Chen <me@linux.beauty>
---
 io_uring/io-wq.c | 31 +++++++++++++++++++++++++++++++
 io_uring/io-wq.h |  1 +
 2 files changed, 32 insertions(+)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 5d0928f37471..97e7eb847c6e 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -35,6 +35,7 @@ enum {
 
 enum {
 	IO_WQ_BIT_EXIT		= 0,	/* wq exiting */
+	IO_WQ_BIT_EXIT_ON_IDLE	= 1,	/* allow all workers to exit on idle */
 };
 
 enum {
@@ -655,6 +656,18 @@ static int io_wq_worker(void *data)
 			io_worker_handle_work(acct, worker);
 
 		raw_spin_lock(&wq->lock);
+		/*
+		 * If wq is marked idle-exit, drop this worker as soon as it
+		 * becomes idle. This is used to avoid keeping io-wq worker
+		 * threads around for tasks that no longer have any active
+		 * io_uring instances.
+		 */
+		if (test_bit(IO_WQ_BIT_EXIT_ON_IDLE, &wq->state)) {
+			acct->nr_workers--;
+			raw_spin_unlock(&wq->lock);
+			__set_current_state(TASK_RUNNING);
+			break;
+		}
 		/*
 		 * Last sleep timed out. Exit if we're not the last worker,
 		 * or if someone modified our affinity.
@@ -894,6 +907,24 @@ static bool io_wq_worker_wake(struct io_worker *worker, void *data)
 	return false;
 }
 
+void io_wq_set_exit_on_idle(struct io_wq *wq, bool enable)
+{
+	if (!wq->task)
+		return;
+
+	if (!enable) {
+		clear_bit(IO_WQ_BIT_EXIT_ON_IDLE, &wq->state);
+		return;
+	}
+
+	if (test_and_set_bit(IO_WQ_BIT_EXIT_ON_IDLE, &wq->state))
+		return;
+
+	rcu_read_lock();
+	io_wq_for_each_worker(wq, io_wq_worker_wake, NULL);
+	rcu_read_unlock();
+}
+
 static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
 {
 	do {
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index b3b004a7b625..f7f17a23693e 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -46,6 +46,7 @@ struct io_wq_data {
 struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data);
 void io_wq_exit_start(struct io_wq *wq);
 void io_wq_put_and_exit(struct io_wq *wq);
+void io_wq_set_exit_on_idle(struct io_wq *wq, bool enable);
 
 void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
 void io_wq_hash_work(struct io_wq_work *work, void *val);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v1 2/2] io_uring: allow io-wq workers to exit when unused
  2026-02-02 14:37 [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused Li Chen
  2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
@ 2026-02-02 14:37 ` Li Chen
  2026-02-02 15:21 ` [PATCH v1 0/2] io_uring/io-wq: let workers " Jens Axboe
  2 siblings, 0 replies; 6+ messages in thread
From: Li Chen @ 2026-02-02 14:37 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, io-uring, linux-kernel; +Cc: Li Chen

io_uring keeps a per-task io-wq around, even when the task no longer has
any io_uring instances.

If the task previously used io_uring for file I/O, this can leave an
unrelated iou-wrk-* worker thread behind after the last io_uring instance
is gone.

When the last io_uring ctx is removed from the task context, mark the io-wq
exit-on-idle so workers can go away. Clear the flag on subsequent io_uring
usage.

Signed-off-by: Li Chen <me@linux.beauty>
---
 io_uring/tctx.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index adc6e42c14df..8c6a4c56f5ec 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -124,6 +124,14 @@ int __io_uring_add_tctx_node(struct io_ring_ctx *ctx)
 				return ret;
 		}
 	}
+
+	/*
+	 * Re-activate io-wq keepalive on any new io_uring usage. The wq may have
+	 * been marked for idle-exit when the task temporarily had no active
+	 * io_uring instances.
+	 */
+	if (tctx->io_wq)
+		io_wq_set_exit_on_idle(tctx->io_wq, false);
 	if (!xa_load(&tctx->xa, (unsigned long)ctx)) {
 		node = kmalloc(sizeof(*node), GFP_KERNEL);
 		if (!node)
@@ -185,6 +193,9 @@ __cold void io_uring_del_tctx_node(unsigned long index)
 	if (tctx->last == node->ctx)
 		tctx->last = NULL;
 	kfree(node);
+
+	if (xa_empty(&tctx->xa) && tctx->io_wq)
+		io_wq_set_exit_on_idle(tctx->io_wq, true);
 }
 
 __cold void io_uring_clean_tctx(struct io_uring_task *tctx)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 1/2] io-wq: add exit-on-idle mode
  2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
@ 2026-02-02 14:52   ` Jens Axboe
  2026-02-02 15:14   ` Jens Axboe
  1 sibling, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-02-02 14:52 UTC (permalink / raw)
  To: Li Chen, Pavel Begunkov, io-uring, linux-kernel

On 2/2/26 7:37 AM, Li Chen wrote:
> io-wq uses an idle timeout to shrink the pool, but keeps the last worker
> around indefinitely to avoid churn.
> 
> For tasks that used io_uring for file I/O and then stop using io_uring,
> this can leave an iou-wrk-* thread behind even after all io_uring instances
> are gone. This is unnecessary overhead and also gets in the way of process
> checkpoint/restore.
> 
> Add an exit-on-idle mode that makes all io-wq workers exit as soon as they
> become idle, and provide io_wq_set_exit_on_idle() to toggle it.

Was going to say, rather than add a mode for this, why not just have the
idle single worker exit when the last ring is closed? But that is indeed
exactly what these two patches do. So I think this is fine, I just don't
think using the word "mode" for it is correct. "state" would be a lot
better - if we have all rings exited, then that's a state change in
terms of yeah let's just dump that idle worker.

With that in mind, I think these two patches look fine. I'll give them a
closer look. Could you perhaps write a test case for this?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 1/2] io-wq: add exit-on-idle mode
  2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
  2026-02-02 14:52   ` Jens Axboe
@ 2026-02-02 15:14   ` Jens Axboe
  1 sibling, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-02-02 15:14 UTC (permalink / raw)
  To: Li Chen, Pavel Begunkov, io-uring, linux-kernel

On 2/2/26 7:37 AM, Li Chen wrote:
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index 5d0928f37471..97e7eb847c6e 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -655,6 +656,18 @@ static int io_wq_worker(void *data)
>  			io_worker_handle_work(acct, worker);
>  
>  		raw_spin_lock(&wq->lock);
> +		/*
> +		 * If wq is marked idle-exit, drop this worker as soon as it
> +		 * becomes idle. This is used to avoid keeping io-wq worker
> +		 * threads around for tasks that no longer have any active
> +		 * io_uring instances.
> +		 */
> +		if (test_bit(IO_WQ_BIT_EXIT_ON_IDLE, &wq->state)) {
> +			acct->nr_workers--;
> +			raw_spin_unlock(&wq->lock);
> +			__set_current_state(TASK_RUNNING);
> +			break;
> +		}
>  		/*
>  		 * Last sleep timed out. Exit if we're not the last worker,
>  		 * or if someone modified our affinity.

One more note - just add this test_bit() to the check right below, then
you avoid duplicating all of that exit logic. They do the exact same
thing.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused
  2026-02-02 14:37 [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused Li Chen
  2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
  2026-02-02 14:37 ` [PATCH v1 2/2] io_uring: allow io-wq workers to exit when unused Li Chen
@ 2026-02-02 15:21 ` Jens Axboe
  2 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-02-02 15:21 UTC (permalink / raw)
  To: Li Chen; +Cc: Pavel Begunkov, io-uring, linux-kernel

On 2/2/26 7:37 AM, Li Chen wrote:
> io_uring uses io-wq to offload regular file I/O. When that happens, the kernel
> creates per-task iou-wrk-<tgid> workers (PF_IO_WORKER) via create_io_thread(),
> so the worker is part of the process thread group and shows up under
> /proc/<pid>/task/.
> 
> io-wq shrinks the pool on idle, but it intentionally keeps the last worker
> around indefinitely as a keepalive to avoid churn. Combined with io_uring's
> per-task context lifetime (tctx stays attached to the task until exit), a
> process may permanently retain an idle iou-wrk thread even after it has closed
> its last io_uring instance and has no active rings.
> 
> The keepalive behavior is a reasonable default(I guess): workloads may have
> bursty I/O patterns, and always tearing down the last worker would add thread
> churn and latency. Creating io-wq workers goes through create_io_thread()
> (copy_process), which is not cheap to do repeatedly.
> 
> However, CRIU currently doesn't cope well with such workers being part of the
> checkpointed thread group. The iou-wrk thread is a kernel-managed worker
> (PF_IO_WORKER) running io_wq_worker() on a kernel stack, rather than a normal
> userspace thread executing application code. In our setup, if the iou-wrk
> thread remains present after quiescing and closing the last io_uring instance,
> criu dump may hang while trying to stop and dump the thread group.
> 
> Besides the resource overhead and surprising userspace-visible threads, this is
> a problem for checkpoint/restore. CRIU needs to freeze and dump all threads in
> the thread group. With a lingering iou-wrk thread, we observed criu dump can
> hang even after the ring has been quiesced and the io_uring fd closed, e.g.:
> 
>   criu dump -t $PID -D images -o dump.log -v4 --shell-job
>   ps -T -p $PID -o pid,tid,comm | grep iou-wrk
> 
> This series is a kernel-side enabler for checkpoint/restore in the current
> reality where userspace needs to quiesce and close io_uring rings before dump.
> It is not trying to make io_uring rings checkpointable, nor does it change what
> CRIU can or cannot restore (e.g. in-flight SQEs/CQEs, SQPOLL, SQE128/CQE32,
> registered resources). Even with userspace gaining limited io_uring support,
> this series only targets the specific "no active io_uring contexts left, but an
> idle iou-wrk keepalive thread remains" case.
> 
> This series adds an explicit exit-on-idle mode to io-wq, and toggles it from
> io_uring task context when the task has no active io_uring contexts
> (xa_empty(&tctx->xa)). The mode is cleared on subsequent io_uring usage, so the
> default behavior for active io_uring users is unchanged.
> 
> Tested on x86_64 with CRIU 4.2.
> With this series applied, after closing the ring iou-wrk exited within ~200ms
> and criu dump completed.

Applied with the mentioned commit message and IO_WQ_BIT_EXIT_ON_IDLE test
placement.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-02-02 15:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-02 14:37 [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused Li Chen
2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
2026-02-02 14:52   ` Jens Axboe
2026-02-02 15:14   ` Jens Axboe
2026-02-02 14:37 ` [PATCH v1 2/2] io_uring: allow io-wq workers to exit when unused Li Chen
2026-02-02 15:21 ` [PATCH v1 0/2] io_uring/io-wq: let workers " Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox