* [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused
@ 2026-02-02 14:37 Li Chen
2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Li Chen @ 2026-02-02 14:37 UTC (permalink / raw)
To: Jens Axboe; +Cc: Pavel Begunkov, io-uring, linux-kernel
io_uring uses io-wq to offload regular file I/O. When that happens, the kernel
creates per-task iou-wrk-<tgid> workers (PF_IO_WORKER) via create_io_thread(),
so the worker is part of the process thread group and shows up under
/proc/<pid>/task/.
io-wq shrinks the pool on idle, but it intentionally keeps the last worker
around indefinitely as a keepalive to avoid churn. Combined with io_uring's
per-task context lifetime (tctx stays attached to the task until exit), a
process may permanently retain an idle iou-wrk thread even after it has closed
its last io_uring instance and has no active rings.
The keepalive behavior is a reasonable default(I guess): workloads may have
bursty I/O patterns, and always tearing down the last worker would add thread
churn and latency. Creating io-wq workers goes through create_io_thread()
(copy_process), which is not cheap to do repeatedly.
However, CRIU currently doesn't cope well with such workers being part of the
checkpointed thread group. The iou-wrk thread is a kernel-managed worker
(PF_IO_WORKER) running io_wq_worker() on a kernel stack, rather than a normal
userspace thread executing application code. In our setup, if the iou-wrk
thread remains present after quiescing and closing the last io_uring instance,
criu dump may hang while trying to stop and dump the thread group.
Besides the resource overhead and surprising userspace-visible threads, this is
a problem for checkpoint/restore. CRIU needs to freeze and dump all threads in
the thread group. With a lingering iou-wrk thread, we observed criu dump can
hang even after the ring has been quiesced and the io_uring fd closed, e.g.:
criu dump -t $PID -D images -o dump.log -v4 --shell-job
ps -T -p $PID -o pid,tid,comm | grep iou-wrk
This series is a kernel-side enabler for checkpoint/restore in the current
reality where userspace needs to quiesce and close io_uring rings before dump.
It is not trying to make io_uring rings checkpointable, nor does it change what
CRIU can or cannot restore (e.g. in-flight SQEs/CQEs, SQPOLL, SQE128/CQE32,
registered resources). Even with userspace gaining limited io_uring support,
this series only targets the specific "no active io_uring contexts left, but an
idle iou-wrk keepalive thread remains" case.
This series adds an explicit exit-on-idle mode to io-wq, and toggles it from
io_uring task context when the task has no active io_uring contexts
(xa_empty(&tctx->xa)). The mode is cleared on subsequent io_uring usage, so the
default behavior for active io_uring users is unchanged.
Tested on x86_64 with CRIU 4.2.
With this series applied, after closing the ring iou-wrk exited within ~200ms
and criu dump completed.
Li Chen (2):
io-wq: add exit-on-idle mode
io_uring: allow io-wq workers to exit when unused
io_uring/io-wq.c | 31 +++++++++++++++++++++++++++++++
io_uring/io-wq.h | 1 +
io_uring/tctx.c | 11 +++++++++++
3 files changed, 43 insertions(+)
--
2.52.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v1 1/2] io-wq: add exit-on-idle mode
2026-02-02 14:37 [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused Li Chen
@ 2026-02-02 14:37 ` Li Chen
2026-02-02 14:52 ` Jens Axboe
2026-02-02 15:14 ` Jens Axboe
2026-02-02 14:37 ` [PATCH v1 2/2] io_uring: allow io-wq workers to exit when unused Li Chen
2026-02-02 15:21 ` [PATCH v1 0/2] io_uring/io-wq: let workers " Jens Axboe
2 siblings, 2 replies; 6+ messages in thread
From: Li Chen @ 2026-02-02 14:37 UTC (permalink / raw)
To: Jens Axboe, Pavel Begunkov, io-uring, linux-kernel; +Cc: Li Chen
io-wq uses an idle timeout to shrink the pool, but keeps the last worker
around indefinitely to avoid churn.
For tasks that used io_uring for file I/O and then stop using io_uring,
this can leave an iou-wrk-* thread behind even after all io_uring instances
are gone. This is unnecessary overhead and also gets in the way of process
checkpoint/restore.
Add an exit-on-idle mode that makes all io-wq workers exit as soon as they
become idle, and provide io_wq_set_exit_on_idle() to toggle it.
Signed-off-by: Li Chen <me@linux.beauty>
---
io_uring/io-wq.c | 31 +++++++++++++++++++++++++++++++
io_uring/io-wq.h | 1 +
2 files changed, 32 insertions(+)
diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 5d0928f37471..97e7eb847c6e 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -35,6 +35,7 @@ enum {
enum {
IO_WQ_BIT_EXIT = 0, /* wq exiting */
+ IO_WQ_BIT_EXIT_ON_IDLE = 1, /* allow all workers to exit on idle */
};
enum {
@@ -655,6 +656,18 @@ static int io_wq_worker(void *data)
io_worker_handle_work(acct, worker);
raw_spin_lock(&wq->lock);
+ /*
+ * If wq is marked idle-exit, drop this worker as soon as it
+ * becomes idle. This is used to avoid keeping io-wq worker
+ * threads around for tasks that no longer have any active
+ * io_uring instances.
+ */
+ if (test_bit(IO_WQ_BIT_EXIT_ON_IDLE, &wq->state)) {
+ acct->nr_workers--;
+ raw_spin_unlock(&wq->lock);
+ __set_current_state(TASK_RUNNING);
+ break;
+ }
/*
* Last sleep timed out. Exit if we're not the last worker,
* or if someone modified our affinity.
@@ -894,6 +907,24 @@ static bool io_wq_worker_wake(struct io_worker *worker, void *data)
return false;
}
+void io_wq_set_exit_on_idle(struct io_wq *wq, bool enable)
+{
+ if (!wq->task)
+ return;
+
+ if (!enable) {
+ clear_bit(IO_WQ_BIT_EXIT_ON_IDLE, &wq->state);
+ return;
+ }
+
+ if (test_and_set_bit(IO_WQ_BIT_EXIT_ON_IDLE, &wq->state))
+ return;
+
+ rcu_read_lock();
+ io_wq_for_each_worker(wq, io_wq_worker_wake, NULL);
+ rcu_read_unlock();
+}
+
static void io_run_cancel(struct io_wq_work *work, struct io_wq *wq)
{
do {
diff --git a/io_uring/io-wq.h b/io_uring/io-wq.h
index b3b004a7b625..f7f17a23693e 100644
--- a/io_uring/io-wq.h
+++ b/io_uring/io-wq.h
@@ -46,6 +46,7 @@ struct io_wq_data {
struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data);
void io_wq_exit_start(struct io_wq *wq);
void io_wq_put_and_exit(struct io_wq *wq);
+void io_wq_set_exit_on_idle(struct io_wq *wq, bool enable);
void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
void io_wq_hash_work(struct io_wq_work *work, void *val);
--
2.52.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v1 2/2] io_uring: allow io-wq workers to exit when unused
2026-02-02 14:37 [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused Li Chen
2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
@ 2026-02-02 14:37 ` Li Chen
2026-02-02 15:21 ` [PATCH v1 0/2] io_uring/io-wq: let workers " Jens Axboe
2 siblings, 0 replies; 6+ messages in thread
From: Li Chen @ 2026-02-02 14:37 UTC (permalink / raw)
To: Jens Axboe, Pavel Begunkov, io-uring, linux-kernel; +Cc: Li Chen
io_uring keeps a per-task io-wq around, even when the task no longer has
any io_uring instances.
If the task previously used io_uring for file I/O, this can leave an
unrelated iou-wrk-* worker thread behind after the last io_uring instance
is gone.
When the last io_uring ctx is removed from the task context, mark the io-wq
exit-on-idle so workers can go away. Clear the flag on subsequent io_uring
usage.
Signed-off-by: Li Chen <me@linux.beauty>
---
io_uring/tctx.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index adc6e42c14df..8c6a4c56f5ec 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -124,6 +124,14 @@ int __io_uring_add_tctx_node(struct io_ring_ctx *ctx)
return ret;
}
}
+
+ /*
+ * Re-activate io-wq keepalive on any new io_uring usage. The wq may have
+ * been marked for idle-exit when the task temporarily had no active
+ * io_uring instances.
+ */
+ if (tctx->io_wq)
+ io_wq_set_exit_on_idle(tctx->io_wq, false);
if (!xa_load(&tctx->xa, (unsigned long)ctx)) {
node = kmalloc(sizeof(*node), GFP_KERNEL);
if (!node)
@@ -185,6 +193,9 @@ __cold void io_uring_del_tctx_node(unsigned long index)
if (tctx->last == node->ctx)
tctx->last = NULL;
kfree(node);
+
+ if (xa_empty(&tctx->xa) && tctx->io_wq)
+ io_wq_set_exit_on_idle(tctx->io_wq, true);
}
__cold void io_uring_clean_tctx(struct io_uring_task *tctx)
--
2.52.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v1 1/2] io-wq: add exit-on-idle mode
2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
@ 2026-02-02 14:52 ` Jens Axboe
2026-02-02 15:14 ` Jens Axboe
1 sibling, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-02-02 14:52 UTC (permalink / raw)
To: Li Chen, Pavel Begunkov, io-uring, linux-kernel
On 2/2/26 7:37 AM, Li Chen wrote:
> io-wq uses an idle timeout to shrink the pool, but keeps the last worker
> around indefinitely to avoid churn.
>
> For tasks that used io_uring for file I/O and then stop using io_uring,
> this can leave an iou-wrk-* thread behind even after all io_uring instances
> are gone. This is unnecessary overhead and also gets in the way of process
> checkpoint/restore.
>
> Add an exit-on-idle mode that makes all io-wq workers exit as soon as they
> become idle, and provide io_wq_set_exit_on_idle() to toggle it.
Was going to say, rather than add a mode for this, why not just have the
idle single worker exit when the last ring is closed? But that is indeed
exactly what these two patches do. So I think this is fine, I just don't
think using the word "mode" for it is correct. "state" would be a lot
better - if we have all rings exited, then that's a state change in
terms of yeah let's just dump that idle worker.
With that in mind, I think these two patches look fine. I'll give them a
closer look. Could you perhaps write a test case for this?
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 1/2] io-wq: add exit-on-idle mode
2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
2026-02-02 14:52 ` Jens Axboe
@ 2026-02-02 15:14 ` Jens Axboe
1 sibling, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-02-02 15:14 UTC (permalink / raw)
To: Li Chen, Pavel Begunkov, io-uring, linux-kernel
On 2/2/26 7:37 AM, Li Chen wrote:
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index 5d0928f37471..97e7eb847c6e 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -655,6 +656,18 @@ static int io_wq_worker(void *data)
> io_worker_handle_work(acct, worker);
>
> raw_spin_lock(&wq->lock);
> + /*
> + * If wq is marked idle-exit, drop this worker as soon as it
> + * becomes idle. This is used to avoid keeping io-wq worker
> + * threads around for tasks that no longer have any active
> + * io_uring instances.
> + */
> + if (test_bit(IO_WQ_BIT_EXIT_ON_IDLE, &wq->state)) {
> + acct->nr_workers--;
> + raw_spin_unlock(&wq->lock);
> + __set_current_state(TASK_RUNNING);
> + break;
> + }
> /*
> * Last sleep timed out. Exit if we're not the last worker,
> * or if someone modified our affinity.
One more note - just add this test_bit() to the check right below, then
you avoid duplicating all of that exit logic. They do the exact same
thing.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused
2026-02-02 14:37 [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused Li Chen
2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
2026-02-02 14:37 ` [PATCH v1 2/2] io_uring: allow io-wq workers to exit when unused Li Chen
@ 2026-02-02 15:21 ` Jens Axboe
2 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-02-02 15:21 UTC (permalink / raw)
To: Li Chen; +Cc: Pavel Begunkov, io-uring, linux-kernel
On 2/2/26 7:37 AM, Li Chen wrote:
> io_uring uses io-wq to offload regular file I/O. When that happens, the kernel
> creates per-task iou-wrk-<tgid> workers (PF_IO_WORKER) via create_io_thread(),
> so the worker is part of the process thread group and shows up under
> /proc/<pid>/task/.
>
> io-wq shrinks the pool on idle, but it intentionally keeps the last worker
> around indefinitely as a keepalive to avoid churn. Combined with io_uring's
> per-task context lifetime (tctx stays attached to the task until exit), a
> process may permanently retain an idle iou-wrk thread even after it has closed
> its last io_uring instance and has no active rings.
>
> The keepalive behavior is a reasonable default(I guess): workloads may have
> bursty I/O patterns, and always tearing down the last worker would add thread
> churn and latency. Creating io-wq workers goes through create_io_thread()
> (copy_process), which is not cheap to do repeatedly.
>
> However, CRIU currently doesn't cope well with such workers being part of the
> checkpointed thread group. The iou-wrk thread is a kernel-managed worker
> (PF_IO_WORKER) running io_wq_worker() on a kernel stack, rather than a normal
> userspace thread executing application code. In our setup, if the iou-wrk
> thread remains present after quiescing and closing the last io_uring instance,
> criu dump may hang while trying to stop and dump the thread group.
>
> Besides the resource overhead and surprising userspace-visible threads, this is
> a problem for checkpoint/restore. CRIU needs to freeze and dump all threads in
> the thread group. With a lingering iou-wrk thread, we observed criu dump can
> hang even after the ring has been quiesced and the io_uring fd closed, e.g.:
>
> criu dump -t $PID -D images -o dump.log -v4 --shell-job
> ps -T -p $PID -o pid,tid,comm | grep iou-wrk
>
> This series is a kernel-side enabler for checkpoint/restore in the current
> reality where userspace needs to quiesce and close io_uring rings before dump.
> It is not trying to make io_uring rings checkpointable, nor does it change what
> CRIU can or cannot restore (e.g. in-flight SQEs/CQEs, SQPOLL, SQE128/CQE32,
> registered resources). Even with userspace gaining limited io_uring support,
> this series only targets the specific "no active io_uring contexts left, but an
> idle iou-wrk keepalive thread remains" case.
>
> This series adds an explicit exit-on-idle mode to io-wq, and toggles it from
> io_uring task context when the task has no active io_uring contexts
> (xa_empty(&tctx->xa)). The mode is cleared on subsequent io_uring usage, so the
> default behavior for active io_uring users is unchanged.
>
> Tested on x86_64 with CRIU 4.2.
> With this series applied, after closing the ring iou-wrk exited within ~200ms
> and criu dump completed.
Applied with the mentioned commit message and IO_WQ_BIT_EXIT_ON_IDLE test
placement.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-02 15:21 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-02 14:37 [PATCH v1 0/2] io_uring/io-wq: let workers exit when unused Li Chen
2026-02-02 14:37 ` [PATCH v1 1/2] io-wq: add exit-on-idle mode Li Chen
2026-02-02 14:52 ` Jens Axboe
2026-02-02 15:14 ` Jens Axboe
2026-02-02 14:37 ` [PATCH v1 2/2] io_uring: allow io-wq workers to exit when unused Li Chen
2026-02-02 15:21 ` [PATCH v1 0/2] io_uring/io-wq: let workers " Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox