[PATCH 4/5] io_uring: wait for cancelations on final ring put

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 4/5] io_uring: wait for cancelations on final ring put
  2025-03-21 19:24 [PATCHSET RFC v2 0/5] Cancel and wait for all requests on exit Jens Axboe
@ 2025-03-21 19:24 ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-03-21 19:24 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, Jens Axboe

We still offload the cancelation to a workqueue, as not to introduce
dependencies between the exiting task waiting on cleanup, and that
task needing to run task_work to complete the process.

This means that once the final ring put is done, any request that was
inflight and needed cancelation will be done as well. Notably requests
that hold references to files - once the ring fd close is done, we will
have dropped any of those references too.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  2 ++
 io_uring/io_uring.c            | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c17d2eedf478..79e223fd4733 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -450,6 +450,8 @@ struct io_ring_ctx {
 	struct io_mapped_region		param_region;
 	/* just one zcrx per ring for now, will move to io_zcrx_ifq eventually */
 	struct io_mapped_region		zcrx_region;
+
+	struct completion		*exit_comp;
 };
 
 /*
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 984db01f5184..d9b65a322ae1 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2894,6 +2894,7 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, exit_work);
 	unsigned long timeout = jiffies + HZ * 60 * 5;
 	unsigned long interval = HZ / 20;
+	struct completion *exit_comp;
 	struct io_tctx_exit exit;
 	struct io_tctx_node *node;
 	int ret;
@@ -2958,6 +2959,10 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 
 	io_kworker_tw_end();
 
+	exit_comp = READ_ONCE(ctx->exit_comp);
+	if (exit_comp)
+		complete(exit_comp);
+
 	init_completion(&exit.completion);
 	init_task_work(&exit.task_work, io_tctx_exit_cb);
 	exit.ctx = ctx;
@@ -3020,9 +3025,21 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 static int io_uring_release(struct inode *inode, struct file *file)
 {
 	struct io_ring_ctx *ctx = file->private_data;
+	DECLARE_COMPLETION_ONSTACK(exit_comp);
 
 	file->private_data = NULL;
+	WRITE_ONCE(ctx->exit_comp, &exit_comp);
 	io_ring_ctx_wait_and_kill(ctx);
+
+	/*
+	 * Wait for cancel to run before exiting task
+	 */
+	do {
+		if (current->io_uring)
+			io_fallback_tw(current->io_uring, false);
+		cond_resched();
+	} while (wait_for_completion_interruptible(&exit_comp));
+
 	return 0;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCHSET v3 0/5] Cancel and wait for all requests on exit
@ 2025-04-09 13:35 Jens Axboe
  2025-04-09 13:35 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-09 13:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, brauner, linux-fsdevel

Hi,

Currently, when a ring is being shut down, some cancelations may happen
out-of-line. This means that an application cannot rely on the ring
exit meaning that any IO has fully completed, or someone else waiting
on an application (which has a ring with pending IO) being terminated
will mean that all requests are done. This has also manifested itself
as various testing sometimes finding a mount point busy after a test
has exited, because it may take a brief period of time for things to
quiesce and be fully done.

This patchset makes the task wait on the cancelations, if any, when
the io_uring file fd is being put. That has the effect of ensuring that
pending IO has fully completed, and files closed, before the ring exit
returns.

io_uring runs cancelations off kthread based fallback work, and patch 1
enables these to run task_work so we'll know that we can put files
inline and not need rely on deadlock prune flushing of the delayed fput
work item. The rest of the patches are all io_uring based and pretty
straight forward. This fundamentally doesn't change how cancelations
work, it just waits on the out-of-line cancelations and request
queiscing before exiting.

The switch away from percpu reference counts is done mostly because
exiting those references will cost us an RCU grace period. That will
noticeably slow down the ring tear down by anywhere from 10-100x.

The changes can also be found here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-exit-cancel.2

Since v2:
- Fix logic error io_queue_iowq()

 fs/file_table.c                |  2 +-
 include/linux/io_uring_types.h |  4 +-
 include/linux/sched.h          |  2 +-
 io_uring/io_uring.c            | 79 +++++++++++++++++++++++-----------
 io_uring/io_uring.h            |  3 +-
 io_uring/msg_ring.c            |  4 +-
 io_uring/refs.h                | 43 ++++++++++++++++++
 io_uring/register.c            |  2 +-
 io_uring/rw.c                  |  2 +-
 io_uring/sqpoll.c              |  2 +-
 io_uring/zcrx.c                |  4 +-
 kernel/fork.c                  |  2 +-
 12 files changed, 111 insertions(+), 38 deletions(-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK
  2025-04-09 13:35 [PATCHSET v3 0/5] Cancel and wait for all requests on exit Jens Axboe
@ 2025-04-09 13:35 ` Jens Axboe
  2025-04-11 13:48   ` Christian Brauner
  2025-04-14 17:11   ` Mateusz Guzik
  2025-04-09 13:35 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-09 13:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, brauner, linux-fsdevel, Jens Axboe

fput currently gates whether or not a task can run task_work on the
PF_KTHREAD flag, which excludes kernel threads as they don't usually run
task_work as they never exit to userspace. This punts the final fput
done from a kthread to a delayed work item instead of using task_work.

It's perfectly viable to have the final fput done by the kthread itself,
as long as it will actually run the task_work. Add a PF_NO_TASKWORK flag
which is set by default by a kernel thread, and gate the task_work fput
on that instead. This enables a kernel thread to clear this flag
temporarily while putting files, as long as it runs its task_work
manually.

This enables users like io_uring to ensure that when the final fput of a
file is done as part of ring teardown to run the local task_work and
hence know that all files have been properly put, without needing to
resort to workqueue flushing tricks which can deadlock.

No functional changes in this patch.

Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/file_table.c       | 2 +-
 include/linux/sched.h | 2 +-
 kernel/fork.c         | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index c04ed94cdc4b..e3c3dd1b820d 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -521,7 +521,7 @@ static void __fput_deferred(struct file *file)
 		return;
 	}
 
-	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+	if (likely(!in_interrupt() && !(task->flags & PF_NO_TASKWORK))) {
 		init_task_work(&file->f_task_work, ____fput);
 		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
 			return;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..349c993fc32b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1736,7 +1736,7 @@ extern struct pid *cad_pid;
 						 * I am cleaning dirty pages from some other bdi. */
 #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
-#define PF__HOLE__00800000	0x00800000
+#define PF_NO_TASKWORK		0x00800000	/* task doesn't run task_work */
 #define PF__HOLE__01000000	0x01000000
 #define PF__HOLE__02000000	0x02000000
 #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b..8dd0b8a5348d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2261,7 +2261,7 @@ __latent_entropy struct task_struct *copy_process(
 		goto fork_out;
 	p->flags &= ~PF_KTHREAD;
 	if (args->kthread)
-		p->flags |= PF_KTHREAD;
+		p->flags |= PF_KTHREAD | PF_NO_TASKWORK;
 	if (args->user_worker) {
 		/*
 		 * Mark us a user worker, and block any signal that isn't
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable
  2025-04-09 13:35 [PATCHSET v3 0/5] Cancel and wait for all requests on exit Jens Axboe
  2025-04-09 13:35 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
@ 2025-04-09 13:35 ` Jens Axboe
  2025-04-11 13:55   ` Christian Brauner
  2025-04-09 13:35 ` [PATCH 3/5] io_uring: consider ring dead once the ref is marked dying Jens Axboe
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-04-09 13:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, brauner, linux-fsdevel, Jens Axboe

There are two types of work here:

1) Fallback work, if the task is exiting
2) The exit side cancelations

and both of them may do the final fput() of a file. When this happens,
fput() will schedule delayed work. This slows down exits when io_uring
needs to wait for that work to finish. It is possible to flush this via
flush_delayed_fput(), but that's a big hammer as other unrelated files
could be involved, and from other tasks as well.

Add two io_uring helpers to temporarily clear PF_NO_TASKWORK for the
worker threads, and run any queued task_work before setting the flag
again. Then we can ensure we only flush related items that received
their final fput as part of work cancelation and flushing.

For now these are io_uring private, but could obviously be made
generically available, should there be a need to do so.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/io_uring.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c6209fe44cb1..bff99e185217 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -238,6 +238,20 @@ static inline void io_req_add_to_cache(struct io_kiocb *req, struct io_ring_ctx
 	wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
 }
 
+static __cold void io_kworker_tw_start(void)
+{
+	if (WARN_ON_ONCE(!(current->flags & PF_NO_TASKWORK)))
+		return;
+	current->flags &= ~PF_NO_TASKWORK;
+}
+
+static __cold void io_kworker_tw_end(void)
+{
+	while (task_work_pending(current))
+		task_work_run();
+	current->flags |= PF_NO_TASKWORK;
+}
+
 static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref)
 {
 	struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
@@ -253,6 +267,8 @@ static __cold void io_fallback_req_func(struct work_struct *work)
 	struct io_kiocb *req, *tmp;
 	struct io_tw_state ts = {};
 
+	io_kworker_tw_start();
+
 	percpu_ref_get(&ctx->refs);
 	mutex_lock(&ctx->uring_lock);
 	llist_for_each_entry_safe(req, tmp, node, io_task_work.node)
@@ -260,6 +276,7 @@ static __cold void io_fallback_req_func(struct work_struct *work)
 	io_submit_flush_completions(ctx);
 	mutex_unlock(&ctx->uring_lock);
 	percpu_ref_put(&ctx->refs);
+	io_kworker_tw_end();
 }
 
 static int io_alloc_hash_table(struct io_hash_table *table, unsigned bits)
@@ -2876,6 +2893,8 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 	struct io_tctx_node *node;
 	int ret;
 
+	io_kworker_tw_start();
+
 	/*
 	 * If we're doing polled IO and end up having requests being
 	 * submitted async (out-of-line), then completions can come in while
@@ -2932,6 +2951,8 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 		 */
 	} while (!wait_for_completion_interruptible_timeout(&ctx->ref_comp, interval));
 
+	io_kworker_tw_end();
+
 	init_completion(&exit.completion);
 	init_task_work(&exit.task_work, io_tctx_exit_cb);
 	exit.ctx = ctx;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] io_uring: consider ring dead once the ref is marked dying
  2025-04-09 13:35 [PATCHSET v3 0/5] Cancel and wait for all requests on exit Jens Axboe
  2025-04-09 13:35 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
  2025-04-09 13:35 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
@ 2025-04-09 13:35 ` Jens Axboe
  2025-04-09 13:35 ` [PATCH 4/5] io_uring: wait for cancelations on final ring put Jens Axboe
  2025-04-09 13:35 ` [PATCH 5/5] io_uring: switch away from percpu refcounts Jens Axboe
  4 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-09 13:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, brauner, linux-fsdevel, Jens Axboe

For queueing work to io-wq or adding normal task_work, io_uring will
cancel the work items if the task is going away. If the ring is starting
to go through teardown, the ref is marked as dying. Use that as well for
the fallback/cancel mechanism.

For deferred task_work, this is done out-of-line as part of the exit
work handling. Hence it doesn't need any extra checks in the hot path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/io_uring.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index bff99e185217..ce00b616e138 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -555,7 +555,8 @@ static void io_queue_iowq(struct io_kiocb *req)
 	 * procedure rather than attempt to run this request (or create a new
 	 * worker for it).
 	 */
-	if (WARN_ON_ONCE(!same_thread_group(tctx->task, current)))
+	if (WARN_ON_ONCE(!same_thread_group(tctx->task, current) ||
+			 percpu_ref_is_dying(&req->ctx->refs)))
 		atomic_or(IO_WQ_WORK_CANCEL, &req->work.flags);
 
 	trace_io_uring_queue_async_work(req, io_wq_is_hashed(&req->work));
@@ -1246,7 +1247,8 @@ static void io_req_normal_work_add(struct io_kiocb *req)
 		return;
 	}
 
-	if (likely(!task_work_add(tctx->task, &tctx->task_work, ctx->notify_method)))
+	if (!percpu_ref_is_dying(&ctx->refs) &&
+	    !task_work_add(tctx->task, &tctx->task_work, ctx->notify_method))
 		return;
 
 	io_fallback_tw(tctx, false);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] io_uring: wait for cancelations on final ring put
  2025-04-09 13:35 [PATCHSET v3 0/5] Cancel and wait for all requests on exit Jens Axboe
                   ` (2 preceding siblings ...)
  2025-04-09 13:35 ` [PATCH 3/5] io_uring: consider ring dead once the ref is marked dying Jens Axboe
@ 2025-04-09 13:35 ` Jens Axboe
  2025-04-09 13:35 ` [PATCH 5/5] io_uring: switch away from percpu refcounts Jens Axboe
  4 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-09 13:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, brauner, linux-fsdevel, Jens Axboe

We still offload the cancelation to a workqueue, as not to introduce
dependencies between the exiting task waiting on cleanup, and that
task needing to run task_work to complete the process.

This means that once the final ring put is done, any request that was
inflight and needed cancelation will be done as well. Notably requests
that hold references to files - once the ring fd close is done, we will
have dropped any of those references too.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  2 ++
 io_uring/io_uring.c            | 17 +++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index b44d201520d8..4d26aef281fb 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -450,6 +450,8 @@ struct io_ring_ctx {
 	struct io_mapped_region		param_region;
 	/* just one zcrx per ring for now, will move to io_zcrx_ifq eventually */
 	struct io_mapped_region		zcrx_region;
+
+	struct completion		*exit_comp;
 };
 
 /*
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ce00b616e138..4b3e3ff774d6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2891,6 +2891,7 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, exit_work);
 	unsigned long timeout = jiffies + HZ * 60 * 5;
 	unsigned long interval = HZ / 20;
+	struct completion *exit_comp;
 	struct io_tctx_exit exit;
 	struct io_tctx_node *node;
 	int ret;
@@ -2955,6 +2956,10 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 
 	io_kworker_tw_end();
 
+	exit_comp = READ_ONCE(ctx->exit_comp);
+	if (exit_comp)
+		complete(exit_comp);
+
 	init_completion(&exit.completion);
 	init_task_work(&exit.task_work, io_tctx_exit_cb);
 	exit.ctx = ctx;
@@ -3017,9 +3022,21 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 static int io_uring_release(struct inode *inode, struct file *file)
 {
 	struct io_ring_ctx *ctx = file->private_data;
+	DECLARE_COMPLETION_ONSTACK(exit_comp);
 
 	file->private_data = NULL;
+	WRITE_ONCE(ctx->exit_comp, &exit_comp);
 	io_ring_ctx_wait_and_kill(ctx);
+
+	/*
+	 * Wait for cancel to run before exiting task
+	 */
+	do {
+		if (current->io_uring)
+			io_fallback_tw(current->io_uring, false);
+		cond_resched();
+	} while (wait_for_completion_interruptible(&exit_comp));
+
 	return 0;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] io_uring: switch away from percpu refcounts
  2025-04-09 13:35 [PATCHSET v3 0/5] Cancel and wait for all requests on exit Jens Axboe
                   ` (3 preceding siblings ...)
  2025-04-09 13:35 ` [PATCH 4/5] io_uring: wait for cancelations on final ring put Jens Axboe
@ 2025-04-09 13:35 ` Jens Axboe
  4 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-09 13:35 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence, brauner, linux-fsdevel, Jens Axboe

For the common cases, the io_uring ref counts are all batched and hence
need not be a percpu reference. This saves some memory on systems, but
outside of that, it gets rid of needing a full RCU grace period on
tearing down the reference. With io_uring now waiting on cancelations
and IO during exit, this slows down the tear down a lot, up to 100x
as slow.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  2 +-
 io_uring/io_uring.c            | 47 ++++++++++++----------------------
 io_uring/io_uring.h            |  3 ++-
 io_uring/msg_ring.c            |  4 +--
 io_uring/refs.h                | 43 +++++++++++++++++++++++++++++++
 io_uring/register.c            |  2 +-
 io_uring/rw.c                  |  2 +-
 io_uring/sqpoll.c              |  2 +-
 io_uring/zcrx.c                |  4 +--
 9 files changed, 70 insertions(+), 39 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 4d26aef281fb..bcafd7cc8c26 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -256,7 +256,7 @@ struct io_ring_ctx {
 
 		struct task_struct	*submitter_task;
 		struct io_rings		*rings;
-		struct percpu_ref	refs;
+		atomic_long_t		refs;
 
 		clockid_t		clockid;
 		enum tk_offsets		clock_offset;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4b3e3ff774d6..8b2f8a081ef6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -252,13 +252,6 @@ static __cold void io_kworker_tw_end(void)
 	current->flags |= PF_NO_TASKWORK;
 }
 
-static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref)
-{
-	struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
-
-	complete(&ctx->ref_comp);
-}
-
 static __cold void io_fallback_req_func(struct work_struct *work)
 {
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx,
@@ -269,13 +262,13 @@ static __cold void io_fallback_req_func(struct work_struct *work)
 
 	io_kworker_tw_start();
 
-	percpu_ref_get(&ctx->refs);
+	io_ring_ref_get(ctx);
 	mutex_lock(&ctx->uring_lock);
 	llist_for_each_entry_safe(req, tmp, node, io_task_work.node)
 		req->io_task_work.func(req, ts);
 	io_submit_flush_completions(ctx);
 	mutex_unlock(&ctx->uring_lock);
-	percpu_ref_put(&ctx->refs);
+	io_ring_ref_put(ctx);
 	io_kworker_tw_end();
 }
 
@@ -333,10 +326,8 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	hash_bits = clamp(hash_bits, 1, 8);
 	if (io_alloc_hash_table(&ctx->cancel_table, hash_bits))
 		goto err;
-	if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free,
-			    0, GFP_KERNEL))
-		goto err;
 
+	io_ring_ref_init(ctx);
 	ctx->flags = p->flags;
 	ctx->hybrid_poll_time = LLONG_MAX;
 	atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT);
@@ -360,7 +351,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	ret |= io_futex_cache_init(ctx);
 	ret |= io_rsrc_cache_init(ctx);
 	if (ret)
-		goto free_ref;
+		goto err;
 	init_completion(&ctx->ref_comp);
 	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
 	mutex_init(&ctx->uring_lock);
@@ -386,9 +377,6 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->mmap_lock);
 
 	return ctx;
-
-free_ref:
-	percpu_ref_exit(&ctx->refs);
 err:
 	io_free_alloc_caches(ctx);
 	kvfree(ctx->cancel_table.hbs);
@@ -556,7 +544,7 @@ static void io_queue_iowq(struct io_kiocb *req)
 	 * worker for it).
 	 */
 	if (WARN_ON_ONCE(!same_thread_group(tctx->task, current) ||
-			 percpu_ref_is_dying(&req->ctx->refs)))
+			 io_ring_ref_is_dying(req->ctx)))
 		atomic_or(IO_WQ_WORK_CANCEL, &req->work.flags);
 
 	trace_io_uring_queue_async_work(req, io_wq_is_hashed(&req->work));
@@ -991,7 +979,7 @@ __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
 		ret = 1;
 	}
 
-	percpu_ref_get_many(&ctx->refs, ret);
+	io_ring_ref_get_many(ctx, ret);
 	while (ret--) {
 		struct io_kiocb *req = reqs[ret];
 
@@ -1046,7 +1034,7 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, io_tw_token_t tw)
 
 	io_submit_flush_completions(ctx);
 	mutex_unlock(&ctx->uring_lock);
-	percpu_ref_put(&ctx->refs);
+	io_ring_ref_put(ctx);
 }
 
 /*
@@ -1070,7 +1058,7 @@ struct llist_node *io_handle_tw_list(struct llist_node *node,
 			ctx_flush_and_put(ctx, ts);
 			ctx = req->ctx;
 			mutex_lock(&ctx->uring_lock);
-			percpu_ref_get(&ctx->refs);
+			io_ring_ref_get(ctx);
 		}
 		INDIRECT_CALL_2(req->io_task_work.func,
 				io_poll_task_func, io_req_rw_complete,
@@ -1099,10 +1087,10 @@ static __cold void __io_fallback_tw(struct llist_node *node, bool sync)
 		if (sync && last_ctx != req->ctx) {
 			if (last_ctx) {
 				flush_delayed_work(&last_ctx->fallback_work);
-				percpu_ref_put(&last_ctx->refs);
+				io_ring_ref_put(last_ctx);
 			}
 			last_ctx = req->ctx;
-			percpu_ref_get(&last_ctx->refs);
+			io_ring_ref_get(last_ctx);
 		}
 		if (llist_add(&req->io_task_work.node,
 			      &req->ctx->fallback_llist))
@@ -1111,7 +1099,7 @@ static __cold void __io_fallback_tw(struct llist_node *node, bool sync)
 
 	if (last_ctx) {
 		flush_delayed_work(&last_ctx->fallback_work);
-		percpu_ref_put(&last_ctx->refs);
+		io_ring_ref_put(last_ctx);
 	}
 }
 
@@ -1247,7 +1235,7 @@ static void io_req_normal_work_add(struct io_kiocb *req)
 		return;
 	}
 
-	if (!percpu_ref_is_dying(&ctx->refs) &&
+	if (!io_ring_ref_is_dying(ctx) &&
 	    !task_work_add(tctx->task, &tctx->task_work, ctx->notify_method))
 		return;
 
@@ -2736,7 +2724,7 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
 		nr++;
 	}
 	if (nr)
-		percpu_ref_put_many(&ctx->refs, nr);
+		io_ring_ref_put_many(ctx, nr);
 	mutex_unlock(&ctx->uring_lock);
 }
 
@@ -2770,7 +2758,6 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
 		static_branch_dec(&io_key_has_sqarray);
 
-	percpu_ref_exit(&ctx->refs);
 	free_uid(ctx->user);
 	io_req_caches_free(ctx);
 	if (ctx->hash_map)
@@ -2795,7 +2782,7 @@ static __cold void io_activate_pollwq_cb(struct callback_head *cb)
 	 * might've been lost due to loose synchronisation.
 	 */
 	wake_up_all(&ctx->poll_wq);
-	percpu_ref_put(&ctx->refs);
+	io_ring_ref_put(ctx);
 }
 
 __cold void io_activate_pollwq(struct io_ring_ctx *ctx)
@@ -2813,9 +2800,9 @@ __cold void io_activate_pollwq(struct io_ring_ctx *ctx)
 	 * only need to sync with it, which is done by injecting a tw
 	 */
 	init_task_work(&ctx->poll_wq_task_work, io_activate_pollwq_cb);
-	percpu_ref_get(&ctx->refs);
+	io_ring_ref_get(ctx);
 	if (task_work_add(ctx->submitter_task, &ctx->poll_wq_task_work, TWA_SIGNAL))
-		percpu_ref_put(&ctx->refs);
+		io_ring_ref_put(ctx);
 out:
 	spin_unlock(&ctx->completion_lock);
 }
@@ -3002,7 +2989,7 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	struct creds *creds;
 
 	mutex_lock(&ctx->uring_lock);
-	percpu_ref_kill(&ctx->refs);
+	io_ring_ref_kill(ctx);
 	xa_for_each(&ctx->personalities, index, creds)
 		io_unregister_personality(ctx, index);
 	mutex_unlock(&ctx->uring_lock);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e4050b2d0821..f8500221dd82 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -13,6 +13,7 @@
 #include "slist.h"
 #include "filetable.h"
 #include "opdef.h"
+#include "refs.h"
 
 #ifndef CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -142,7 +143,7 @@ static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
 		 * Not from an SQE, as those cannot be submitted, but via
 		 * updating tagged resources.
 		 */
-		if (!percpu_ref_is_dying(&ctx->refs))
+		if (!io_ring_ref_is_dying(ctx))
 			lockdep_assert(current == ctx->submitter_task);
 	}
 #endif
diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 50a958e9c921..00d6a9ed2431 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -83,7 +83,7 @@ static void io_msg_tw_complete(struct io_kiocb *req, io_tw_token_t tw)
 	}
 	if (req)
 		kmem_cache_free(req_cachep, req);
-	percpu_ref_put(&ctx->refs);
+	io_ring_ref_put(ctx);
 }
 
 static int io_msg_remote_post(struct io_ring_ctx *ctx, struct io_kiocb *req,
@@ -96,7 +96,7 @@ static int io_msg_remote_post(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	req->opcode = IORING_OP_NOP;
 	req->cqe.user_data = user_data;
 	io_req_set_res(req, res, cflags);
-	percpu_ref_get(&ctx->refs);
+	io_ring_ref_get(ctx);
 	req->ctx = ctx;
 	req->tctx = NULL;
 	req->io_task_work.func = io_msg_tw_complete;
diff --git a/io_uring/refs.h b/io_uring/refs.h
index 0d928d87c4ed..9cda88a0a0d5 100644
--- a/io_uring/refs.h
+++ b/io_uring/refs.h
@@ -59,4 +59,47 @@ static inline void io_req_set_refcount(struct io_kiocb *req)
 {
 	__io_req_set_refcount(req, 1);
 }
+
+#define IO_RING_REF_DEAD	(1UL << (BITS_PER_LONG - 1))
+#define IO_RING_REF_MASK	(~IO_RING_REF_DEAD)
+
+static inline bool io_ring_ref_is_dying(struct io_ring_ctx *ctx)
+{
+	return atomic_long_read(&ctx->refs) & IO_RING_REF_DEAD;
+}
+
+static inline void io_ring_ref_put_many(struct io_ring_ctx *ctx, int nr_refs)
+{
+	unsigned long refs;
+
+	refs = atomic_long_sub_return(nr_refs, &ctx->refs);
+	if (!(refs & IO_RING_REF_MASK))
+		complete(&ctx->ref_comp);
+}
+
+static inline void io_ring_ref_put(struct io_ring_ctx *ctx)
+{
+	io_ring_ref_put_many(ctx, 1);
+}
+
+static inline void io_ring_ref_kill(struct io_ring_ctx *ctx)
+{
+	atomic_long_xor(IO_RING_REF_DEAD, &ctx->refs);
+	io_ring_ref_put(ctx);
+}
+
+static inline void io_ring_ref_init(struct io_ring_ctx *ctx)
+{
+	atomic_long_set(&ctx->refs, 1);
+}
+
+static inline void io_ring_ref_get_many(struct io_ring_ctx *ctx, int nr_refs)
+{
+	atomic_long_add(nr_refs, &ctx->refs);
+}
+
+static inline void io_ring_ref_get(struct io_ring_ctx *ctx)
+{
+	atomic_long_inc(&ctx->refs);
+}
 #endif
diff --git a/io_uring/register.c b/io_uring/register.c
index cc23a4c205cd..54fe94a0101b 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -637,7 +637,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 	 * We don't quiesce the refs for register anymore and so it can't be
 	 * dying as we're holding a file ref here.
 	 */
-	if (WARN_ON_ONCE(percpu_ref_is_dying(&ctx->refs)))
+	if (WARN_ON_ONCE(io_ring_ref_is_dying(ctx)))
 		return -ENXIO;
 
 	if (ctx->submitter_task && ctx->submitter_task != current)
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 039e063f7091..e010d548edea 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -496,7 +496,7 @@ static bool io_rw_should_reissue(struct io_kiocb *req)
 	 * Don't attempt to reissue from that path, just let it fail with
 	 * -EAGAIN.
 	 */
-	if (percpu_ref_is_dying(&ctx->refs))
+	if (io_ring_ref_is_dying(ctx))
 		return false;
 
 	io_meta_restore(io, &rw->kiocb);
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index d037cc68e9d3..b71f8d52386e 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -184,7 +184,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
 		 * Don't submit if refs are dying, good for io_uring_register(),
 		 * but also it is relied upon by io_ring_exit_work()
 		 */
-		if (to_submit && likely(!percpu_ref_is_dying(&ctx->refs)) &&
+		if (to_submit && likely(!io_ring_ref_is_dying(ctx)) &&
 		    !(ctx->flags & IORING_SETUP_R_DISABLED))
 			ret = io_submit_sqes(ctx, to_submit);
 		mutex_unlock(&ctx->uring_lock);
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 0f46e0404c04..e8dbed7b8171 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -630,7 +630,7 @@ static int io_pp_zc_init(struct page_pool *pp)
 	if (pp->p.dma_dir != DMA_FROM_DEVICE)
 		return -EOPNOTSUPP;
 
-	percpu_ref_get(&ifq->ctx->refs);
+	io_ring_ref_get(ifq->ctx);
 	return 0;
 }
 
@@ -641,7 +641,7 @@ static void io_pp_zc_destroy(struct page_pool *pp)
 
 	if (WARN_ON_ONCE(area->free_count != area->nia.num_niovs))
 		return;
-	percpu_ref_put(&ifq->ctx->refs);
+	io_ring_ref_put(ifq->ctx);
 }
 
 static int io_pp_nl_fill(void *mp_priv, struct sk_buff *rsp,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK
  2025-04-09 13:35 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
@ 2025-04-11 13:48   ` Christian Brauner
  2025-04-11 14:37     ` Jens Axboe
  2025-04-14 17:11   ` Mateusz Guzik
  1 sibling, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2025-04-11 13:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, asml.silence, linux-fsdevel

On Wed, Apr 09, 2025 at 07:35:19AM -0600, Jens Axboe wrote:
> fput currently gates whether or not a task can run task_work on the
> PF_KTHREAD flag, which excludes kernel threads as they don't usually run
> task_work as they never exit to userspace. This punts the final fput
> done from a kthread to a delayed work item instead of using task_work.
> 
> It's perfectly viable to have the final fput done by the kthread itself,
> as long as it will actually run the task_work. Add a PF_NO_TASKWORK flag
> which is set by default by a kernel thread, and gate the task_work fput
> on that instead. This enables a kernel thread to clear this flag
> temporarily while putting files, as long as it runs its task_work
> manually.
> 
> This enables users like io_uring to ensure that when the final fput of a
> file is done as part of ring teardown to run the local task_work and
> hence know that all files have been properly put, without needing to
> resort to workqueue flushing tricks which can deadlock.
> 
> No functional changes in this patch.
> 
> Cc: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---

Seems fine. Although it has some potential for abuse. So maybe a
VFS_WARN_ON_ONCE() that PF_NO_TASKWORK is only used with PF_KTHREAD
would make sense.

Acked-by: Christian Brauner <brauner@kernel.org>

>  fs/file_table.c       | 2 +-
>  include/linux/sched.h | 2 +-
>  kernel/fork.c         | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/file_table.c b/fs/file_table.c
> index c04ed94cdc4b..e3c3dd1b820d 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -521,7 +521,7 @@ static void __fput_deferred(struct file *file)
>  		return;
>  	}
>  
> -	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> +	if (likely(!in_interrupt() && !(task->flags & PF_NO_TASKWORK))) {
>  		init_task_work(&file->f_task_work, ____fput);
>  		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
>  			return;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..349c993fc32b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1736,7 +1736,7 @@ extern struct pid *cad_pid;
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> -#define PF__HOLE__00800000	0x00800000
> +#define PF_NO_TASKWORK		0x00800000	/* task doesn't run task_work */
>  #define PF__HOLE__01000000	0x01000000
>  #define PF__HOLE__02000000	0x02000000
>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c4b26cd8998b..8dd0b8a5348d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2261,7 +2261,7 @@ __latent_entropy struct task_struct *copy_process(
>  		goto fork_out;
>  	p->flags &= ~PF_KTHREAD;
>  	if (args->kthread)
> -		p->flags |= PF_KTHREAD;
> +		p->flags |= PF_KTHREAD | PF_NO_TASKWORK;
>  	if (args->user_worker) {
>  		/*
>  		 * Mark us a user worker, and block any signal that isn't
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable
  2025-04-09 13:35 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
@ 2025-04-11 13:55   ` Christian Brauner
  2025-04-11 14:38     ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2025-04-11 13:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, asml.silence, linux-fsdevel

On Wed, Apr 09, 2025 at 07:35:20AM -0600, Jens Axboe wrote:
> There are two types of work here:
> 
> 1) Fallback work, if the task is exiting
> 2) The exit side cancelations
> 
> and both of them may do the final fput() of a file. When this happens,
> fput() will schedule delayed work. This slows down exits when io_uring

I was a bit surprised by this because it means that all those __fput()s
are done with kthread credentials which is a bit surprising (but
harmless afaict).

> needs to wait for that work to finish. It is possible to flush this via
> flush_delayed_fput(), but that's a big hammer as other unrelated files
> could be involved, and from other tasks as well.
> 
> Add two io_uring helpers to temporarily clear PF_NO_TASKWORK for the
> worker threads, and run any queued task_work before setting the flag
> again. Then we can ensure we only flush related items that received
> their final fput as part of work cancelation and flushing.

Ok, so the only change is that this isn't offloaded to the global
delayed fput workqueue but to the task work that you're running off of
your kthread helpers.

> 
> For now these are io_uring private, but could obviously be made
> generically available, should there be a need to do so.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  io_uring/io_uring.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index c6209fe44cb1..bff99e185217 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -238,6 +238,20 @@ static inline void io_req_add_to_cache(struct io_kiocb *req, struct io_ring_ctx
>  	wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
>  }
>  
> +static __cold void io_kworker_tw_start(void)
> +{
> +	if (WARN_ON_ONCE(!(current->flags & PF_NO_TASKWORK)))
> +		return;
> +	current->flags &= ~PF_NO_TASKWORK;
> +}
> +
> +static __cold void io_kworker_tw_end(void)
> +{
> +	while (task_work_pending(current))
> +		task_work_run();
> +	current->flags |= PF_NO_TASKWORK;
> +}
> +
>  static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref)
>  {
>  	struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
> @@ -253,6 +267,8 @@ static __cold void io_fallback_req_func(struct work_struct *work)
>  	struct io_kiocb *req, *tmp;
>  	struct io_tw_state ts = {};
>  
> +	io_kworker_tw_start();
> +
>  	percpu_ref_get(&ctx->refs);
>  	mutex_lock(&ctx->uring_lock);
>  	llist_for_each_entry_safe(req, tmp, node, io_task_work.node)
> @@ -260,6 +276,7 @@ static __cold void io_fallback_req_func(struct work_struct *work)
>  	io_submit_flush_completions(ctx);
>  	mutex_unlock(&ctx->uring_lock);
>  	percpu_ref_put(&ctx->refs);
> +	io_kworker_tw_end();
>  }
>  
>  static int io_alloc_hash_table(struct io_hash_table *table, unsigned bits)
> @@ -2876,6 +2893,8 @@ static __cold void io_ring_exit_work(struct work_struct *work)
>  	struct io_tctx_node *node;
>  	int ret;
>  
> +	io_kworker_tw_start();
> +
>  	/*
>  	 * If we're doing polled IO and end up having requests being
>  	 * submitted async (out-of-line), then completions can come in while
> @@ -2932,6 +2951,8 @@ static __cold void io_ring_exit_work(struct work_struct *work)
>  		 */
>  	} while (!wait_for_completion_interruptible_timeout(&ctx->ref_comp, interval));
>  
> +	io_kworker_tw_end();
> +
>  	init_completion(&exit.completion);
>  	init_task_work(&exit.task_work, io_tctx_exit_cb);
>  	exit.ctx = ctx;
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK
  2025-04-11 13:48   ` Christian Brauner
@ 2025-04-11 14:37     ` Jens Axboe
  2025-04-14 10:10       ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2025-04-11 14:37 UTC (permalink / raw)
  To: Christian Brauner; +Cc: io-uring, asml.silence, linux-fsdevel

On 4/11/25 7:48 AM, Christian Brauner wrote:
> Seems fine. Although it has some potential for abuse. So maybe a
> VFS_WARN_ON_ONCE() that PF_NO_TASKWORK is only used with PF_KTHREAD
> would make sense.

Can certainly add that. You'd want that before the check for
in_interrupt and PF_NO_TASKWORK? Something ala

	/* PF_NO_TASKWORK should only be used with PF_KTHREAD */
	VFS_WARN_ON_ONCE((task->flags & PF_NO_TASKWORK) && !(task->flags & PF_KTHREAD));

?

> Acked-by: Christian Brauner <brauner@kernel.org>

Thanks!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable
  2025-04-11 13:55   ` Christian Brauner
@ 2025-04-11 14:38     ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-11 14:38 UTC (permalink / raw)
  To: Christian Brauner; +Cc: io-uring, asml.silence, linux-fsdevel

On 4/11/25 7:55 AM, Christian Brauner wrote:
> On Wed, Apr 09, 2025 at 07:35:20AM -0600, Jens Axboe wrote:
>> There are two types of work here:
>>
>> 1) Fallback work, if the task is exiting
>> 2) The exit side cancelations
>>
>> and both of them may do the final fput() of a file. When this happens,
>> fput() will schedule delayed work. This slows down exits when io_uring
> 
> I was a bit surprised by this because it means that all those __fput()s
> are done with kthread credentials which is a bit surprising (but
> harmless afaict).

Sure hope it is, because that's already what happens off the delayed
fput that it'd otherwise go through!

>> needs to wait for that work to finish. It is possible to flush this via
>> flush_delayed_fput(), but that's a big hammer as other unrelated files
>> could be involved, and from other tasks as well.
>>
>> Add two io_uring helpers to temporarily clear PF_NO_TASKWORK for the
>> worker threads, and run any queued task_work before setting the flag
>> again. Then we can ensure we only flush related items that received
>> their final fput as part of work cancelation and flushing.
> 
> Ok, so the only change is that this isn't offloaded to the global
> delayed fput workqueue but to the task work that you're running off of
> your kthread helpers.

Exactly

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK
  2025-04-11 14:37     ` Jens Axboe
@ 2025-04-14 10:10       ` Christian Brauner
  2025-04-14 14:29         ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2025-04-14 10:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, asml.silence, linux-fsdevel

On Fri, Apr 11, 2025 at 08:37:51AM -0600, Jens Axboe wrote:
> On 4/11/25 7:48 AM, Christian Brauner wrote:
> > Seems fine. Although it has some potential for abuse. So maybe a
> > VFS_WARN_ON_ONCE() that PF_NO_TASKWORK is only used with PF_KTHREAD
> > would make sense.
> 
> Can certainly add that. You'd want that before the check for
> in_interrupt and PF_NO_TASKWORK? Something ala
> 
> 	/* PF_NO_TASKWORK should only be used with PF_KTHREAD */
> 	VFS_WARN_ON_ONCE((task->flags & PF_NO_TASKWORK) && !(task->flags & PF_KTHREAD));
> 
> ?

Yeah, sounds good!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK
  2025-04-14 10:10       ` Christian Brauner
@ 2025-04-14 14:29         ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-14 14:29 UTC (permalink / raw)
  To: Christian Brauner; +Cc: io-uring, asml.silence, linux-fsdevel

On 4/14/25 4:10 AM, Christian Brauner wrote:
> On Fri, Apr 11, 2025 at 08:37:51AM -0600, Jens Axboe wrote:
>> On 4/11/25 7:48 AM, Christian Brauner wrote:
>>> Seems fine. Although it has some potential for abuse. So maybe a
>>> VFS_WARN_ON_ONCE() that PF_NO_TASKWORK is only used with PF_KTHREAD
>>> would make sense.
>>
>> Can certainly add that. You'd want that before the check for
>> in_interrupt and PF_NO_TASKWORK? Something ala
>>
>> 	/* PF_NO_TASKWORK should only be used with PF_KTHREAD */
>> 	VFS_WARN_ON_ONCE((task->flags & PF_NO_TASKWORK) && !(task->flags & PF_KTHREAD));
>>
>> ?
> 
> Yeah, sounds good!

I used the usual XOR trick for this kind of test, but placed in the same
spot:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-exit-cancel.2&id=d5ab108781ccc2f0f013fe009a010a1f29a4785d


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK
  2025-04-09 13:35 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
  2025-04-11 13:48   ` Christian Brauner
@ 2025-04-14 17:11   ` Mateusz Guzik
  2025-04-14 19:35     ` Jens Axboe
  1 sibling, 1 reply; 15+ messages in thread
From: Mateusz Guzik @ 2025-04-14 17:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, asml.silence, brauner, linux-fsdevel

On Wed, Apr 09, 2025 at 07:35:19AM -0600, Jens Axboe wrote:
> fput currently gates whether or not a task can run task_work on the
> PF_KTHREAD flag, which excludes kernel threads as they don't usually run
> task_work as they never exit to userspace. This punts the final fput
> done from a kthread to a delayed work item instead of using task_work.
> 
> It's perfectly viable to have the final fput done by the kthread itself,
> as long as it will actually run the task_work. Add a PF_NO_TASKWORK flag
> which is set by default by a kernel thread, and gate the task_work fput
> on that instead. This enables a kernel thread to clear this flag
> temporarily while putting files, as long as it runs its task_work
> manually.
> 
> This enables users like io_uring to ensure that when the final fput of a
> file is done as part of ring teardown to run the local task_work and
> hence know that all files have been properly put, without needing to
> resort to workqueue flushing tricks which can deadlock.
> 
> No functional changes in this patch.
> 
> Cc: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/file_table.c       | 2 +-
>  include/linux/sched.h | 2 +-
>  kernel/fork.c         | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/file_table.c b/fs/file_table.c
> index c04ed94cdc4b..e3c3dd1b820d 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -521,7 +521,7 @@ static void __fput_deferred(struct file *file)
>  		return;
>  	}
>  
> -	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
> +	if (likely(!in_interrupt() && !(task->flags & PF_NO_TASKWORK))) {
>  		init_task_work(&file->f_task_work, ____fput);
>  		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
>  			return;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..349c993fc32b 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1736,7 +1736,7 @@ extern struct pid *cad_pid;
>  						 * I am cleaning dirty pages from some other bdi. */
>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
> -#define PF__HOLE__00800000	0x00800000
> +#define PF_NO_TASKWORK		0x00800000	/* task doesn't run task_work */
>  #define PF__HOLE__01000000	0x01000000
>  #define PF__HOLE__02000000	0x02000000
>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c4b26cd8998b..8dd0b8a5348d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2261,7 +2261,7 @@ __latent_entropy struct task_struct *copy_process(
>  		goto fork_out;
>  	p->flags &= ~PF_KTHREAD;
>  	if (args->kthread)
> -		p->flags |= PF_KTHREAD;
> +		p->flags |= PF_KTHREAD | PF_NO_TASKWORK;
>  	if (args->user_worker) {
>  		/*
>  		 * Mark us a user worker, and block any signal that isn't

I don't have comments on the semantics here, I do have comments on some
future-proofing.

To my reading kthreads on the stock kernel never execute task_work.

This suggests it would be nice for task_work_add() to at least WARN_ON
when executing with a kthread. After all you don't want a task_work_add
consumer adding work which will never execute.

But then for your patch to not produce any splats there would have to be
a flag blessing select kthreads as legitimate task_work consumers.

So my suggestion would be to add the WARN_ON() in task_work_add() prior
to anything in this patchset, then this patch would be extended with a
flag (PF_KTHREAD_DOES_TASK_WORK?) and relevant io_uring threads would
get the flag.

Then the machinery which sets/unsets PF_NO_TASKWORK can assert that:
1. it operates on a kthread...
2. ...with the PF_KTHREAD_DOES_TASK_WORK flag

This is just a suggestion though.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK
  2025-04-14 17:11   ` Mateusz Guzik
@ 2025-04-14 19:35     ` Jens Axboe
  0 siblings, 0 replies; 15+ messages in thread
From: Jens Axboe @ 2025-04-14 19:35 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: io-uring, asml.silence, brauner, linux-fsdevel

On 4/14/25 11:11 AM, Mateusz Guzik wrote:
> On Wed, Apr 09, 2025 at 07:35:19AM -0600, Jens Axboe wrote:
>> fput currently gates whether or not a task can run task_work on the
>> PF_KTHREAD flag, which excludes kernel threads as they don't usually run
>> task_work as they never exit to userspace. This punts the final fput
>> done from a kthread to a delayed work item instead of using task_work.
>>
>> It's perfectly viable to have the final fput done by the kthread itself,
>> as long as it will actually run the task_work. Add a PF_NO_TASKWORK flag
>> which is set by default by a kernel thread, and gate the task_work fput
>> on that instead. This enables a kernel thread to clear this flag
>> temporarily while putting files, as long as it runs its task_work
>> manually.
>>
>> This enables users like io_uring to ensure that when the final fput of a
>> file is done as part of ring teardown to run the local task_work and
>> hence know that all files have been properly put, without needing to
>> resort to workqueue flushing tricks which can deadlock.
>>
>> No functional changes in this patch.
>>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/file_table.c       | 2 +-
>>  include/linux/sched.h | 2 +-
>>  kernel/fork.c         | 2 +-
>>  3 files changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/file_table.c b/fs/file_table.c
>> index c04ed94cdc4b..e3c3dd1b820d 100644
>> --- a/fs/file_table.c
>> +++ b/fs/file_table.c
>> @@ -521,7 +521,7 @@ static void __fput_deferred(struct file *file)
>>  		return;
>>  	}
>>  
>> -	if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
>> +	if (likely(!in_interrupt() && !(task->flags & PF_NO_TASKWORK))) {
>>  		init_task_work(&file->f_task_work, ____fput);
>>  		if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
>>  			return;
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index f96ac1982893..349c993fc32b 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1736,7 +1736,7 @@ extern struct pid *cad_pid;
>>  						 * I am cleaning dirty pages from some other bdi. */
>>  #define PF_KTHREAD		0x00200000	/* I am a kernel thread */
>>  #define PF_RANDOMIZE		0x00400000	/* Randomize virtual address space */
>> -#define PF__HOLE__00800000	0x00800000
>> +#define PF_NO_TASKWORK		0x00800000	/* task doesn't run task_work */
>>  #define PF__HOLE__01000000	0x01000000
>>  #define PF__HOLE__02000000	0x02000000
>>  #define PF_NO_SETAFFINITY	0x04000000	/* Userland is not allowed to meddle with cpus_mask */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index c4b26cd8998b..8dd0b8a5348d 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -2261,7 +2261,7 @@ __latent_entropy struct task_struct *copy_process(
>>  		goto fork_out;
>>  	p->flags &= ~PF_KTHREAD;
>>  	if (args->kthread)
>> -		p->flags |= PF_KTHREAD;
>> +		p->flags |= PF_KTHREAD | PF_NO_TASKWORK;
>>  	if (args->user_worker) {
>>  		/*
>>  		 * Mark us a user worker, and block any signal that isn't
> 
> I don't have comments on the semantics here, I do have comments on some
> future-proofing.
> 
> To my reading kthreads on the stock kernel never execute task_work.

Correct

> This suggests it would be nice for task_work_add() to at least WARN_ON
> when executing with a kthread. After all you don't want a task_work_add
> consumer adding work which will never execute.

I don't think there's much need for that, as I'm not aware of any kernel
usage that had a bug due to that. And if you did, you'd find it pretty
quick during testing as that work would just never execute.

> But then for your patch to not produce any splats there would have to be
> a flag blessing select kthreads as legitimate task_work consumers.

This patchset very much adds a specific flag for that, PF_NO_TASKWORK,
and kernel threads have it set by default. It just separates the "do I
run task_work" flag from PF_KTHREAD. So yes you could add:

WARN_ON_ONCE(task->flags & PF_NO_TASKWORK);

to task_work_add(), but I'm not really convinced it'd be super useful.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-04-14 19:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-09 13:35 [PATCHSET v3 0/5] Cancel and wait for all requests on exit Jens Axboe
2025-04-09 13:35 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
2025-04-11 13:48   ` Christian Brauner
2025-04-11 14:37     ` Jens Axboe
2025-04-14 10:10       ` Christian Brauner
2025-04-14 14:29         ` Jens Axboe
2025-04-14 17:11   ` Mateusz Guzik
2025-04-14 19:35     ` Jens Axboe
2025-04-09 13:35 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
2025-04-11 13:55   ` Christian Brauner
2025-04-11 14:38     ` Jens Axboe
2025-04-09 13:35 ` [PATCH 3/5] io_uring: consider ring dead once the ref is marked dying Jens Axboe
2025-04-09 13:35 ` [PATCH 4/5] io_uring: wait for cancelations on final ring put Jens Axboe
2025-04-09 13:35 ` [PATCH 5/5] io_uring: switch away from percpu refcounts Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2025-03-21 19:24 [PATCHSET RFC v2 0/5] Cancel and wait for all requests on exit Jens Axboe
2025-03-21 19:24 ` [PATCH 4/5] io_uring: wait for cancelations on final ring put Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox