* [PATCHSET RFC 0/5] Wait on cancelations at release time
@ 2024-06-04 19:01 Jens Axboe
2024-06-04 19:01 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
` (4 more replies)
0 siblings, 5 replies; 13+ messages in thread
From: Jens Axboe @ 2024-06-04 19:01 UTC (permalink / raw)
To: io-uring
Hi,
I've posted this before, but did a bit more work on it and sending it
out again. The idea is to ensure that we've done any fputs that we need
to when a task using a ring exit, so that we don't leave references that
will get put "shortly afterwards". Currently cancelations are done by
ring exit work, which is punted to a kworker. This means that after the
final ->release() on the io_uring fd has completed, there can still be
pending fputs. This can be observed by running the following script:
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/data
ITER=0
while true; do
echo loop $ITER
sudo mount $DEV $MNT
fio --name=test --ioengine=io_uring --iodepth=2 --filename=$MNT/foo --size=1g --buffered=1 --overwrite=0 --numjobs=12 --minimal --rw=randread --thread=1 --output=/dev/null --eta=never &
Y=$(($RANDOM % 3))
X=$(($RANDOM % 10))
VAL="$Y.$X"
sleep $VAL
FIO_PID=$(pidof fio)
if [ -z "$FIO_PID" ]; then
((ITER++))
continue
fi
ps -e | grep fio > /dev/null 2>&1
while [ $? -eq 0 ]; do
killall -KILL $FIO_PID > /dev/null 2>&1
echo will wait
wait > /dev/null 2>&1
echo done waiting
ps -e | grep "fio " > /dev/null 2>&1
done
sudo umount /data
if [ $? -ne 0 ]; then
break
fi
((ITER++))
done
which just starts a fio job doing writes, kills it, waits on the task
to exit, and then immediately tries to umount it. Currently that will
at some point trigger:
[...]
loop 9
will wait(f=12)
done waiting
umount: /data: target is busy.
as the umount raced with the final fputs on the files being accessed
on the mount point.
There are a few parts to this:
1) Final fput is done via task_work, but for kernel threads, it's done
via a delayed work queue. Patches 1+2 allows for kernel threads to
use task_work like other threads, as we can then quiesce the fputs
for the task rather than need to flush a system wide global pending
list that can have pending final releases for any task or file.
2) Patch 3 moves away from percpu reference counts, as those require
an RCU sync on freeing. As the goal is to move to sync cancelations
on exit, this can add considerable latency. Outside of that, percpu
ref counts provide a lot of guarantees and features that io_uring
doesn't need, and the new approach is faster.
3) Finally, make the cancelations sync. They are still offloaded to
a kworker, but the task doing ->release() waits for them to finish.
With this, the above test case works fine, as expected.
I'll send patches 1+2 separately, but wanted to get this out for review
and discussion first.
Patches are against current -git, with io_uring 6.10 and 6.11 pending
changes pulled in. You can also find the patches here:
https://git.kernel.dk/cgit/linux/log/?h=io_uring-exit-cancel
fs/file_table.c | 2 +-
include/linux/io_uring_types.h | 4 +-
include/linux/sched.h | 2 +-
io_uring/Makefile | 2 +-
io_uring/io_uring.c | 77 ++++++++++++++++++++++++----------
io_uring/io_uring.h | 3 +-
io_uring/refs.c | 58 +++++++++++++++++++++++++
io_uring/refs.h | 53 +++++++++++++++++++++++
io_uring/register.c | 3 +-
io_uring/rw.c | 3 +-
io_uring/sqpoll.c | 3 +-
kernel/fork.c | 2 +-
12 files changed, 182 insertions(+), 30 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK
2024-06-04 19:01 [PATCHSET RFC 0/5] Wait on cancelations at release time Jens Axboe
@ 2024-06-04 19:01 ` Jens Axboe
2024-06-04 19:01 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
` (3 subsequent siblings)
4 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-06-04 19:01 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
Rather than hardwire this to kernel threads, add a task flag that tells
us whether the task in question runs task_work or not. At fork time,
this flag is set for kernel threads. This is in preparation for allowing
kernel threads to signal that they will run deferred task_work.
No functional changes in this patch.
Signed-off-by: Jens Axboe <[email protected]>
---
fs/file_table.c | 2 +-
include/linux/sched.h | 2 +-
kernel/fork.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/file_table.c b/fs/file_table.c
index 4f03beed4737..d7c6685afbcb 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -477,7 +477,7 @@ void fput(struct file *file)
file_free(file);
return;
}
- if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
+ if (likely(!in_interrupt() && !(task->flags & PF_NO_TASKWORK))) {
init_task_work(&file->f_task_work, ____fput);
if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
return;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6..1393d557f05e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1635,7 +1635,7 @@ extern struct pid *cad_pid;
#define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */
#define PF_USER_WORKER 0x00004000 /* Kernel thread cloned from userspace thread */
#define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */
-#define PF__HOLE__00010000 0x00010000
+#define PF_NO_TASKWORK 0x00010000 /* task doesn't run task_work */
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocations inherit GFP_NOFS. See memalloc_nfs_save() */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocations inherit GFP_NOIO. See memalloc_noio_save() */
diff --git a/kernel/fork.c b/kernel/fork.c
index 99076dbe27d8..156bf8778d18 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2222,7 +2222,7 @@ __latent_entropy struct task_struct *copy_process(
goto fork_out;
p->flags &= ~PF_KTHREAD;
if (args->kthread)
- p->flags |= PF_KTHREAD;
+ p->flags |= PF_KTHREAD | PF_NO_TASKWORK;
if (args->user_worker) {
/*
* Mark us a user worker, and block any signal that isn't
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable
2024-06-04 19:01 [PATCHSET RFC 0/5] Wait on cancelations at release time Jens Axboe
2024-06-04 19:01 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
@ 2024-06-04 19:01 ` Jens Axboe
2024-06-05 15:01 ` Pavel Begunkov
2024-06-04 19:01 ` [PATCH 3/5] io_uring: move to using private ring references Jens Axboe
` (2 subsequent siblings)
4 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2024-06-04 19:01 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
There are two types of work here:
1) Fallback work, if the task is exiting
2) The exit side cancelations
and both of them may do the final fput() of a file. When this happens,
fput() will schedule delayed work. This slows down exits when io_uring
needs to wait for that work to finish. It is possible to flush this via
flush_delayed_fput(), but that's a big hammer as other unrelated files
could be involved, and from other tasks as well.
Add two io_uring helpers to temporarily clear PF_NO_TASKWORK for the
worker threads, and run any queued task_work before setting the flag
again. Then we can ensure we only flush related items that received
their final fput as part of work cancelation and flushing.
For now these are io_uring private, but could obviously be made
generically available, should there be a need to do so.
Signed-off-by: Jens Axboe <[email protected]>
---
io_uring/io_uring.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 96f6da0bf5cd..3ad915262a45 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -234,6 +234,20 @@ static inline void io_req_add_to_cache(struct io_kiocb *req, struct io_ring_ctx
wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
}
+static __cold void io_kworker_tw_start(void)
+{
+ if (WARN_ON_ONCE(!(current->flags & PF_NO_TASKWORK)))
+ return;
+ current->flags &= ~PF_NO_TASKWORK;
+}
+
+static __cold void io_kworker_tw_end(void)
+{
+ while (task_work_pending(current))
+ task_work_run();
+ current->flags |= PF_NO_TASKWORK;
+}
+
static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref)
{
struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
@@ -249,6 +263,8 @@ static __cold void io_fallback_req_func(struct work_struct *work)
struct io_kiocb *req, *tmp;
struct io_tw_state ts = {};
+ io_kworker_tw_start();
+
percpu_ref_get(&ctx->refs);
mutex_lock(&ctx->uring_lock);
llist_for_each_entry_safe(req, tmp, node, io_task_work.node)
@@ -256,6 +272,7 @@ static __cold void io_fallback_req_func(struct work_struct *work)
io_submit_flush_completions(ctx);
mutex_unlock(&ctx->uring_lock);
percpu_ref_put(&ctx->refs);
+ io_kworker_tw_end();
}
static int io_alloc_hash_table(struct io_hash_table *table, unsigned bits)
@@ -2720,6 +2737,8 @@ static __cold void io_ring_exit_work(struct work_struct *work)
struct io_tctx_node *node;
int ret;
+ io_kworker_tw_start();
+
/*
* If we're doing polled IO and end up having requests being
* submitted async (out-of-line), then completions can come in while
@@ -2770,6 +2789,8 @@ static __cold void io_ring_exit_work(struct work_struct *work)
*/
} while (!wait_for_completion_interruptible_timeout(&ctx->ref_comp, interval));
+ io_kworker_tw_end();
+
init_completion(&exit.completion);
init_task_work(&exit.task_work, io_tctx_exit_cb);
exit.ctx = ctx;
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/5] io_uring: move to using private ring references
2024-06-04 19:01 [PATCHSET RFC 0/5] Wait on cancelations at release time Jens Axboe
2024-06-04 19:01 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
2024-06-04 19:01 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
@ 2024-06-04 19:01 ` Jens Axboe
2024-06-05 15:11 ` Pavel Begunkov
2024-06-04 19:01 ` [PATCH 4/5] io_uring: consider ring dead once the ref is marked dying Jens Axboe
2024-06-04 19:01 ` [PATCH 5/5] io_uring: wait for cancelations on final ring put Jens Axboe
4 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2024-06-04 19:01 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
io_uring currently uses percpu refcounts for the ring reference. This
works fine, but exiting a ring requires an RCU grace period to lapse
and this slows down ring exit quite a lot.
Add a basic per-cpu counter for our references instead, and use that.
This is in preparation for doing a sync wait on on any request (notably
file) references on ring exit. As we're going to be waiting on ctx refs
going away as well with that, the RCU grace period wait becomes a
noticeable slowdown.
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/io_uring_types.h | 2 +-
io_uring/Makefile | 2 +-
io_uring/io_uring.c | 39 ++++++++++-------------
io_uring/refs.c | 58 ++++++++++++++++++++++++++++++++++
io_uring/refs.h | 53 +++++++++++++++++++++++++++++++
io_uring/register.c | 3 +-
io_uring/rw.c | 3 +-
io_uring/sqpoll.c | 3 +-
8 files changed, 135 insertions(+), 28 deletions(-)
create mode 100644 io_uring/refs.c
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index a2227ab7fd16..fc1e0e65d474 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -238,7 +238,7 @@ struct io_ring_ctx {
struct task_struct *submitter_task;
struct io_rings *rings;
- struct percpu_ref refs;
+ unsigned long ref_ptr;
enum task_work_notify_mode notify_method;
unsigned sq_thread_idle;
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 61923e11c767..b167ab8930a9 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -4,7 +4,7 @@
obj-$(CONFIG_IO_URING) += io_uring.o opdef.o kbuf.o rsrc.o notif.o \
tctx.o filetable.o rw.o net.o poll.o \
- eventfd.o uring_cmd.o openclose.o \
+ eventfd.o refs.o uring_cmd.o openclose.o \
sqpoll.o xattr.o nop.o fs.o splice.o \
sync.o msg_ring.o advise.o openclose.o \
epoll.o statx.o timeout.o fdinfo.o \
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3ad915262a45..841a5dd6ba89 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -248,13 +248,6 @@ static __cold void io_kworker_tw_end(void)
current->flags |= PF_NO_TASKWORK;
}
-static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref)
-{
- struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
-
- complete(&ctx->ref_comp);
-}
-
static __cold void io_fallback_req_func(struct work_struct *work)
{
struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx,
@@ -265,13 +258,13 @@ static __cold void io_fallback_req_func(struct work_struct *work)
io_kworker_tw_start();
- percpu_ref_get(&ctx->refs);
+ io_ring_ref_get(ctx);
mutex_lock(&ctx->uring_lock);
llist_for_each_entry_safe(req, tmp, node, io_task_work.node)
req->io_task_work.func(req, &ts);
io_submit_flush_completions(ctx);
mutex_unlock(&ctx->uring_lock);
- percpu_ref_put(&ctx->refs);
+ io_ring_ref_put(ctx);
io_kworker_tw_end();
}
@@ -312,8 +305,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
goto err;
if (io_alloc_hash_table(&ctx->cancel_table_locked, hash_bits))
goto err;
- if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free,
- 0, GFP_KERNEL))
+ if (io_ring_ref_init(ctx))
goto err;
ctx->flags = p->flags;
@@ -939,7 +931,7 @@ __cold bool __io_alloc_req_refill(struct io_ring_ctx *ctx)
ret = 1;
}
- percpu_ref_get_many(&ctx->refs, ret);
+ io_ring_ref_get_many(ctx, ret);
while (ret--) {
struct io_kiocb *req = reqs[ret];
@@ -994,7 +986,7 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, struct io_tw_state *ts)
io_submit_flush_completions(ctx);
mutex_unlock(&ctx->uring_lock);
- percpu_ref_put(&ctx->refs);
+ io_ring_ref_put(ctx);
}
/*
@@ -1018,7 +1010,7 @@ struct llist_node *io_handle_tw_list(struct llist_node *node,
ctx_flush_and_put(ctx, &ts);
ctx = req->ctx;
mutex_lock(&ctx->uring_lock);
- percpu_ref_get(&ctx->refs);
+ io_ring_ref_get(ctx);
}
INDIRECT_CALL_2(req->io_task_work.func,
io_poll_task_func, io_req_rw_complete,
@@ -1062,10 +1054,10 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
if (sync && last_ctx != req->ctx) {
if (last_ctx) {
flush_delayed_work(&last_ctx->fallback_work);
- percpu_ref_put(&last_ctx->refs);
+ io_ring_ref_put(last_ctx);
}
last_ctx = req->ctx;
- percpu_ref_get(&last_ctx->refs);
+ io_ring_ref_get(last_ctx);
}
if (llist_add(&req->io_task_work.node,
&req->ctx->fallback_llist))
@@ -1074,7 +1066,7 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx, bool sync)
if (last_ctx) {
flush_delayed_work(&last_ctx->fallback_work);
- percpu_ref_put(&last_ctx->refs);
+ io_ring_ref_put(last_ctx);
}
}
@@ -2566,7 +2558,7 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
nr++;
}
if (nr)
- percpu_ref_put_many(&ctx->refs, nr);
+ io_ring_ref_put_many(ctx, nr);
mutex_unlock(&ctx->uring_lock);
}
@@ -2610,7 +2602,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
}
io_rings_free(ctx);
- percpu_ref_exit(&ctx->refs);
+ io_ring_ref_free(ctx);
free_uid(ctx->user);
io_req_caches_free(ctx);
if (ctx->hash_map)
@@ -2636,7 +2628,7 @@ static __cold void io_activate_pollwq_cb(struct callback_head *cb)
* might've been lost due to loose synchronisation.
*/
wake_up_all(&ctx->poll_wq);
- percpu_ref_put(&ctx->refs);
+ io_ring_ref_put(ctx);
}
__cold void io_activate_pollwq(struct io_ring_ctx *ctx)
@@ -2654,9 +2646,9 @@ __cold void io_activate_pollwq(struct io_ring_ctx *ctx)
* only need to sync with it, which is done by injecting a tw
*/
init_task_work(&ctx->poll_wq_task_work, io_activate_pollwq_cb);
- percpu_ref_get(&ctx->refs);
+ io_ring_ref_get(ctx);
if (task_work_add(ctx->submitter_task, &ctx->poll_wq_task_work, TWA_SIGNAL))
- percpu_ref_put(&ctx->refs);
+ io_ring_ref_put(ctx);
out:
spin_unlock(&ctx->completion_lock);
}
@@ -2833,7 +2825,7 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
struct creds *creds;
mutex_lock(&ctx->uring_lock);
- percpu_ref_kill(&ctx->refs);
+ io_ring_ref_kill(ctx);
xa_for_each(&ctx->personalities, index, creds)
io_unregister_personality(ctx, index);
mutex_unlock(&ctx->uring_lock);
@@ -2848,6 +2840,7 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
* over using system_wq.
*/
queue_work(iou_wq, &ctx->exit_work);
+ io_ring_ref_put(ctx);
}
static int io_uring_release(struct inode *inode, struct file *file)
diff --git a/io_uring/refs.c b/io_uring/refs.c
new file mode 100644
index 000000000000..af21f3937f09
--- /dev/null
+++ b/io_uring/refs.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/percpu.h>
+#include <linux/io_uring.h>
+
+#include "refs.h"
+
+int io_ring_ref_init(struct io_ring_ctx *ctx)
+{
+ size_t align = max_t(size_t, 1 << __PERCPU_REF_FLAG_BITS,
+ __alignof__(local_t));
+
+ ctx->ref_ptr = (unsigned long) __alloc_percpu(sizeof(local_t), align);
+ if (ctx->ref_ptr)
+ return 0;
+
+ return -ENOMEM;
+}
+
+void io_ring_ref_free(struct io_ring_ctx *ctx)
+{
+ local_t __percpu *refs = io_ring_ref(ctx);
+
+ free_percpu(refs);
+ ctx->ref_ptr = 0;
+}
+
+/*
+ * Checks if all references are gone, completes if so.
+ */
+void __cold io_ring_ref_maybe_done(struct io_ring_ctx *ctx)
+{
+ local_t __percpu *refs = io_ring_ref(ctx);
+ long sum = 0;
+ int cpu;
+
+ preempt_disable();
+ for_each_possible_cpu(cpu)
+ sum += local_read(per_cpu_ptr(refs, cpu));
+ preempt_enable();
+
+ if (!sum)
+ complete(&ctx->ref_comp);
+}
+
+/*
+ * Mark the reference killed. This grabs a reference which the caller must
+ * drop.
+ */
+void io_ring_ref_kill(struct io_ring_ctx *ctx)
+{
+ io_ring_ref_get(ctx);
+ set_bit(CTX_REF_DEAD_BIT, &ctx->ref_ptr);
+ io_ring_ref_maybe_done(ctx);
+}
diff --git a/io_uring/refs.h b/io_uring/refs.h
index 63982ead9f7d..a4d4d46d6290 100644
--- a/io_uring/refs.h
+++ b/io_uring/refs.h
@@ -2,6 +2,7 @@
#define IOU_REQ_REF_H
#include <linux/atomic.h>
+#include <asm/local.h>
#include <linux/io_uring_types.h>
/*
@@ -52,4 +53,56 @@ static inline void io_req_set_refcount(struct io_kiocb *req)
{
__io_req_set_refcount(req, 1);
}
+
+int io_ring_ref_init(struct io_ring_ctx *ctx);
+void io_ring_ref_free(struct io_ring_ctx *ctx);
+void __cold io_ring_ref_maybe_done(struct io_ring_ctx *ctx);
+void io_ring_ref_kill(struct io_ring_ctx *ctx);
+
+enum {
+ CTX_REF_DEAD_BIT = 0UL,
+ CTX_REF_DEAD_MASK = 1UL,
+};
+
+static inline local_t __percpu *io_ring_ref(struct io_ring_ctx *ctx)
+{
+ return (local_t __percpu *) (ctx->ref_ptr & ~CTX_REF_DEAD_MASK);
+}
+
+static inline bool io_ring_ref_is_dying(struct io_ring_ctx *ctx)
+{
+ return test_bit(CTX_REF_DEAD_BIT, &ctx->ref_ptr);
+}
+
+static inline void io_ring_ref_get_many(struct io_ring_ctx *ctx, int nr)
+{
+ local_t __percpu *refs = io_ring_ref(ctx);
+
+ preempt_disable();
+ local_add(nr, this_cpu_ptr(refs));
+ preempt_enable();
+}
+
+static inline void io_ring_ref_get(struct io_ring_ctx *ctx)
+{
+ io_ring_ref_get_many(ctx, 1);
+}
+
+static inline void io_ring_ref_put_many(struct io_ring_ctx *ctx, int nr)
+{
+ local_t __percpu *refs = io_ring_ref(ctx);
+
+ preempt_disable();
+ local_sub(nr, this_cpu_ptr(refs));
+ preempt_enable();
+
+ if (unlikely(io_ring_ref_is_dying(ctx)))
+ io_ring_ref_maybe_done(ctx);
+}
+
+static inline void io_ring_ref_put(struct io_ring_ctx *ctx)
+{
+ io_ring_ref_put_many(ctx, 1);
+}
+
#endif
diff --git a/io_uring/register.c b/io_uring/register.c
index f121e02f5e10..9c1984e5c2f2 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -28,6 +28,7 @@
#include "kbuf.h"
#include "napi.h"
#include "eventfd.h"
+#include "refs.h"
#define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -347,7 +348,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
* We don't quiesce the refs for register anymore and so it can't be
* dying as we're holding a file ref here.
*/
- if (WARN_ON_ONCE(percpu_ref_is_dying(&ctx->refs)))
+ if (WARN_ON_ONCE(io_ring_ref_is_dying(ctx)))
return -ENXIO;
if (ctx->submitter_task && ctx->submitter_task != current)
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 1a2128459cb4..1092a6d5cefc 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -21,6 +21,7 @@
#include "alloc_cache.h"
#include "rsrc.h"
#include "poll.h"
+#include "refs.h"
#include "rw.h"
struct io_rw {
@@ -419,7 +420,7 @@ static bool io_rw_should_reissue(struct io_kiocb *req)
* Don't attempt to reissue from that path, just let it fail with
* -EAGAIN.
*/
- if (percpu_ref_is_dying(&ctx->refs))
+ if (io_ring_ref_is_dying(ctx))
return false;
/*
* Play it safe and assume not safe to re-import and reissue if we're
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index b3722e5275e7..de003b6b06ce 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -16,6 +16,7 @@
#include "io_uring.h"
#include "napi.h"
+#include "refs.h"
#include "sqpoll.h"
#define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
@@ -190,7 +191,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
* Don't submit if refs are dying, good for io_uring_register(),
* but also it is relied upon by io_ring_exit_work()
*/
- if (to_submit && likely(!percpu_ref_is_dying(&ctx->refs)) &&
+ if (to_submit && likely(!io_ring_ref_is_dying(ctx)) &&
!(ctx->flags & IORING_SETUP_R_DISABLED))
ret = io_submit_sqes(ctx, to_submit);
mutex_unlock(&ctx->uring_lock);
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/5] io_uring: consider ring dead once the ref is marked dying
2024-06-04 19:01 [PATCHSET RFC 0/5] Wait on cancelations at release time Jens Axboe
` (2 preceding siblings ...)
2024-06-04 19:01 ` [PATCH 3/5] io_uring: move to using private ring references Jens Axboe
@ 2024-06-04 19:01 ` Jens Axboe
2024-06-04 19:01 ` [PATCH 5/5] io_uring: wait for cancelations on final ring put Jens Axboe
4 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-06-04 19:01 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
Don't gate this on the task exiting flag. It's generally not a good idea
to gate it on the task PF_EXITING flag anyway. Once the ring is starting
to go through ring teardown, the ref is marked as dying. Use that as
our fallback/cancel mechanism.
Signed-off-by: Jens Axboe <[email protected]>
---
io_uring/io_uring.c | 9 +++++++--
io_uring/io_uring.h | 3 ++-
2 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 841a5dd6ba89..5a4699170136 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -528,7 +528,11 @@ static void io_queue_iowq(struct io_kiocb *req)
* procedure rather than attempt to run this request (or create a new
* worker for it).
*/
- if (WARN_ON_ONCE(!same_thread_group(req->task, current)))
+ WARN_ON_ONCE(!io_ring_ref_is_dying(req->ctx) &&
+ !same_thread_group(req->task, current));
+
+ if (!same_thread_group(req->task, current) ||
+ io_ring_ref_is_dying(req->ctx))
req->work.flags |= IO_WQ_WORK_CANCEL;
trace_io_uring_queue_async_work(req, io_wq_is_hashed(&req->work));
@@ -1196,7 +1200,8 @@ static void io_req_normal_work_add(struct io_kiocb *req)
return;
}
- if (likely(!task_work_add(req->task, &tctx->task_work, ctx->notify_method)))
+ if (!io_ring_ref_is_dying(ctx) &&
+ !task_work_add(req->task, &tctx->task_work, ctx->notify_method))
return;
io_fallback_tw(tctx, false);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index cd43924eed04..55eac07d5fe0 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -11,6 +11,7 @@
#include "io-wq.h"
#include "slist.h"
#include "filetable.h"
+#include "refs.h"
#ifndef CREATE_TRACE_POINTS
#include <trace/events/io_uring.h>
@@ -122,7 +123,7 @@ static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
* Not from an SQE, as those cannot be submitted, but via
* updating tagged resources.
*/
- if (ctx->submitter_task->flags & PF_EXITING)
+ if (io_ring_ref_is_dying(ctx))
lockdep_assert(current_work());
else
lockdep_assert(current == ctx->submitter_task);
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 5/5] io_uring: wait for cancelations on final ring put
2024-06-04 19:01 [PATCHSET RFC 0/5] Wait on cancelations at release time Jens Axboe
` (3 preceding siblings ...)
2024-06-04 19:01 ` [PATCH 4/5] io_uring: consider ring dead once the ref is marked dying Jens Axboe
@ 2024-06-04 19:01 ` Jens Axboe
4 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-06-04 19:01 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
We still offload the cancelation to a workqueue, as not to introduce
dependencies between the exiting task waiting on cleanup, and that
task needing to run task_work to complete the process.
This means that once the final ring put is done, any request that was
inflight and needed cancelation will be done as well. Notably requests
that hold references to files - once the ring fd close is done, we will
have dropped any of those references too.
Signed-off-by: Jens Axboe <[email protected]>
---
include/linux/io_uring_types.h | 2 ++
io_uring/io_uring.c | 16 ++++++++++++++++
2 files changed, 18 insertions(+)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index fc1e0e65d474..a6b5f041423f 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -420,6 +420,8 @@ struct io_ring_ctx {
unsigned short n_sqe_pages;
struct page **ring_pages;
struct page **sqe_pages;
+
+ struct completion *exit_comp;
};
struct io_tw_state {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 5a4699170136..3000a865baec 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2730,6 +2730,7 @@ static __cold void io_ring_exit_work(struct work_struct *work)
struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, exit_work);
unsigned long timeout = jiffies + HZ * 60 * 5;
unsigned long interval = HZ / 20;
+ struct completion *exit_comp;
struct io_tctx_exit exit;
struct io_tctx_node *node;
int ret;
@@ -2788,6 +2789,10 @@ static __cold void io_ring_exit_work(struct work_struct *work)
io_kworker_tw_end();
+ exit_comp = READ_ONCE(ctx->exit_comp);
+ if (exit_comp)
+ complete(exit_comp);
+
init_completion(&exit.completion);
init_task_work(&exit.task_work, io_tctx_exit_cb);
exit.ctx = ctx;
@@ -2851,9 +2856,20 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
static int io_uring_release(struct inode *inode, struct file *file)
{
struct io_ring_ctx *ctx = file->private_data;
+ DECLARE_COMPLETION_ONSTACK(exit_comp);
file->private_data = NULL;
+ WRITE_ONCE(ctx->exit_comp, &exit_comp);
io_ring_ctx_wait_and_kill(ctx);
+
+ /*
+ * Wait for cancel to run before exiting task
+ */
+ do {
+ if (current->io_uring)
+ io_fallback_tw(current->io_uring, false);
+ } while (wait_for_completion_interruptible(&exit_comp));
+
return 0;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable
2024-06-04 19:01 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
@ 2024-06-05 15:01 ` Pavel Begunkov
2024-06-05 18:08 ` Jens Axboe
0 siblings, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2024-06-05 15:01 UTC (permalink / raw)
To: Jens Axboe, io-uring
On 6/4/24 20:01, Jens Axboe wrote:
> There are two types of work here:
>
> 1) Fallback work, if the task is exiting
> 2) The exit side cancelations
>
> and both of them may do the final fput() of a file. When this happens,
> fput() will schedule delayed work. This slows down exits when io_uring
> needs to wait for that work to finish. It is possible to flush this via
> flush_delayed_fput(), but that's a big hammer as other unrelated files
> could be involved, and from other tasks as well.
>
> Add two io_uring helpers to temporarily clear PF_NO_TASKWORK for the
> worker threads, and run any queued task_work before setting the flag
> again. Then we can ensure we only flush related items that received
> their final fput as part of work cancelation and flushing.
>
> For now these are io_uring private, but could obviously be made
> generically available, should there be a need to do so.
>
> Signed-off-by: Jens Axboe <[email protected]>
> ---
> io_uring/io_uring.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 96f6da0bf5cd..3ad915262a45 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -234,6 +234,20 @@ static inline void io_req_add_to_cache(struct io_kiocb *req, struct io_ring_ctx
> wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
> }
>
> +static __cold void io_kworker_tw_start(void)
> +{
> + if (WARN_ON_ONCE(!(current->flags & PF_NO_TASKWORK)))
> + return;
> + current->flags &= ~PF_NO_TASKWORK;
> +}
> +
> +static __cold void io_kworker_tw_end(void)
> +{
> + while (task_work_pending(current))
> + task_work_run();
Clear TIF_NOTIFY_SIGNAL/RESUME? Maybe even retrying task_work_run()
after and looping around if there are items to execute.
> + current->flags |= PF_NO_TASKWORK;
> +}
> +
> static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref)
> {
> struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
> @@ -249,6 +263,8 @@ static __cold void io_fallback_req_func(struct work_struct *work)
> struct io_kiocb *req, *tmp;
> struct io_tw_state ts = {};
>
> + io_kworker_tw_start();
> +
> percpu_ref_get(&ctx->refs);
> mutex_lock(&ctx->uring_lock);
> llist_for_each_entry_safe(req, tmp, node, io_task_work.node)
> @@ -256,6 +272,7 @@ static __cold void io_fallback_req_func(struct work_struct *work)
> io_submit_flush_completions(ctx);
> mutex_unlock(&ctx->uring_lock);
> percpu_ref_put(&ctx->refs);
> + io_kworker_tw_end();
> }
>
> static int io_alloc_hash_table(struct io_hash_table *table, unsigned bits)
> @@ -2720,6 +2737,8 @@ static __cold void io_ring_exit_work(struct work_struct *work)
> struct io_tctx_node *node;
> int ret;
>
> + io_kworker_tw_start();
> +
> /*
> * If we're doing polled IO and end up having requests being
> * submitted async (out-of-line), then completions can come in while
> @@ -2770,6 +2789,8 @@ static __cold void io_ring_exit_work(struct work_struct *work)
> */
> } while (!wait_for_completion_interruptible_timeout(&ctx->ref_comp, interval));
>
> + io_kworker_tw_end();
> +
> init_completion(&exit.completion);
> init_task_work(&exit.task_work, io_tctx_exit_cb);
> exit.ctx = ctx;
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/5] io_uring: move to using private ring references
2024-06-04 19:01 ` [PATCH 3/5] io_uring: move to using private ring references Jens Axboe
@ 2024-06-05 15:11 ` Pavel Begunkov
2024-06-05 16:31 ` Pavel Begunkov
0 siblings, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2024-06-05 15:11 UTC (permalink / raw)
To: Jens Axboe, io-uring
On 6/4/24 20:01, Jens Axboe wrote:
> io_uring currently uses percpu refcounts for the ring reference. This
> works fine, but exiting a ring requires an RCU grace period to lapse
> and this slows down ring exit quite a lot.
>
> Add a basic per-cpu counter for our references instead, and use that.
All the synchronisation heavy lifting is done by RCU, what
makes it safe to read other CPUs counters in
io_ring_ref_maybe_done()?
Let's say you have 1 ref, then:
CPU1: fallback: get_ref();
CPU2: put_ref(); io_ring_ref_maybe_done();
There should be 1 ref left but without extra sync
io_ring_ref_maybe_done() can read the old value from CPU1
before the get => UAF.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/5] io_uring: move to using private ring references
2024-06-05 15:11 ` Pavel Begunkov
@ 2024-06-05 16:31 ` Pavel Begunkov
2024-06-05 19:13 ` Pavel Begunkov
0 siblings, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2024-06-05 16:31 UTC (permalink / raw)
To: Jens Axboe, io-uring
On 6/5/24 16:11, Pavel Begunkov wrote:
> On 6/4/24 20:01, Jens Axboe wrote:
>> io_uring currently uses percpu refcounts for the ring reference. This
>> works fine, but exiting a ring requires an RCU grace period to lapse
>> and this slows down ring exit quite a lot.
>>
>> Add a basic per-cpu counter for our references instead, and use that.
>
> All the synchronisation heavy lifting is done by RCU, what
> makes it safe to read other CPUs counters in
> io_ring_ref_maybe_done()?
Other options are expedited RCU (Paul saying it's an order of
magnitude faster), or to switch to plain atomics since it's cached,
but it's only good if submitter and waiter are the same task. Paul
also mentioned more elaborate approaches like percpu (to reduce
contention) atomics.
> Let's say you have 1 ref, then:
>
> CPU1: fallback: get_ref();
> CPU2: put_ref(); io_ring_ref_maybe_done();
>
> There should be 1 ref left but without extra sync
> io_ring_ref_maybe_done() can read the old value from CPU1
> before the get => UAF.
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable
2024-06-05 15:01 ` Pavel Begunkov
@ 2024-06-05 18:08 ` Jens Axboe
0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-06-05 18:08 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
On 6/5/24 9:01 AM, Pavel Begunkov wrote:
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 96f6da0bf5cd..3ad915262a45 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -234,6 +234,20 @@ static inline void io_req_add_to_cache(struct io_kiocb *req, struct io_ring_ctx
>> wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
>> }
>> +static __cold void io_kworker_tw_start(void)
>> +{
>> + if (WARN_ON_ONCE(!(current->flags & PF_NO_TASKWORK)))
>> + return;
>> + current->flags &= ~PF_NO_TASKWORK;
>> +}
>> +
>> +static __cold void io_kworker_tw_end(void)
>> +{
>> + while (task_work_pending(current))
>> + task_work_run();
>
> Clear TIF_NOTIFY_SIGNAL/RESUME? Maybe even retrying task_work_run()
> after and looping around if there are items to execute.
Yeah good point, it should handle clear the notifiers too. Will make
that change.
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/5] io_uring: move to using private ring references
2024-06-05 16:31 ` Pavel Begunkov
@ 2024-06-05 19:13 ` Pavel Begunkov
2024-06-05 19:29 ` Jens Axboe
0 siblings, 1 reply; 13+ messages in thread
From: Pavel Begunkov @ 2024-06-05 19:13 UTC (permalink / raw)
To: Jens Axboe, io-uring
On 6/5/24 17:31, Pavel Begunkov wrote:
> On 6/5/24 16:11, Pavel Begunkov wrote:
>> On 6/4/24 20:01, Jens Axboe wrote:
>>> io_uring currently uses percpu refcounts for the ring reference. This
>>> works fine, but exiting a ring requires an RCU grace period to lapse
>>> and this slows down ring exit quite a lot.
>>>
>>> Add a basic per-cpu counter for our references instead, and use that.
>>
>> All the synchronisation heavy lifting is done by RCU, what
>> makes it safe to read other CPUs counters in
>> io_ring_ref_maybe_done()?
>
> Other options are expedited RCU (Paul saying it's an order of
> magnitude faster), or to switch to plain atomics since it's cached,
> but it's only good if submitter and waiter are the same task. Paul
I mixed it with task refs, ctx refs should be cached well
for any configuration as they're bound to requests (and req
caches).
> also mentioned more elaborate approaches like percpu (to reduce
> contention) atomics.
>
>> Let's say you have 1 ref, then:
>>
>> CPU1: fallback: get_ref();
>> CPU2: put_ref(); io_ring_ref_maybe_done();
>>
>> There should be 1 ref left but without extra sync
>> io_ring_ref_maybe_done() can read the old value from CPU1
>> before the get => UAF.
>>
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/5] io_uring: move to using private ring references
2024-06-05 19:13 ` Pavel Begunkov
@ 2024-06-05 19:29 ` Jens Axboe
2024-06-05 19:39 ` Jens Axboe
0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2024-06-05 19:29 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
On 6/5/24 1:13 PM, Pavel Begunkov wrote:
> On 6/5/24 17:31, Pavel Begunkov wrote:
>> On 6/5/24 16:11, Pavel Begunkov wrote:
>>> On 6/4/24 20:01, Jens Axboe wrote:
>>>> io_uring currently uses percpu refcounts for the ring reference. This
>>>> works fine, but exiting a ring requires an RCU grace period to lapse
>>>> and this slows down ring exit quite a lot.
>>>>
>>>> Add a basic per-cpu counter for our references instead, and use that.
>>>
>>> All the synchronisation heavy lifting is done by RCU, what
>>> makes it safe to read other CPUs counters in
>>> io_ring_ref_maybe_done()?
>>
>> Other options are expedited RCU (Paul saying it's an order of
>> magnitude faster), or to switch to plain atomics since it's cached,
>> but it's only good if submitter and waiter are the same task. Paul
>
> I mixed it with task refs, ctx refs should be cached well
> for any configuration as they're bound to requests (and req
> caches).
That's a good point, maybe even our current RCU approach is overkill
since we do the caching pretty well. Let me run a quick test, just
switching this to a basic atomic_t. The dead mask can just be the 31st
bit.
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/5] io_uring: move to using private ring references
2024-06-05 19:29 ` Jens Axboe
@ 2024-06-05 19:39 ` Jens Axboe
0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2024-06-05 19:39 UTC (permalink / raw)
To: Pavel Begunkov, io-uring
On 6/5/24 1:29 PM, Jens Axboe wrote:
> On 6/5/24 1:13 PM, Pavel Begunkov wrote:
>> On 6/5/24 17:31, Pavel Begunkov wrote:
>>> On 6/5/24 16:11, Pavel Begunkov wrote:
>>>> On 6/4/24 20:01, Jens Axboe wrote:
>>>>> io_uring currently uses percpu refcounts for the ring reference. This
>>>>> works fine, but exiting a ring requires an RCU grace period to lapse
>>>>> and this slows down ring exit quite a lot.
>>>>>
>>>>> Add a basic per-cpu counter for our references instead, and use that.
>>>>
>>>> All the synchronisation heavy lifting is done by RCU, what
>>>> makes it safe to read other CPUs counters in
>>>> io_ring_ref_maybe_done()?
>>>
>>> Other options are expedited RCU (Paul saying it's an order of
>>> magnitude faster), or to switch to plain atomics since it's cached,
>>> but it's only good if submitter and waiter are the same task. Paul
>>
>> I mixed it with task refs, ctx refs should be cached well
>> for any configuration as they're bound to requests (and req
>> caches).
>
> That's a good point, maybe even our current RCU approach is overkill
> since we do the caching pretty well. Let me run a quick test, just
> switching this to a basic atomic_t. The dead mask can just be the 31st
> bit.
Well, the exception is non-local task_work, we still grab and put a
reference on the ctx for each context while iterating.
Outside of that, the request pre-alloc takes care of the rest.
--
Jens Axboe
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-06-05 19:39 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-04 19:01 [PATCHSET RFC 0/5] Wait on cancelations at release time Jens Axboe
2024-06-04 19:01 ` [PATCH 1/5] fs: gate final fput task_work on PF_NO_TASKWORK Jens Axboe
2024-06-04 19:01 ` [PATCH 2/5] io_uring: mark exit side kworkers as task_work capable Jens Axboe
2024-06-05 15:01 ` Pavel Begunkov
2024-06-05 18:08 ` Jens Axboe
2024-06-04 19:01 ` [PATCH 3/5] io_uring: move to using private ring references Jens Axboe
2024-06-05 15:11 ` Pavel Begunkov
2024-06-05 16:31 ` Pavel Begunkov
2024-06-05 19:13 ` Pavel Begunkov
2024-06-05 19:29 ` Jens Axboe
2024-06-05 19:39 ` Jens Axboe
2024-06-04 19:01 ` [PATCH 4/5] io_uring: consider ring dead once the ref is marked dying Jens Axboe
2024-06-04 19:01 ` [PATCH 5/5] io_uring: wait for cancelations on final ring put Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox