public inbox for [email protected]
 help / color / mirror / Atom feed
* [RFC 00/11] io_uring specific task_work infra
@ 2022-04-21 13:44 Pavel Begunkov
  2022-04-21 13:44 ` [RFC 01/11] io_uring: optimise io_req_task_work_add Pavel Begunkov
                   ` (11 more replies)
  0 siblings, 12 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

For experiments only. If proves to be useful would need to make it
nicer on the non-io_uring side.

0-10 save 1 spinlock/unlock_irq pair and 2 cmpxchg per batch. 11/11 in
general trades 1 per tw add spin_lock/unlock_irq and 2 per batch spinlocking
with 2 cmpxchg to 1 per tw add cmpxchg and 1 per batch cmpxchg.

Pavel Begunkov (11):
  io_uring: optimise io_req_task_work_add
  io_uringg: add io_should_fail_tw() helper
  io_uring: ban tw queue for exiting processes
  io_uring: don't take ctx refs in tctx_task_work()
  io_uring: add dummy io_uring_task_work_run()
  task_work: add helper for signalling a task
  io_uring: run io_uring task_works on TIF_NOTIFY_SIGNAL
  io_uring: wire io_uring specific task work
  io_uring: refactor io_run_task_work()
  io_uring: remove priority tw list
  io_uring: lock-free task_work stack

 fs/io-wq.c                |   1 +
 fs/io_uring.c             | 213 +++++++++++++++-----------------------
 include/linux/io_uring.h  |   4 +
 include/linux/task_work.h |   4 +
 kernel/entry/kvm.c        |   1 +
 kernel/signal.c           |   2 +
 kernel/task_work.c        |  33 +++---
 7 files changed, 115 insertions(+), 143 deletions(-)

-- 
2.36.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC 01/11] io_uring: optimise io_req_task_work_add
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 02/11] io_uringg: add io_should_fail_tw() helper Pavel Begunkov
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Deduplicate wq_list_add_tail() calls in io_req_task_work_add(), becasue,
apparently, some compilers fail to optimise it and generate a bunch of
extra instructions.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3905b3ec87b8..8011a61e6bd4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2580,6 +2580,7 @@ static void tctx_task_work(struct callback_head *cb)
 
 static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
+	struct io_wq_work_list *list;
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
 	enum task_work_notify_mode notify;
@@ -2592,10 +2593,8 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 	io_drop_inflight_file(req);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	if (priority)
-		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
-	else
-		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	list = priority ? &tctx->prior_task_list : &tctx->task_list;
+	wq_list_add_tail(&req->io_task_work.node, list);
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 02/11] io_uringg: add io_should_fail_tw() helper
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
  2022-04-21 13:44 ` [RFC 01/11] io_uring: optimise io_req_task_work_add Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 03/11] io_uring: ban tw queue for exiting processes Pavel Begunkov
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Add a simple helper telling a tw handler whether it should cancel
requests, i.e. when the owner task is exiting.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8011a61e6bd4..272a180ab7ee 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1304,6 +1304,11 @@ static inline void req_ref_get(struct io_kiocb *req)
 	atomic_inc(&req->refs);
 }
 
+static inline bool io_should_fail_tw(struct io_kiocb *req)
+{
+	return unlikely(req->task->flags & PF_EXITING);
+}
+
 static inline void io_submit_flush_completions(struct io_ring_ctx *ctx)
 {
 	if (!wq_list_empty(&ctx->submit_state.compl_reqs))
@@ -2641,8 +2646,8 @@ static void io_req_task_cancel(struct io_kiocb *req, bool *locked)
 static void io_req_task_submit(struct io_kiocb *req, bool *locked)
 {
 	io_tw_lock(req->ctx, locked);
-	/* req->task == current here, checking PF_EXITING is safe */
-	if (likely(!(req->task->flags & PF_EXITING)))
+
+	if (!io_should_fail_tw(req))
 		io_queue_sqe(req);
 	else
 		io_req_complete_failed(req, -EFAULT);
@@ -5867,8 +5872,7 @@ static int io_poll_check_events(struct io_kiocb *req, bool locked)
 	struct io_ring_ctx *ctx = req->ctx;
 	int v;
 
-	/* req->task == current here, checking PF_EXITING is safe */
-	if (unlikely(req->task->flags & PF_EXITING))
+	if (io_should_fail_tw(req))
 		return -ECANCELED;
 
 	do {
@@ -7418,7 +7422,7 @@ static void io_req_task_link_timeout(struct io_kiocb *req, bool *locked)
 	int ret = -ENOENT;
 
 	if (prev) {
-		if (!(req->task->flags & PF_EXITING))
+		if (!io_should_fail_tw(req))
 			ret = io_try_cancel_userdata(req, prev->cqe.user_data);
 		io_req_complete_post(req, ret ?: -ETIME, 0);
 		io_put_req(prev);
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 03/11] io_uring: ban tw queue for exiting processes
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
  2022-04-21 13:44 ` [RFC 01/11] io_uring: optimise io_req_task_work_add Pavel Begunkov
  2022-04-21 13:44 ` [RFC 02/11] io_uringg: add io_should_fail_tw() helper Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 04/11] io_uring: don't take ctx refs in tctx_task_work() Pavel Begunkov
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We rely on PF_EXITING and task_work infrastructure for preventing adding
new task_work items to a dying task, which is a bit more convoluted than
desired.

Ban new tw items earlier in io_uring_cancel_generic() by relying on
->in_idle. io_req_task_work_add() will check the flag, set REQ_F_FAIL
and push requests to the fallback path. task_work handlers will find it
and cancel requests just as it was with PF_EXITING.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 272a180ab7ee..ec5fe55ab265 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1306,7 +1306,7 @@ static inline void req_ref_get(struct io_kiocb *req)
 
 static inline bool io_should_fail_tw(struct io_kiocb *req)
 {
-	return unlikely(req->task->flags & PF_EXITING);
+	return unlikely(req->flags & REQ_F_FAIL);
 }
 
 static inline void io_submit_flush_completions(struct io_ring_ctx *ctx)
@@ -2577,10 +2577,6 @@ static void tctx_task_work(struct callback_head *cb)
 	}
 
 	ctx_flush_and_put(ctx, &uring_locked);
-
-	/* relaxed read is enough as only the task itself sets ->in_idle */
-	if (unlikely(atomic_read(&tctx->in_idle)))
-		io_uring_drop_tctx_refs(current);
 }
 
 static void io_req_task_work_add(struct io_kiocb *req, bool priority)
@@ -2600,6 +2596,9 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 	spin_lock_irqsave(&tctx->task_lock, flags);
 	list = priority ? &tctx->prior_task_list : &tctx->task_list;
 	wq_list_add_tail(&req->io_task_work.node, list);
+	if (unlikely(atomic_read(&tctx->in_idle)))
+		goto cancel_locked;
+
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
@@ -2623,12 +2622,13 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 	}
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	tctx->task_running = false;
+cancel_locked:
 	node = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
 		req = container_of(node, struct io_kiocb, io_task_work.node);
+		req_set_fail(req);
 		node = node->next;
 		if (llist_add(&req->io_task_work.fallback_node,
 			      &req->ctx->fallback_llist))
@@ -10352,7 +10352,10 @@ static __cold void io_uring_cancel_generic(bool cancel_all,
 	if (tctx->io_wq)
 		io_wq_exit_start(tctx->io_wq);
 
+	spin_lock_irq(&tctx->task_lock);
 	atomic_inc(&tctx->in_idle);
+	spin_unlock_irq(&tctx->task_lock);
+
 	do {
 		io_uring_drop_tctx_refs(current);
 		/* read completions before cancelations */
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 04/11] io_uring: don't take ctx refs in tctx_task_work()
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
                   ` (2 preceding siblings ...)
  2022-04-21 13:44 ` [RFC 03/11] io_uring: ban tw queue for exiting processes Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 05/11] io_uring: add dummy io_uring_task_work_run() Pavel Begunkov
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Now we ban any new req-task_works to be added after we start the task
cancellation. Because tctx is removed from ctx lists only during
task cancellation, and considering that it's removed from the task
context, we'll have current accounted in all rings tctx_task_work() is
working with and so they will stay alive at least awhile it's running.
Don't takes extra ctx refs.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ec5fe55ab265..8d5aff1ecb4c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2475,7 +2475,6 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 		mutex_unlock(&ctx->uring_lock);
 		*locked = false;
 	}
-	percpu_ref_put(&ctx->refs);
 }
 
 static inline void ctx_commit_and_unlock(struct io_ring_ctx *ctx)
@@ -2506,7 +2505,6 @@ static void handle_prev_tw_list(struct io_wq_work_node *node,
 			*ctx = req->ctx;
 			/* if not contended, grab and improve batching */
 			*uring_locked = mutex_trylock(&(*ctx)->uring_lock);
-			percpu_ref_get(&(*ctx)->refs);
 			if (unlikely(!*uring_locked))
 				spin_lock(&(*ctx)->completion_lock);
 		}
@@ -2537,7 +2535,6 @@ static void handle_tw_list(struct io_wq_work_node *node,
 			*ctx = req->ctx;
 			/* if not contended, grab and improve batching */
 			*locked = mutex_trylock(&(*ctx)->uring_lock);
-			percpu_ref_get(&(*ctx)->refs);
 		}
 		req->io_task_work.func(req, locked);
 		node = next;
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 05/11] io_uring: add dummy io_uring_task_work_run()
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
                   ` (3 preceding siblings ...)
  2022-04-21 13:44 ` [RFC 04/11] io_uring: don't take ctx refs in tctx_task_work() Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 06/11] task_work: add helper for signalling a task Pavel Begunkov
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Preparing to decoupling io_uring from task_works, add a new helper. It's
empty for now but will be used for running io_uring's task work items.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 1814e698d861..e87ed946214f 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -5,6 +5,10 @@
 #include <linux/sched.h>
 #include <linux/xarray.h>
 
+static inline void io_uring_task_work_run(void)
+{
+}
+
 #if defined(CONFIG_IO_URING)
 struct sock *io_uring_get_socket(struct file *file);
 void __io_uring_cancel(bool cancel_all);
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 06/11] task_work: add helper for signalling a task
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
                   ` (4 preceding siblings ...)
  2022-04-21 13:44 ` [RFC 05/11] io_uring: add dummy io_uring_task_work_run() Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 07/11] io_uring: run io_uring task_works on TIF_NOTIFY_SIGNAL Pavel Begunkov
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Extract a task_work_notify() helper from task_work_add(), so we can use
it in the future.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/task_work.h |  3 +++
 kernel/task_work.c        | 33 +++++++++++++++++++--------------
 2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 897494b597ba..0c5fc557ecd9 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -27,6 +27,9 @@ static inline bool task_work_pending(struct task_struct *task)
 int task_work_add(struct task_struct *task, struct callback_head *twork,
 			enum task_work_notify_mode mode);
 
+void task_work_notify(struct task_struct *task,
+			enum task_work_notify_mode notify);
+
 struct callback_head *task_work_cancel_match(struct task_struct *task,
 	bool (*match)(struct callback_head *, void *data), void *data);
 struct callback_head *task_work_cancel(struct task_struct *, task_work_func_t);
diff --git a/kernel/task_work.c b/kernel/task_work.c
index c59e1a49bc40..cce8a7ca228f 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -5,6 +5,24 @@
 
 static struct callback_head work_exited; /* all we need is ->next == NULL */
 
+void task_work_notify(struct task_struct *task,
+		     enum task_work_notify_mode notify)
+{
+	switch (notify) {
+	case TWA_NONE:
+		break;
+	case TWA_RESUME:
+		set_notify_resume(task);
+		break;
+	case TWA_SIGNAL:
+		set_notify_signal(task);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+}
+
 /**
  * task_work_add - ask the @task to execute @work->func()
  * @task: the task which should run the callback
@@ -44,20 +62,7 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
 		work->next = head;
 	} while (cmpxchg(&task->task_works, head, work) != head);
 
-	switch (notify) {
-	case TWA_NONE:
-		break;
-	case TWA_RESUME:
-		set_notify_resume(task);
-		break;
-	case TWA_SIGNAL:
-		set_notify_signal(task);
-		break;
-	default:
-		WARN_ON_ONCE(1);
-		break;
-	}
-
+	task_work_notify(task, notify);
 	return 0;
 }
 
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 07/11] io_uring: run io_uring task_works on TIF_NOTIFY_SIGNAL
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
                   ` (5 preceding siblings ...)
  2022-04-21 13:44 ` [RFC 06/11] task_work: add helper for signalling a task Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 08/11] io_uring: wire io_uring specific task work Pavel Begunkov
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Now TIF_NOTIFY_SIGNAL can mean not only normal task_work but also
io_uring specific task requests. Make sure to run them when needed.

TODO: add hot path when not having io_uring tw items

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io-wq.c                | 1 +
 fs/io_uring.c             | 1 +
 include/linux/task_work.h | 1 +
 kernel/entry/kvm.c        | 1 +
 kernel/signal.c           | 2 ++
 5 files changed, 6 insertions(+)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 32aeb2c581c5..35d8c2b46699 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -523,6 +523,7 @@ static bool io_flush_signals(void)
 	if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL))) {
 		__set_current_state(TASK_RUNNING);
 		clear_notify_signal();
+		io_uring_task_work_run();
 		if (task_work_pending(current))
 			task_work_run();
 		return true;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8d5aff1ecb4c..22dcd2fb9687 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2798,6 +2798,7 @@ static inline bool io_run_task_work(void)
 	if (test_thread_flag(TIF_NOTIFY_SIGNAL) || task_work_pending(current)) {
 		__set_current_state(TASK_RUNNING);
 		clear_notify_signal();
+		io_uring_task_work_run();
 		if (task_work_pending(current))
 			task_work_run();
 		return true;
diff --git a/include/linux/task_work.h b/include/linux/task_work.h
index 0c5fc557ecd9..66852f4a2ca0 100644
--- a/include/linux/task_work.h
+++ b/include/linux/task_work.h
@@ -4,6 +4,7 @@
 
 #include <linux/list.h>
 #include <linux/sched.h>
+#include <linux/io_uring.h>
 
 typedef void (*task_work_func_t)(struct callback_head *);
 
diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
index 9d09f489b60e..46d7d23d3cc6 100644
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -10,6 +10,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
 
 		if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) {
 			clear_notify_signal();
+			io_uring_task_work_run();
 			if (task_work_pending(current))
 				task_work_run();
 		}
diff --git a/kernel/signal.c b/kernel/signal.c
index 30cd1ca43bcd..8d46c4b63204 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2357,6 +2357,7 @@ int ptrace_notify(int exit_code, unsigned long message)
 	int signr;
 
 	BUG_ON((exit_code & (0x7f | ~0xffff)) != SIGTRAP);
+	io_uring_task_work_run();
 	if (unlikely(task_work_pending(current)))
 		task_work_run();
 
@@ -2637,6 +2638,7 @@ bool get_signal(struct ksignal *ksig)
 	int signr;
 
 	clear_notify_signal();
+	io_uring_task_work_run();
 	if (unlikely(task_work_pending(current)))
 		task_work_run();
 
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 08/11] io_uring: wire io_uring specific task work
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
                   ` (6 preceding siblings ...)
  2022-04-21 13:44 ` [RFC 07/11] io_uring: run io_uring task_works on TIF_NOTIFY_SIGNAL Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 09/11] io_uring: refactor io_run_task_work() Pavel Begunkov
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Instead of relying on task_work infrastructure for io_uring, add
io_uring specific path. This removes 2 cmpxchg and a
spin_[un]lock_irq pair per batch.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c            | 50 +++++++++++++++++++++-------------------
 include/linux/io_uring.h |  8 +++----
 2 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 22dcd2fb9687..ea7b0c71ca8b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -509,7 +509,6 @@ struct io_uring_task {
 	spinlock_t		task_lock;
 	struct io_wq_work_list	task_list;
 	struct io_wq_work_list	prior_task_list;
-	struct callback_head	task_work;
 	struct file		**registered_rings;
 	bool			task_running;
 };
@@ -2541,12 +2540,14 @@ static void handle_tw_list(struct io_wq_work_node *node,
 	} while (node);
 }
 
-static void tctx_task_work(struct callback_head *cb)
+void io_uring_task_work_run(void)
 {
-	bool uring_locked = false;
+	struct io_uring_task *tctx = current->io_uring;
 	struct io_ring_ctx *ctx = NULL;
-	struct io_uring_task *tctx = container_of(cb, struct io_uring_task,
-						  task_work);
+	bool uring_locked = false;
+
+	if (!tctx)
+		return;
 
 	while (1) {
 		struct io_wq_work_node *node1, *node2;
@@ -2581,7 +2582,6 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 	struct io_wq_work_list *list;
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
-	enum task_work_notify_mode notify;
 	struct io_wq_work_node *node;
 	unsigned long flags;
 	bool running;
@@ -2602,21 +2602,19 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	/* task_work already pending, we're done */
-	if (running)
-		return;
-
-	/*
-	 * SQPOLL kernel thread doesn't need notification, just a wakeup. For
-	 * all other cases, use TWA_SIGNAL unconditionally to ensure we're
-	 * processing task_work. There's no reliable way to tell if TWA_RESUME
-	 * will do the job.
-	 */
-	notify = (req->ctx->flags & IORING_SETUP_SQPOLL) ? TWA_NONE : TWA_SIGNAL;
-	if (likely(!task_work_add(tsk, &tctx->task_work, notify))) {
-		if (notify == TWA_NONE)
+	if (!running) {
+		/*
+		 * SQPOLL kernel thread doesn't need notification, just a wakeup. For
+		 * all other cases, use TWA_SIGNAL unconditionally to ensure we're
+		 * processing task_work. There's no reliable way to tell if TWA_RESUME
+		 * will do the job.
+		 */
+		if (req->ctx->flags & IORING_SETUP_SQPOLL)
 			wake_up_process(tsk);
-		return;
+		else
+			task_work_notify(tsk, TWA_SIGNAL);
 	}
+	return;
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
 cancel_locked:
@@ -2795,7 +2793,9 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
 
 static inline bool io_run_task_work(void)
 {
-	if (test_thread_flag(TIF_NOTIFY_SIGNAL) || task_work_pending(current)) {
+	struct io_uring_task *tctx = current->io_uring;
+
+	if (test_thread_flag(TIF_NOTIFY_SIGNAL) || (tctx && tctx->task_running)) {
 		__set_current_state(TASK_RUNNING);
 		clear_notify_signal();
 		io_uring_task_work_run();
@@ -7988,6 +7988,7 @@ static bool io_sqd_handle_event(struct io_sq_data *sqd)
 
 static int io_sq_thread(void *data)
 {
+	struct io_uring_task *tctx = current->io_uring;
 	struct io_sq_data *sqd = data;
 	struct io_ring_ctx *ctx;
 	unsigned long timeout = 0;
@@ -8022,8 +8023,10 @@ static int io_sq_thread(void *data)
 			if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
 				sqt_spin = true;
 		}
-		if (io_run_task_work())
+		if (tctx->task_running) {
+			io_uring_task_work_run();
 			sqt_spin = true;
+		}
 
 		if (sqt_spin || !time_after(jiffies, timeout)) {
 			cond_resched();
@@ -8033,7 +8036,7 @@ static int io_sq_thread(void *data)
 		}
 
 		prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);
-		if (!io_sqd_events_pending(sqd) && !task_work_pending(current)) {
+		if (!io_sqd_events_pending(sqd) && !tctx->task_running) {
 			bool needs_sched = true;
 
 			list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
@@ -8074,7 +8077,6 @@ static int io_sq_thread(void *data)
 	sqd->thread = NULL;
 	list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
 		io_ring_set_wakeup_flag(ctx);
-	io_run_task_work();
 	mutex_unlock(&sqd->lock);
 
 	audit_free(current);
@@ -9135,7 +9137,6 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
 	INIT_WQ_LIST(&tctx->prior_task_list);
-	init_task_work(&tctx->task_work, tctx_task_work);
 	return 0;
 }
 
@@ -10356,6 +10357,7 @@ static __cold void io_uring_cancel_generic(bool cancel_all,
 
 	do {
 		io_uring_drop_tctx_refs(current);
+		io_run_task_work();
 		/* read completions before cancelations */
 		inflight = tctx_inflight(tctx, !cancel_all);
 		if (!inflight)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index e87ed946214f..331bea825ee6 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -5,15 +5,12 @@
 #include <linux/sched.h>
 #include <linux/xarray.h>
 
-static inline void io_uring_task_work_run(void)
-{
-}
-
 #if defined(CONFIG_IO_URING)
 struct sock *io_uring_get_socket(struct file *file);
 void __io_uring_cancel(bool cancel_all);
 void __io_uring_free(struct task_struct *tsk);
 void io_uring_unreg_ringfd(void);
+void io_uring_task_work_run(void);
 
 static inline void io_uring_files_cancel(void)
 {
@@ -46,6 +43,9 @@ static inline void io_uring_files_cancel(void)
 static inline void io_uring_free(struct task_struct *tsk)
 {
 }
+static inline void io_uring_task_work_run(void)
+{
+}
 #endif
 
 #endif
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 09/11] io_uring: refactor io_run_task_work()
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
                   ` (7 preceding siblings ...)
  2022-04-21 13:44 ` [RFC 08/11] io_uring: wire io_uring specific task work Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 10/11] io_uring: remove priority tw list Pavel Begunkov
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

The second check in io_run_task_work() is there only for SQPOLL, for
which we use TWA_NONE. Remove the check and hand code SQPOLL specific
task_work running where it makes sense.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ea7b0c71ca8b..6397348748ad 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2793,9 +2793,7 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
 
 static inline bool io_run_task_work(void)
 {
-	struct io_uring_task *tctx = current->io_uring;
-
-	if (test_thread_flag(TIF_NOTIFY_SIGNAL) || (tctx && tctx->task_running)) {
+	if (test_thread_flag(TIF_NOTIFY_SIGNAL)) {
 		__set_current_state(TASK_RUNNING);
 		clear_notify_signal();
 		io_uring_task_work_run();
@@ -10382,6 +10380,10 @@ static __cold void io_uring_cancel_generic(bool cancel_all,
 
 		prepare_to_wait(&tctx->wait, &wait, TASK_INTERRUPTIBLE);
 		io_run_task_work();
+		if (tctx->task_running) {
+			__set_current_state(TASK_RUNNING);
+			io_uring_task_work_run();
+		}
 		io_uring_drop_tctx_refs(current);
 
 		/*
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 10/11] io_uring: remove priority tw list
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
                   ` (8 preceding siblings ...)
  2022-04-21 13:44 ` [RFC 09/11] io_uring: refactor io_run_task_work() Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 13:44 ` [RFC 11/11] io_uring: lock-free task_work stack Pavel Begunkov
  2022-04-21 14:50 ` [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Not for upstreaming. Remove it for experimenting

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 61 +++++++--------------------------------------------
 1 file changed, 8 insertions(+), 53 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6397348748ad..51b6ee2b70f2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -508,7 +508,6 @@ struct io_uring_task {
 
 	spinlock_t		task_lock;
 	struct io_wq_work_list	task_list;
-	struct io_wq_work_list	prior_task_list;
 	struct file		**registered_rings;
 	bool			task_running;
 };
@@ -2483,42 +2482,6 @@ static inline void ctx_commit_and_unlock(struct io_ring_ctx *ctx)
 	io_cqring_ev_posted(ctx);
 }
 
-static void handle_prev_tw_list(struct io_wq_work_node *node,
-				struct io_ring_ctx **ctx, bool *uring_locked)
-{
-	if (*ctx && !*uring_locked)
-		spin_lock(&(*ctx)->completion_lock);
-
-	do {
-		struct io_wq_work_node *next = node->next;
-		struct io_kiocb *req = container_of(node, struct io_kiocb,
-						    io_task_work.node);
-
-		prefetch(container_of(next, struct io_kiocb, io_task_work.node));
-
-		if (req->ctx != *ctx) {
-			if (unlikely(!*uring_locked && *ctx))
-				ctx_commit_and_unlock(*ctx);
-
-			ctx_flush_and_put(*ctx, uring_locked);
-			*ctx = req->ctx;
-			/* if not contended, grab and improve batching */
-			*uring_locked = mutex_trylock(&(*ctx)->uring_lock);
-			if (unlikely(!*uring_locked))
-				spin_lock(&(*ctx)->completion_lock);
-		}
-		if (likely(*uring_locked))
-			req->io_task_work.func(req, uring_locked);
-		else
-			__io_req_complete_post(req, req->cqe.res,
-						io_put_kbuf_comp(req));
-		node = next;
-	} while (node);
-
-	if (unlikely(!*uring_locked))
-		ctx_commit_and_unlock(*ctx);
-}
-
 static void handle_tw_list(struct io_wq_work_node *node,
 			   struct io_ring_ctx **ctx, bool *locked)
 {
@@ -2550,27 +2513,21 @@ void io_uring_task_work_run(void)
 		return;
 
 	while (1) {
-		struct io_wq_work_node *node1, *node2;
+		struct io_wq_work_node *node2;
 
 		spin_lock_irq(&tctx->task_lock);
-		node1 = tctx->prior_task_list.first;
 		node2 = tctx->task_list.first;
 		INIT_WQ_LIST(&tctx->task_list);
-		INIT_WQ_LIST(&tctx->prior_task_list);
-		if (!node2 && !node1)
+		if (!node2)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
-		if (!node2 && !node1)
+		if (!node2)
 			break;
 
-		if (node1)
-			handle_prev_tw_list(node1, &ctx, &uring_locked);
-		if (node2)
-			handle_tw_list(node2, &ctx, &uring_locked);
+		handle_tw_list(node2, &ctx, &uring_locked);
 		cond_resched();
 
-		if (data_race(!tctx->task_list.first) &&
-		    data_race(!tctx->prior_task_list.first) && uring_locked)
+		if (data_race(!tctx->task_list.first) && uring_locked)
 			io_submit_flush_completions(ctx);
 	}
 
@@ -2579,7 +2536,6 @@ void io_uring_task_work_run(void)
 
 static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
-	struct io_wq_work_list *list;
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
 	struct io_wq_work_node *node;
@@ -2591,8 +2547,7 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 	io_drop_inflight_file(req);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	list = priority ? &tctx->prior_task_list : &tctx->task_list;
-	wq_list_add_tail(&req->io_task_work.node, list);
+	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
 	if (unlikely(atomic_read(&tctx->in_idle)))
 		goto cancel_locked;
 
@@ -2618,7 +2573,8 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
 cancel_locked:
-	node = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
+	node = tctx->task_list.first;
+	INIT_WQ_LIST(&tctx->task_list);
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
@@ -9134,7 +9090,6 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	task->io_uring = tctx;
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
-	INIT_WQ_LIST(&tctx->prior_task_list);
 	return 0;
 }
 
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 11/11] io_uring: lock-free task_work stack
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
                   ` (9 preceding siblings ...)
  2022-04-21 13:44 ` [RFC 10/11] io_uring: remove priority tw list Pavel Begunkov
@ 2022-04-21 13:44 ` Pavel Begunkov
  2022-04-21 14:50 ` [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
  11 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 13:44 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Instead of keeping a list of task_work items keep them in a lock-free
stack. However, we still would like to keep the ordering guarantees, so
reverse the list upon execution in io_uring_task_work_run().

First, for each tw add it a spin_lock/unlock_irq() pair with a single
cmpxchg(). Same on the execution side but per batch. And it also kills
the final lock/unlock at the end of io_uring_task_work_run().

The main downside here is that we need to reverse the tw list on
execution messing up with caches.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 120 +++++++++++++++++++++++---------------------------
 1 file changed, 56 insertions(+), 64 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 51b6ee2b70f2..97b5559bb660 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -506,10 +506,8 @@ struct io_uring_task {
 	struct percpu_counter	inflight;
 	atomic_t		in_idle;
 
-	spinlock_t		task_lock;
-	struct io_wq_work_list	task_list;
+	struct io_task_work	*task_list;
 	struct file		**registered_rings;
-	bool			task_running;
 };
 
 /*
@@ -860,7 +858,7 @@ typedef void (*io_req_tw_func_t)(struct io_kiocb *req, bool *locked);
 
 struct io_task_work {
 	union {
-		struct io_wq_work_node	node;
+		struct io_task_work	*next;
 		struct llist_node	fallback_node;
 	};
 	io_req_tw_func_t		func;
@@ -2482,15 +2480,29 @@ static inline void ctx_commit_and_unlock(struct io_ring_ctx *ctx)
 	io_cqring_ev_posted(ctx);
 }
 
-static void handle_tw_list(struct io_wq_work_node *node,
+static struct io_task_work tw_work_exited; /* all we need is ->next == NULL */
+
+static void handle_tw_list(struct io_task_work *node,
 			   struct io_ring_ctx **ctx, bool *locked)
 {
+	struct io_task_work *next;
+	struct io_task_work *prev = NULL;
+
+	/* reverse the list */
+	while (node->next) {
+		next = node->next;
+		node->next = prev;
+		prev = node;
+		node = next;
+	}
+	node->next = prev;
+
 	do {
-		struct io_wq_work_node *next = node->next;
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
-						    io_task_work.node);
+						    io_task_work);
 
-		prefetch(container_of(next, struct io_kiocb, io_task_work.node));
+		next = node->next;
+		prefetch(container_of(next, struct io_kiocb, io_task_work));
 
 		if (req->ctx != *ctx) {
 			ctx_flush_and_put(*ctx, locked);
@@ -2511,25 +2523,27 @@ void io_uring_task_work_run(void)
 
 	if (!tctx)
 		return;
+	/*
+	 * The poison is only assigned from the task context we're currently in.
+	 * Nobody can set it while io_uring_task_work_run() is running
+	 */
+	if (READ_ONCE(tctx->task_list) == &tw_work_exited)
+		return;
 
-	while (1) {
-		struct io_wq_work_node *node2;
-
-		spin_lock_irq(&tctx->task_lock);
-		node2 = tctx->task_list.first;
-		INIT_WQ_LIST(&tctx->task_list);
-		if (!node2)
-			tctx->task_running = false;
-		spin_unlock_irq(&tctx->task_lock);
-		if (!node2)
+	do {
+		struct io_task_work *head = xchg(&tctx->task_list, NULL);
+
+		if (unlikely(!head))
 			break;
+		handle_tw_list(head, &ctx, &uring_locked);
 
-		handle_tw_list(node2, &ctx, &uring_locked);
 		cond_resched();
-
-		if (data_race(!tctx->task_list.first) && uring_locked)
-			io_submit_flush_completions(ctx);
-	}
+		if (READ_ONCE(tctx->task_list))
+			continue;
+		if (!uring_locked)
+			break;
+		io_submit_flush_completions(ctx);
+	} while (READ_ONCE(tctx->task_list));
 
 	ctx_flush_and_put(ctx, &uring_locked);
 }
@@ -2538,26 +2552,26 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
-	struct io_wq_work_node *node;
-	unsigned long flags;
-	bool running;
+	struct io_task_work *head;
 
 	WARN_ON_ONCE(!tctx);
 
 	io_drop_inflight_file(req);
 
-	spin_lock_irqsave(&tctx->task_lock, flags);
-	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
-	if (unlikely(atomic_read(&tctx->in_idle)))
-		goto cancel_locked;
+	do {
+		head = READ_ONCE(tctx->task_list);
+		if (unlikely(head == &tw_work_exited)) {
+			req_set_fail(req);
+			if (llist_add(&req->io_task_work.fallback_node,
+				      &req->ctx->fallback_llist))
+				schedule_delayed_work(&req->ctx->fallback_work, 1);
+			return;
+		}
 
-	running = tctx->task_running;
-	if (!running)
-		tctx->task_running = true;
-	spin_unlock_irqrestore(&tctx->task_lock, flags);
+		req->io_task_work.next = head;
+	} while (cmpxchg(&tctx->task_list, head, &req->io_task_work) != head);
 
-	/* task_work already pending, we're done */
-	if (!running) {
+	if (!head) {
 		/*
 		 * SQPOLL kernel thread doesn't need notification, just a wakeup. For
 		 * all other cases, use TWA_SIGNAL unconditionally to ensure we're
@@ -2569,22 +2583,6 @@ static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 		else
 			task_work_notify(tsk, TWA_SIGNAL);
 	}
-	return;
-
-	spin_lock_irqsave(&tctx->task_lock, flags);
-cancel_locked:
-	node = tctx->task_list.first;
-	INIT_WQ_LIST(&tctx->task_list);
-	spin_unlock_irqrestore(&tctx->task_lock, flags);
-
-	while (node) {
-		req = container_of(node, struct io_kiocb, io_task_work.node);
-		req_set_fail(req);
-		node = node->next;
-		if (llist_add(&req->io_task_work.fallback_node,
-			      &req->ctx->fallback_llist))
-			schedule_delayed_work(&req->ctx->fallback_work, 1);
-	}
 }
 
 static void io_req_task_cancel(struct io_kiocb *req, bool *locked)
@@ -7977,7 +7975,7 @@ static int io_sq_thread(void *data)
 			if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
 				sqt_spin = true;
 		}
-		if (tctx->task_running) {
+		if (READ_ONCE(tctx->task_list)) {
 			io_uring_task_work_run();
 			sqt_spin = true;
 		}
@@ -7990,7 +7988,7 @@ static int io_sq_thread(void *data)
 		}
 
 		prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);
-		if (!io_sqd_events_pending(sqd) && !tctx->task_running) {
+		if (!io_sqd_events_pending(sqd) && !READ_ONCE(tctx->task_list)) {
 			bool needs_sched = true;
 
 			list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
@@ -9088,8 +9086,6 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	init_waitqueue_head(&tctx->wait);
 	atomic_set(&tctx->in_idle, 0);
 	task->io_uring = tctx;
-	spin_lock_init(&tctx->task_lock);
-	INIT_WQ_LIST(&tctx->task_list);
 	return 0;
 }
 
@@ -10301,16 +10297,16 @@ static __cold void io_uring_cancel_generic(bool cancel_all,
 
 	if (!current->io_uring)
 		return;
+	if (WARN_ON_ONCE(READ_ONCE(tctx->task_list) == &tw_work_exited))
+		return;
 	if (tctx->io_wq)
 		io_wq_exit_start(tctx->io_wq);
+	while (cmpxchg(&tctx->task_list, NULL, &tw_work_exited) != NULL)
+		io_uring_task_work_run();
 
-	spin_lock_irq(&tctx->task_lock);
 	atomic_inc(&tctx->in_idle);
-	spin_unlock_irq(&tctx->task_lock);
-
 	do {
 		io_uring_drop_tctx_refs(current);
-		io_run_task_work();
 		/* read completions before cancelations */
 		inflight = tctx_inflight(tctx, !cancel_all);
 		if (!inflight)
@@ -10335,10 +10331,6 @@ static __cold void io_uring_cancel_generic(bool cancel_all,
 
 		prepare_to_wait(&tctx->wait, &wait, TASK_INTERRUPTIBLE);
 		io_run_task_work();
-		if (tctx->task_running) {
-			__set_current_state(TASK_RUNNING);
-			io_uring_task_work_run();
-		}
 		io_uring_drop_tctx_refs(current);
 
 		/*
-- 
2.36.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC 00/11] io_uring specific task_work infra
  2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
                   ` (10 preceding siblings ...)
  2022-04-21 13:44 ` [RFC 11/11] io_uring: lock-free task_work stack Pavel Begunkov
@ 2022-04-21 14:50 ` Pavel Begunkov
  2022-04-22  8:45   ` Hao Xu
  11 siblings, 1 reply; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-21 14:50 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

On 4/21/22 14:44, Pavel Begunkov wrote:
> For experiments only. If proves to be useful would need to make it
> nicer on the non-io_uring side.
> 
> 0-10 save 1 spinlock/unlock_irq pair and 2 cmpxchg per batch. 11/11 in
> general trades 1 per tw add spin_lock/unlock_irq and 2 per batch spinlocking
> with 2 cmpxchg to 1 per tw add cmpxchg and 1 per batch cmpxchg.

null_blk irqmode=1 completion_nsec=0 submit_queues=32 poll_queues=32
echo -n 0 > /sys/block/nullb0/queue/iostats
echo -n 2 > /sys/block/nullb0/queue/nomerges
io_uring -d<QD> -s<QD> -c<QD> -p0 -B1 -F1 -b512 /dev/nullb0


      | base | 1-10         | 1-11
___________________________________________
QD1  | 1.88 | 2.15 (+14%)  | 2.19 (+16.4%)
QD4  | 2.8  | 3.06 (+9.2%) | 3.11 (+11%)
QD32 | 3.61 | 3.81 (+5.5%) | 3.96 (+9.6%)

The numbers are in MIOPS, (%) is relative diff with the baseline.
It gives more than I expected, but the testing is not super
consistent, so a part of it might be due to variance.


> Pavel Begunkov (11):
>    io_uring: optimise io_req_task_work_add
>    io_uringg: add io_should_fail_tw() helper
>    io_uring: ban tw queue for exiting processes
>    io_uring: don't take ctx refs in tctx_task_work()
>    io_uring: add dummy io_uring_task_work_run()
>    task_work: add helper for signalling a task
>    io_uring: run io_uring task_works on TIF_NOTIFY_SIGNAL
>    io_uring: wire io_uring specific task work
>    io_uring: refactor io_run_task_work()
>    io_uring: remove priority tw list
>    io_uring: lock-free task_work stack
> 
>   fs/io-wq.c                |   1 +
>   fs/io_uring.c             | 213 +++++++++++++++-----------------------
>   include/linux/io_uring.h  |   4 +
>   include/linux/task_work.h |   4 +
>   kernel/entry/kvm.c        |   1 +
>   kernel/signal.c           |   2 +
>   kernel/task_work.c        |  33 +++---
>   7 files changed, 115 insertions(+), 143 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 00/11] io_uring specific task_work infra
  2022-04-21 14:50 ` [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
@ 2022-04-22  8:45   ` Hao Xu
  2022-04-22 11:54     ` Pavel Begunkov
  0 siblings, 1 reply; 15+ messages in thread
From: Hao Xu @ 2022-04-22  8:45 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jens Axboe

Hi,
在 2022/4/21 下午10:50, Pavel Begunkov 写道:
> On 4/21/22 14:44, Pavel Begunkov wrote:
>> For experiments only. If proves to be useful would need to make it
>> nicer on the non-io_uring side.
>>
>> 0-10 save 1 spinlock/unlock_irq pair and 2 cmpxchg per batch. 11/11 in
>> general trades 1 per tw add spin_lock/unlock_irq and 2 per batch 
>> spinlocking
>> with 2 cmpxchg to 1 per tw add cmpxchg and 1 per batch cmpxchg.
> 
> null_blk irqmode=1 completion_nsec=0 submit_queues=32 poll_queues=32
> echo -n 0 > /sys/block/nullb0/queue/iostats
> echo -n 2 > /sys/block/nullb0/queue/nomerges
> io_uring -d<QD> -s<QD> -c<QD> -p0 -B1 -F1 -b512 /dev/nullb0
This series looks good to me, by the way, what does -s and -c mean? and
what is the tested workload?

Regards,
Hao
> 
> 
>       | base | 1-10         | 1-11
> ___________________________________________
> QD1  | 1.88 | 2.15 (+14%)  | 2.19 (+16.4%)
> QD4  | 2.8  | 3.06 (+9.2%) | 3.11 (+11%)
> QD32 | 3.61 | 3.81 (+5.5%) | 3.96 (+9.6%)
> 
> The numbers are in MIOPS, (%) is relative diff with the baseline.
> It gives more than I expected, but the testing is not super
> consistent, so a part of it might be due to variance.
> 
> 
>> Pavel Begunkov (11):
>>    io_uring: optimise io_req_task_work_add
>>    io_uringg: add io_should_fail_tw() helper
>>    io_uring: ban tw queue for exiting processes
>>    io_uring: don't take ctx refs in tctx_task_work()
>>    io_uring: add dummy io_uring_task_work_run()
>>    task_work: add helper for signalling a task
>>    io_uring: run io_uring task_works on TIF_NOTIFY_SIGNAL
>>    io_uring: wire io_uring specific task work
>>    io_uring: refactor io_run_task_work()
>>    io_uring: remove priority tw list
>>    io_uring: lock-free task_work stack
>>
>>   fs/io-wq.c                |   1 +
>>   fs/io_uring.c             | 213 +++++++++++++++-----------------------
>>   include/linux/io_uring.h  |   4 +
>>   include/linux/task_work.h |   4 +
>>   kernel/entry/kvm.c        |   1 +
>>   kernel/signal.c           |   2 +
>>   kernel/task_work.c        |  33 +++---
>>   7 files changed, 115 insertions(+), 143 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 00/11] io_uring specific task_work infra
  2022-04-22  8:45   ` Hao Xu
@ 2022-04-22 11:54     ` Pavel Begunkov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Begunkov @ 2022-04-22 11:54 UTC (permalink / raw)
  To: Hao Xu, io-uring; +Cc: Jens Axboe

On 4/22/22 09:45, Hao Xu wrote:
> Hi,
> 在 2022/4/21 下午10:50, Pavel Begunkov 写道:
>> On 4/21/22 14:44, Pavel Begunkov wrote:
>>> For experiments only. If proves to be useful would need to make it
>>> nicer on the non-io_uring side.
>>>
>>> 0-10 save 1 spinlock/unlock_irq pair and 2 cmpxchg per batch. 11/11 in
>>> general trades 1 per tw add spin_lock/unlock_irq and 2 per batch spinlocking
>>> with 2 cmpxchg to 1 per tw add cmpxchg and 1 per batch cmpxchg.
>>
>> null_blk irqmode=1 completion_nsec=0 submit_queues=32 poll_queues=32
>> echo -n 0 > /sys/block/nullb0/queue/iostats
>> echo -n 2 > /sys/block/nullb0/queue/nomerges
>> io_uring -d<QD> -s<QD> -c<QD> -p0 -B1 -F1 -b512 /dev/nullb0
> This series looks good to me, by the way, what does -s and -c mean? and
> what is the tested workload?

It was a standard testing tool from fio/t/. It just does random reads
keeping QD. The flags are submission and completion batching respectively,
i.e. how many requests it submits and waits per syscall.



> 
> Regards,
> Hao
>>
>>
>>       | base | 1-10         | 1-11
>> ___________________________________________
>> QD1  | 1.88 | 2.15 (+14%)  | 2.19 (+16.4%)
>> QD4  | 2.8  | 3.06 (+9.2%) | 3.11 (+11%)
>> QD32 | 3.61 | 3.81 (+5.5%) | 3.96 (+9.6%)
>>
>> The numbers are in MIOPS, (%) is relative diff with the baseline.
>> It gives more than I expected, but the testing is not super
>> consistent, so a part of it might be due to variance.
>>
>>
>>> Pavel Begunkov (11):
>>>    io_uring: optimise io_req_task_work_add
>>>    io_uringg: add io_should_fail_tw() helper
>>>    io_uring: ban tw queue for exiting processes
>>>    io_uring: don't take ctx refs in tctx_task_work()
>>>    io_uring: add dummy io_uring_task_work_run()
>>>    task_work: add helper for signalling a task
>>>    io_uring: run io_uring task_works on TIF_NOTIFY_SIGNAL
>>>    io_uring: wire io_uring specific task work
>>>    io_uring: refactor io_run_task_work()
>>>    io_uring: remove priority tw list
>>>    io_uring: lock-free task_work stack
>>>
>>>   fs/io-wq.c                |   1 +
>>>   fs/io_uring.c             | 213 +++++++++++++++-----------------------
>>>   include/linux/io_uring.h  |   4 +
>>>   include/linux/task_work.h |   4 +
>>>   kernel/entry/kvm.c        |   1 +
>>>   kernel/signal.c           |   2 +
>>>   kernel/task_work.c        |  33 +++---
>>>   7 files changed, 115 insertions(+), 143 deletions(-)
>>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-04-22 11:54 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-21 13:44 [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
2022-04-21 13:44 ` [RFC 01/11] io_uring: optimise io_req_task_work_add Pavel Begunkov
2022-04-21 13:44 ` [RFC 02/11] io_uringg: add io_should_fail_tw() helper Pavel Begunkov
2022-04-21 13:44 ` [RFC 03/11] io_uring: ban tw queue for exiting processes Pavel Begunkov
2022-04-21 13:44 ` [RFC 04/11] io_uring: don't take ctx refs in tctx_task_work() Pavel Begunkov
2022-04-21 13:44 ` [RFC 05/11] io_uring: add dummy io_uring_task_work_run() Pavel Begunkov
2022-04-21 13:44 ` [RFC 06/11] task_work: add helper for signalling a task Pavel Begunkov
2022-04-21 13:44 ` [RFC 07/11] io_uring: run io_uring task_works on TIF_NOTIFY_SIGNAL Pavel Begunkov
2022-04-21 13:44 ` [RFC 08/11] io_uring: wire io_uring specific task work Pavel Begunkov
2022-04-21 13:44 ` [RFC 09/11] io_uring: refactor io_run_task_work() Pavel Begunkov
2022-04-21 13:44 ` [RFC 10/11] io_uring: remove priority tw list Pavel Begunkov
2022-04-21 13:44 ` [RFC 11/11] io_uring: lock-free task_work stack Pavel Begunkov
2022-04-21 14:50 ` [RFC 00/11] io_uring specific task_work infra Pavel Begunkov
2022-04-22  8:45   ` Hao Xu
2022-04-22 11:54     ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox