public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v3 00/11] CQ waiting / task_work optimisations
@ 2023-01-09 14:46 Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 01/11] io_uring: move submitter_task out of cold cacheline Pavel Begunkov
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

For DEFER_TASKRUN rings replace CQ waitqueues with a custom implementation
based on the fact that only one task may be waiting for completions. Also,
improve deferred task running by removing one atomic in patch 11

Benchmarking QD1 with simulated tw arrival right after we start waiting:
7.5 MIOPS -> 9.3 (+23%), where half of CPU cycles goes to syscall overhead.

v2: remove merged cleanups and add new ones
    add 11/11 removing one extra atomic
    a small sync adjustment in 10/10
    add extra comments

Pavel Begunkov (11):
  io_uring: move submitter_task out of cold cacheline
  io_uring: refactor io_wake_function
  io_uring: don't set TASK_RUNNING in local tw runner
  io_uring: mark io_run_local_work static
  io_uring: move io_run_local_work_locked
  io_uring: separate wq for ring polling
  io_uring: add lazy poll_wq activation
  io_uring: wake up optimisations
  io_uring: waitqueue-less cq waiting
  io_uring: add io_req_local_work_add wake fast path
  io_uring: optimise deferred tw execution

 include/linux/io_uring_types.h |  15 +--
 io_uring/io_uring.c            | 161 ++++++++++++++++++++++++++-------
 io_uring/io_uring.h            |  28 ++----
 3 files changed, 144 insertions(+), 60 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v3 01/11] io_uring: move submitter_task out of cold cacheline
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 02/11] io_uring: refactor io_wake_function Pavel Begunkov
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

->submitter_task is used somewhat more frequent now than before, i.e.
for local tw enqueue and run, let's move it from the end of ctx, which
is full of cold data, to the first cacheline with mostly constants.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 128a67a40065..8dfb6c4a35d9 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -195,11 +195,7 @@ struct io_alloc_cache {
 struct io_ring_ctx {
 	/* const or read-mostly hot data */
 	struct {
-		struct percpu_ref	refs;
-
-		struct io_rings		*rings;
 		unsigned int		flags;
-		enum task_work_notify_mode	notify_method;
 		unsigned int		compat: 1;
 		unsigned int		drain_next: 1;
 		unsigned int		restricted: 1;
@@ -210,6 +206,11 @@ struct io_ring_ctx {
 		unsigned int		syscall_iopoll: 1;
 		/* all CQEs should be posted only by the submitter task */
 		unsigned int		task_complete: 1;
+
+		enum task_work_notify_mode	notify_method;
+		struct io_rings			*rings;
+		struct task_struct		*submitter_task;
+		struct percpu_ref		refs;
 	} ____cacheline_aligned_in_smp;
 
 	/* submission data */
@@ -320,7 +321,6 @@ struct io_ring_ctx {
 	/* Keep this last, we don't need it for the fast path */
 
 	struct io_restriction		restrictions;
-	struct task_struct		*submitter_task;
 
 	/* slow path rsrc auxilary data, used by update/register */
 	struct io_rsrc_node		*rsrc_backup_node;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 02/11] io_uring: refactor io_wake_function
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 01/11] io_uring: move submitter_task out of cold cacheline Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 03/11] io_uring: don't set TASK_RUNNING in local tw runner Pavel Begunkov
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Remove a local variable ctx in io_wake_function(), we don't need it if
io_should_wake() triggers it to wake up.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 89dd2827a977..157e6ef6da7c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2441,15 +2441,13 @@ static inline bool io_should_wake(struct io_wait_queue *iowq)
 static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 			    int wake_flags, void *key)
 {
-	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
-							wq);
-	struct io_ring_ctx *ctx = iowq->ctx;
+	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, wq);
 
 	/*
 	 * Cannot safely flush overflowed CQEs from here, ensure we wake up
 	 * the task, and the next invocation will do it.
 	 */
-	if (io_should_wake(iowq) || io_has_work(ctx))
+	if (io_should_wake(iowq) || io_has_work(iowq->ctx))
 		return autoremove_wake_function(curr, mode, wake_flags, key);
 	return -1;
 }
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 03/11] io_uring: don't set TASK_RUNNING in local tw runner
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 01/11] io_uring: move submitter_task out of cold cacheline Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 02/11] io_uring: refactor io_wake_function Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 04/11] io_uring: mark io_run_local_work static Pavel Begunkov
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

The CQ waiting loop sets TASK_RUNNING before trying to execute
task_work, no need to repeat it in io_run_local_work().

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 157e6ef6da7c..961ebc031992 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1339,11 +1339,9 @@ int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
 
 int io_run_local_work(struct io_ring_ctx *ctx)
 {
-	bool locked;
+	bool locked = mutex_trylock(&ctx->uring_lock);
 	int ret;
 
-	__set_current_state(TASK_RUNNING);
-	locked = mutex_trylock(&ctx->uring_lock);
 	ret = __io_run_local_work(ctx, &locked);
 	if (locked)
 		mutex_unlock(&ctx->uring_lock);
@@ -2455,6 +2453,7 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 int io_run_task_work_sig(struct io_ring_ctx *ctx)
 {
 	if (!llist_empty(&ctx->work_llist)) {
+		__set_current_state(TASK_RUNNING);
 		if (io_run_local_work(ctx) > 0)
 			return 1;
 	}
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 04/11] io_uring: mark io_run_local_work static
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
                   ` (2 preceding siblings ...)
  2023-01-09 14:46 ` [PATCH v3 03/11] io_uring: don't set TASK_RUNNING in local tw runner Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 05/11] io_uring: move io_run_local_work_locked Pavel Begunkov
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

io_run_local_work is enclosed in io_uring.c, we don't need to export it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 2 +-
 io_uring/io_uring.h | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 961ebc031992..5b62386413d9 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1337,7 +1337,7 @@ int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
 
 }
 
-int io_run_local_work(struct io_ring_ctx *ctx)
+static int io_run_local_work(struct io_ring_ctx *ctx)
 {
 	bool locked = mutex_trylock(&ctx->uring_lock);
 	int ret;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 011932935c36..78896d1f7916 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -29,7 +29,6 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow);
 bool io_req_cqe_overflow(struct io_kiocb *req);
 int io_run_task_work_sig(struct io_ring_ctx *ctx);
 int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked);
-int io_run_local_work(struct io_ring_ctx *ctx);
 void io_req_defer_failed(struct io_kiocb *req, s32 res);
 void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 05/11] io_uring: move io_run_local_work_locked
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
                   ` (3 preceding siblings ...)
  2023-01-09 14:46 ` [PATCH v3 04/11] io_uring: mark io_run_local_work static Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 06/11] io_uring: separate wq for ring polling Pavel Begunkov
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

io_run_local_work_locked() is only used in io_uring.c, move it there.
With that we can also make __io_run_local_work() static.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 18 +++++++++++++++++-
 io_uring/io_uring.h | 17 -----------------
 2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 5b62386413d9..c251acf0964e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1296,7 +1296,7 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 	}
 }
 
-int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
+static int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
 {
 	struct llist_node *node;
 	struct llist_node fake;
@@ -1337,6 +1337,22 @@ int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
 
 }
 
+static inline int io_run_local_work_locked(struct io_ring_ctx *ctx)
+{
+	bool locked;
+	int ret;
+
+	if (llist_empty(&ctx->work_llist))
+		return 0;
+
+	locked = true;
+	ret = __io_run_local_work(ctx, &locked);
+	/* shouldn't happen! */
+	if (WARN_ON_ONCE(!locked))
+		mutex_lock(&ctx->uring_lock);
+	return ret;
+}
+
 static int io_run_local_work(struct io_ring_ctx *ctx)
 {
 	bool locked = mutex_trylock(&ctx->uring_lock);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 78896d1f7916..b5975e353aa1 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -28,7 +28,6 @@ enum {
 struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx, bool overflow);
 bool io_req_cqe_overflow(struct io_kiocb *req);
 int io_run_task_work_sig(struct io_ring_ctx *ctx);
-int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked);
 void io_req_defer_failed(struct io_kiocb *req, s32 res);
 void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
@@ -283,22 +282,6 @@ static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
 	return task_work_pending(current) || !wq_list_empty(&ctx->work_llist);
 }
 
-static inline int io_run_local_work_locked(struct io_ring_ctx *ctx)
-{
-	bool locked;
-	int ret;
-
-	if (llist_empty(&ctx->work_llist))
-		return 0;
-
-	locked = true;
-	ret = __io_run_local_work(ctx, &locked);
-	/* shouldn't happen! */
-	if (WARN_ON_ONCE(!locked))
-		mutex_lock(&ctx->uring_lock);
-	return ret;
-}
-
 static inline void io_tw_lock(struct io_ring_ctx *ctx, bool *locked)
 {
 	if (!*locked) {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 06/11] io_uring: separate wq for ring polling
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
                   ` (4 preceding siblings ...)
  2023-01-09 14:46 ` [PATCH v3 05/11] io_uring: move io_run_local_work_locked Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 07/11] io_uring: add lazy poll_wq activation Pavel Begunkov
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Don't use ->cq_wait for ring polling but add a separate wait queue for
it. We need it for following patches.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 2 +-
 io_uring/io_uring.c            | 3 ++-
 io_uring/io_uring.h            | 9 +++++++++
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 8dfb6c4a35d9..0d94ee191c15 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -319,7 +319,7 @@ struct io_ring_ctx {
 	} ____cacheline_aligned_in_smp;
 
 	/* Keep this last, we don't need it for the fast path */
-
+	struct wait_queue_head		poll_wq;
 	struct io_restriction		restrictions;
 
 	/* slow path rsrc auxilary data, used by update/register */
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c251acf0964e..8d19b0812f30 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -316,6 +316,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->cq_wait);
+	init_waitqueue_head(&ctx->poll_wq);
 	spin_lock_init(&ctx->completion_lock);
 	spin_lock_init(&ctx->timeout_lock);
 	INIT_WQ_LIST(&ctx->iopoll_list);
@@ -2788,7 +2789,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 	struct io_ring_ctx *ctx = file->private_data;
 	__poll_t mask = 0;
 
-	poll_wait(file, &ctx->cq_wait, wait);
+	poll_wait(file, &ctx->poll_wq, wait);
 	/*
 	 * synchronizes with barrier from wq_has_sleeper call in
 	 * io_commit_cqring
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index b5975e353aa1..c75bbb94703c 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -220,9 +220,18 @@ static inline void io_commit_cqring(struct io_ring_ctx *ctx)
 	smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
 }
 
+static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
+{
+	if (waitqueue_active(&ctx->poll_wq))
+		__wake_up(&ctx->poll_wq, TASK_NORMAL, 0,
+				poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
+}
+
 /* requires smb_mb() prior, see wq_has_sleeper() */
 static inline void __io_cqring_wake(struct io_ring_ctx *ctx)
 {
+	io_poll_wq_wake(ctx);
+
 	/*
 	 * Trigger waitqueue handler on all waiters on our waitqueue. This
 	 * won't necessarily wake up all the tasks, io_should_wake() will make
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 07/11] io_uring: add lazy poll_wq activation
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
                   ` (5 preceding siblings ...)
  2023-01-09 14:46 ` [PATCH v3 06/11] io_uring: separate wq for ring polling Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 08/11] io_uring: wake up optimisations Pavel Begunkov
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Even though io_poll_wq_wake()'s waitqueue_active reuses a barrier we do
for another waitqueue, it's not going to be the case in the future and
so we want to have a fast path for it when the ring has never been
polled.

Move poll_wq wake ups into __io_commit_cqring_flush() using a new flag
called ->poll_activated. The idea behind the flag is to set it when the
ring was polled for the first time. This requires additional sync to not
miss events, which is done here by using task_work for ->task_complete
rings, and by default enabling the flag for all other types of rings.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  2 ++
 io_uring/io_uring.c            | 60 +++++++++++++++++++++++++++++++++-
 io_uring/io_uring.h            |  7 ++--
 3 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 0d94ee191c15..7b5e90520278 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -206,6 +206,7 @@ struct io_ring_ctx {
 		unsigned int		syscall_iopoll: 1;
 		/* all CQEs should be posted only by the submitter task */
 		unsigned int		task_complete: 1;
+		unsigned int		poll_activated: 1;
 
 		enum task_work_notify_mode	notify_method;
 		struct io_rings			*rings;
@@ -357,6 +358,7 @@ struct io_ring_ctx {
 	u32				iowq_limits[2];
 	bool				iowq_limits_set;
 
+	struct callback_head		poll_wq_task_work;
 	struct list_head		defer_list;
 	unsigned			sq_thread_idle;
 	/* protected by ->completion_lock */
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8d19b0812f30..a6ed022c1356 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -573,6 +573,8 @@ static void io_eventfd_flush_signal(struct io_ring_ctx *ctx)
 
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
 {
+	if (ctx->poll_activated)
+		io_poll_wq_wake(ctx);
 	if (ctx->off_timeout_used)
 		io_flush_timeouts(ctx);
 	if (ctx->drain_active) {
@@ -2784,11 +2786,53 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	kfree(ctx);
 }
 
+static __cold void io_activate_pollwq_cb(struct callback_head *cb)
+{
+	struct io_ring_ctx *ctx = container_of(cb, struct io_ring_ctx,
+					       poll_wq_task_work);
+
+	mutex_lock(&ctx->uring_lock);
+	ctx->poll_activated = true;
+	mutex_unlock(&ctx->uring_lock);
+
+	/*
+	 * Wake ups for some events between start of polling and activation
+	 * might've been lost due to loose synchronisation.
+	 */
+	wake_up_all(&ctx->poll_wq);
+	percpu_ref_put(&ctx->refs);
+}
+
+static __cold void io_activate_pollwq(struct io_ring_ctx *ctx)
+{
+	spin_lock(&ctx->completion_lock);
+	/* already activated or in progress */
+	if (ctx->poll_activated || ctx->poll_wq_task_work.func)
+		goto out;
+	if (WARN_ON_ONCE(!ctx->task_complete))
+		goto out;
+	if (!ctx->submitter_task)
+		goto out;
+	/*
+	 * with ->submitter_task only the submitter task completes requests, we
+	 * only need to sync with it, which is done by injecting a tw
+	 */
+	init_task_work(&ctx->poll_wq_task_work, io_activate_pollwq_cb);
+	percpu_ref_get(&ctx->refs);
+	if (task_work_add(ctx->submitter_task, &ctx->poll_wq_task_work, TWA_SIGNAL))
+		percpu_ref_put(&ctx->refs);
+out:
+	spin_unlock(&ctx->completion_lock);
+}
+
 static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 {
 	struct io_ring_ctx *ctx = file->private_data;
 	__poll_t mask = 0;
 
+	if (unlikely(!ctx->poll_activated))
+		io_activate_pollwq(ctx);
+
 	poll_wait(file, &ctx->poll_wq, wait);
 	/*
 	 * synchronizes with barrier from wq_has_sleeper call in
@@ -3595,6 +3639,13 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	    !(ctx->flags & IORING_SETUP_SQPOLL))
 		ctx->task_complete = true;
 
+	/*
+	 * lazy poll_wq activation relies on ->task_complete for synchronisation
+	 * purposes, see io_activate_pollwq()
+	 */
+	if (!ctx->task_complete)
+		ctx->poll_activated = true;
+
 	/*
 	 * When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, user
 	 * space applications don't need to do io completion events
@@ -3888,8 +3939,15 @@ static int io_register_enable_rings(struct io_ring_ctx *ctx)
 	if (!(ctx->flags & IORING_SETUP_R_DISABLED))
 		return -EBADFD;
 
-	if (ctx->flags & IORING_SETUP_SINGLE_ISSUER && !ctx->submitter_task)
+	if (ctx->flags & IORING_SETUP_SINGLE_ISSUER && !ctx->submitter_task) {
 		ctx->submitter_task = get_task_struct(current);
+		/*
+		 * Lazy activation attempts would fail if it was polled before
+		 * submitter_task is set.
+		 */
+		if (wq_has_sleeper(&ctx->poll_wq))
+			io_activate_pollwq(ctx);
+	}
 
 	if (ctx->restrictions.registered)
 		ctx->restricted = 1;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index c75bbb94703c..5113e0ddb01d 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -222,7 +222,7 @@ static inline void io_commit_cqring(struct io_ring_ctx *ctx)
 
 static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
 {
-	if (waitqueue_active(&ctx->poll_wq))
+	if (wq_has_sleeper(&ctx->poll_wq))
 		__wake_up(&ctx->poll_wq, TASK_NORMAL, 0,
 				poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
 }
@@ -230,8 +230,6 @@ static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
 /* requires smb_mb() prior, see wq_has_sleeper() */
 static inline void __io_cqring_wake(struct io_ring_ctx *ctx)
 {
-	io_poll_wq_wake(ctx);
-
 	/*
 	 * Trigger waitqueue handler on all waiters on our waitqueue. This
 	 * won't necessarily wake up all the tasks, io_should_wake() will make
@@ -316,7 +314,8 @@ static inline void io_req_complete_defer(struct io_kiocb *req)
 
 static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx)
 {
-	if (unlikely(ctx->off_timeout_used || ctx->drain_active || ctx->has_evfd))
+	if (unlikely(ctx->off_timeout_used || ctx->drain_active ||
+		     ctx->has_evfd || ctx->poll_activated))
 		__io_commit_cqring_flush(ctx);
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 08/11] io_uring: wake up optimisations
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
                   ` (6 preceding siblings ...)
  2023-01-09 14:46 ` [PATCH v3 07/11] io_uring: add lazy poll_wq activation Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 09/11] io_uring: waitqueue-less cq waiting Pavel Begunkov
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Flush completions is done either from the submit syscall or by the
task_work, both are in the context of the submitter task, and when it
goes for a single threaded rings like implied by ->task_complete, there
won't be any waiters on ->cq_wait but the master task. That means that
there can be no tasks sleeping on cq_wait while we run
__io_submit_flush_completions() and so waking up can be skipped.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a6ed022c1356..0dabd0f3271f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -621,6 +621,25 @@ static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
 	io_cqring_wake(ctx);
 }
 
+static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
+	__releases(ctx->completion_lock)
+{
+	io_commit_cqring(ctx);
+	__io_cq_unlock(ctx);
+	io_commit_cqring_flush(ctx);
+
+	/*
+	 * As ->task_complete implies that the ring is single tasked, cq_wait
+	 * may only be waited on by the current in io_cqring_wait(), but since
+	 * it will re-check the wakeup conditions once we return we can safely
+	 * skip waking it up.
+	 */
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
+		smp_mb();
+		__io_cqring_wake(ctx);
+	}
+}
+
 void io_cq_unlock_post(struct io_ring_ctx *ctx)
 	__releases(ctx->completion_lock)
 {
@@ -1480,7 +1499,7 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 			}
 		}
 	}
-	__io_cq_unlock_post(ctx);
+	__io_cq_unlock_post_flush(ctx);
 
 	if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
 		io_free_batch_list(ctx, state->compl_reqs.first);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 09/11] io_uring: waitqueue-less cq waiting
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
                   ` (7 preceding siblings ...)
  2023-01-09 14:46 ` [PATCH v3 08/11] io_uring: wake up optimisations Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 10/11] io_uring: add io_req_local_work_add wake fast path Pavel Begunkov
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

With DEFER_TASKRUN only ctx->submitter_task might be waiting for CQEs,
we can use this to optimise io_cqring_wait(). Replace ->cq_wait
waitqueue with waking the task directly.

It works but misses an important optimisation covered by the following
patch, so this patch without follow ups might hurt performance.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0dabd0f3271f..62d879b14873 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1263,7 +1263,7 @@ static void io_req_local_work_add(struct io_kiocb *req)
 		percpu_ref_put(&ctx->refs);
 		return;
 	}
-	/* need it for the following io_cqring_wake() */
+	/* needed for the following wake up */
 	smp_mb__after_atomic();
 
 	if (unlikely(atomic_read(&req->task->io_uring->in_idle))) {
@@ -1274,10 +1274,9 @@ static void io_req_local_work_add(struct io_kiocb *req)
 
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 		atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
-
 	if (ctx->has_evfd)
 		io_eventfd_signal(ctx);
-	__io_cqring_wake(ctx);
+	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 	percpu_ref_put(&ctx->refs);
 }
 
@@ -2578,12 +2577,17 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	do {
 		unsigned long check_cq;
 
-		prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
-						TASK_INTERRUPTIBLE);
+		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+			set_current_state(TASK_INTERRUPTIBLE);
+		} else {
+			prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
+							TASK_INTERRUPTIBLE);
+		}
+
 		ret = io_cqring_wait_schedule(ctx, &iowq);
+		__set_current_state(TASK_RUNNING);
 		if (ret < 0)
 			break;
-		__set_current_state(TASK_RUNNING);
 		/*
 		 * Run task_work after scheduling and before io_should_wake().
 		 * If we got woken because of task_work being processed, run it
@@ -2611,7 +2615,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		cond_resched();
 	} while (1);
 
-	finish_wait(&ctx->cq_wait, &iowq.wq);
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+		finish_wait(&ctx->cq_wait, &iowq.wq);
 	restore_saved_sigmask_unless(ret == -EINTR);
 
 	return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret : 0;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 10/11] io_uring: add io_req_local_work_add wake fast path
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
                   ` (8 preceding siblings ...)
  2023-01-09 14:46 ` [PATCH v3 09/11] io_uring: waitqueue-less cq waiting Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-09 14:46 ` [PATCH v3 11/11] io_uring: optimise deferred tw execution Pavel Begunkov
  2023-01-11 18:00 ` [PATCH v3 00/11] CQ waiting / task_work optimisations Jens Axboe
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Don't wake the master task after queueing a deferred tw unless it's
currently waiting in io_cqring_wait.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 1 +
 io_uring/io_uring.c            | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 7b5e90520278..cc0cf0705b8f 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -295,6 +295,7 @@ struct io_ring_ctx {
 		spinlock_t		completion_lock;
 
 		bool			poll_multi_queue;
+		bool			cq_waiting;
 
 		/*
 		 * ->iopoll_list is protected by the ctx->uring_lock for
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 62d879b14873..9c95ceb1a9f2 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1276,7 +1276,9 @@ static void io_req_local_work_add(struct io_kiocb *req)
 		atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 	if (ctx->has_evfd)
 		io_eventfd_signal(ctx);
-	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
+
+	if (READ_ONCE(ctx->cq_waiting))
+		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 	percpu_ref_put(&ctx->refs);
 }
 
@@ -2578,6 +2580,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		unsigned long check_cq;
 
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+			WRITE_ONCE(ctx->cq_waiting, 1);
 			set_current_state(TASK_INTERRUPTIBLE);
 		} else {
 			prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
@@ -2586,6 +2589,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 		ret = io_cqring_wait_schedule(ctx, &iowq);
 		__set_current_state(TASK_RUNNING);
+		WRITE_ONCE(ctx->cq_waiting, 0);
+
 		if (ret < 0)
 			break;
 		/*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v3 11/11] io_uring: optimise deferred tw execution
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
                   ` (9 preceding siblings ...)
  2023-01-09 14:46 ` [PATCH v3 10/11] io_uring: add io_req_local_work_add wake fast path Pavel Begunkov
@ 2023-01-09 14:46 ` Pavel Begunkov
  2023-01-11 18:00 ` [PATCH v3 00/11] CQ waiting / task_work optimisations Jens Axboe
  11 siblings, 0 replies; 13+ messages in thread
From: Pavel Begunkov @ 2023-01-09 14:46 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We needed fake nodes in __io_run_local_work() and to avoid unecessary wake
ups while the task already running task_works, but we don't need them
anymore since wake ups are protected by cq_waiting, which is always
cleared by the time we're executing deferred task_work items.

Note that because of loose sync around cq_waiting clearing
io_req_local_work_add() may wake the task more than once, but that's
fine and should be rare to not hurt perf.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 9c95ceb1a9f2..1dd0fc0412c8 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1322,17 +1322,16 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 static int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
 {
 	struct llist_node *node;
-	struct llist_node fake;
-	struct llist_node *current_final = NULL;
+	unsigned int loops = 0;
 	int ret = 0;
-	unsigned int loops = 1;
 
 	if (WARN_ON_ONCE(ctx->submitter_task != current))
 		return -EEXIST;
-
-	node = io_llist_xchg(&ctx->work_llist, &fake);
+	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 again:
-	while (node != current_final) {
+	node = io_llist_xchg(&ctx->work_llist, NULL);
+	while (node) {
 		struct llist_node *next = node->next;
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
 						    io_task_work.node);
@@ -1341,23 +1340,14 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
 		ret++;
 		node = next;
 	}
+	loops++;
 
-	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
-		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
-
-	node = io_llist_cmpxchg(&ctx->work_llist, &fake, NULL);
-	if (node != &fake) {
-		loops++;
-		current_final = &fake;
-		node = io_llist_xchg(&ctx->work_llist, &fake);
+	if (!llist_empty(&ctx->work_llist))
 		goto again;
-	}
-
 	if (*locked)
 		io_submit_flush_completions(ctx);
 	trace_io_uring_local_work_run(ctx, ret, loops);
 	return ret;
-
 }
 
 static inline int io_run_local_work_locked(struct io_ring_ctx *ctx)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v3 00/11] CQ waiting / task_work optimisations
  2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
                   ` (10 preceding siblings ...)
  2023-01-09 14:46 ` [PATCH v3 11/11] io_uring: optimise deferred tw execution Pavel Begunkov
@ 2023-01-11 18:00 ` Jens Axboe
  11 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2023-01-11 18:00 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov


On Mon, 09 Jan 2023 14:46:02 +0000, Pavel Begunkov wrote:
> For DEFER_TASKRUN rings replace CQ waitqueues with a custom implementation
> based on the fact that only one task may be waiting for completions. Also,
> improve deferred task running by removing one atomic in patch 11
> 
> Benchmarking QD1 with simulated tw arrival right after we start waiting:
> 7.5 MIOPS -> 9.3 (+23%), where half of CPU cycles goes to syscall overhead.
> 
> [...]

Applied, thanks!

[01/11] io_uring: move submitter_task out of cold cacheline
        commit: 8516c8b514839600b7e63090f2dce5b4d658fd68
[02/11] io_uring: refactor io_wake_function
        commit: 291f31bf963c0018a2b84a94388a0e7b535c3dae
[03/11] io_uring: don't set TASK_RUNNING in local tw runner
        commit: 5eb30c28823aed63946c9d2a222bf1158a3aecae
[04/11] io_uring: mark io_run_local_work static
        commit: 88d14c077c1a04555978c499acd12f5f55de51da
[05/11] io_uring: move io_run_local_work_locked
        commit: 78c37b460a63c866050d3e05d6d4bfddf654075e
[06/11] io_uring: separate wq for ring polling
        commit: 6b40f3c9a37b97e629a99a92d2c392d77ae20f60
[07/11] io_uring: add lazy poll_wq activation
        commit: e05f6f47bf8aed0e97d9ba1d52e2a10ea542609c
[08/11] io_uring: wake up optimisations
        commit: ef3ddc6ac629fc829ed6e08418e1c070332dde63
[09/11] io_uring: waitqueue-less cq waiting
        commit: 65ca9dd8ce5e3de42b100f0e7d2ae360e9b8d14e
[10/11] io_uring: add io_req_local_work_add wake fast path
        commit: 6cd16656e2ddc63ee7aae7c7f27edcab933a0e09
[11/11] io_uring: optimise deferred tw execution
        commit: 607947314b4c9f8c979f79c095da9156b41c82b8

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-01-11 18:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-09 14:46 [PATCH v3 00/11] CQ waiting / task_work optimisations Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 01/11] io_uring: move submitter_task out of cold cacheline Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 02/11] io_uring: refactor io_wake_function Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 03/11] io_uring: don't set TASK_RUNNING in local tw runner Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 04/11] io_uring: mark io_run_local_work static Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 05/11] io_uring: move io_run_local_work_locked Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 06/11] io_uring: separate wq for ring polling Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 07/11] io_uring: add lazy poll_wq activation Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 08/11] io_uring: wake up optimisations Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 09/11] io_uring: waitqueue-less cq waiting Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 10/11] io_uring: add io_req_local_work_add wake fast path Pavel Begunkov
2023-01-09 14:46 ` [PATCH v3 11/11] io_uring: optimise deferred tw execution Pavel Begunkov
2023-01-11 18:00 ` [PATCH v3 00/11] CQ waiting / task_work optimisations Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox