[PATCH v2 0/8] optimise resheduling due to deferred tw

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH v2 0/8] optimise resheduling due to deferred tw
@ 2023-04-06 13:20 Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 1/8] io_uring: move pinning out of io_req_local_work_add Pavel Begunkov
                   ` (8 more replies)
  0 siblings, 9 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

io_uring extensively uses task_work, but when a task is waiting
every new queued task_work batch will try to wake it up and so
cause lots of scheduling activity. This series optimises it,
specifically applied for rw completions and send-zc notifications
for now, and will helpful for further optimisations.

Quick testing shows similar to v1 results, numbers from v1:
For my zc net test once in a while waiting for a portion of buffers
I've got 10x descrease in the number of context switches and 2x
improvement in CPU util (17% vs 8%). In profiles, io_cqring_work()
got down from 40-50% of CPU to ~13%.

There is also an improvement on the softirq side for io_uring
notifications as io_req_local_work_add() doesn't trigger wake_up()
as often. System wide profiles show reduction of cycles taken
by io_req_local_work_add() from 3% -> 0.5%, which is mostly not
reflected in the numbers above as it was firing off of a different
CPU.

v2: Remove atomics decrements by the queueing side and instead carry
    all the info in requests. It's definitely simpler and removes extra
    atomics, the downside is touching the previous request, which might
    be not cached.

    Add a couple of patches from backlog optimising and cleaning
    io_req_local_work_add().

Pavel Begunkov (8):
  io_uring: move pinning out of io_req_local_work_add
  io_uring: optimie local tw add ctx pinning
  io_uring: refactor __io_cq_unlock_post_flush()
  io_uring: add tw add flags
  io_uring: inline llist_add()
  io_uring: reduce scheduling due to tw
  io_uring: refactor __io_cq_unlock_post_flush()
  io_uring: optimise io_req_local_work_add

 include/linux/io_uring_types.h |   3 +-
 io_uring/io_uring.c            | 131 ++++++++++++++++++++++-----------
 io_uring/io_uring.h            |  29 +++++---
 io_uring/notif.c               |   2 +-
 io_uring/notif.h               |   2 +-
 io_uring/rw.c                  |   2 +-
 6 files changed, 110 insertions(+), 59 deletions(-)

-- 
2.40.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/8] io_uring: move pinning out of io_req_local_work_add
  2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 2/8] io_uring: optimie local tw add ctx pinning Pavel Begunkov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

Move ctx pinning from io_req_local_work_add() to the caller, looks
better and makes working with the code a bit easier.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ae90d2753e0d..29a0516ee5ce 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1306,17 +1306,15 @@ static void io_req_local_work_add(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	percpu_ref_get(&ctx->refs);
-
 	if (!llist_add(&req->io_task_work.node, &ctx->work_llist))
-		goto put_ref;
+		return;
 
 	/* needed for the following wake up */
 	smp_mb__after_atomic();
 
 	if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
 		io_move_task_work_from_local(ctx);
-		goto put_ref;
+		return;
 	}
 
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
@@ -1326,9 +1324,6 @@ static void io_req_local_work_add(struct io_kiocb *req)
 
 	if (READ_ONCE(ctx->cq_waiting))
 		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
-
-put_ref:
-	percpu_ref_put(&ctx->refs);
 }
 
 void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
@@ -1337,7 +1332,9 @@ void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
 	struct io_ring_ctx *ctx = req->ctx;
 
 	if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+		percpu_ref_get(&ctx->refs);
 		io_req_local_work_add(req);
+		percpu_ref_put(&ctx->refs);
 		return;
 	}
 
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/8] io_uring: optimie local tw add ctx pinning
  2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 1/8] io_uring: move pinning out of io_req_local_work_add Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

We currently pin the ctx for io_req_local_work_add() with
percpu_ref_get/put, which imply two rcu_read_lock/unlock pairs and some
extra overhead on top in the fast path. Replace it with a pure rcu read
and let io_ring_exit_work() synchronise against it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 29a0516ee5ce..fb7215b543cd 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1332,9 +1332,9 @@ void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
 	struct io_ring_ctx *ctx = req->ctx;
 
 	if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
-		percpu_ref_get(&ctx->refs);
+		rcu_read_lock();
 		io_req_local_work_add(req);
-		percpu_ref_put(&ctx->refs);
+		rcu_read_unlock();
 		return;
 	}
 
@@ -3052,6 +3052,10 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 	spin_lock(&ctx->completion_lock);
 	spin_unlock(&ctx->completion_lock);
 
+	/* pairs with RCU read section in io_req_local_work_add() */
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
+		synchronize_rcu();
+
 	io_ring_ctx_free(ctx);
 }
 
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush()
  2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 1/8] io_uring: move pinning out of io_req_local_work_add Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 2/8] io_uring: optimie local tw add ctx pinning Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
  2023-04-06 14:23   ` Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 4/8] io_uring: add tw add flags Pavel Begunkov
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

Instead of smp_mb() + __io_cqring_wake() in __io_cq_unlock_post_flush()
use equivalent io_cqring_wake(). With that we can clean it up further
and remove __io_cqring_wake().

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c |  6 ++----
 io_uring/io_uring.h | 11 ++---------
 2 files changed, 4 insertions(+), 13 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index fb7215b543cd..d4ac62de2113 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -640,10 +640,8 @@ static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
 	 * it will re-check the wakeup conditions once we return we can safely
 	 * skip waking it up.
 	 */
-	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
-		smp_mb();
-		__io_cqring_wake(ctx);
-	}
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+		io_cqring_wake(ctx);
 }
 
 void io_cq_unlock_post(struct io_ring_ctx *ctx)
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 193b2db39fe8..24d8196bbca3 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -228,8 +228,7 @@ static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
 				poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
 }
 
-/* requires smb_mb() prior, see wq_has_sleeper() */
-static inline void __io_cqring_wake(struct io_ring_ctx *ctx)
+static inline void io_cqring_wake(struct io_ring_ctx *ctx)
 {
 	/*
 	 * Trigger waitqueue handler on all waiters on our waitqueue. This
@@ -241,17 +240,11 @@ static inline void __io_cqring_wake(struct io_ring_ctx *ctx)
 	 * waitqueue handlers, we know we have a dependency between eventfd or
 	 * epoll and should terminate multishot poll at that point.
 	 */
-	if (waitqueue_active(&ctx->cq_wait))
+	if (wq_has_sleeper(&ctx->cq_wait))
 		__wake_up(&ctx->cq_wait, TASK_NORMAL, 0,
 				poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
 }
 
-static inline void io_cqring_wake(struct io_ring_ctx *ctx)
-{
-	smp_mb();
-	__io_cqring_wake(ctx);
-}
-
 static inline bool io_sqring_full(struct io_ring_ctx *ctx)
 {
 	struct io_rings *r = ctx->rings;
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush()
  2023-04-06 13:20 ` [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
@ 2023-04-06 14:23   ` Pavel Begunkov
  0 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 14:23 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, linux-kernel

On 4/6/23 14:20, Pavel Begunkov wrote:
> Instead of smp_mb() + __io_cqring_wake() in __io_cq_unlock_post_flush()
> use equivalent io_cqring_wake(). With that we can clean it up further
> and remove __io_cqring_wake().

I didn't notice patches 3 and 7 have the same subj. This one
should've better been called refactor io_cqring_wake().


> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
>   io_uring/io_uring.c |  6 ++----
>   io_uring/io_uring.h | 11 ++---------
>   2 files changed, 4 insertions(+), 13 deletions(-)
> 
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index fb7215b543cd..d4ac62de2113 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -640,10 +640,8 @@ static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
>   	 * it will re-check the wakeup conditions once we return we can safely
>   	 * skip waking it up.
>   	 */
> -	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
> -		smp_mb();
> -		__io_cqring_wake(ctx);
> -	}
> +	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
> +		io_cqring_wake(ctx);
>   }
>   


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 4/8] io_uring: add tw add flags
  2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
                   ` (2 preceding siblings ...)
  2023-04-06 13:20 ` [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 5/8] io_uring: inline llist_add() Pavel Begunkov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

We pass 'allow_local' into io_req_task_work_add() but will need more
flags. Replace it with a flags bit field and name this allow_local
flag.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 7 ++++---
 io_uring/io_uring.h | 9 +++++++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index d4ac62de2113..6f175fe682e4 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1324,12 +1324,13 @@ static void io_req_local_work_add(struct io_kiocb *req)
 		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 }
 
-void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
+void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
 {
 	struct io_uring_task *tctx = req->task->io_uring;
 	struct io_ring_ctx *ctx = req->ctx;
 
-	if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+	if (!(flags & IOU_F_TWQ_FORCE_NORMAL) &&
+	    (ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
 		rcu_read_lock();
 		io_req_local_work_add(req);
 		rcu_read_unlock();
@@ -1359,7 +1360,7 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 						    io_task_work.node);
 
 		node = node->next;
-		__io_req_task_work_add(req, false);
+		__io_req_task_work_add(req, IOU_F_TWQ_FORCE_NORMAL);
 	}
 }
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 24d8196bbca3..cb4309a2acdc 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -15,6 +15,11 @@
 #include <trace/events/io_uring.h>
 #endif
 
+enum {
+	/* don't use deferred task_work */
+	IOU_F_TWQ_FORCE_NORMAL			= 1,
+};
+
 enum {
 	IOU_OK			= 0,
 	IOU_ISSUE_SKIP_COMPLETE	= -EIOCBQUEUED,
@@ -48,7 +53,7 @@ static inline bool io_req_ffs_set(struct io_kiocb *req)
 	return req->flags & REQ_F_FIXED_FILE;
 }
 
-void __io_req_task_work_add(struct io_kiocb *req, bool allow_local);
+void __io_req_task_work_add(struct io_kiocb *req, unsigned flags);
 bool io_is_uring_fops(struct file *file);
 bool io_alloc_async_data(struct io_kiocb *req);
 void io_req_task_queue(struct io_kiocb *req);
@@ -93,7 +98,7 @@ bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
 
 static inline void io_req_task_work_add(struct io_kiocb *req)
 {
-	__io_req_task_work_add(req, true);
+	__io_req_task_work_add(req, 0);
 }
 
 #define io_for_each_link(pos, head) \
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 5/8] io_uring: inline llist_add()
  2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
                   ` (3 preceding siblings ...)
  2023-04-06 13:20 ` [PATCH v2 4/8] io_uring: add tw add flags Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 6/8] io_uring: reduce scheduling due to tw Pavel Begunkov
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

We'll need to grab some information from the previous request in the tw
list, inline llist_add(), it'll be used in the following patch.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 6f175fe682e4..786ecfa01c54 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1303,8 +1303,15 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx)
 static void io_req_local_work_add(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
+	struct llist_node *first;
 
-	if (!llist_add(&req->io_task_work.node, &ctx->work_llist))
+	first = READ_ONCE(ctx->work_llist.first);
+	do {
+		req->io_task_work.node.next = first;
+	} while (!try_cmpxchg(&ctx->work_llist.first, &first,
+			      &req->io_task_work.node));
+
+	if (first)
 		return;
 
 	/* needed for the following wake up */
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 6/8] io_uring: reduce scheduling due to tw
  2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
                   ` (4 preceding siblings ...)
  2023-04-06 13:20 ` [PATCH v2 5/8] io_uring: inline llist_add() Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 7/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

Every task_work will try to wake the task to be executed, which causes
excessive scheduling and additional overhead. For some tw it's
justified, but others won't do much but post a single CQE.

When a task waits for multiple cqes, every such task_work will wake it
up. Instead, the task may give a hint about how many cqes it waits for,
io_req_local_work_add() will compare against it and skip wake ups
if #cqes + #tw is not enough to satisfy the waiting condition. Task_work
that uses the optimisation should be simple enough and never post more
than one CQE. It's also ignored for non DEFER_TASKRUN rings.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  3 +-
 io_uring/io_uring.c            | 68 +++++++++++++++++++++++-----------
 io_uring/io_uring.h            |  9 +++++
 io_uring/notif.c               |  2 +-
 io_uring/notif.h               |  2 +-
 io_uring/rw.c                  |  2 +-
 6 files changed, 61 insertions(+), 25 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 4a6ce03a4903..fa621a508a01 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -296,7 +296,7 @@ struct io_ring_ctx {
 		spinlock_t		completion_lock;
 
 		bool			poll_multi_queue;
-		bool			cq_waiting;
+		atomic_t		cq_wait_nr;
 
 		/*
 		 * ->iopoll_list is protected by the ctx->uring_lock for
@@ -566,6 +566,7 @@ struct io_kiocb {
 	atomic_t			refs;
 	atomic_t			poll_refs;
 	struct io_task_work		io_task_work;
+	unsigned			nr_tw;
 	/* for polled requests, i.e. IORING_OP_POLL_ADD and async armed poll */
 	union {
 		struct hlist_node	hash_node;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 786ecfa01c54..8a327a81beaf 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1300,35 +1300,59 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx)
 	}
 }
 
-static void io_req_local_work_add(struct io_kiocb *req)
+static void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
+	unsigned nr_wait, nr_tw, nr_tw_prev;
 	struct llist_node *first;
 
+	if (req->flags & (REQ_F_LINK | REQ_F_HARDLINK))
+		flags &= ~IOU_F_TWQ_LAZY_WAKE;
+
 	first = READ_ONCE(ctx->work_llist.first);
 	do {
+		nr_tw_prev = 0;
+		if (first) {
+			struct io_kiocb *first_req = container_of(first,
+							struct io_kiocb,
+							io_task_work.node);
+			/*
+			 * Might be executed at any moment, rely on
+			 * SLAB_TYPESAFE_BY_RCU to keep it alive.
+			 */
+			nr_tw_prev = READ_ONCE(first_req->nr_tw);
+		}
+		nr_tw = nr_tw_prev + 1;
+		/* Large enough to fail the nr_wait comparison below */
+		if (!(flags & IOU_F_TWQ_LAZY_WAKE))
+			nr_tw = -1U;
+
+		req->nr_tw = nr_tw;
 		req->io_task_work.node.next = first;
 	} while (!try_cmpxchg(&ctx->work_llist.first, &first,
 			      &req->io_task_work.node));
 
-	if (first)
-		return;
-
-	/* needed for the following wake up */
-	smp_mb__after_atomic();
-
-	if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
-		io_move_task_work_from_local(ctx);
-		return;
+	if (!first) {
+		if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
+			io_move_task_work_from_local(ctx);
+			return;
+		}
+		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+			atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
+		if (ctx->has_evfd)
+			io_eventfd_signal(ctx);
 	}
 
-	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
-		atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
-	if (ctx->has_evfd)
-		io_eventfd_signal(ctx);
-
-	if (READ_ONCE(ctx->cq_waiting))
-		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
+	nr_wait = atomic_read(&ctx->cq_wait_nr);
+	/* no one is waiting */
+	if (!nr_wait)
+		return;
+	/* either not enough or the previous add has already woken it up */
+	if (nr_wait > nr_tw || nr_tw_prev >= nr_wait)
+		return;
+	/* pairs with set_current_state() in io_cqring_wait() */
+	smp_mb__after_atomic();
+	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 }
 
 void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
@@ -1339,7 +1363,7 @@ void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
 	if (!(flags & IOU_F_TWQ_FORCE_NORMAL) &&
 	    (ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
 		rcu_read_lock();
-		io_req_local_work_add(req);
+		io_req_local_work_add(req, flags);
 		rcu_read_unlock();
 		return;
 	}
@@ -2625,7 +2649,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		unsigned long check_cq;
 
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
-			WRITE_ONCE(ctx->cq_waiting, 1);
+			int nr_wait = (int) iowq.cq_tail - READ_ONCE(ctx->rings->cq.tail);
+
+			atomic_set(&ctx->cq_wait_nr, nr_wait);
 			set_current_state(TASK_INTERRUPTIBLE);
 		} else {
 			prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
@@ -2634,7 +2660,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 		ret = io_cqring_wait_schedule(ctx, &iowq);
 		__set_current_state(TASK_RUNNING);
-		WRITE_ONCE(ctx->cq_waiting, 0);
+		atomic_set(&ctx->cq_wait_nr, 0);
 
 		if (ret < 0)
 			break;
@@ -4517,7 +4543,7 @@ static int __init io_uring_init(void)
 	io_uring_optable_init();
 
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC |
-				SLAB_ACCOUNT);
+				SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU);
 	return 0;
 };
 __initcall(io_uring_init);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index cb4309a2acdc..ef449e43d493 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -18,6 +18,15 @@
 enum {
 	/* don't use deferred task_work */
 	IOU_F_TWQ_FORCE_NORMAL			= 1,
+
+	/*
+	 * A hint to not wake right away but delay until there are enough of
+	 * tw's queued to match the number of CQEs the task is waiting for.
+	 *
+	 * Must not be used wirh requests generating more than one CQE.
+	 * It's also ignored unless IORING_SETUP_DEFER_TASKRUN is set.
+	 */
+	IOU_F_TWQ_LAZY_WAKE			= 2,
 };
 
 enum {
diff --git a/io_uring/notif.c b/io_uring/notif.c
index 172105eb347d..e1846a25dde1 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -31,7 +31,7 @@ static void io_tx_ubuf_callback(struct sk_buff *skb, struct ubuf_info *uarg,
 	struct io_kiocb *notif = cmd_to_io_kiocb(nd);
 
 	if (refcount_dec_and_test(&uarg->refcnt))
-		io_req_task_work_add(notif);
+		__io_req_task_work_add(notif, IOU_F_TWQ_LAZY_WAKE);
 }
 
 static void io_tx_ubuf_callback_ext(struct sk_buff *skb, struct ubuf_info *uarg,
diff --git a/io_uring/notif.h b/io_uring/notif.h
index c88c800cd89d..6dd1b30a468f 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -33,7 +33,7 @@ static inline void io_notif_flush(struct io_kiocb *notif)
 
 	/* drop slot's master ref */
 	if (refcount_dec_and_test(&nd->uarg.refcnt))
-		io_req_task_work_add(notif);
+		__io_req_task_work_add(notif, IOU_F_TWQ_LAZY_WAKE);
 }
 
 static inline int io_notif_account_mem(struct io_kiocb *notif, unsigned len)
diff --git a/io_uring/rw.c b/io_uring/rw.c
index f14868624f41..6c7d2654770e 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -304,7 +304,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
 		return;
 	io_req_set_res(req, io_fixup_rw_res(req, res), 0);
 	req->io_task_work.func = io_req_rw_complete;
-	io_req_task_work_add(req);
+	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 7/8] io_uring: refactor __io_cq_unlock_post_flush()
  2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
                   ` (5 preceding siblings ...)
  2023-04-06 13:20 ` [PATCH v2 6/8] io_uring: reduce scheduling due to tw Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
  2023-04-06 13:20 ` [PATCH v2 8/8] io_uring: optimise io_req_local_work_add Pavel Begunkov
  2023-04-12  1:53 ` [PATCH v2 0/8] optimise resheduling due to deferred tw Jens Axboe
  8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

Separate ->task_complete path in __io_cq_unlock_post_flush().

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8a327a81beaf..0ea50c46f27f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -627,21 +627,23 @@ static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
 	io_cqring_wake(ctx);
 }
 
-static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
+static void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
 	__releases(ctx->completion_lock)
 {
 	io_commit_cqring(ctx);
-	__io_cq_unlock(ctx);
-	io_commit_cqring_flush(ctx);
 
-	/*
-	 * As ->task_complete implies that the ring is single tasked, cq_wait
-	 * may only be waited on by the current in io_cqring_wait(), but since
-	 * it will re-check the wakeup conditions once we return we can safely
-	 * skip waking it up.
-	 */
-	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+	if (ctx->task_complete) {
+		/*
+		 * ->task_complete implies that only current might be waiting
+		 * for CQEs, and obviously, we currently don't. No one is
+		 * waiting, wakeups are futile, skip them.
+		 */
+		io_commit_cqring_flush(ctx);
+	} else {
+		__io_cq_unlock(ctx);
+		io_commit_cqring_flush(ctx);
 		io_cqring_wake(ctx);
+	}
 }
 
 void io_cq_unlock_post(struct io_ring_ctx *ctx)
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 8/8] io_uring: optimise io_req_local_work_add
  2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
                   ` (6 preceding siblings ...)
  2023-04-06 13:20 ` [PATCH v2 7/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
@ 2023-04-06 13:20 ` Pavel Begunkov
  2023-04-12  1:53 ` [PATCH v2 0/8] optimise resheduling due to deferred tw Jens Axboe
  8 siblings, 0 replies; 11+ messages in thread
From: Pavel Begunkov @ 2023-04-06 13:20 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

Chains of memory accesses are never good for performance.
The req->task->io_uring->in_cancel in io_req_local_work_add() is there
so that when a task is exiting via io_uring_try_cancel_requests() and
starts waiting for completions it gets woken up by every new task_work
item queued.

Do a little trick by announcing waiting in io_uring_try_cancel_requests()
making io_req_local_work_add() to wake us up. We also need to check for
deferred tw items after prepare_to_wait(TASK_INTERRUPTIBLE);

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0ea50c46f27f..9bbf58297a0e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1335,10 +1335,6 @@ static void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 			      &req->io_task_work.node));
 
 	if (!first) {
-		if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
-			io_move_task_work_from_local(ctx);
-			return;
-		}
 		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 			atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 		if (ctx->has_evfd)
@@ -3205,6 +3201,12 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	enum io_wq_cancel cret;
 	bool ret = false;
 
+	/* set it so io_req_local_work_add() would wake us up */
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+		atomic_set(&ctx->cq_wait_nr, 1);
+		smp_mb();
+	}
+
 	/* failed during ring init, it couldn't have issued any requests */
 	if (!ctx->rings)
 		return false;
@@ -3259,6 +3261,8 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 {
 	struct io_uring_task *tctx = current->io_uring;
 	struct io_ring_ctx *ctx;
+	struct io_tctx_node *node;
+	unsigned long index;
 	s64 inflight;
 	DEFINE_WAIT(wait);
 
@@ -3280,9 +3284,6 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 			break;
 
 		if (!sqd) {
-			struct io_tctx_node *node;
-			unsigned long index;
-
 			xa_for_each(&tctx->xa, index, node) {
 				/* sqpoll task will cancel all its requests */
 				if (node->ctx->sq_data)
@@ -3305,7 +3306,13 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 		prepare_to_wait(&tctx->wait, &wait, TASK_INTERRUPTIBLE);
 		io_run_task_work();
 		io_uring_drop_tctx_refs(current);
-
+		xa_for_each(&tctx->xa, index, node) {
+			if (!llist_empty(&node->ctx->work_llist)) {
+				WARN_ON_ONCE(node->ctx->submitter_task &&
+					     node->ctx->submitter_task != current);
+				goto end_wait;
+			}
+		}
 		/*
 		 * If we've seen completions, retry without waiting. This
 		 * avoids a race where a completion comes in before we did
@@ -3313,6 +3320,7 @@ __cold void io_uring_cancel_generic(bool cancel_all, struct io_sq_data *sqd)
 		 */
 		if (inflight == tctx_inflight(tctx, !cancel_all))
 			schedule();
+end_wait:
 		finish_wait(&tctx->wait, &wait);
 	} while (1);
 
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/8] optimise resheduling due to deferred tw
  2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
                   ` (7 preceding siblings ...)
  2023-04-06 13:20 ` [PATCH v2 8/8] io_uring: optimise io_req_local_work_add Pavel Begunkov
@ 2023-04-12  1:53 ` Jens Axboe
  8 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2023-04-12  1:53 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov; +Cc: linux-kernel


On Thu, 06 Apr 2023 14:20:06 +0100, Pavel Begunkov wrote:
> io_uring extensively uses task_work, but when a task is waiting
> every new queued task_work batch will try to wake it up and so
> cause lots of scheduling activity. This series optimises it,
> specifically applied for rw completions and send-zc notifications
> for now, and will helpful for further optimisations.
> 
> Quick testing shows similar to v1 results, numbers from v1:
> For my zc net test once in a while waiting for a portion of buffers
> I've got 10x descrease in the number of context switches and 2x
> improvement in CPU util (17% vs 8%). In profiles, io_cqring_work()
> got down from 40-50% of CPU to ~13%.
> 
> [...]

Applied, thanks!

[1/8] io_uring: move pinning out of io_req_local_work_add
      commit: ab1c590f5c9b96d8d8843d351aed72469f8f2ef0
[2/8] io_uring: optimie local tw add ctx pinning
      commit: d73a572df24661851465c821d33c03e70e4b68e5
[3/8] io_uring: refactor __io_cq_unlock_post_flush()
      commit: c66ae3ec38f946edb1776d25c1c8cd63803b8ec3
[4/8] io_uring: add tw add flags
      commit: 8501fe70ae9855076ffb03a3670e02a7b3437304
[5/8] io_uring: inline llist_add()
      commit: 5150940079a3ce94d7474f6f5b0d6276569dc1de
[6/8] io_uring: reduce scheduling due to tw
      commit: 8751d15426a31baaf40f7570263c27c3e5d1dc44
[7/8] io_uring: refactor __io_cq_unlock_post_flush()
      commit: c66ae3ec38f946edb1776d25c1c8cd63803b8ec3
[8/8] io_uring: optimise io_req_local_work_add
      commit: 360cd42c4e95ff06d8d7b0a54e42236c7e7c187f

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-04-12  1:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-06 13:20 [PATCH v2 0/8] optimise resheduling due to deferred tw Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 1/8] io_uring: move pinning out of io_req_local_work_add Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 2/8] io_uring: optimie local tw add ctx pinning Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 3/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
2023-04-06 14:23   ` Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 4/8] io_uring: add tw add flags Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 5/8] io_uring: inline llist_add() Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 6/8] io_uring: reduce scheduling due to tw Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 7/8] io_uring: refactor __io_cq_unlock_post_flush() Pavel Begunkov
2023-04-06 13:20 ` [PATCH v2 8/8] io_uring: optimise io_req_local_work_add Pavel Begunkov
2023-04-12  1:53 ` [PATCH v2 0/8] optimise resheduling due to deferred tw Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox