[RFC v2 00/13] CQ waiting and wake up optimisations

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC v2 00/13] CQ waiting and wake up optimisations
@ 2023-01-03  3:03 Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 01/13] io_uring: rearrange defer list checks Pavel Begunkov
                   ` (14 more replies)
  0 siblings, 15 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

The series replaces waitqueues for CQ waiting with a custom waiting
loop and adds a couple more perf tweak around it. Benchmarking is done
for QD1 with simulated tw arrival right after we start waiting, it
gets us from 7.5 MIOPS to 9.2, which is +22%, or double the number for
the in-kernel io_uring overhead (i.e. without syscall and userspace).
That matches profiles, wake_up() _without_ wake_up_state() was taking
12-14% and prepare_to_wait_exclusive() was around 4-6%.

Another 15% reported in the v1 are not there as it got optimised in the
meanwhile by 52ea806ad9834 ("io_uring: finish waiting before flushing
overflow entries"). So, comparing to a couple of weeks ago the perf
of this test case should've jumped more than 30% end-to-end. (Again,
spend only half of cycles in io_uring kernel code).

1-8 are preparation patches, they might be taken right away. The rest
needs more comments and maybe a little brushing.

Pavel Begunkov (13):
  io_uring: rearrange defer list checks
  io_uring: don't iterate cq wait fast path
  io_uring: kill io_run_task_work_ctx
  io_uring: move defer tw task checks
  io_uring: parse check_cq out of wq waiting
  io_uring: mimimise io_cqring_wait_schedule
  io_uring: simplify io_has_work
  io_uring: set TASK_RUNNING right after schedule
  io_uring: separate wq for ring polling
  io_uring: add lazy poll_wq activation
  io_uring: wake up optimisations
  io_uring: waitqueue-less cq waiting
  io_uring: add io_req_local_work_add wake fast path

 include/linux/io_uring_types.h |   4 +
 io_uring/io_uring.c            | 194 +++++++++++++++++++++++----------
 io_uring/io_uring.h            |  35 +++---
 3 files changed, 155 insertions(+), 78 deletions(-)

-- 
2.38.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC v2 01/13] io_uring: rearrange defer list checks
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
@ 2023-01-03  3:03 ` Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 02/13] io_uring: don't iterate cq wait fast path Pavel Begunkov
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

There should be nothing in the ->work_llist for non DEFER_TASKRUN rings,
so we can skip flag checks and test the list emptiness directly. Also
move it out of io_run_local_work() for inlining.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 3 ---
 io_uring/io_uring.h | 2 +-
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 58ac13b69dc8..4f12619f9f21 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1338,9 +1338,6 @@ int io_run_local_work(struct io_ring_ctx *ctx)
 	bool locked;
 	int ret;
 
-	if (llist_empty(&ctx->work_llist))
-		return 0;
-
 	__set_current_state(TASK_RUNNING);
 	locked = mutex_trylock(&ctx->uring_lock);
 	ret = __io_run_local_work(ctx, &locked);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e9f0d41ebb99..46c0f765a77a 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -274,7 +274,7 @@ static inline int io_run_task_work_ctx(struct io_ring_ctx *ctx)
 	int ret = 0;
 	int ret2;
 
-	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
+	if (!llist_empty(&ctx->work_llist))
 		ret = io_run_local_work(ctx);
 
 	/* want to run this after in case more is added */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 02/13] io_uring: don't iterate cq wait fast path
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 01/13] io_uring: rearrange defer list checks Pavel Begunkov
@ 2023-01-03  3:03 ` Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 03/13] io_uring: kill io_run_task_work_ctx Pavel Begunkov
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Task work runners keep running until all queues tw items are exhausted.
It's also rare for defer tw to queue normal tw and vise versa. Taking it
into account, there is only a dim chance that further iterating the
io_cqring_wait() fast path will get us anything and so we can remove
the loop there.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4f12619f9f21..d9a2cf061acc 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2507,18 +2507,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 	if (!io_allowed_run_tw(ctx))
 		return -EEXIST;
-
-	do {
-		/* always run at least 1 task work to process local work */
-		ret = io_run_task_work_ctx(ctx);
+	if (!llist_empty(&ctx->work_llist)) {
+		ret = io_run_local_work(ctx);
 		if (ret < 0)
 			return ret;
-		io_cqring_overflow_flush(ctx);
-
-		/* if user messes with these they will just get an early return */
-		if (__io_cqring_events_user(ctx) >= min_events)
-			return 0;
-	} while (ret > 0);
+	}
+	io_run_task_work();
+	io_cqring_overflow_flush(ctx);
+	/* if user messes with these they will just get an early return */
+	if (__io_cqring_events_user(ctx) >= min_events)
+		return 0;
 
 	if (sig) {
 #ifdef CONFIG_COMPAT
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 03/13] io_uring: kill io_run_task_work_ctx
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 01/13] io_uring: rearrange defer list checks Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 02/13] io_uring: don't iterate cq wait fast path Pavel Begunkov
@ 2023-01-03  3:03 ` Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 04/13] io_uring: move defer tw task checks Pavel Begunkov
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

There is only one user of io_run_task_work_ctx(), inline it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c |  6 +++++-
 io_uring/io_uring.h | 20 --------------------
 2 files changed, 5 insertions(+), 21 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index d9a2cf061acc..a22c6778a988 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2452,7 +2452,11 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
 
 int io_run_task_work_sig(struct io_ring_ctx *ctx)
 {
-	if (io_run_task_work_ctx(ctx) > 0)
+	if (!llist_empty(&ctx->work_llist)) {
+		if (io_run_local_work(ctx) > 0)
+			return 1;
+	}
+	if (io_run_task_work() > 0)
 		return 1;
 	if (task_sigpending(current))
 		return -EINTR;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 46c0f765a77a..8a5c3affd724 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -269,26 +269,6 @@ static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
 	return task_work_pending(current) || !wq_list_empty(&ctx->work_llist);
 }
 
-static inline int io_run_task_work_ctx(struct io_ring_ctx *ctx)
-{
-	int ret = 0;
-	int ret2;
-
-	if (!llist_empty(&ctx->work_llist))
-		ret = io_run_local_work(ctx);
-
-	/* want to run this after in case more is added */
-	ret2 = io_run_task_work();
-
-	/* Try propagate error in favour of if tasks were run,
-	 * but still make sure to run them if requested
-	 */
-	if (ret >= 0)
-		ret += ret2;
-
-	return ret;
-}
-
 static inline int io_run_local_work_locked(struct io_ring_ctx *ctx)
 {
 	bool locked;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 04/13] io_uring: move defer tw task checks
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (2 preceding siblings ...)
  2023-01-03  3:03 ` [RFC v2 03/13] io_uring: kill io_run_task_work_ctx Pavel Begunkov
@ 2023-01-03  3:03 ` Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 05/13] io_uring: parse check_cq out of wq waiting Pavel Begunkov
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Most places that want to run local tw explicitly and in advance check if
they are allowed to do so. Don't rely on a similar check in
__io_run_local_work(), leave it as a just-in-case warning and make sure
callers checks capabilities themselves.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 15 ++++++---------
 io_uring/io_uring.h |  5 +++++
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a22c6778a988..ff457e525e7c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1296,14 +1296,13 @@ int __io_run_local_work(struct io_ring_ctx *ctx, bool *locked)
 	struct llist_node *node;
 	struct llist_node fake;
 	struct llist_node *current_final = NULL;
-	int ret;
+	int ret = 0;
 	unsigned int loops = 1;
 
-	if (unlikely(ctx->submitter_task != current))
+	if (WARN_ON_ONCE(ctx->submitter_task != current))
 		return -EEXIST;
 
 	node = io_llist_xchg(&ctx->work_llist, &fake);
-	ret = 0;
 again:
 	while (node != current_final) {
 		struct llist_node *next = node->next;
@@ -2511,11 +2510,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 	if (!io_allowed_run_tw(ctx))
 		return -EEXIST;
-	if (!llist_empty(&ctx->work_llist)) {
-		ret = io_run_local_work(ctx);
-		if (ret < 0)
-			return ret;
-	}
+	if (!llist_empty(&ctx->work_llist))
+		io_run_local_work(ctx);
 	io_run_task_work();
 	io_cqring_overflow_flush(ctx);
 	/* if user messes with these they will just get an early return */
@@ -3052,7 +3048,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 		}
 	}
 
-	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
+	if ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
+	    io_allowed_defer_tw_run(ctx))
 		ret |= io_run_local_work(ctx) > 0;
 	ret |= io_cancel_defer_files(ctx, task, cancel_all);
 	mutex_lock(&ctx->uring_lock);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 8a5c3affd724..9b7baeff5a1c 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -352,6 +352,11 @@ static inline struct io_kiocb *io_alloc_req(struct io_ring_ctx *ctx)
 	return container_of(node, struct io_kiocb, comp_list);
 }
 
+static inline bool io_allowed_defer_tw_run(struct io_ring_ctx *ctx)
+{
+	return likely(ctx->submitter_task == current);
+}
+
 static inline bool io_allowed_run_tw(struct io_ring_ctx *ctx)
 {
 	return likely(!(ctx->flags & IORING_SETUP_DEFER_TASKRUN) ||
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 05/13] io_uring: parse check_cq out of wq waiting
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (3 preceding siblings ...)
  2023-01-03  3:03 ` [RFC v2 04/13] io_uring: move defer tw task checks Pavel Begunkov
@ 2023-01-03  3:03 ` Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 06/13] io_uring: mimimise io_cqring_wait_schedule Pavel Begunkov
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

We already avoid flushing overflows in io_cqring_wait_schedule() but
only return an error for the outer loop to handle it. Minimise it even
further by moving all ->check_cq parsing there.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ff457e525e7c..e3c5de299baa 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2468,21 +2468,13 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 					  ktime_t timeout)
 {
 	int ret;
-	unsigned long check_cq;
 
+	if (unlikely(READ_ONCE(ctx->check_cq)))
+		return 1;
 	/* make sure we run task_work before checking for signals */
 	ret = io_run_task_work_sig(ctx);
 	if (ret || io_should_wake(iowq))
 		return ret;
-
-	check_cq = READ_ONCE(ctx->check_cq);
-	if (unlikely(check_cq)) {
-		/* let the caller flush overflows, retry */
-		if (check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT))
-			return 1;
-		if (check_cq & BIT(IO_CHECK_CQ_DROPPED_BIT))
-			return -EBADR;
-	}
 	if (!schedule_hrtimeout(&timeout, HRTIMER_MODE_ABS))
 		return -ETIME;
 
@@ -2548,13 +2540,25 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 	trace_io_uring_cqring_wait(ctx, min_events);
 	do {
-		if (test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq)) {
-			finish_wait(&ctx->cq_wait, &iowq.wq);
-			io_cqring_do_overflow_flush(ctx);
-		}
+		unsigned long check_cq;
+
 		prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
 						TASK_INTERRUPTIBLE);
 		ret = io_cqring_wait_schedule(ctx, &iowq, timeout);
+
+		check_cq = READ_ONCE(ctx->check_cq);
+		if (unlikely(check_cq)) {
+			/* let the caller flush overflows, retry */
+			if (check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT)) {
+				finish_wait(&ctx->cq_wait, &iowq.wq);
+				io_cqring_do_overflow_flush(ctx);
+			}
+			if (check_cq & BIT(IO_CHECK_CQ_DROPPED_BIT)) {
+				ret = -EBADR;
+				break;
+			}
+		}
+
 		if (__io_cqring_events_user(ctx) >= min_events)
 			break;
 		cond_resched();
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 06/13] io_uring: mimimise io_cqring_wait_schedule
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (4 preceding siblings ...)
  2023-01-03  3:03 ` [RFC v2 05/13] io_uring: parse check_cq out of wq waiting Pavel Begunkov
@ 2023-01-03  3:03 ` Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 07/13] io_uring: simplify io_has_work Pavel Begunkov
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

io_cqring_wait_schedule() is called after we started waiting on the cq
wq and set the state to TASK_INTERRUPTIBLE, for that reason we have to
constantly worry whether we has returned the state back to running or
not. Leave only quick checks in io_cqring_wait_schedule() and move the
rest including running task work to the callers. Note, we run tw in the
loop after the sched checks because of the fast path in the beginning of
the function.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e3c5de299baa..fc9604848bbb 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2467,24 +2467,19 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
 					  struct io_wait_queue *iowq,
 					  ktime_t timeout)
 {
-	int ret;
-
 	if (unlikely(READ_ONCE(ctx->check_cq)))
 		return 1;
-	/* make sure we run task_work before checking for signals */
-	ret = io_run_task_work_sig(ctx);
-	if (ret || io_should_wake(iowq))
-		return ret;
+	if (unlikely(!llist_empty(&ctx->work_llist)))
+		return 1;
+	if (unlikely(test_thread_flag(TIF_NOTIFY_SIGNAL)))
+		return 1;
+	if (unlikely(task_sigpending(current)))
+		return -EINTR;
+	if (unlikely(io_should_wake(iowq)))
+		return 0;
 	if (!schedule_hrtimeout(&timeout, HRTIMER_MODE_ABS))
 		return -ETIME;
-
-	/*
-	 * Run task_work after scheduling. If we got woken because of
-	 * task_work being processed, run it now rather than let the caller
-	 * do another wait loop.
-	 */
-	ret = io_run_task_work_sig(ctx);
-	return ret < 0 ? ret : 1;
+	return 0;
 }
 
 /*
@@ -2545,6 +2540,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
 						TASK_INTERRUPTIBLE);
 		ret = io_cqring_wait_schedule(ctx, &iowq, timeout);
+		if (ret < 0)
+			break;
+		/*
+		 * Run task_work after scheduling and before io_should_wake().
+		 * If we got woken because of task_work being processed, run it
+		 * now rather than let the caller do another wait loop.
+		 */
+		io_run_task_work();
+		if (!llist_empty(&ctx->work_llist))
+			io_run_local_work(ctx);
 
 		check_cq = READ_ONCE(ctx->check_cq);
 		if (unlikely(check_cq)) {
@@ -2559,10 +2564,12 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 			}
 		}
 
-		if (__io_cqring_events_user(ctx) >= min_events)
+		if (io_should_wake(&iowq)) {
+			ret = 0;
 			break;
+		}
 		cond_resched();
-	} while (ret > 0);
+	} while (1);
 
 	finish_wait(&ctx->cq_wait, &iowq.wq);
 	restore_saved_sigmask_unless(ret == -EINTR);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 07/13] io_uring: simplify io_has_work
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (5 preceding siblings ...)
  2023-01-03  3:03 ` [RFC v2 06/13] io_uring: mimimise io_cqring_wait_schedule Pavel Begunkov
@ 2023-01-03  3:03 ` Pavel Begunkov
  2023-01-03  3:03 ` [RFC v2 08/13] io_uring: set TASK_RUNNING right after schedule Pavel Begunkov
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

->work_llist should never be non-empty for a non DEFER_TASKRUN ring, so
we can safely skip checking the flag.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index fc9604848bbb..a8d3826f3d17 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2416,8 +2416,7 @@ struct io_wait_queue {
 static inline bool io_has_work(struct io_ring_ctx *ctx)
 {
 	return test_bit(IO_CHECK_CQ_OVERFLOW_BIT, &ctx->check_cq) ||
-	       ((ctx->flags & IORING_SETUP_DEFER_TASKRUN) &&
-		!llist_empty(&ctx->work_llist));
+	       !llist_empty(&ctx->work_llist);
 }
 
 static inline bool io_should_wake(struct io_wait_queue *iowq)
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 08/13] io_uring: set TASK_RUNNING right after schedule
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (6 preceding siblings ...)
  2023-01-03  3:03 ` [RFC v2 07/13] io_uring: simplify io_has_work Pavel Begunkov
@ 2023-01-03  3:03 ` Pavel Begunkov
  2023-01-03  3:04 ` [RFC v2 09/13] io_uring: separate wq for ring polling Pavel Begunkov
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Instead of constantly watching that the state of the task is running
before executing tw or taking locks in io_cqring_wait(), switch it back
to TASK_RUNNING immediately.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index a8d3826f3d17..682f4b086f09 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2541,6 +2541,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		ret = io_cqring_wait_schedule(ctx, &iowq, timeout);
 		if (ret < 0)
 			break;
+		__set_current_state(TASK_RUNNING);
 		/*
 		 * Run task_work after scheduling and before io_should_wake().
 		 * If we got woken because of task_work being processed, run it
@@ -2553,10 +2554,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		check_cq = READ_ONCE(ctx->check_cq);
 		if (unlikely(check_cq)) {
 			/* let the caller flush overflows, retry */
-			if (check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT)) {
-				finish_wait(&ctx->cq_wait, &iowq.wq);
+			if (check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT))
 				io_cqring_do_overflow_flush(ctx);
-			}
 			if (check_cq & BIT(IO_CHECK_CQ_DROPPED_BIT)) {
 				ret = -EBADR;
 				break;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 09/13] io_uring: separate wq for ring polling
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (7 preceding siblings ...)
  2023-01-03  3:03 ` [RFC v2 08/13] io_uring: set TASK_RUNNING right after schedule Pavel Begunkov
@ 2023-01-03  3:04 ` Pavel Begunkov
  2023-01-04 18:08   ` Jens Axboe
  2023-01-03  3:04 ` [RFC v2 10/13] io_uring: add lazy poll_wq activation Pavel Begunkov
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:04 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Don't use ->cq_wait for ring polling but add a separate wait queue for
it. We need it for following patches.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 1 +
 io_uring/io_uring.c            | 3 ++-
 io_uring/io_uring.h            | 9 +++++++++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index dcd8a563ab52..cbcd3aaddd9d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -286,6 +286,7 @@ struct io_ring_ctx {
 		unsigned		cq_entries;
 		struct io_ev_fd	__rcu	*io_ev_fd;
 		struct wait_queue_head	cq_wait;
+		struct wait_queue_head	poll_wq;
 		unsigned		cq_extra;
 	} ____cacheline_aligned_in_smp;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 682f4b086f09..42f512c42099 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -316,6 +316,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	xa_init_flags(&ctx->personalities, XA_FLAGS_ALLOC1);
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->cq_wait);
+	init_waitqueue_head(&ctx->poll_wq);
 	spin_lock_init(&ctx->completion_lock);
 	spin_lock_init(&ctx->timeout_lock);
 	INIT_WQ_LIST(&ctx->iopoll_list);
@@ -2768,7 +2769,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 	struct io_ring_ctx *ctx = file->private_data;
 	__poll_t mask = 0;
 
-	poll_wait(file, &ctx->cq_wait, wait);
+	poll_wait(file, &ctx->poll_wq, wait);
 	/*
 	 * synchronizes with barrier from wq_has_sleeper call in
 	 * io_commit_cqring
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 9b7baeff5a1c..645ace377d7e 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -207,9 +207,18 @@ static inline void io_commit_cqring(struct io_ring_ctx *ctx)
 	smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
 }
 
+static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
+{
+	if (waitqueue_active(&ctx->poll_wq))
+		__wake_up(&ctx->poll_wq, TASK_NORMAL, 0,
+				poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
+}
+
 /* requires smb_mb() prior, see wq_has_sleeper() */
 static inline void __io_cqring_wake(struct io_ring_ctx *ctx)
 {
+	io_poll_wq_wake(ctx);
+
 	/*
 	 * Trigger waitqueue handler on all waiters on our waitqueue. This
 	 * won't necessarily wake up all the tasks, io_should_wake() will make
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 10/13] io_uring: add lazy poll_wq activation
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (8 preceding siblings ...)
  2023-01-03  3:04 ` [RFC v2 09/13] io_uring: separate wq for ring polling Pavel Begunkov
@ 2023-01-03  3:04 ` Pavel Begunkov
  2023-01-03  3:04 ` [RFC v2 11/13] io_uring: wake up optimisations Pavel Begunkov
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:04 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Even though io_poll_wq_wake()'s waitqueue_active reuses a barrier we do
for another waitqueue, it's not going to be the case in the future and
so we want to have a fast path for it when the ring has never been
polled.

Move poll_wq wake ups into __io_commit_cqring_flush() using a new flag
called ->poll_activated. The idea behind the flag is to set it when the
ring was polled for the first time. This requires additional sync to not
miss events, which is done here by using task_work for ->task_complete
rings, and by default enabling the flag for all other types of rings.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  2 ++
 io_uring/io_uring.c            | 40 ++++++++++++++++++++++++++++++++++
 io_uring/io_uring.h            |  7 +++---
 3 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index cbcd3aaddd9d..1452ff745e5c 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -210,6 +210,7 @@ struct io_ring_ctx {
 		unsigned int		syscall_iopoll: 1;
 		/* all CQEs should be posted only by the submitter task */
 		unsigned int		task_complete: 1;
+		unsigned int		poll_activated: 1;
 	} ____cacheline_aligned_in_smp;
 
 	/* submission data */
@@ -357,6 +358,7 @@ struct io_ring_ctx {
 	u32				iowq_limits[2];
 	bool				iowq_limits_set;
 
+	struct callback_head		poll_wq_task_work;
 	struct list_head		defer_list;
 	unsigned			sq_thread_idle;
 	/* protected by ->completion_lock */
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 42f512c42099..d2a3d9928ba3 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -573,6 +573,8 @@ static void io_eventfd_flush_signal(struct io_ring_ctx *ctx)
 
 void __io_commit_cqring_flush(struct io_ring_ctx *ctx)
 {
+	if (ctx->poll_activated)
+		io_poll_wq_wake(ctx);
 	if (ctx->off_timeout_used)
 		io_flush_timeouts(ctx);
 	if (ctx->drain_active) {
@@ -2764,11 +2766,42 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	kfree(ctx);
 }
 
+static __cold void io_lazy_activate_poll(struct callback_head *cb)
+{
+	struct io_ring_ctx *ctx = container_of(cb, struct io_ring_ctx,
+					       poll_wq_task_work);
+
+	mutex_lock(&ctx->uring_lock);
+	ctx->poll_activated = true;
+	mutex_unlock(&ctx->uring_lock);
+
+	/*
+	 * Wake ups for some events between start of polling and activation
+	 * might've been lost due to loose synchronisation.
+	 */
+	io_poll_wq_wake(ctx);
+	percpu_ref_put(&ctx->refs);
+}
+
 static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 {
 	struct io_ring_ctx *ctx = file->private_data;
 	__poll_t mask = 0;
 
+	if (unlikely(!ctx->poll_activated)) {
+		spin_lock(&ctx->completion_lock);
+		if (!ctx->poll_activated && !ctx->poll_wq_task_work.func &&
+		    ctx->submitter_task) {
+			init_task_work(&ctx->poll_wq_task_work, io_lazy_activate_poll);
+			percpu_ref_get(&ctx->refs);
+
+			if (task_work_add(ctx->submitter_task,
+					  &ctx->poll_wq_task_work, TWA_SIGNAL))
+				percpu_ref_put(&ctx->refs);
+		}
+		spin_unlock(&ctx->completion_lock);
+	}
+
 	poll_wait(file, &ctx->poll_wq, wait);
 	/*
 	 * synchronizes with barrier from wq_has_sleeper call in
@@ -3575,6 +3608,13 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	    !(ctx->flags & IORING_SETUP_SQPOLL))
 		ctx->task_complete = true;
 
+	/*
+	 * Lazy poll_wq activation requires sync with all potential completors,
+	 * ->task_complete guarantees a single completor
+	 */
+	if (!ctx->task_complete)
+		ctx->poll_activated = true;
+
 	/*
 	 * When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, user
 	 * space applications don't need to do io completion events
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 645ace377d7e..e9819872c186 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -209,7 +209,7 @@ static inline void io_commit_cqring(struct io_ring_ctx *ctx)
 
 static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
 {
-	if (waitqueue_active(&ctx->poll_wq))
+	if (wq_has_sleeper(&ctx->poll_wq))
 		__wake_up(&ctx->poll_wq, TASK_NORMAL, 0,
 				poll_to_key(EPOLL_URING_WAKE | EPOLLIN));
 }
@@ -217,8 +217,6 @@ static inline void io_poll_wq_wake(struct io_ring_ctx *ctx)
 /* requires smb_mb() prior, see wq_has_sleeper() */
 static inline void __io_cqring_wake(struct io_ring_ctx *ctx)
 {
-	io_poll_wq_wake(ctx);
-
 	/*
 	 * Trigger waitqueue handler on all waiters on our waitqueue. This
 	 * won't necessarily wake up all the tasks, io_should_wake() will make
@@ -319,7 +317,8 @@ static inline void io_req_complete_defer(struct io_kiocb *req)
 
 static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx)
 {
-	if (unlikely(ctx->off_timeout_used || ctx->drain_active || ctx->has_evfd))
+	if (unlikely(ctx->off_timeout_used || ctx->drain_active ||
+		     ctx->has_evfd || ctx->poll_activated))
 		__io_commit_cqring_flush(ctx);
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 11/13] io_uring: wake up optimisations
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (9 preceding siblings ...)
  2023-01-03  3:04 ` [RFC v2 10/13] io_uring: add lazy poll_wq activation Pavel Begunkov
@ 2023-01-03  3:04 ` Pavel Begunkov
  2023-01-03  3:04 ` [RFC v2 12/13] io_uring: waitqueue-less cq waiting Pavel Begunkov
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:04 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Flush completions is done either from the submit syscall or by the
task_work, both are in the context of the submitter task, and when it
goes for a single threaded rings like implied by ->task_complete, there
won't be any waiters on ->cq_wait but the master task. That means that
there can be no tasks sleeping on cq_wait while we run
__io_submit_flush_completions() and so waking up can be skipped.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index d2a3d9928ba3..98d0d9e49be0 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -621,6 +621,25 @@ static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx)
 	io_cqring_wake(ctx);
 }
 
+static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx)
+	__releases(ctx->completion_lock)
+{
+	io_commit_cqring(ctx);
+	__io_cq_unlock(ctx);
+	io_commit_cqring_flush(ctx);
+
+	/*
+	 * As ->task_complete implies that the ring is single tasked, cq_wait
+	 * may only be waited on by the current in io_cqring_wait(), but since
+	 * it will re-check the wakeup conditions once we return we can safely
+	 * skip waking it up.
+	 */
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
+		smp_mb();
+		__io_cqring_wake(ctx);
+	}
+}
+
 void io_cq_unlock_post(struct io_ring_ctx *ctx)
 	__releases(ctx->completion_lock)
 {
@@ -1461,7 +1480,7 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 			}
 		}
 	}
-	__io_cq_unlock_post(ctx);
+	__io_cq_unlock_post_flush(ctx);
 
 	if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
 		io_free_batch_list(ctx, state->compl_reqs.first);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 12/13] io_uring: waitqueue-less cq waiting
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (10 preceding siblings ...)
  2023-01-03  3:04 ` [RFC v2 11/13] io_uring: wake up optimisations Pavel Begunkov
@ 2023-01-03  3:04 ` Pavel Begunkov
  2023-01-03  3:04 ` [RFC v2 13/13] io_uring: add io_req_local_work_add wake fast path Pavel Begunkov
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:04 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

With DEFER_TASKRUN only ctx->submitter_task might be waiting for CQEs,
we can use this to optimise io_cqring_wait(). Replace ->cq_wait
waitqueue with waking the task directly.

It works but misses an important optimisation covered by the following
patch, so this patch without follow ups might hurt performance.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 98d0d9e49be0..943032d2fd21 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1273,7 +1273,12 @@ static void io_req_local_work_add(struct io_kiocb *req)
 
 	if (ctx->has_evfd)
 		io_eventfd_signal(ctx);
-	__io_cqring_wake(ctx);
+
+	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
+	} else {
+		__io_cqring_wake(ctx);
+	}
 }
 
 void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
@@ -2558,12 +2563,17 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	do {
 		unsigned long check_cq;
 
-		prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
-						TASK_INTERRUPTIBLE);
+		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+			set_current_state(TASK_INTERRUPTIBLE);
+		} else {
+			prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
+							TASK_INTERRUPTIBLE);
+		}
+
 		ret = io_cqring_wait_schedule(ctx, &iowq, timeout);
+		__set_current_state(TASK_RUNNING);
 		if (ret < 0)
 			break;
-		__set_current_state(TASK_RUNNING);
 		/*
 		 * Run task_work after scheduling and before io_should_wake().
 		 * If we got woken because of task_work being processed, run it
@@ -2591,7 +2601,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		cond_resched();
 	} while (1);
 
-	finish_wait(&ctx->cq_wait, &iowq.wq);
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+		finish_wait(&ctx->cq_wait, &iowq.wq);
 	restore_saved_sigmask_unless(ret == -EINTR);
 
 	return READ_ONCE(rings->cq.head) == READ_ONCE(rings->cq.tail) ? ret : 0;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 13/13] io_uring: add io_req_local_work_add wake fast path
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (11 preceding siblings ...)
  2023-01-03  3:04 ` [RFC v2 12/13] io_uring: waitqueue-less cq waiting Pavel Begunkov
@ 2023-01-03  3:04 ` Pavel Begunkov
  2023-01-04 18:05 ` (subset) [RFC v2 00/13] CQ waiting and wake up optimisations Jens Axboe
  2023-01-04 20:25 ` Pavel Begunkov
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-03  3:04 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence

Don't wake the master task after queueing a deferred tw unless it's
currently waiting in io_cqring_wait.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h | 1 +
 io_uring/io_uring.c            | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 1452ff745e5c..332a29cfe076 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -285,6 +285,7 @@ struct io_ring_ctx {
 
 		unsigned		cached_cq_tail;
 		unsigned		cq_entries;
+		bool			cq_waiting;
 		struct io_ev_fd	__rcu	*io_ev_fd;
 		struct wait_queue_head	cq_wait;
 		struct wait_queue_head	poll_wq;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 943032d2fd21..e436fe73becf 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1275,7 +1275,8 @@ static void io_req_local_work_add(struct io_kiocb *req)
 		io_eventfd_signal(ctx);
 
 	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
-		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
+		if (READ_ONCE(ctx->cq_waiting))
+			wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 	} else {
 		__io_cqring_wake(ctx);
 	}
@@ -2565,6 +2566,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
 			set_current_state(TASK_INTERRUPTIBLE);
+			smp_store_mb(ctx->cq_waiting, 1);
 		} else {
 			prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
 							TASK_INTERRUPTIBLE);
@@ -2572,6 +2574,8 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 		ret = io_cqring_wait_schedule(ctx, &iowq, timeout);
 		__set_current_state(TASK_RUNNING);
+		WRITE_ONCE(ctx->cq_waiting, 0);
+
 		if (ret < 0)
 			break;
 		/*
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: (subset) [RFC v2 00/13] CQ waiting and wake up optimisations
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (12 preceding siblings ...)
  2023-01-03  3:04 ` [RFC v2 13/13] io_uring: add io_req_local_work_add wake fast path Pavel Begunkov
@ 2023-01-04 18:05 ` Jens Axboe
  2023-01-04 20:25 ` Pavel Begunkov
  14 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2023-01-04 18:05 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov


On Tue, 03 Jan 2023 03:03:51 +0000, Pavel Begunkov wrote:
> The series replaces waitqueues for CQ waiting with a custom waiting
> loop and adds a couple more perf tweak around it. Benchmarking is done
> for QD1 with simulated tw arrival right after we start waiting, it
> gets us from 7.5 MIOPS to 9.2, which is +22%, or double the number for
> the in-kernel io_uring overhead (i.e. without syscall and userspace).
> That matches profiles, wake_up() _without_ wake_up_state() was taking
> 12-14% and prepare_to_wait_exclusive() was around 4-6%.
> 
> [...]

Applied, thanks!

[01/13] io_uring: rearrange defer list checks
        commit: 9617404e5d86e9cfb2da4ac2b17e99a72836bf69
[02/13] io_uring: don't iterate cq wait fast path
        commit: 1329dc7e79da3570f6591d9997bd2fe3a7d17ca6
[03/13] io_uring: kill io_run_task_work_ctx
        commit: 90b8457304e25a137c1b8c89f7cae276b79d3273
[04/13] io_uring: move defer tw task checks
        commit: 1345a6b381b4d39b15a1e34c0a78be2ee2e452c6
[05/13] io_uring: parse check_cq out of wq waiting
        commit: b5be9ebe91246b67d4b0dee37e3071d73ba69119
[06/13] io_uring: mimimise io_cqring_wait_schedule
        commit: de254b5029fa37c4e0a6a16743fa2271fa524fc7
[07/13] io_uring: simplify io_has_work
        commit: 26736d171ec54487de677f09d682d144489957fa
[08/13] io_uring: set TASK_RUNNING right after schedule
        commit: 8214ccccf64f1335b34b98ed7deb2c6c29969c49

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 09/13] io_uring: separate wq for ring polling
  2023-01-03  3:04 ` [RFC v2 09/13] io_uring: separate wq for ring polling Pavel Begunkov
@ 2023-01-04 18:08   ` Jens Axboe
  2023-01-04 20:28     ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2023-01-04 18:08 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 1/2/23 8:04 PM, Pavel Begunkov wrote:
> Don't use ->cq_wait for ring polling but add a separate wait queue for
> it. We need it for following patches.
> 
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
>  include/linux/io_uring_types.h | 1 +
>  io_uring/io_uring.c            | 3 ++-
>  io_uring/io_uring.h            | 9 +++++++++
>  3 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index dcd8a563ab52..cbcd3aaddd9d 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -286,6 +286,7 @@ struct io_ring_ctx {
>  		unsigned		cq_entries;
>  		struct io_ev_fd	__rcu	*io_ev_fd;
>  		struct wait_queue_head	cq_wait;
> +		struct wait_queue_head	poll_wq;
>  		unsigned		cq_extra;
>  	} ____cacheline_aligned_in_smp;
>  

Should we move poll_wq somewhere else, more out of the way? Would need to
gate the check a flag or something.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 00/13] CQ waiting and wake up optimisations
  2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
                   ` (13 preceding siblings ...)
  2023-01-04 18:05 ` (subset) [RFC v2 00/13] CQ waiting and wake up optimisations Jens Axboe
@ 2023-01-04 20:25 ` Pavel Begunkov
  14 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-04 20:25 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

On 1/3/23 03:03, Pavel Begunkov wrote:
> The series replaces waitqueues for CQ waiting with a custom waiting
> loop and adds a couple more perf tweak around it. Benchmarking is done
> for QD1 with simulated tw arrival right after we start waiting, it
> gets us from 7.5 MIOPS to 9.2, which is +22%, or double the number for
> the in-kernel io_uring overhead (i.e. without syscall and userspace).
> That matches profiles, wake_up() _without_ wake_up_state() was taking
> 12-14% and prepare_to_wait_exclusive() was around 4-6%.

The numbers are gathered with an in-kernel trick. Tried to quickly
measure without it:

modprobe null_blk no_sched=1 irqmode=2 completion_nsec=0
taskset -c 0 fio/t/io_uring -d1 -s1 -c1 -p0 -B1 -F1 -X -b512 -n4 /dev/nullb0

The important part here is using timers-backed nullblk and pinning
multiple workers to a single CPU. -n4 was enough for me to keep
the CPU 100% busy.

old:
IOPS=539.51K, BW=2.11GiB/s, IOS/call=1/1
IOPS=542.26K, BW=2.12GiB/s, IOS/call=1/1
IOPS=540.73K, BW=2.11GiB/s, IOS/call=1/1
IOPS=541.28K, BW=2.11GiB/s, IOS/call=0/0

new:
IOPS=561.85K, BW=2.19GiB/s, IOS/call=1/1
IOPS=561.58K, BW=2.19GiB/s, IOS/call=1/1
IOPS=561.56K, BW=2.19GiB/s, IOS/call=1/1
IOPS=559.94K, BW=2.19GiB/s, IOS/call=1/1

The different is only ~3.5%  because of huge additional overhead
for nullb timers, block qos and other unnecessary bits.

P.S. tested with an out-of-tree patch adding a flag enabling/disabling
the feature to remove variance b/w reboots.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 09/13] io_uring: separate wq for ring polling
  2023-01-04 18:08   ` Jens Axboe
@ 2023-01-04 20:28     ` Pavel Begunkov
  2023-01-04 20:34       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-04 20:28 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 1/4/23 18:08, Jens Axboe wrote:
> On 1/2/23 8:04 PM, Pavel Begunkov wrote:
>> Don't use ->cq_wait for ring polling but add a separate wait queue for
>> it. We need it for following patches.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
>>   include/linux/io_uring_types.h | 1 +
>>   io_uring/io_uring.c            | 3 ++-
>>   io_uring/io_uring.h            | 9 +++++++++
>>   3 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index dcd8a563ab52..cbcd3aaddd9d 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -286,6 +286,7 @@ struct io_ring_ctx {
>>   		unsigned		cq_entries;
>>   		struct io_ev_fd	__rcu	*io_ev_fd;
>>   		struct wait_queue_head	cq_wait;
>> +		struct wait_queue_head	poll_wq;
>>   		unsigned		cq_extra;
>>   	} ____cacheline_aligned_in_smp;
>>   
> 
> Should we move poll_wq somewhere else, more out of the way?

If we care about polling perf and cache collisions with
cq_wait, yeah we can. In any case it's a good idea to at
least move it after cq_extra.

> Would need to gate the check a flag or something.

Not sure I follow

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 09/13] io_uring: separate wq for ring polling
  2023-01-04 20:28     ` Pavel Begunkov
@ 2023-01-04 20:34       ` Jens Axboe
  2023-01-04 20:45         ` Pavel Begunkov
  2023-01-04 20:52         ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2023-01-04 20:34 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 1/4/23 1:28?PM, Pavel Begunkov wrote:
> On 1/4/23 18:08, Jens Axboe wrote:
>> On 1/2/23 8:04?PM, Pavel Begunkov wrote:
>>> Don't use ->cq_wait for ring polling but add a separate wait queue for
>>> it. We need it for following patches.
>>>
>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>> ---
>>>   include/linux/io_uring_types.h | 1 +
>>>   io_uring/io_uring.c            | 3 ++-
>>>   io_uring/io_uring.h            | 9 +++++++++
>>>   3 files changed, 12 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>> index dcd8a563ab52..cbcd3aaddd9d 100644
>>> --- a/include/linux/io_uring_types.h
>>> +++ b/include/linux/io_uring_types.h
>>> @@ -286,6 +286,7 @@ struct io_ring_ctx {
>>>           unsigned        cq_entries;
>>>           struct io_ev_fd    __rcu    *io_ev_fd;
>>>           struct wait_queue_head    cq_wait;
>>> +        struct wait_queue_head    poll_wq;
>>>           unsigned        cq_extra;
>>>       } ____cacheline_aligned_in_smp;
>>>   
>>
>> Should we move poll_wq somewhere else, more out of the way?
> 
> If we care about polling perf and cache collisions with
> cq_wait, yeah we can. In any case it's a good idea to at
> least move it after cq_extra.
> 
>> Would need to gate the check a flag or something.
> 
> Not sure I follow

I guess I could've been a bit more verbose... If we consider poll on the
io_uring rather uncommon, then moving the poll_wq outside of the hotter
cq_wait cacheline(s) would make sense. Each wait_queue_head is more than
a cacheline. Then we could have a flag in a spot that's hot anyway
whether to check it or not, eg in that same section as cq_wait.

Looking at the layout right now, we're at 116 bytes for that section, or
two cachelines with 12 bytes to spare. If we add poll_wq, then we'll be
at 196 bytes, which is 4 bytes over the next cacheline. So it'd
essentially double the size of that section. If we moved it outside of
the aligned sections, then it'd pack better.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 09/13] io_uring: separate wq for ring polling
  2023-01-04 20:34       ` Jens Axboe
@ 2023-01-04 20:45         ` Pavel Begunkov
  2023-01-04 20:53           ` Jens Axboe
  2023-01-04 20:52         ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2023-01-04 20:45 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 1/4/23 20:34, Jens Axboe wrote:
> On 1/4/23 1:28?PM, Pavel Begunkov wrote:
>> On 1/4/23 18:08, Jens Axboe wrote:
>>> On 1/2/23 8:04?PM, Pavel Begunkov wrote:
>>>> Don't use ->cq_wait for ring polling but add a separate wait queue for
>>>> it. We need it for following patches.
>>>>
>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>> ---
>>>>    include/linux/io_uring_types.h | 1 +
>>>>    io_uring/io_uring.c            | 3 ++-
>>>>    io_uring/io_uring.h            | 9 +++++++++
>>>>    3 files changed, 12 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>>> index dcd8a563ab52..cbcd3aaddd9d 100644
>>>> --- a/include/linux/io_uring_types.h
>>>> +++ b/include/linux/io_uring_types.h
>>>> @@ -286,6 +286,7 @@ struct io_ring_ctx {
>>>>            unsigned        cq_entries;
>>>>            struct io_ev_fd    __rcu    *io_ev_fd;
>>>>            struct wait_queue_head    cq_wait;
>>>> +        struct wait_queue_head    poll_wq;
>>>>            unsigned        cq_extra;
>>>>        } ____cacheline_aligned_in_smp;
>>>>    
>>>
>>> Should we move poll_wq somewhere else, more out of the way?
>>
>> If we care about polling perf and cache collisions with
>> cq_wait, yeah we can. In any case it's a good idea to at
>> least move it after cq_extra.
>>
>>> Would need to gate the check a flag or something.
>>
>> Not sure I follow
> 
> I guess I could've been a bit more verbose... If we consider poll on the
> io_uring rather uncommon, then moving the poll_wq outside of the hotter
> cq_wait cacheline(s) would make sense. Each wait_queue_head is more than
> a cacheline.

Looks it's 24B, and wait_queue_entry is uncomfortable 40B.

> Then we could have a flag in a spot that's hot anyway
> whether to check it or not, eg in that same section as cq_wait.
> Looking at the layout right now, we're at 116 bytes for that section, or
> two cachelines with 12 bytes to spare. If we add poll_wq, then we'll be
> at 196 bytes, which is 4 bytes over the next cacheline. So it'd
> essentially double the size of that section. If we moved it outside of
> the aligned sections, then it'd pack better.

Than it's not about hotness and caches but rather memory
consumption due to padding, which is still a good argument.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 09/13] io_uring: separate wq for ring polling
  2023-01-04 20:34       ` Jens Axboe
  2023-01-04 20:45         ` Pavel Begunkov
@ 2023-01-04 20:52         ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2023-01-04 20:52 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 1/4/23 1:34?PM, Jens Axboe wrote:
> On 1/4/23 1:28?PM, Pavel Begunkov wrote:
>> On 1/4/23 18:08, Jens Axboe wrote:
>>> On 1/2/23 8:04?PM, Pavel Begunkov wrote:
>>>> Don't use ->cq_wait for ring polling but add a separate wait queue for
>>>> it. We need it for following patches.
>>>>
>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>> ---
>>>>   include/linux/io_uring_types.h | 1 +
>>>>   io_uring/io_uring.c            | 3 ++-
>>>>   io_uring/io_uring.h            | 9 +++++++++
>>>>   3 files changed, 12 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>>> index dcd8a563ab52..cbcd3aaddd9d 100644
>>>> --- a/include/linux/io_uring_types.h
>>>> +++ b/include/linux/io_uring_types.h
>>>> @@ -286,6 +286,7 @@ struct io_ring_ctx {
>>>>           unsigned        cq_entries;
>>>>           struct io_ev_fd    __rcu    *io_ev_fd;
>>>>           struct wait_queue_head    cq_wait;
>>>> +        struct wait_queue_head    poll_wq;
>>>>           unsigned        cq_extra;
>>>>       } ____cacheline_aligned_in_smp;
>>>>   
>>>
>>> Should we move poll_wq somewhere else, more out of the way?
>>
>> If we care about polling perf and cache collisions with
>> cq_wait, yeah we can. In any case it's a good idea to at
>> least move it after cq_extra.
>>
>>> Would need to gate the check a flag or something.
>>
>> Not sure I follow
> 
> I guess I could've been a bit more verbose... If we consider poll on the
> io_uring rather uncommon, then moving the poll_wq outside of the hotter
> cq_wait cacheline(s) would make sense. Each wait_queue_head is more than
> a cacheline. Then we could have a flag in a spot that's hot anyway
> whether to check it or not, eg in that same section as cq_wait.
> 
> Looking at the layout right now, we're at 116 bytes for that section, or
> two cachelines with 12 bytes to spare. If we add poll_wq, then we'll be
> at 196 bytes, which is 4 bytes over the next cacheline. So it'd
> essentially double the size of that section. If we moved it outside of
> the aligned sections, then it'd pack better.

Just after writing this, I noticed that a spinlock took 64 bytes... In
other words, I have LOCKDEP enabled. The correct number is 24 bytes for
wait_queue_head which is obviously a lot more reasonable. It'd still
make that section one more cacheline since it's now at 60 bytes and
would grow to 84 bytes. But it's obviously not as big of a problem as I
had originally assumed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 09/13] io_uring: separate wq for ring polling
  2023-01-04 20:45         ` Pavel Begunkov
@ 2023-01-04 20:53           ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2023-01-04 20:53 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 1/4/23 1:45?PM, Pavel Begunkov wrote:
> On 1/4/23 20:34, Jens Axboe wrote:
>> On 1/4/23 1:28?PM, Pavel Begunkov wrote:
>>> On 1/4/23 18:08, Jens Axboe wrote:
>>>> On 1/2/23 8:04?PM, Pavel Begunkov wrote:
>>>>> Don't use ->cq_wait for ring polling but add a separate wait queue for
>>>>> it. We need it for following patches.
>>>>>
>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>> ---
>>>>>    include/linux/io_uring_types.h | 1 +
>>>>>    io_uring/io_uring.c            | 3 ++-
>>>>>    io_uring/io_uring.h            | 9 +++++++++
>>>>>    3 files changed, 12 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>>>>> index dcd8a563ab52..cbcd3aaddd9d 100644
>>>>> --- a/include/linux/io_uring_types.h
>>>>> +++ b/include/linux/io_uring_types.h
>>>>> @@ -286,6 +286,7 @@ struct io_ring_ctx {
>>>>>            unsigned        cq_entries;
>>>>>            struct io_ev_fd    __rcu    *io_ev_fd;
>>>>>            struct wait_queue_head    cq_wait;
>>>>> +        struct wait_queue_head    poll_wq;
>>>>>            unsigned        cq_extra;
>>>>>        } ____cacheline_aligned_in_smp;
>>>>>    
>>>>
>>>> Should we move poll_wq somewhere else, more out of the way?
>>>
>>> If we care about polling perf and cache collisions with
>>> cq_wait, yeah we can. In any case it's a good idea to at
>>> least move it after cq_extra.
>>>
>>>> Would need to gate the check a flag or something.
>>>
>>> Not sure I follow
>>
>> I guess I could've been a bit more verbose... If we consider poll on the
>> io_uring rather uncommon, then moving the poll_wq outside of the hotter
>> cq_wait cacheline(s) would make sense. Each wait_queue_head is more than
>> a cacheline.
> 
> Looks it's 24B, and wait_queue_entry is uncomfortable 40B.

(also see followup email). Yes, it's only 24 bytes indeed.

>> Then we could have a flag in a spot that's hot anyway
>> whether to check it or not, eg in that same section as cq_wait.
>> Looking at the layout right now, we're at 116 bytes for that section, or
>> two cachelines with 12 bytes to spare. If we add poll_wq, then we'll be
>> at 196 bytes, which is 4 bytes over the next cacheline. So it'd
>> essentially double the size of that section. If we moved it outside of
>> the aligned sections, then it'd pack better.
> 
> Than it's not about hotness and caches but rather memory
> consumption due to padding, which is still a good argument.

Right, it's nice to not keep io_ring_ctx bigger than it needs to be. And
if moved out-of-line, then it'd pack better and we would not "waste"
another cacheline on adding this wait_queue_head for polling.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-01-04 20:53 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-03  3:03 [RFC v2 00/13] CQ waiting and wake up optimisations Pavel Begunkov
2023-01-03  3:03 ` [RFC v2 01/13] io_uring: rearrange defer list checks Pavel Begunkov
2023-01-03  3:03 ` [RFC v2 02/13] io_uring: don't iterate cq wait fast path Pavel Begunkov
2023-01-03  3:03 ` [RFC v2 03/13] io_uring: kill io_run_task_work_ctx Pavel Begunkov
2023-01-03  3:03 ` [RFC v2 04/13] io_uring: move defer tw task checks Pavel Begunkov
2023-01-03  3:03 ` [RFC v2 05/13] io_uring: parse check_cq out of wq waiting Pavel Begunkov
2023-01-03  3:03 ` [RFC v2 06/13] io_uring: mimimise io_cqring_wait_schedule Pavel Begunkov
2023-01-03  3:03 ` [RFC v2 07/13] io_uring: simplify io_has_work Pavel Begunkov
2023-01-03  3:03 ` [RFC v2 08/13] io_uring: set TASK_RUNNING right after schedule Pavel Begunkov
2023-01-03  3:04 ` [RFC v2 09/13] io_uring: separate wq for ring polling Pavel Begunkov
2023-01-04 18:08   ` Jens Axboe
2023-01-04 20:28     ` Pavel Begunkov
2023-01-04 20:34       ` Jens Axboe
2023-01-04 20:45         ` Pavel Begunkov
2023-01-04 20:53           ` Jens Axboe
2023-01-04 20:52         ` Jens Axboe
2023-01-03  3:04 ` [RFC v2 10/13] io_uring: add lazy poll_wq activation Pavel Begunkov
2023-01-03  3:04 ` [RFC v2 11/13] io_uring: wake up optimisations Pavel Begunkov
2023-01-03  3:04 ` [RFC v2 12/13] io_uring: waitqueue-less cq waiting Pavel Begunkov
2023-01-03  3:04 ` [RFC v2 13/13] io_uring: add io_req_local_work_add wake fast path Pavel Begunkov
2023-01-04 18:05 ` (subset) [RFC v2 00/13] CQ waiting and wake up optimisations Jens Axboe
2023-01-04 20:25 ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox