[RFC 0/2] optimise local-tw task resheduling

public inbox for [email protected]
 help / color / mirror / Atom feed

* [RFC 0/2] optimise local-tw task resheduling
@ 2023-03-10 19:04 Pavel Begunkov
  2023-03-10 19:04 ` [RFC 1/2] io_uring: add tw add flags Pavel Begunkov
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-10 19:04 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

io_uring extensively uses task_work, but when a task is waiting
for multiple CQEs it causes lots of rescheduling. This series
is an attempt to optimise it and be a base for future improvements.

For some zc network tests eventually waiting for a portion of 
buffers I've got 10x descrease in the number of context switches,
which reduced the CPU consumption more than twice (17% -> 8%).
It also helps storage cases, while running fio/t/io_uring against
a low performant drive it got 2x descrease of the number of context
switches for QD8 and ~4 times for QD32.

Not for inclusion yet, I want to add an optimisation for when
waiting for 1 CQE.

Pavel Begunkov (2):
  io_uring: add tw add flags
  io_uring: reduce sheduling due to tw

 include/linux/io_uring_types.h |  2 +-
 io_uring/io_uring.c            | 48 ++++++++++++++++++++--------------
 io_uring/io_uring.h            | 10 +++++--
 io_uring/notif.h               |  2 +-
 io_uring/rw.c                  |  2 +-
 5 files changed, 40 insertions(+), 24 deletions(-)

-- 
2.39.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 1/2] io_uring: add tw add flags
  2023-03-10 19:04 [RFC 0/2] optimise local-tw task resheduling Pavel Begunkov
@ 2023-03-10 19:04 ` Pavel Begunkov
  2023-03-10 19:04 ` [RFC 2/2] io_uring: reduce sheduling due to tw Pavel Begunkov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-10 19:04 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

We pass 'allow_local' into io_req_task_work_add() but will need more
flags. Replace it with a flags bit field and name this allow_local
flag.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 io_uring/io_uring.c | 7 ++++---
 io_uring/io_uring.h | 9 +++++++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 7625597b5227..42ada470845f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1308,12 +1308,13 @@ static void io_req_local_work_add(struct io_kiocb *req)
 	percpu_ref_put(&ctx->refs);
 }
 
-void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
+void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
 {
 	struct io_uring_task *tctx = req->task->io_uring;
 	struct io_ring_ctx *ctx = req->ctx;
 
-	if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+	if (!(flags & IOU_F_TWQ_FORCE_NORMAL) &&
+	    (ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
 		io_req_local_work_add(req);
 		return;
 	}
@@ -1341,7 +1342,7 @@ static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 						    io_task_work.node);
 
 		node = node->next;
-		__io_req_task_work_add(req, false);
+		__io_req_task_work_add(req, IOU_F_TWQ_FORCE_NORMAL);
 	}
 }
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2711865f1e19..cd2e702f206c 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -15,6 +15,11 @@
 #include <trace/events/io_uring.h>
 #endif
 
+enum {
+	/* don't use deferred task_work */
+	IOU_F_TWQ_FORCE_NORMAL			= 1,
+};
+
 enum {
 	IOU_OK			= 0,
 	IOU_ISSUE_SKIP_COMPLETE	= -EIOCBQUEUED,
@@ -48,7 +53,7 @@ static inline bool io_req_ffs_set(struct io_kiocb *req)
 	return req->flags & REQ_F_FIXED_FILE;
 }
 
-void __io_req_task_work_add(struct io_kiocb *req, bool allow_local);
+void __io_req_task_work_add(struct io_kiocb *req, unsigned flags);
 bool io_is_uring_fops(struct file *file);
 bool io_alloc_async_data(struct io_kiocb *req);
 void io_req_task_queue(struct io_kiocb *req);
@@ -93,7 +98,7 @@ bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
 
 static inline void io_req_task_work_add(struct io_kiocb *req)
 {
-	__io_req_task_work_add(req, true);
+	__io_req_task_work_add(req, 0);
 }
 
 #define io_for_each_link(pos, head) \
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 2/2] io_uring: reduce sheduling due to tw
  2023-03-10 19:04 [RFC 0/2] optimise local-tw task resheduling Pavel Begunkov
  2023-03-10 19:04 ` [RFC 1/2] io_uring: add tw add flags Pavel Begunkov
@ 2023-03-10 19:04 ` Pavel Begunkov
  2023-03-11 17:24 ` [RFC 0/2] optimise local-tw task resheduling Jens Axboe
  2023-03-15  2:35 ` Ming Lei
  3 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-10 19:04 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, linux-kernel

Every task_work will try to wake the task to be executed, which causes
excessive scheduling with corresponding overhead. For some tw it's
justified, but others won't do much but post a single CQE.

When a task waits for multiple cqes, every such task_work will wake it
up. Instead, the task may give a hint about how many cqes it waits for,
io_req_local_work_add() will compare against it and skip wake ups
if #cqes + #tw items is not enough to satisfy the task. The optimisation
is used only for simple enough tws, more complex and/or urgent items
will force wake up. It's also limited to DEFER_TASKRUN.

The trade-off is having extra atomics in io_req_local_work_add() but
saving more on rescheduling the task..

Signed-off-by: Pavel Begunkov <[email protected]>
---
 include/linux/io_uring_types.h |  2 +-
 io_uring/io_uring.c            | 41 +++++++++++++++++++++-------------
 io_uring/io_uring.h            |  1 +
 io_uring/notif.h               |  2 +-
 io_uring/rw.c                  |  2 +-
 5 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 00689c12f6ab..fdf0ae28023d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -295,7 +295,7 @@ struct io_ring_ctx {
 		spinlock_t		completion_lock;
 
 		bool			poll_multi_queue;
-		bool			cq_waiting;
+		atomic_t		cq_wait_nr;
 
 		/*
 		 * ->iopoll_list is protected by the ctx->uring_lock for
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 42ada470845f..0fa4dee8dcf4 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1279,31 +1279,38 @@ static __cold void io_fallback_tw(struct io_uring_task *tctx)
 	}
 }
 
-static void io_req_local_work_add(struct io_kiocb *req)
+static void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
+	bool first;
 
 	percpu_ref_get(&ctx->refs);
 
-	if (!llist_add(&req->io_task_work.node, &ctx->work_llist))
-		goto put_ref;
-
+	first = llist_add(&req->io_task_work.node, &ctx->work_llist);
 	/* needed for the following wake up */
 	smp_mb__after_atomic();
 
-	if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
-		io_move_task_work_from_local(ctx);
-		goto put_ref;
+	if (first) {
+		if (unlikely(atomic_read(&req->task->io_uring->in_cancel))) {
+			io_move_task_work_from_local(ctx);
+			goto put_ref;
+		}
+
+		if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
+			atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
+		if (ctx->has_evfd)
+			io_eventfd_signal(ctx);
 	}
 
-	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
-		atomic_or(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
-	if (ctx->has_evfd)
-		io_eventfd_signal(ctx);
+	if (atomic_read(&ctx->cq_wait_nr) <= 0)
+		goto put_ref;
 
-	if (READ_ONCE(ctx->cq_waiting))
-		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
+	if (!(flags & IOU_F_TWQ_FACILE))
+		atomic_set(&ctx->cq_wait_nr, 0);
+	else if (atomic_dec_return(&ctx->cq_wait_nr) > 0)
+		goto put_ref;
 
+	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 put_ref:
 	percpu_ref_put(&ctx->refs);
 }
@@ -1315,7 +1322,7 @@ void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
 
 	if (!(flags & IOU_F_TWQ_FORCE_NORMAL) &&
 	    (ctx->flags & IORING_SETUP_DEFER_TASKRUN)) {
-		io_req_local_work_add(req);
+		io_req_local_work_add(req, flags);
 		return;
 	}
 
@@ -2601,7 +2608,9 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 		unsigned long check_cq;
 
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
-			WRITE_ONCE(ctx->cq_waiting, 1);
+			int to_wait = (int) iowq.cq_tail - READ_ONCE(ctx->rings->cq.tail);
+
+			atomic_set(&ctx->cq_wait_nr, to_wait);
 			set_current_state(TASK_INTERRUPTIBLE);
 		} else {
 			prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
@@ -2610,7 +2619,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 
 		ret = io_cqring_wait_schedule(ctx, &iowq);
 		__set_current_state(TASK_RUNNING);
-		WRITE_ONCE(ctx->cq_waiting, 0);
+		atomic_set(&ctx->cq_wait_nr, 0);
 
 		if (ret < 0)
 			break;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index cd2e702f206c..98ff9b71d498 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -18,6 +18,7 @@
 enum {
 	/* don't use deferred task_work */
 	IOU_F_TWQ_FORCE_NORMAL			= 1,
+	IOU_F_TWQ_FACILE			= 2,
 };
 
 enum {
diff --git a/io_uring/notif.h b/io_uring/notif.h
index c88c800cd89d..ec9998fb0be6 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -33,7 +33,7 @@ static inline void io_notif_flush(struct io_kiocb *notif)
 
 	/* drop slot's master ref */
 	if (refcount_dec_and_test(&nd->uarg.refcnt))
-		io_req_task_work_add(notif);
+		__io_req_task_work_add(notif, IOU_F_TWQ_FACILE);
 }
 
 static inline int io_notif_account_mem(struct io_kiocb *notif, unsigned len)
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 4c233910e200..a4578c120973 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -304,7 +304,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
 		return;
 	io_req_set_res(req, io_fixup_rw_res(req, res), 0);
 	req->io_task_work.func = io_req_rw_complete;
-	io_req_task_work_add(req);
+	__io_req_task_work_add(req, IOU_F_TWQ_FACILE);
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-10 19:04 [RFC 0/2] optimise local-tw task resheduling Pavel Begunkov
  2023-03-10 19:04 ` [RFC 1/2] io_uring: add tw add flags Pavel Begunkov
  2023-03-10 19:04 ` [RFC 2/2] io_uring: reduce sheduling due to tw Pavel Begunkov
@ 2023-03-11 17:24 ` Jens Axboe
  2023-03-11 20:45   ` Pavel Begunkov
  2023-03-16 12:25   ` Pavel Begunkov
  2023-03-15  2:35 ` Ming Lei
  3 siblings, 2 replies; 17+ messages in thread
From: Jens Axboe @ 2023-03-11 17:24 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-kernel

On 3/10/23 12:04?PM, Pavel Begunkov wrote:
> io_uring extensively uses task_work, but when a task is waiting
> for multiple CQEs it causes lots of rescheduling. This series
> is an attempt to optimise it and be a base for future improvements.
> 
> For some zc network tests eventually waiting for a portion of 
> buffers I've got 10x descrease in the number of context switches,
> which reduced the CPU consumption more than twice (17% -> 8%).
> It also helps storage cases, while running fio/t/io_uring against
> a low performant drive it got 2x descrease of the number of context
> switches for QD8 and ~4 times for QD32.
> 
> Not for inclusion yet, I want to add an optimisation for when
> waiting for 1 CQE.

Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
that, and I see context rates of around 8.1-8.3M/sec with the current
kernel.

Applied the two patches, but didn't see much of a change? Performance is
about the same, and cx rate ditto. Confused... As you probably know,
this test waits for 32 ios at the time.

Didn't take a closer look just yet, but I grok the concept. One
immediate thing I'd want to change is the FACILE part of it. Let's call
it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT?
I can see this mostly being used for filling a CQE, so it could also be
named something like that. But could also be used for light work in the
same vein, so might not be a good idea to base the naming on that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-11 17:24 ` [RFC 0/2] optimise local-tw task resheduling Jens Axboe
@ 2023-03-11 20:45   ` Pavel Begunkov
  2023-03-11 20:53     ` Pavel Begunkov
  2023-03-12 15:30     ` Jens Axboe
  2023-03-16 12:25   ` Pavel Begunkov
  1 sibling, 2 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-11 20:45 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-kernel

On 3/11/23 17:24, Jens Axboe wrote:
> On 3/10/23 12:04?PM, Pavel Begunkov wrote:
>> io_uring extensively uses task_work, but when a task is waiting
>> for multiple CQEs it causes lots of rescheduling. This series
>> is an attempt to optimise it and be a base for future improvements.
>>
>> For some zc network tests eventually waiting for a portion of
>> buffers I've got 10x descrease in the number of context switches,
>> which reduced the CPU consumption more than twice (17% -> 8%).
>> It also helps storage cases, while running fio/t/io_uring against
>> a low performant drive it got 2x descrease of the number of context
>> switches for QD8 and ~4 times for QD32.
>>
>> Not for inclusion yet, I want to add an optimisation for when
>> waiting for 1 CQE.
> 
> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
> that, and I see context rates of around 8.1-8.3M/sec with the current
> kernel.
> 
> Applied the two patches, but didn't see much of a change? Performance is
> about the same, and cx rate ditto. Confused... As you probably know,
> this test waits for 32 ios at the time.

If I'd to guess it already has perfect batching, for which case
the patch does nothing. Maybe it's due to SSD coalescing +
small ro I/O + consistency and small latencies of Optanes,
or might be on the scheduling and the kernel side to be slow
to react.

I was looking at trace_io_uring_local_work_run() while testing,
It's always should be @loop=QD (i.e. 32) for the patch, but
the guess is it's also 32 with that setup but without patches.

> Didn't take a closer look just yet, but I grok the concept. One
> immediate thing I'd want to change is the FACILE part of it. Let's call
> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT?

I don't really care, will change, but let me also ask why?
They're more or less synonyms, though facile is much less
popular. Is that your reasoning?

> I can see this mostly being used for filling a CQE, so it could also be
> named something like that. But could also be used for light work in the
> same vein, so might not be a good idea to base the naming on that.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-11 20:45   ` Pavel Begunkov
@ 2023-03-11 20:53     ` Pavel Begunkov
  2023-03-12 15:31       ` Jens Axboe
  2023-03-12 15:30     ` Jens Axboe
  1 sibling, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-11 20:53 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-kernel

On 3/11/23 20:45, Pavel Begunkov wrote:
> On 3/11/23 17:24, Jens Axboe wrote:
>> On 3/10/23 12:04?PM, Pavel Begunkov wrote:
>>> io_uring extensively uses task_work, but when a task is waiting
>>> for multiple CQEs it causes lots of rescheduling. This series
>>> is an attempt to optimise it and be a base for future improvements.
>>>
>>> For some zc network tests eventually waiting for a portion of
>>> buffers I've got 10x descrease in the number of context switches,
>>> which reduced the CPU consumption more than twice (17% -> 8%).
>>> It also helps storage cases, while running fio/t/io_uring against
>>> a low performant drive it got 2x descrease of the number of context
>>> switches for QD8 and ~4 times for QD32.
>>>
>>> Not for inclusion yet, I want to add an optimisation for when
>>> waiting for 1 CQE.
>>
>> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
>> that, and I see context rates of around 8.1-8.3M/sec with the current
>> kernel.
>>
>> Applied the two patches, but didn't see much of a change? Performance is
>> about the same, and cx rate ditto. Confused... As you probably know,
>> this test waits for 32 ios at the time.
> 
> If I'd to guess it already has perfect batching, for which case
> the patch does nothing. Maybe it's due to SSD coalescing +
> small ro I/O + consistency and small latencies of Optanes,
> or might be on the scheduling and the kernel side to be slow
> to react.

And if that's that, I have to note that it's quite a sterile
case, the last time I asked the usual batching we're currently
getting for networking cases is 1-2.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-11 20:53     ` Pavel Begunkov
@ 2023-03-12 15:31       ` Jens Axboe
  2023-03-13  3:52         ` Pavel Begunkov
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2023-03-12 15:31 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-kernel

On 3/11/23 1:53?PM, Pavel Begunkov wrote:
> On 3/11/23 20:45, Pavel Begunkov wrote:
>> On 3/11/23 17:24, Jens Axboe wrote:
>>> On 3/10/23 12:04?PM, Pavel Begunkov wrote:
>>>> io_uring extensively uses task_work, but when a task is waiting
>>>> for multiple CQEs it causes lots of rescheduling. This series
>>>> is an attempt to optimise it and be a base for future improvements.
>>>>
>>>> For some zc network tests eventually waiting for a portion of
>>>> buffers I've got 10x descrease in the number of context switches,
>>>> which reduced the CPU consumption more than twice (17% -> 8%).
>>>> It also helps storage cases, while running fio/t/io_uring against
>>>> a low performant drive it got 2x descrease of the number of context
>>>> switches for QD8 and ~4 times for QD32.
>>>>
>>>> Not for inclusion yet, I want to add an optimisation for when
>>>> waiting for 1 CQE.
>>>
>>> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
>>> that, and I see context rates of around 8.1-8.3M/sec with the current
>>> kernel.
>>>
>>> Applied the two patches, but didn't see much of a change? Performance is
>>> about the same, and cx rate ditto. Confused... As you probably know,
>>> this test waits for 32 ios at the time.
>>
>> If I'd to guess it already has perfect batching, for which case
>> the patch does nothing. Maybe it's due to SSD coalescing +
>> small ro I/O + consistency and small latencies of Optanes,
>> or might be on the scheduling and the kernel side to be slow
>> to react.
> 
> And if that's that, I have to note that it's quite a sterile
> case, the last time I asked the usual batching we're currently
> getting for networking cases is 1-2.

I can definitely see this being very useful for the more
non-deterministic cases where "completions" come in more sporadically.
But for the networking case, if this is eg receives, you'd trigger the
wakeup anyway to do the actual receive? And then the cqe posting doesn't
trigger another wakeup.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-12 15:31       ` Jens Axboe
@ 2023-03-13  3:52         ` Pavel Begunkov
  0 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-13  3:52 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-kernel

On 3/12/23 15:31, Jens Axboe wrote:
> On 3/11/23 1:53?PM, Pavel Begunkov wrote:
>> On 3/11/23 20:45, Pavel Begunkov wrote:
>>> On 3/11/23 17:24, Jens Axboe wrote:
>>>> On 3/10/23 12:04?PM, Pavel Begunkov wrote:
>>>>> io_uring extensively uses task_work, but when a task is waiting
>>>>> for multiple CQEs it causes lots of rescheduling. This series
>>>>> is an attempt to optimise it and be a base for future improvements.
>>>>>
>>>>> For some zc network tests eventually waiting for a portion of
>>>>> buffers I've got 10x descrease in the number of context switches,
>>>>> which reduced the CPU consumption more than twice (17% -> 8%).
>>>>> It also helps storage cases, while running fio/t/io_uring against
>>>>> a low performant drive it got 2x descrease of the number of context
>>>>> switches for QD8 and ~4 times for QD32.
>>>>>
>>>>> Not for inclusion yet, I want to add an optimisation for when
>>>>> waiting for 1 CQE.
>>>>
>>>> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
>>>> that, and I see context rates of around 8.1-8.3M/sec with the current
>>>> kernel.
>>>>
>>>> Applied the two patches, but didn't see much of a change? Performance is
>>>> about the same, and cx rate ditto. Confused... As you probably know,
>>>> this test waits for 32 ios at the time.
>>>
>>> If I'd to guess it already has perfect batching, for which case
>>> the patch does nothing. Maybe it's due to SSD coalescing +
>>> small ro I/O + consistency and small latencies of Optanes,
>>> or might be on the scheduling and the kernel side to be slow
>>> to react.
>>
>> And if that's that, I have to note that it's quite a sterile
>> case, the last time I asked the usual batching we're currently
>> getting for networking cases is 1-2.
> 
> I can definitely see this being very useful for the more
> non-deterministic cases where "completions" come in more sporadically.
> But for the networking case, if this is eg receives, you'd trigger the
> wakeup anyway to do the actual receive? And then the cqe posting doesn't
> trigger another wakeup.

True, In my case zc send notifications were the culprit.

It's not in the series, it might be better to not wake eagerly recv
poll tw, it'll give time to accumulate more data. I'm a bit afraid
of exhausting recv queues this way, so I don't think it's applicable
by default.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-11 20:45   ` Pavel Begunkov
  2023-03-11 20:53     ` Pavel Begunkov
@ 2023-03-12 15:30     ` Jens Axboe
  2023-03-13  3:45       ` Pavel Begunkov
  1 sibling, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2023-03-12 15:30 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-kernel

On 3/11/23 1:45?PM, Pavel Begunkov wrote:
> On 3/11/23 17:24, Jens Axboe wrote:
>> On 3/10/23 12:04?PM, Pavel Begunkov wrote:
>>> io_uring extensively uses task_work, but when a task is waiting
>>> for multiple CQEs it causes lots of rescheduling. This series
>>> is an attempt to optimise it and be a base for future improvements.
>>>
>>> For some zc network tests eventually waiting for a portion of
>>> buffers I've got 10x descrease in the number of context switches,
>>> which reduced the CPU consumption more than twice (17% -> 8%).
>>> It also helps storage cases, while running fio/t/io_uring against
>>> a low performant drive it got 2x descrease of the number of context
>>> switches for QD8 and ~4 times for QD32.
>>>
>>> Not for inclusion yet, I want to add an optimisation for when
>>> waiting for 1 CQE.
>>
>> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
>> that, and I see context rates of around 8.1-8.3M/sec with the current
>> kernel.
>>
>> Applied the two patches, but didn't see much of a change? Performance is
>> about the same, and cx rate ditto. Confused... As you probably know,
>> this test waits for 32 ios at the time.
> 
> If I'd to guess it already has perfect batching, for which case
> the patch does nothing. Maybe it's due to SSD coalescing +
> small ro I/O + consistency and small latencies of Optanes,
> or might be on the scheduling and the kernel side to be slow
> to react.
> 
> I was looking at trace_io_uring_local_work_run() while testing,
> It's always should be @loop=QD (i.e. 32) for the patch, but
> the guess is it's also 32 with that setup but without patches.

It very well could be that it's just loaded enough that we get perfect
batching anyway. I'd need to reuse some of your tracing to know for
sure.

>> Didn't take a closer look just yet, but I grok the concept. One
>> immediate thing I'd want to change is the FACILE part of it. Let's call
>> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT?
> 
> I don't really care, will change, but let me also ask why?
> They're more or less synonyms, though facile is much less
> popular. Is that your reasoning?

Yep, it's not very common and the name should be self-explanatory
immediately for most people.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-12 15:30     ` Jens Axboe
@ 2023-03-13  3:45       ` Pavel Begunkov
  2023-03-13 14:16         ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-13  3:45 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-kernel

On 3/12/23 15:30, Jens Axboe wrote:
> On 3/11/23 1:45?PM, Pavel Begunkov wrote:
>> On 3/11/23 17:24, Jens Axboe wrote:
>>> On 3/10/23 12:04?PM, Pavel Begunkov wrote:
>>>> io_uring extensively uses task_work, but when a task is waiting
>>>> for multiple CQEs it causes lots of rescheduling. This series
>>>> is an attempt to optimise it and be a base for future improvements.
>>>>
>>>> For some zc network tests eventually waiting for a portion of
>>>> buffers I've got 10x descrease in the number of context switches,
>>>> which reduced the CPU consumption more than twice (17% -> 8%).
>>>> It also helps storage cases, while running fio/t/io_uring against
>>>> a low performant drive it got 2x descrease of the number of context
>>>> switches for QD8 and ~4 times for QD32.
>>>>
>>>> Not for inclusion yet, I want to add an optimisation for when
>>>> waiting for 1 CQE.
>>>
>>> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
>>> that, and I see context rates of around 8.1-8.3M/sec with the current
>>> kernel.
>>>
>>> Applied the two patches, but didn't see much of a change? Performance is
>>> about the same, and cx rate ditto. Confused... As you probably know,
>>> this test waits for 32 ios at the time.
>>
>> If I'd to guess it already has perfect batching, for which case
>> the patch does nothing. Maybe it's due to SSD coalescing +
>> small ro I/O + consistency and small latencies of Optanes,
>> or might be on the scheduling and the kernel side to be slow
>> to react.
>>
>> I was looking at trace_io_uring_local_work_run() while testing,
>> It's always should be @loop=QD (i.e. 32) for the patch, but
>> the guess is it's also 32 with that setup but without patches.
> 
> It very well could be that it's just loaded enough that we get perfect
> batching anyway. I'd need to reuse some of your tracing to know for
> sure.

I used existing trace points. If you see a pattern

trace_io_uring_local_work_run()
trace_io_uring_cqring_wait(@count=32)

trace_io_uring_local_work_run()
trace_io_uring_cqring_wait(@count=32)

...

that would mean a perfect batching. Even more so
if @loops=1


>>> Didn't take a closer look just yet, but I grok the concept. One
>>> immediate thing I'd want to change is the FACILE part of it. Let's call
>>> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT?
>>
>> I don't really care, will change, but let me also ask why?
>> They're more or less synonyms, though facile is much less
>> popular. Is that your reasoning?
> 
> Yep, it's not very common and the name should be self-explanatory
> immediately for most people.

That's exactly the problem. Someone will think that it's
like normal tw but "better" and blindly apply it. Same happened
before with priority tw lists.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-13  3:45       ` Pavel Begunkov
@ 2023-03-13 14:16         ` Jens Axboe
  2023-03-13 17:50           ` Pavel Begunkov
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2023-03-13 14:16 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-kernel

On 3/12/23 9:45?PM, Pavel Begunkov wrote:
>>>> Didn't take a closer look just yet, but I grok the concept. One
>>>> immediate thing I'd want to change is the FACILE part of it. Let's call
>>>> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT?
>>>
>>> I don't really care, will change, but let me also ask why?
>>> They're more or less synonyms, though facile is much less
>>> popular. Is that your reasoning?
>>
>> Yep, it's not very common and the name should be self-explanatory
>> immediately for most people.
> 
> That's exactly the problem. Someone will think that it's
> like normal tw but "better" and blindly apply it. Same happened
> before with priority tw lists.

But the way to fix that is not through obscure naming, it's through
better and more frequent review. Naming is hard, but naming should be
basically self-explanatory in terms of why it differs from not setting
that flag. LIGHTWEIGHT and friends isn't great either, maybe it should
just be explicit in that this task_work just posts a CQE and hence it's
pointless to wake the task to run it unless it'll then meet the criteria
of having that task exit its wait loop as it now has enough CQEs
available. IO_UF_TWQ_CQE_POST or something like that. Then if it at some
point gets modified to also encompass different types of task_work that
should not cause wakes, then it can change again. Just tossing
suggestions out there...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-13 14:16         ` Jens Axboe
@ 2023-03-13 17:50           ` Pavel Begunkov
  2023-03-13 22:01             ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-13 17:50 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-kernel

On 3/13/23 14:16, Jens Axboe wrote:
> On 3/12/23 9:45?PM, Pavel Begunkov wrote:
>>>>> Didn't take a closer look just yet, but I grok the concept. One
>>>>> immediate thing I'd want to change is the FACILE part of it. Let's call
>>>>> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT?
>>>>
>>>> I don't really care, will change, but let me also ask why?
>>>> They're more or less synonyms, though facile is much less
>>>> popular. Is that your reasoning?
>>
>>> Yep, it's not very common and the name should be self-explanatory
>>> immediately for most people.
>>
>> That's exactly the problem. Someone will think that it's
>> like normal tw but "better" and blindly apply it. Same happened
>> before with priority tw lists.
> 
> But the way to fix that is not through obscure naming, it's through
> better and more frequent review. Naming is hard, but naming should be
> basically self-explanatory in terms of why it differs from not setting
> that flag. LIGHTWEIGHT and friends isn't great either, maybe it should
> just be explicit in that this task_work just posts a CQE and hence it's
> pointless to wake the task to run it unless it'll then meet the criteria
> of having that task exit its wait loop as it now has enough CQEs
> available. IO_UF_TWQ_CQE_POST or something like that. Then if it at some

There are 2 expectations (will add a comment)
1) it's posts no more that 1 CQE, 0 is fine

2) it's not urgent, including that it doesn't lock out scarce
[system wide] resources. DMA mappings come to mind as an example.

IIRC is a problem even now with nvme passthrough and DEFER_TASKRUN

> point gets modified to also encompass different types of task_work that
> should not cause wakes, then it can change again. Just tossing
> suggestions out there...

I honestly don't see how LIGHTWEIGHT is better. I think a proper
name would be _LAZY_WAKE or maybe _DEFERRED_WAKE. It doesn't tell
much about why you would want it, but at least sets expectations
what it does. Only needs a comment that multishot is not supported.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-13 17:50           ` Pavel Begunkov
@ 2023-03-13 22:01             ` Jens Axboe
  0 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2023-03-13 22:01 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: linux-kernel

On 3/13/23 11:50?AM, Pavel Begunkov wrote:
> On 3/13/23 14:16, Jens Axboe wrote:
>> On 3/12/23 9:45?PM, Pavel Begunkov wrote:
>>>>>> Didn't take a closer look just yet, but I grok the concept. One
>>>>>> immediate thing I'd want to change is the FACILE part of it. Let's call
>>>>>> it something a bit more straightforward, perhaps LIGHT? Or LIGHTWEIGHT?
>>>>>
>>>>> I don't really care, will change, but let me also ask why?
>>>>> They're more or less synonyms, though facile is much less
>>>>> popular. Is that your reasoning?
>>>
>>>> Yep, it's not very common and the name should be self-explanatory
>>>> immediately for most people.
>>>
>>> That's exactly the problem. Someone will think that it's
>>> like normal tw but "better" and blindly apply it. Same happened
>>> before with priority tw lists.
>>
>> But the way to fix that is not through obscure naming, it's through
>> better and more frequent review. Naming is hard, but naming should be
>> basically self-explanatory in terms of why it differs from not setting
>> that flag. LIGHTWEIGHT and friends isn't great either, maybe it should
>> just be explicit in that this task_work just posts a CQE and hence it's
>> pointless to wake the task to run it unless it'll then meet the criteria
>> of having that task exit its wait loop as it now has enough CQEs
>> available. IO_UF_TWQ_CQE_POST or something like that. Then if it at some
> 
> There are 2 expectations (will add a comment)
> 1) it's posts no more that 1 CQE, 0 is fine
> 
> 2) it's not urgent, including that it doesn't lock out scarce
> [system wide] resources. DMA mappings come to mind as an example.
> 
> IIRC is a problem even now with nvme passthrough and DEFER_TASKRUN

DMA mappings aren't really scarce, only on weird/crappy setups with a
very limited IOMMU space where and IOMMU is being used. So not a huge
deal I think.

>> point gets modified to also encompass different types of task_work that
>> should not cause wakes, then it can change again. Just tossing
>> suggestions out there...
> 
> I honestly don't see how LIGHTWEIGHT is better. I think a proper
> name would be _LAZY_WAKE or maybe _DEFERRED_WAKE. It doesn't tell
> much about why you would want it, but at least sets expectations
> what it does. Only needs a comment that multishot is not supported.

Agree, and this is what I said too, LIGHTWEIGHT isn't a great word
either. DEFERRED_WAKE seems like a good candidate, and it'd be great to
also include a code comment there on what it does. That'll help future
contributors.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-11 17:24 ` [RFC 0/2] optimise local-tw task resheduling Jens Axboe
  2023-03-11 20:45   ` Pavel Begunkov
@ 2023-03-16 12:25   ` Pavel Begunkov
  1 sibling, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-16 12:25 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: linux-kernel

On 3/11/23 17:24, Jens Axboe wrote:
> On 3/10/23 12:04?PM, Pavel Begunkov wrote:
>> io_uring extensively uses task_work, but when a task is waiting
>> for multiple CQEs it causes lots of rescheduling. This series
>> is an attempt to optimise it and be a base for future improvements.
>>
>> For some zc network tests eventually waiting for a portion of
>> buffers I've got 10x descrease in the number of context switches,
>> which reduced the CPU consumption more than twice (17% -> 8%).
>> It also helps storage cases, while running fio/t/io_uring against
>> a low performant drive it got 2x descrease of the number of context
>> switches for QD8 and ~4 times for QD32.
>>
>> Not for inclusion yet, I want to add an optimisation for when
>> waiting for 1 CQE.
> 
> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
> that, and I see context rates of around 8.1-8.3M/sec with the current
> kernel.

Tried it out. No difference with bs=512, qd=4 is completed before
it gets to schedule() in io_cqring_wait(). With QD32, it's local tw run
__io_run_local_work() spins 2 loops, and QD=8 somewhat in the middle
with rare extra sched.

For bs=4096 QD=8 I see a lot of:

io_cqring_wait() @min_events=8
schedule()
__io_run_local_work() nr=4
schedule()
__io_run_local_work() nr=4


And if we benchmark without and with the patch there is a nice
CPU util reduction.

CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
   0    1.18    0.00   19.24    0.00    0.00    0.00    0.00    0.00    0.00   79.57
   0    1.63    0.00   29.38    0.00    0.00    0.00    0.00    0.00    0.00   68.98

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-10 19:04 [RFC 0/2] optimise local-tw task resheduling Pavel Begunkov
                   ` (2 preceding siblings ...)
  2023-03-11 17:24 ` [RFC 0/2] optimise local-tw task resheduling Jens Axboe
@ 2023-03-15  2:35 ` Ming Lei
  2023-03-15 16:53   ` Pavel Begunkov
  3 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2023-03-15  2:35 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Jens Axboe, linux-kernel, ming.lei

Hi Pavel

On Fri, Mar 10, 2023 at 07:04:14PM +0000, Pavel Begunkov wrote:
> io_uring extensively uses task_work, but when a task is waiting
> for multiple CQEs it causes lots of rescheduling. This series
> is an attempt to optimise it and be a base for future improvements.
> 
> For some zc network tests eventually waiting for a portion of 
> buffers I've got 10x descrease in the number of context switches,
> which reduced the CPU consumption more than twice (17% -> 8%).
> It also helps storage cases, while running fio/t/io_uring against
> a low performant drive it got 2x descrease of the number of context
> switches for QD8 and ~4 times for QD32.

ublk uses io_uring_cmd_complete_in_task()(io_req_task_work_add())
heavily. So I tried this patchset, looks not see obvious change
on both IOPS and context switches when running 't/io_uring /dev/ublkb0',
and it is one null ublk target(ublk add -t null -z -u 1 -q 2), IOPS
is ~2.8M.

But ublk applies batch schedule similar with io_uring before calling
io_uring_cmd_complete_in_task().

thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-15  2:35 ` Ming Lei
@ 2023-03-15 16:53   ` Pavel Begunkov
  2023-03-16  1:25     ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2023-03-15 16:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: io-uring, Jens Axboe, linux-kernel

On 3/15/23 02:35, Ming Lei wrote:
> Hi Pavel
> 
> On Fri, Mar 10, 2023 at 07:04:14PM +0000, Pavel Begunkov wrote:
>> io_uring extensively uses task_work, but when a task is waiting
>> for multiple CQEs it causes lots of rescheduling. This series
>> is an attempt to optimise it and be a base for future improvements.
>>
>> For some zc network tests eventually waiting for a portion of
>> buffers I've got 10x descrease in the number of context switches,
>> which reduced the CPU consumption more than twice (17% -> 8%).
>> It also helps storage cases, while running fio/t/io_uring against
>> a low performant drive it got 2x descrease of the number of context
>> switches for QD8 and ~4 times for QD32.
> 
> ublk uses io_uring_cmd_complete_in_task()(io_req_task_work_add())
> heavily. So I tried this patchset, looks not see obvious change
> on both IOPS and context switches when running 't/io_uring /dev/ublkb0',
> and it is one null ublk target(ublk add -t null -z -u 1 -q 2), IOPS
> is ~2.8M.

Hi Ming,

It's enabled for rw requests and send-zc notifications, but
io_uring_cmd_complete_in_task() is not covered. I'll be enabling
it for more cases, including pass through.

> But ublk applies batch schedule similar with io_uring before calling
> io_uring_cmd_complete_in_task().

The feature doesn't tolerate tw that produce multiple CQEs, so
it can't be applied to this batching and the task would stuck
waiting.

btw, from a quick look it appeared that ublk batching is there
to keep requests together but not to improve batching. And if so,
I think we can get rid of it, rely on io_uring batching and
let ublk to gather its requests from tw list, which sounds
cleaner. I'll elaborate on that later

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] optimise local-tw task resheduling
  2023-03-15 16:53   ` Pavel Begunkov
@ 2023-03-16  1:25     ` Ming Lei
  0 siblings, 0 replies; 17+ messages in thread
From: Ming Lei @ 2023-03-16  1:25 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring, Jens Axboe, linux-kernel, ming.lei

On Wed, Mar 15, 2023 at 04:53:09PM +0000, Pavel Begunkov wrote:
> On 3/15/23 02:35, Ming Lei wrote:
> > Hi Pavel
> > 
> > On Fri, Mar 10, 2023 at 07:04:14PM +0000, Pavel Begunkov wrote:
> > > io_uring extensively uses task_work, but when a task is waiting
> > > for multiple CQEs it causes lots of rescheduling. This series
> > > is an attempt to optimise it and be a base for future improvements.
> > > 
> > > For some zc network tests eventually waiting for a portion of
> > > buffers I've got 10x descrease in the number of context switches,
> > > which reduced the CPU consumption more than twice (17% -> 8%).
> > > It also helps storage cases, while running fio/t/io_uring against
> > > a low performant drive it got 2x descrease of the number of context
> > > switches for QD8 and ~4 times for QD32.
> > 
> > ublk uses io_uring_cmd_complete_in_task()(io_req_task_work_add())
> > heavily. So I tried this patchset, looks not see obvious change
> > on both IOPS and context switches when running 't/io_uring /dev/ublkb0',
> > and it is one null ublk target(ublk add -t null -z -u 1 -q 2), IOPS
> > is ~2.8M.
> 
> Hi Ming,
> 
> It's enabled for rw requests and send-zc notifications, but
> io_uring_cmd_complete_in_task() is not covered. I'll be enabling
> it for more cases, including pass through.
> 
> > But ublk applies batch schedule similar with io_uring before calling
> > io_uring_cmd_complete_in_task().
> 
> The feature doesn't tolerate tw that produce multiple CQEs, so
> it can't be applied to this batching and the task would stuck
> waiting.
> 
> btw, from a quick look it appeared that ublk batching is there
> to keep requests together but not to improve batching. And if so,
> I think we can get rid of it, rely on io_uring batching and
> let ublk to gather its requests from tw list, which sounds
> cleaner. I'll elaborate on that later

Yeah, the ublk batching can be removed since __io_req_task_work_add
already does it, and it is kept just for micro optimization of calling
less io_uring_cmd_complete_in_task(), but I think we can get bigger
improvement with your tw optimization.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-03-16 12:27 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-10 19:04 [RFC 0/2] optimise local-tw task resheduling Pavel Begunkov
2023-03-10 19:04 ` [RFC 1/2] io_uring: add tw add flags Pavel Begunkov
2023-03-10 19:04 ` [RFC 2/2] io_uring: reduce sheduling due to tw Pavel Begunkov
2023-03-11 17:24 ` [RFC 0/2] optimise local-tw task resheduling Jens Axboe
2023-03-11 20:45   ` Pavel Begunkov
2023-03-11 20:53     ` Pavel Begunkov
2023-03-12 15:31       ` Jens Axboe
2023-03-13  3:52         ` Pavel Begunkov
2023-03-12 15:30     ` Jens Axboe
2023-03-13  3:45       ` Pavel Begunkov
2023-03-13 14:16         ` Jens Axboe
2023-03-13 17:50           ` Pavel Begunkov
2023-03-13 22:01             ` Jens Axboe
2023-03-16 12:25   ` Pavel Begunkov
2023-03-15  2:35 ` Ming Lei
2023-03-15 16:53   ` Pavel Begunkov
2023-03-16  1:25     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox