[PATCH for-5.16 v4 0/6] task work optimization

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH for-5.16 v4 0/6] task work optimization
@ 2021-10-29 12:22 Hao Xu
  2021-10-29 12:22 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Hao Xu @ 2021-10-29 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

v3->v4
- remove 8/8 in v3
- remove nr_ctx
- insert a TW to the priority tw list only in sqpoll mode
- optimise the priority tw list handling logic to be compatible for
  multiple ctx case

Hao Xu (6):
  io-wq: add helper to merge two wq_lists
  io_uring: add a priority tw list for irq completion work
  io_uring: add helper for task work execution code
  io_uring: split io_req_complete_post() and add a helper
  io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  io_uring: batch completion in prior_task_list

 fs/io-wq.h    |  21 +++++++
 fs/io_uring.c | 160 ++++++++++++++++++++++++++++++++++----------------
 2 files changed, 130 insertions(+), 51 deletions(-)

-- 
2.24.4


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/6] io-wq: add helper to merge two wq_lists
  2021-10-29 12:22 [PATCH for-5.16 v4 0/6] task work optimization Hao Xu
@ 2021-10-29 12:22 ` Hao Xu
  2021-11-17 22:41   ` Pavel Begunkov
  2021-10-29 12:22 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Hao Xu @ 2021-10-29 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

add a helper to merge two wq_lists, it will be useful in the next
patches.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io-wq.h | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/io-wq.h b/fs/io-wq.h
index 41bf37674a49..a7b0b505db9d 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -52,6 +52,27 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
 		list->last = node;
 }
 
+/**
+ * wq_list_merge - merge the second list to the first one.
+ * @list0: the first list
+ * @list1: the second list
+ * Return list0 if list1 is NULL, and vice versa.
+ * Otherwise after merge, list0 contains the merged list.
+ */
+static inline struct io_wq_work_list *wq_list_merge(struct io_wq_work_list *list0,
+						    struct io_wq_work_list *list1)
+{
+	if (!list1 || !list1->first)
+		return list0;
+
+	if (!list0 || !list0->first)
+		return list1;
+
+	list0->last->next = list1->first;
+	list0->last = list1->last;
+	return list0;
+}
+
 static inline void wq_list_add_tail(struct io_wq_work_node *node,
 				    struct io_wq_work_list *list)
 {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/6] io_uring: add a priority tw list for irq completion work
  2021-10-29 12:22 [PATCH for-5.16 v4 0/6] task work optimization Hao Xu
  2021-10-29 12:22 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
@ 2021-10-29 12:22 ` Hao Xu
  2021-11-17 23:03   ` Pavel Begunkov
  2021-10-29 12:22 ` [PATCH 3/6] io_uring: add helper for task work execution code Hao Xu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Hao Xu @ 2021-10-29 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.

Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:

ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 710K, while the old one is 665K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
	depth      new          old
	 1        7.05         7.10
	 2        8.47         8.60
	 4        10.42        10.42
	 8        13.78        13.22
	 16       27.41        24.33
	 32       49.40        53.08
	 64       102.53       103.36
	 128      196.98       205.61
	 256      372.99       414.88
         512      747.23       791.30
         1024     1472.59      1538.72
         2048     3153.49      3329.01
         4096     6387.86      6682.54
         8192     12150.25     12774.14
         16384    23085.58     26044.71

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 17cb0e1b88f0..981794ee3f3f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -467,6 +467,7 @@ struct io_uring_task {
 
 	spinlock_t		task_lock;
 	struct io_wq_work_list	task_list;
+	struct io_wq_work_list	prior_task_list;
 	struct callback_head	task_work;
 	bool			task_running;
 };
@@ -2148,13 +2149,17 @@ static void tctx_task_work(struct callback_head *cb)
 
 	while (1) {
 		struct io_wq_work_node *node;
+		struct io_wq_work_list *merged_list;
 
-		if (!tctx->task_list.first && locked)
+		if (!tctx->prior_task_list.first &&
+		    !tctx->task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node = tctx->task_list.first;
+		merged_list = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
+		node = merged_list->first;
 		INIT_WQ_LIST(&tctx->task_list);
+		INIT_WQ_LIST(&tctx->prior_task_list);
 		if (!node)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
@@ -2183,19 +2188,23 @@ static void tctx_task_work(struct callback_head *cb)
 	ctx_flush_and_put(ctx, &locked);
 }
 
-static void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
 	enum task_work_notify_mode notify;
 	struct io_wq_work_node *node;
+	struct io_wq_work_list *merged_list;
 	unsigned long flags;
 	bool running;
 
 	WARN_ON_ONCE(!tctx);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	if (priority)
+		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
+	else
+		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
@@ -2220,8 +2229,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
 	tctx->task_running = false;
-	node = tctx->task_list.first;
+	merged_list = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
+	node = merged_list->first;
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
@@ -2258,19 +2269,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
 	req->result = ret;
 	req->io_task_work.func = io_req_task_cancel;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_req_task_submit;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue_reissue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_queue_async_work;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static inline void io_queue_next(struct io_kiocb *req)
@@ -2375,7 +2386,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
 {
 	if (req_ref_put_and_test(req)) {
 		req->io_task_work.func = io_free_req_work;
-		io_req_task_work_add(req);
+		io_req_task_work_add(req, false);
 	}
 }
 
@@ -2678,7 +2689,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 		return;
 	req->result = res;
 	req->io_task_work.func = io_req_task_complete;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
@@ -5243,7 +5254,7 @@ static inline int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *pol
 	 * of executing it. We can't safely execute it anyway, as we may not
 	 * have the needed state needed for it anyway.
 	 */
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return 1;
 }
 
@@ -5923,7 +5934,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -6889,7 +6900,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_link_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -8593,6 +8604,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	task->io_uring = tctx;
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	init_task_work(&tctx->task_work, tctx_task_work);
 	return 0;
 }
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/6] io_uring: add helper for task work execution code
  2021-10-29 12:22 [PATCH for-5.16 v4 0/6] task work optimization Hao Xu
  2021-10-29 12:22 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
  2021-10-29 12:22 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
@ 2021-10-29 12:22 ` Hao Xu
  2021-10-29 12:22 ` [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper Hao Xu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Hao Xu @ 2021-10-29 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Add a helper for task work execution code. We will use it later.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 981794ee3f3f..71bc589bbb4d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2140,6 +2140,25 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
+{
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		if (req->ctx != *ctx) {
+			ctx_flush_and_put(*ctx, locked);
+			*ctx = req->ctx;
+			/* if not contended, grab and improve batching */
+			*locked = mutex_trylock(&(*ctx)->uring_lock);
+			percpu_ref_get(&(*ctx)->refs);
+		}
+		req->io_task_work.func(req, locked);
+		node = next;
+	} while (node);
+}
+
 static void tctx_task_work(struct callback_head *cb)
 {
 	bool locked = false;
@@ -2166,22 +2185,7 @@ static void tctx_task_work(struct callback_head *cb)
 		if (!node)
 			break;
 
-		do {
-			struct io_wq_work_node *next = node->next;
-			struct io_kiocb *req = container_of(node, struct io_kiocb,
-							    io_task_work.node);
-
-			if (req->ctx != ctx) {
-				ctx_flush_and_put(ctx, &locked);
-				ctx = req->ctx;
-				/* if not contended, grab and improve batching */
-				locked = mutex_trylock(&ctx->uring_lock);
-				percpu_ref_get(&ctx->refs);
-			}
-			req->io_task_work.func(req, &locked);
-			node = next;
-		} while (node);
-
+		handle_tw_list(node, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper
  2021-10-29 12:22 [PATCH for-5.16 v4 0/6] task work optimization Hao Xu
                   ` (2 preceding siblings ...)
  2021-10-29 12:22 ` [PATCH 3/6] io_uring: add helper for task work execution code Hao Xu
@ 2021-10-29 12:22 ` Hao Xu
  2021-10-29 12:22 ` [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
  2021-10-29 12:22 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
  5 siblings, 0 replies; 14+ messages in thread
From: Hao Xu @ 2021-10-29 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Split io_req_complete_post(), this is a prep for the next patch.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 71bc589bbb4d..52f95db3cfdb 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1819,12 +1819,11 @@ static noinline bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data
 	return __io_cqring_fill_event(ctx, user_data, res, cflags);
 }
 
-static void io_req_complete_post(struct io_kiocb *req, s32 res,
+static void __io_req_complete_post(struct io_kiocb *req, s32 res,
 				 u32 cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	spin_lock(&ctx->completion_lock);
 	__io_cqring_fill_event(ctx, req->user_data, res, cflags);
 	/*
 	 * If we're the last reference to this request, add to our locked
@@ -1845,6 +1844,15 @@ static void io_req_complete_post(struct io_kiocb *req, s32 res,
 		wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
 		ctx->locked_free_nr++;
 	}
+}
+
+static void io_req_complete_post(struct io_kiocb *req, long res,
+				 unsigned int cflags)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	spin_lock(&ctx->completion_lock);
+	__io_req_complete_post(req, res, cflags);
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
 	io_cqring_ev_posted(ctx);
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  2021-10-29 12:22 [PATCH for-5.16 v4 0/6] task work optimization Hao Xu
                   ` (3 preceding siblings ...)
  2021-10-29 12:22 ` [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper Hao Xu
@ 2021-10-29 12:22 ` Hao Xu
  2021-10-29 12:22 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
  5 siblings, 0 replies; 14+ messages in thread
From: Hao Xu @ 2021-10-29 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Move them up to avoid explicit declaration. We will use them in later
patches.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 52f95db3cfdb..694195c086f3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2148,6 +2148,24 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
+{
+	unsigned int cflags;
+
+	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
+	cflags |= IORING_CQE_F_BUFFER;
+	req->flags &= ~REQ_F_BUFFER_SELECTED;
+	kfree(kbuf);
+	return cflags;
+}
+
+static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
+{
+	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
+		return 0;
+	return io_put_kbuf(req, req->kbuf);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2417,24 +2435,6 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
 	return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head;
 }
 
-static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
-{
-	unsigned int cflags;
-
-	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
-	cflags |= IORING_CQE_F_BUFFER;
-	req->flags &= ~REQ_F_BUFFER_SELECTED;
-	kfree(kbuf);
-	return cflags;
-}
-
-static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
-{
-	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
-		return 0;
-	return io_put_kbuf(req, req->kbuf);
-}
-
 static inline bool io_run_task_work(void)
 {
 	if (test_thread_flag(TIF_NOTIFY_SIGNAL) || current->task_works) {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/6] io_uring: batch completion in prior_task_list
  2021-10-29 12:22 [PATCH for-5.16 v4 0/6] task work optimization Hao Xu
                   ` (4 preceding siblings ...)
  2021-10-29 12:22 ` [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
@ 2021-10-29 12:22 ` Hao Xu
  2021-11-17 22:55   ` Pavel Begunkov
  5 siblings, 1 reply; 14+ messages in thread
From: Hao Xu @ 2021-10-29 12:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch regardless uring lock. For instance, we are doing simple
direct read, most task work will be io_req_task_complete(), with this
patch we don't need to hold uring lock there for long time.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 52 ++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 694195c086f3..565cd0b34f18 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2166,6 +2166,37 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
 	return io_put_kbuf(req, req->kbuf);
 }
 
+static void handle_prior_tw_list(struct io_wq_work_node *node)
+{
+	struct io_ring_ctx *ctx = NULL;
+
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+		if (req->ctx != ctx) {
+			if (ctx) {
+				io_commit_cqring(ctx);
+				spin_unlock(&ctx->completion_lock);
+				io_cqring_ev_posted(ctx);
+				percpu_ref_put(&ctx->refs);
+			}
+			ctx = req->ctx;
+			percpu_ref_get(&ctx->refs);
+			spin_lock(&ctx->completion_lock);
+		}
+		__io_req_complete_post(req, req->result, io_put_rw_kbuf(req));
+		node = next;
+	} while (node);
+
+	if (ctx) {
+		io_commit_cqring(ctx);
+		spin_unlock(&ctx->completion_lock);
+		io_cqring_ev_posted(ctx);
+		percpu_ref_put(&ctx->refs);
+	}
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2193,25 +2224,28 @@ static void tctx_task_work(struct callback_head *cb)
 						  task_work);
 
 	while (1) {
-		struct io_wq_work_node *node;
-		struct io_wq_work_list *merged_list;
+		struct io_wq_work_node *node1, *node2;
 
-		if (!tctx->prior_task_list.first &&
-		    !tctx->task_list.first && locked)
+		if (!tctx->task_list.first &&
+		    !tctx->prior_task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		merged_list = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
-		node = merged_list->first;
+		node1 = tctx->prior_task_list.first;
+		node2 = tctx->task_list.first;
 		INIT_WQ_LIST(&tctx->task_list);
 		INIT_WQ_LIST(&tctx->prior_task_list);
-		if (!node)
+		if (!node2 && !node1)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
-		if (!node)
+		if (!node2 && !node1)
 			break;
 
-		handle_tw_list(node, &ctx, &locked);
+		if (node1)
+			handle_prior_tw_list(node1);
+
+		if (node2)
+			handle_tw_list(node2, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/6] io-wq: add helper to merge two wq_lists
  2021-10-29 12:22 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
@ 2021-11-17 22:41   ` Pavel Begunkov
  0 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2021-11-17 22:41 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 10/29/21 13:22, Hao Xu wrote:
> add a helper to merge two wq_lists, it will be useful in the next
> patches.
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
>   fs/io-wq.h | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
> 
> diff --git a/fs/io-wq.h b/fs/io-wq.h
> index 41bf37674a49..a7b0b505db9d 100644
> --- a/fs/io-wq.h
> +++ b/fs/io-wq.h
> @@ -52,6 +52,27 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
>   		list->last = node;
>   }
>   
> +/**
> + * wq_list_merge - merge the second list to the first one.
> + * @list0: the first list
> + * @list1: the second list
> + * Return list0 if list1 is NULL, and vice versa.
> + * Otherwise after merge, list0 contains the merged list.
> + */
> +static inline struct io_wq_work_list *wq_list_merge(struct io_wq_work_list *list0,
> +						    struct io_wq_work_list *list1)

might be easier if it'd be a splice or even initialising both lists
and returning a node ptr. E.g. (untested)

struct node* wq_list_splice(list0, list1) {
	struct node *ret;

	if (!list0->first) {
		ret = list1->first;
	} else {
		ret = list0->first;
		list0->last->next = list1->first;
	}

	init(list0);
	init(list1);
	return ret;
}

> +{
> +	if (!list1 || !list1->first)

Can also get rid of NULL checks, i.e. !list1

> +		return list0;
> +
> +	if (!list0 || !list0->first)
> +		return list1;
> +
> +	list0->last->next = list1->first;
> +	list0->last = list1->last;
> +	return list0;
> +}
> +
>   static inline void wq_list_add_tail(struct io_wq_work_node *node,
>   				    struct io_wq_work_list *list)
>   {
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/6] io_uring: batch completion in prior_task_list
  2021-10-29 12:22 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
@ 2021-11-17 22:55   ` Pavel Begunkov
  2021-11-18 10:39     ` Hao Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2021-11-17 22:55 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 10/29/21 13:22, Hao Xu wrote:
> In previous patches, we have already gathered some tw with
> io_req_task_complete() as callback in prior_task_list, let's complete
> them in batch regardless uring lock. For instance, we are doing simple
> direct read, most task work will be io_req_task_complete(), with this
> patch we don't need to hold uring lock there for long time.
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
>   fs/io_uring.c | 52 ++++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 43 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 694195c086f3..565cd0b34f18 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2166,6 +2166,37 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
>   	return io_put_kbuf(req, req->kbuf);
>   }
>   
> +static void handle_prior_tw_list(struct io_wq_work_node *node)
> +{
> +	struct io_ring_ctx *ctx = NULL;
> +
> +	do {
> +		struct io_wq_work_node *next = node->next;
> +		struct io_kiocb *req = container_of(node, struct io_kiocb,
> +						    io_task_work.node);
> +		if (req->ctx != ctx) {
> +			if (ctx) {
> +				io_commit_cqring(ctx);
> +				spin_unlock(&ctx->completion_lock);
> +				io_cqring_ev_posted(ctx);
> +				percpu_ref_put(&ctx->refs);
> +			}
> +			ctx = req->ctx;
> +			percpu_ref_get(&ctx->refs);
> +			spin_lock(&ctx->completion_lock);
> +		}
> +		__io_req_complete_post(req, req->result, io_put_rw_kbuf(req));
> +		node = next;
> +	} while (node);
> +
> +	if (ctx) {
> +		io_commit_cqring(ctx);
> +		spin_unlock(&ctx->completion_lock);
> +		io_cqring_ev_posted(ctx);
> +		percpu_ref_put(&ctx->refs);
> +	}
> +}
> +
>   static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
>   {
>   	do {
> @@ -2193,25 +2224,28 @@ static void tctx_task_work(struct callback_head *cb)
>   						  task_work);
>   
>   	while (1) {
> -		struct io_wq_work_node *node;
> -		struct io_wq_work_list *merged_list;
> +		struct io_wq_work_node *node1, *node2;
>   
> -		if (!tctx->prior_task_list.first &&
> -		    !tctx->task_list.first && locked)
> +		if (!tctx->task_list.first &&
> +		    !tctx->prior_task_list.first && locked)
>   			io_submit_flush_completions(ctx);
>   
>   		spin_lock_irq(&tctx->task_lock);
> -		merged_list = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
> -		node = merged_list->first;
> +		node1 = tctx->prior_task_list.first;
> +		node2 = tctx->task_list.first;
>   		INIT_WQ_LIST(&tctx->task_list);
>   		INIT_WQ_LIST(&tctx->prior_task_list);
> -		if (!node)
> +		if (!node2 && !node1)
>   			tctx->task_running = false;
>   		spin_unlock_irq(&tctx->task_lock);
> -		if (!node)
> +		if (!node2 && !node1)
>   			break;
>   
> -		handle_tw_list(node, &ctx, &locked);
> +		if (node1)
> +			handle_prior_tw_list(node1);

IIUC, it moves all IRQ rw completions to this new path even when we already
have the lock. One concern is that io_submit_flush_completions() is better
optimised. Should probably be visible for one threaded apps and a bunch of
other cases.

How about a combined scheme? if we can grab the lock, go through the old
path, otherwise handle_prior_tw_list(). The rest looks good, will formally
review once we deal with this one.

> +
> +		if (node2)
> +			handle_tw_list(node2, &ctx, &locked);
>   		cond_resched();
>   	}
>   
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/6] io_uring: add a priority tw list for irq completion work
  2021-10-29 12:22 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
@ 2021-11-17 23:03   ` Pavel Begunkov
  2021-11-24  7:53     ` Hao Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2021-11-17 23:03 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 10/29/21 13:22, Hao Xu wrote:
> Now we have a lot of task_work users, some are just to complete a req
> and generate a cqe. Let's put the work to a new tw list which has a
> higher priority, so that it can be handled quickly and thus to reduce
> avg req latency and users can issue next round of sqes earlier.
> An explanatory case:
> 
> origin timeline:
>      submit_sqe-->irq-->add completion task_work
>      -->run heavy work0~n-->run completion task_work
> now timeline:
>      submit_sqe-->irq-->add completion task_work
>      -->run completion task_work-->run heavy work0~n
> 
> Limitation: this optimization is only for those that submission and
> reaping process are in different threads. Otherwise anyhow we have to
> submit new sqes after returning to userspace, then the order of TWs
> doesn't matter.
> 
> Tested this patch(and the following ones) by manually replace
> __io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
> 'heavy' task works. Then test with fio:
> 
> ioengine=io_uring
> sqpoll=1
> thread=1
> bs=4k
> direct=1
> rw=randread
> time_based=1
> runtime=600
> randrepeat=0
> group_reporting=1
> filename=/dev/nvme0n1
> 
> Tried various iodepth.
> The peak IOPS for this patch is 710K, while the old one is 665K.
> For avg latency, difference shows when iodepth grow:
> depth and avg latency(usec):
> 	depth      new          old
> 	 1        7.05         7.10
> 	 2        8.47         8.60
> 	 4        10.42        10.42
> 	 8        13.78        13.22
> 	 16       27.41        24.33
> 	 32       49.40        53.08
> 	 64       102.53       103.36
> 	 128      196.98       205.61
> 	 256      372.99       414.88
>           512      747.23       791.30
>           1024     1472.59      1538.72
>           2048     3153.49      3329.01
>           4096     6387.86      6682.54
>           8192     12150.25     12774.14
>           16384    23085.58     26044.71
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
>   fs/io_uring.c | 38 +++++++++++++++++++++++++-------------
>   1 file changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 17cb0e1b88f0..981794ee3f3f 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -467,6 +467,7 @@ struct io_uring_task {
>   
>   	spinlock_t		task_lock;
>   	struct io_wq_work_list	task_list;
> +	struct io_wq_work_list	prior_task_list;
>   	struct callback_head	task_work;
>   	bool			task_running;
>   };
> @@ -2148,13 +2149,17 @@ static void tctx_task_work(struct callback_head *cb)
>   
>   	while (1) {
>   		struct io_wq_work_node *node;
> +		struct io_wq_work_list *merged_list;
>   
> -		if (!tctx->task_list.first && locked)
> +		if (!tctx->prior_task_list.first &&
> +		    !tctx->task_list.first && locked)
>   			io_submit_flush_completions(ctx);
>   
>   		spin_lock_irq(&tctx->task_lock);
> -		node = tctx->task_list.first;
> +		merged_list = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
> +		node = merged_list->first;
>   		INIT_WQ_LIST(&tctx->task_list);
> +		INIT_WQ_LIST(&tctx->prior_task_list);
>   		if (!node)
>   			tctx->task_running = false;
>   		spin_unlock_irq(&tctx->task_lock);
> @@ -2183,19 +2188,23 @@ static void tctx_task_work(struct callback_head *cb)
>   	ctx_flush_and_put(ctx, &locked);
>   }
>   
> -static void io_req_task_work_add(struct io_kiocb *req)
> +static void io_req_task_work_add(struct io_kiocb *req, bool priority)
>   {
>   	struct task_struct *tsk = req->task;
>   	struct io_uring_task *tctx = tsk->io_uring;
>   	enum task_work_notify_mode notify;
>   	struct io_wq_work_node *node;
> +	struct io_wq_work_list *merged_list;
>   	unsigned long flags;
>   	bool running;
>   
>   	WARN_ON_ONCE(!tctx);
>   
>   	spin_lock_irqsave(&tctx->task_lock, flags);
> -	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
> +	if (priority)
> +		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
> +	else
> +		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>   	running = tctx->task_running;
>   	if (!running)
>   		tctx->task_running = true;
> @@ -2220,8 +2229,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
>   
>   	spin_lock_irqsave(&tctx->task_lock, flags);
>   	tctx->task_running = false;
> -	node = tctx->task_list.first;
> +	merged_list = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
> +	node = merged_list->first;
>   	INIT_WQ_LIST(&tctx->task_list);
> +	INIT_WQ_LIST(&tctx->prior_task_list);
>   	spin_unlock_irqrestore(&tctx->task_lock, flags);
>   
>   	while (node) {
> @@ -2258,19 +2269,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
>   {
>   	req->result = ret;
>   	req->io_task_work.func = io_req_task_cancel;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>   }
>   
>   static void io_req_task_queue(struct io_kiocb *req)
>   {
>   	req->io_task_work.func = io_req_task_submit;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>   }
>   
>   static void io_req_task_queue_reissue(struct io_kiocb *req)
>   {
>   	req->io_task_work.func = io_queue_async_work;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>   }
>   
>   static inline void io_queue_next(struct io_kiocb *req)
> @@ -2375,7 +2386,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
>   {
>   	if (req_ref_put_and_test(req)) {
>   		req->io_task_work.func = io_free_req_work;
> -		io_req_task_work_add(req);
> +		io_req_task_work_add(req, false);
>   	}
>   }
>   
> @@ -2678,7 +2689,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
>   		return;
>   	req->result = res;
>   	req->io_task_work.func = io_req_task_complete;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));

I'm not sure this special case makes sense. I remembered you mentioned
that you measured it, but what's the reason? Can it be related to my
comments on 6/6?

>   }
>   
>   static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
> @@ -5243,7 +5254,7 @@ static inline int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *pol
>   	 * of executing it. We can't safely execute it anyway, as we may not
>   	 * have the needed state needed for it anyway.
>   	 */
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>   	return 1;
>   }
>   
> @@ -5923,7 +5934,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
>   	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
>   
>   	req->io_task_work.func = io_req_task_timeout;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>   	return HRTIMER_NORESTART;
>   }
>   
> @@ -6889,7 +6900,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
>   	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
>   
>   	req->io_task_work.func = io_req_task_link_timeout;
> -	io_req_task_work_add(req);
> +	io_req_task_work_add(req, false);
>   	return HRTIMER_NORESTART;
>   }
>   
> @@ -8593,6 +8604,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
>   	task->io_uring = tctx;
>   	spin_lock_init(&tctx->task_lock);
>   	INIT_WQ_LIST(&tctx->task_list);
> +	INIT_WQ_LIST(&tctx->prior_task_list);
>   	init_task_work(&tctx->task_work, tctx_task_work);
>   	return 0;
>   }
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/6] io_uring: batch completion in prior_task_list
  2021-11-17 22:55   ` Pavel Begunkov
@ 2021-11-18 10:39     ` Hao Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Hao Xu @ 2021-11-18 10:39 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/11/18 上午6:55, Pavel Begunkov 写道:
> On 10/29/21 13:22, Hao Xu wrote:
>> In previous patches, we have already gathered some tw with
>> io_req_task_complete() as callback in prior_task_list, let's complete
>> them in batch regardless uring lock. For instance, we are doing simple
>> direct read, most task work will be io_req_task_complete(), with this
>> patch we don't need to hold uring lock there for long time.
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   fs/io_uring.c | 52 ++++++++++++++++++++++++++++++++++++++++++---------
>>   1 file changed, 43 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 694195c086f3..565cd0b34f18 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2166,6 +2166,37 @@ static inline unsigned int 
>> io_put_rw_kbuf(struct io_kiocb *req)
>>       return io_put_kbuf(req, req->kbuf);
>>   }
>> +static void handle_prior_tw_list(struct io_wq_work_node *node)
>> +{
>> +    struct io_ring_ctx *ctx = NULL;
>> +
>> +    do {
>> +        struct io_wq_work_node *next = node->next;
>> +        struct io_kiocb *req = container_of(node, struct io_kiocb,
>> +                            io_task_work.node);
>> +        if (req->ctx != ctx) {
>> +            if (ctx) {
>> +                io_commit_cqring(ctx);
>> +                spin_unlock(&ctx->completion_lock);
>> +                io_cqring_ev_posted(ctx);
>> +                percpu_ref_put(&ctx->refs);
>> +            }
>> +            ctx = req->ctx;
>> +            percpu_ref_get(&ctx->refs);
>> +            spin_lock(&ctx->completion_lock);
>> +        }
>> +        __io_req_complete_post(req, req->result, io_put_rw_kbuf(req));
>> +        node = next;
>> +    } while (node);
>> +
>> +    if (ctx) {
>> +        io_commit_cqring(ctx);
>> +        spin_unlock(&ctx->completion_lock);
>> +        io_cqring_ev_posted(ctx);
>> +        percpu_ref_put(&ctx->refs);
>> +    }
>> +}
>> +
>>   static void handle_tw_list(struct io_wq_work_node *node, struct 
>> io_ring_ctx **ctx, bool *locked)
>>   {
>>       do {
>> @@ -2193,25 +2224,28 @@ static void tctx_task_work(struct 
>> callback_head *cb)
>>                             task_work);
>>       while (1) {
>> -        struct io_wq_work_node *node;
>> -        struct io_wq_work_list *merged_list;
>> +        struct io_wq_work_node *node1, *node2;
>> -        if (!tctx->prior_task_list.first &&
>> -            !tctx->task_list.first && locked)
>> +        if (!tctx->task_list.first &&
>> +            !tctx->prior_task_list.first && locked)
>>               io_submit_flush_completions(ctx);
>>           spin_lock_irq(&tctx->task_lock);
>> -        merged_list = wq_list_merge(&tctx->prior_task_list, 
>> &tctx->task_list);
>> -        node = merged_list->first;
>> +        node1 = tctx->prior_task_list.first;
>> +        node2 = tctx->task_list.first;
>>           INIT_WQ_LIST(&tctx->task_list);
>>           INIT_WQ_LIST(&tctx->prior_task_list);
>> -        if (!node)
>> +        if (!node2 && !node1)
>>               tctx->task_running = false;
>>           spin_unlock_irq(&tctx->task_lock);
>> -        if (!node)
>> +        if (!node2 && !node1)
>>               break;
>> -        handle_tw_list(node, &ctx, &locked);
>> +        if (node1)
>> +            handle_prior_tw_list(node1);
> 
> IIUC, it moves all IRQ rw completions to this new path even when we already
> have the lock. One concern is that io_submit_flush_completions() is better
> optimised. Should probably be visible for one threaded apps and a bunch of
> other cases.
> 
> How about a combined scheme? if we can grab the lock, go through the old
> path, otherwise handle_prior_tw_list(). The rest looks good, will formally
> review once we deal with this one.
Thanks Pavel, I'll look into this patchset soon after
finishing some tests to my io-wq patchset.
> 
>> +
>> +        if (node2)
>> +            handle_tw_list(node2, &ctx, &locked);
>>           cond_resched();
>>       }
>>
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/6] io_uring: add a priority tw list for irq completion work
  2021-11-17 23:03   ` Pavel Begunkov
@ 2021-11-24  7:53     ` Hao Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Hao Xu @ 2021-11-24  7:53 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/11/18 上午7:03, Pavel Begunkov 写道:
> On 10/29/21 13:22, Hao Xu wrote:
>> Now we have a lot of task_work users, some are just to complete a req
>> and generate a cqe. Let's put the work to a new tw list which has a
>> higher priority, so that it can be handled quickly and thus to reduce
>> avg req latency and users can issue next round of sqes earlier.
>> An explanatory case:
>>
>> origin timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run heavy work0~n-->run completion task_work
>> now timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run completion task_work-->run heavy work0~n
>>
>> Limitation: this optimization is only for those that submission and
>> reaping process are in different threads. Otherwise anyhow we have to
>> submit new sqes after returning to userspace, then the order of TWs
>> doesn't matter.
>>
>> Tested this patch(and the following ones) by manually replace
>> __io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
>> 'heavy' task works. Then test with fio:
>>
>> ioengine=io_uring
>> sqpoll=1
>> thread=1
>> bs=4k
>> direct=1
>> rw=randread
>> time_based=1
>> runtime=600
>> randrepeat=0
>> group_reporting=1
>> filename=/dev/nvme0n1
>>
>> Tried various iodepth.
>> The peak IOPS for this patch is 710K, while the old one is 665K.
>> For avg latency, difference shows when iodepth grow:
>> depth and avg latency(usec):
>>     depth      new          old
>>      1        7.05         7.10
>>      2        8.47         8.60
>>      4        10.42        10.42
>>      8        13.78        13.22
>>      16       27.41        24.33
>>      32       49.40        53.08
>>      64       102.53       103.36
>>      128      196.98       205.61
>>      256      372.99       414.88
>>           512      747.23       791.30
>>           1024     1472.59      1538.72
>>           2048     3153.49      3329.01
>>           4096     6387.86      6682.54
>>           8192     12150.25     12774.14
>>           16384    23085.58     26044.71
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   fs/io_uring.c | 38 +++++++++++++++++++++++++-------------
>>   1 file changed, 25 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 17cb0e1b88f0..981794ee3f3f 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -467,6 +467,7 @@ struct io_uring_task {
>>       spinlock_t        task_lock;
>>       struct io_wq_work_list    task_list;
>> +    struct io_wq_work_list    prior_task_list;
>>       struct callback_head    task_work;
>>       bool            task_running;
>>   };
>> @@ -2148,13 +2149,17 @@ static void tctx_task_work(struct 
>> callback_head *cb)
>>       while (1) {
>>           struct io_wq_work_node *node;
>> +        struct io_wq_work_list *merged_list;
>> -        if (!tctx->task_list.first && locked)
>> +        if (!tctx->prior_task_list.first &&
>> +            !tctx->task_list.first && locked)
>>               io_submit_flush_completions(ctx);
>>           spin_lock_irq(&tctx->task_lock);
>> -        node = tctx->task_list.first;
>> +        merged_list = wq_list_merge(&tctx->prior_task_list, 
>> &tctx->task_list);
>> +        node = merged_list->first;
>>           INIT_WQ_LIST(&tctx->task_list);
>> +        INIT_WQ_LIST(&tctx->prior_task_list);
>>           if (!node)
>>               tctx->task_running = false;
>>           spin_unlock_irq(&tctx->task_lock);
>> @@ -2183,19 +2188,23 @@ static void tctx_task_work(struct 
>> callback_head *cb)
>>       ctx_flush_and_put(ctx, &locked);
>>   }
>> -static void io_req_task_work_add(struct io_kiocb *req)
>> +static void io_req_task_work_add(struct io_kiocb *req, bool priority)
>>   {
>>       struct task_struct *tsk = req->task;
>>       struct io_uring_task *tctx = tsk->io_uring;
>>       enum task_work_notify_mode notify;
>>       struct io_wq_work_node *node;
>> +    struct io_wq_work_list *merged_list;
>>       unsigned long flags;
>>       bool running;
>>       WARN_ON_ONCE(!tctx);
>>       spin_lock_irqsave(&tctx->task_lock, flags);
>> -    wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>> +    if (priority)
>> +        wq_list_add_tail(&req->io_task_work.node, 
>> &tctx->prior_task_list);
>> +    else
>> +        wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>       running = tctx->task_running;
>>       if (!running)
>>           tctx->task_running = true;
>> @@ -2220,8 +2229,10 @@ static void io_req_task_work_add(struct 
>> io_kiocb *req)
>>       spin_lock_irqsave(&tctx->task_lock, flags);
>>       tctx->task_running = false;
>> -    node = tctx->task_list.first;
>> +    merged_list = wq_list_merge(&tctx->prior_task_list, 
>> &tctx->task_list);
>> +    node = merged_list->first;
>>       INIT_WQ_LIST(&tctx->task_list);
>> +    INIT_WQ_LIST(&tctx->prior_task_list);
>>       spin_unlock_irqrestore(&tctx->task_lock, flags);
>>       while (node) {
>> @@ -2258,19 +2269,19 @@ static void io_req_task_queue_fail(struct 
>> io_kiocb *req, int ret)
>>   {
>>       req->result = ret;
>>       req->io_task_work.func = io_req_task_cancel;
>> -    io_req_task_work_add(req);
>> +    io_req_task_work_add(req, false);
>>   }
>>   static void io_req_task_queue(struct io_kiocb *req)
>>   {
>>       req->io_task_work.func = io_req_task_submit;
>> -    io_req_task_work_add(req);
>> +    io_req_task_work_add(req, false);
>>   }
>>   static void io_req_task_queue_reissue(struct io_kiocb *req)
>>   {
>>       req->io_task_work.func = io_queue_async_work;
>> -    io_req_task_work_add(req);
>> +    io_req_task_work_add(req, false);
>>   }
>>   static inline void io_queue_next(struct io_kiocb *req)
>> @@ -2375,7 +2386,7 @@ static inline void io_put_req_deferred(struct 
>> io_kiocb *req)
>>   {
>>       if (req_ref_put_and_test(req)) {
>>           req->io_task_work.func = io_free_req_work;
>> -        io_req_task_work_add(req);
>> +        io_req_task_work_add(req, false);
>>       }
>>   }
>> @@ -2678,7 +2689,7 @@ static void io_complete_rw(struct kiocb *kiocb, 
>> long res, long res2)
>>           return;
>>       req->result = res;
>>       req->io_task_work.func = io_req_task_complete;
>> -    io_req_task_work_add(req);
>> +    io_req_task_work_add(req, !!(req->ctx->flags & 
>> IORING_SETUP_SQPOLL));
> 
> I'm not sure this special case makes sense. I remembered you mentioned
> that you measured it, but what's the reason? Can it be related to my
> comments on 6/6?
The discussion is here:

https://lore.kernel.org/io-uring/[email protected]/

https://lore.kernel.org/io-uring/[email protected]/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/6] io_uring: add a priority tw list for irq completion work
  2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
@ 2021-11-24 12:21 ` Hao Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Hao Xu @ 2021-11-24 12:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.

Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:

ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 710K, while the old one is 665K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
	depth      new          old
	 1        7.05         7.10
	 2        8.47         8.60
	 4        10.42        10.42
	 8        13.78        13.22
	 16       27.41        24.33
	 32       49.40        53.08
	 64       102.53       103.36
	 128      196.98       205.61
	 256      372.99       414.88
         512      747.23       791.30
         1024     1472.59      1538.72
         2048     3153.49      3329.01
         4096     6387.86      6682.54
         8192     12150.25     12774.14
         16384    23085.58     26044.71

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c887e4e19e9e..aca68c4a7964 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -467,6 +467,7 @@ struct io_uring_task {
 
 	spinlock_t		task_lock;
 	struct io_wq_work_list	task_list;
+	struct io_wq_work_list	prior_task_list;
 	struct callback_head	task_work;
 	bool			task_running;
 };
@@ -2148,12 +2149,12 @@ static void tctx_task_work(struct callback_head *cb)
 	while (1) {
 		struct io_wq_work_node *node;
 
-		if (!tctx->task_list.first && locked)
+		if (!tctx->prior_task_list.first &&
+		    !tctx->task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node = tctx->task_list.first;
-		INIT_WQ_LIST(&tctx->task_list);
+		node= wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 		if (!node)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
@@ -2182,7 +2183,7 @@ static void tctx_task_work(struct callback_head *cb)
 	ctx_flush_and_put(ctx, &locked);
 }
 
-static void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
@@ -2194,7 +2195,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
 	WARN_ON_ONCE(!tctx);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	if (priority)
+		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
+	else
+		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
@@ -2219,8 +2223,7 @@ static void io_req_task_work_add(struct io_kiocb *req)
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
 	tctx->task_running = false;
-	node = tctx->task_list.first;
-	INIT_WQ_LIST(&tctx->task_list);
+	node = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
@@ -2257,19 +2260,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
 	req->result = ret;
 	req->io_task_work.func = io_req_task_cancel;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_req_task_submit;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue_reissue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_queue_async_work;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static inline void io_queue_next(struct io_kiocb *req)
@@ -2374,7 +2377,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
 {
 	if (req_ref_put_and_test(req)) {
 		req->io_task_work.func = io_free_req_work;
-		io_req_task_work_add(req);
+		io_req_task_work_add(req, false);
 	}
 }
 
@@ -2677,7 +2680,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 		return;
 	req->result = res;
 	req->io_task_work.func = io_req_task_complete;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
@@ -5248,7 +5251,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 	 * of executing it. We can't safely execute it anyway, as we may not
 	 * have the needed state needed for it anyway.
 	 */
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return 1;
 }
 
@@ -5914,7 +5917,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -6880,7 +6883,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_link_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -8584,6 +8587,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	task->io_uring = tctx;
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	init_task_work(&tctx->task_work, tctx_task_work);
 	return 0;
 }
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/6] io_uring: add a priority tw list for irq completion work
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
@ 2021-11-26 10:07 ` Hao Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Hao Xu @ 2021-11-26 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.

Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:

ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 710K, while the old one is 665K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
	depth      new          old
	 1        7.05         7.10
	 2        8.47         8.60
	 4        10.42        10.42
	 8        13.78        13.22
	 16       27.41        24.33
	 32       49.40        53.08
	 64       102.53       103.36
	 128      196.98       205.61
	 256      372.99       414.88
         512      747.23       791.30
         1024     1472.59      1538.72
         2048     3153.49      3329.01
         4096     6387.86      6682.54
         8192     12150.25     12774.14
         16384    23085.58     26044.71

Signed-off-by: Hao Xu <[email protected]>
[pavel: rebase, fixup new *work_add call sites]
Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 08b1b3de9b3f..821aa0f8c643 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -474,6 +474,7 @@ struct io_uring_task {
 
 	spinlock_t		task_lock;
 	struct io_wq_work_list	task_list;
+	struct io_wq_work_list	prior_task_list;
 	struct callback_head	task_work;
 	bool			task_running;
 };
@@ -2206,12 +2207,12 @@ static void tctx_task_work(struct callback_head *cb)
 	while (1) {
 		struct io_wq_work_node *node;
 
-		if (!tctx->task_list.first && locked)
+		if (!tctx->prior_task_list.first &&
+		    !tctx->task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node = tctx->task_list.first;
-		INIT_WQ_LIST(&tctx->task_list);
+		node= wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 		if (!node)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
@@ -2240,7 +2241,7 @@ static void tctx_task_work(struct callback_head *cb)
 	ctx_flush_and_put(ctx, &locked);
 }
 
-static void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
@@ -2252,7 +2253,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
 	WARN_ON_ONCE(!tctx);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	if (priority)
+		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
+	else
+		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
@@ -2277,8 +2281,7 @@ static void io_req_task_work_add(struct io_kiocb *req)
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
 	tctx->task_running = false;
-	node = tctx->task_list.first;
-	INIT_WQ_LIST(&tctx->task_list);
+	node = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
@@ -2315,19 +2318,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
 	req->result = ret;
 	req->io_task_work.func = io_req_task_cancel;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_req_task_submit;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue_reissue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_queue_async_work;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static inline void io_queue_next(struct io_kiocb *req)
@@ -2437,7 +2440,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
 {
 	if (req_ref_put_and_test(req)) {
 		req->io_task_work.func = io_free_req_work;
-		io_req_task_work_add(req);
+		io_req_task_work_add(req, false);
 	}
 }
 
@@ -2742,7 +2745,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
 		return;
 	req->result = res;
 	req->io_task_work.func = io_req_task_complete;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)
@@ -2984,7 +2987,7 @@ static void kiocb_done(struct io_kiocb *req, ssize_t ret,
 			req_set_fail(req);
 			req->result = ret;
 			req->io_task_work.func = io_req_task_complete;
-			io_req_task_work_add(req);
+			io_req_task_work_add(req, false);
 		}
 	}
 }
@@ -5317,7 +5320,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 	 * of executing it. We can't safely execute it anyway, as we may not
 	 * have the needed state needed for it anyway.
 	 */
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return 1;
 }
 
@@ -5985,7 +5988,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -6960,7 +6963,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_link_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -8675,6 +8678,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	task->io_uring = tctx;
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	init_task_work(&tctx->task_work, tctx_task_work);
 	return 0;
 }
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-11-26 10:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-29 12:22 [PATCH for-5.16 v4 0/6] task work optimization Hao Xu
2021-10-29 12:22 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
2021-11-17 22:41   ` Pavel Begunkov
2021-10-29 12:22 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
2021-11-17 23:03   ` Pavel Begunkov
2021-11-24  7:53     ` Hao Xu
2021-10-29 12:22 ` [PATCH 3/6] io_uring: add helper for task work execution code Hao Xu
2021-10-29 12:22 ` [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper Hao Xu
2021-10-29 12:22 ` [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
2021-10-29 12:22 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
2021-11-17 22:55   ` Pavel Begunkov
2021-11-18 10:39     ` Hao Xu
  -- strict thread matches above, loose matches on Subject: below --
2021-11-24 12:21 [PATCH v5 0/6] task work optimization Hao Xu
2021-11-24 12:21 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
2021-11-26 10:07 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox