[PATCH 0/6] task_work optimization

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH 0/6] task_work optimization
@ 2021-09-27  6:17 Hao Xu
  2021-09-27  6:17 ` [PATCH 1/8] io-wq: code clean for io_wq_add_work_after() Hao Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

The main patches are 3/6 and 6/6. 3/6 is to set a new task list and
complete its task works prior to the normal task works in old task list.

6/6 is an optimization of batching completion of task works in the new
task list if they all have same ctx which is the normal situation, the
benefit is we now batch them regardless uring_lock.

Tested this patchset by manually replace __io_queue_sqe() to
io_req_task_complete() to construct 'heavy' task works. Then test with
fio:

ioengine=io_uring
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 314K, while the old one is 249K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
	depth      new          old
	 1        22.80        23.77
	 2        23.48        24.54
	 4        24.26        25.57
	 8        29.21        32.89
	 16       53.61        63.50
	 32       106.29       131.34
	 64       217.21       256.33
	 128      421.59       513.87
 	 256      815.15       1050.99

without this patchset
iodepth=1
clat percentiles (usec):
 |  1.00th=[    7],  5.00th=[    7], 10.00th=[    8], 20.00th=[    8],
 | 30.00th=[    8], 40.00th=[    8], 50.00th=[    8], 60.00th=[    8],
 | 70.00th=[    8], 80.00th=[    8], 90.00th=[   82], 95.00th=[   97],
 | 99.00th=[   99], 99.50th=[   99], 99.90th=[  100], 99.95th=[  101],
 | 99.99th=[  126]
iodepth=2
clat percentiles (usec):
 |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
 | 30.00th=[    8], 40.00th=[    8], 50.00th=[    8], 60.00th=[    9],
 | 70.00th=[   10], 80.00th=[   10], 90.00th=[   83], 95.00th=[   97],
 | 99.00th=[  100], 99.50th=[  102], 99.90th=[  126], 99.95th=[  145],
 | 99.99th=[  971]
iodepth=4
clat percentiles (usec):
 |  1.00th=[    7],  5.00th=[    7], 10.00th=[    8], 20.00th=[    8],
 | 30.00th=[    8], 40.00th=[    9], 50.00th=[    9], 60.00th=[   10],
 | 70.00th=[   11], 80.00th=[   13], 90.00th=[   86], 95.00th=[   98],
 | 99.00th=[  105], 99.50th=[  115], 99.90th=[  139], 99.95th=[  149],
 | 99.99th=[  169]
iodepth=8
clat percentiles (usec):
 |  1.00th=[    7],  5.00th=[    8], 10.00th=[    9], 20.00th=[   12],
 | 30.00th=[   13], 40.00th=[   16], 50.00th=[   18], 60.00th=[   20],
 | 70.00th=[   22], 80.00th=[   27], 90.00th=[   95], 95.00th=[  105],
 | 99.00th=[  121], 99.50th=[  131], 99.90th=[  157], 99.95th=[  167],
 | 99.99th=[  206]
iodepth=16
clat percentiles (usec):
 |  1.00th=[   25],  5.00th=[   33], 10.00th=[   37], 20.00th=[   41],
 | 30.00th=[   44], 40.00th=[   46], 50.00th=[   49], 60.00th=[   51],
 | 70.00th=[   55], 80.00th=[   63], 90.00th=[  125], 95.00th=[  137],
 | 99.00th=[  155], 99.50th=[  165], 99.90th=[  198], 99.95th=[  235],
 | 99.99th=[ 1844]
iodepth=32
clat percentiles (usec):
 |  1.00th=[   92],  5.00th=[   98], 10.00th=[  102], 20.00th=[  106],
 | 30.00th=[  110], 40.00th=[  112], 50.00th=[  116], 60.00th=[  120],
 | 70.00th=[  128], 80.00th=[  141], 90.00th=[  192], 95.00th=[  204],
 | 99.00th=[  227], 99.50th=[  235], 99.90th=[  260], 99.95th=[  273],
 | 99.99th=[  322]
iodepth=64
clat percentiles (usec):
 |  1.00th=[  221],  5.00th=[  227], 10.00th=[  231], 20.00th=[  233],
 | 30.00th=[  237], 40.00th=[  239], 50.00th=[  241], 60.00th=[  243],
 | 70.00th=[  247], 80.00th=[  253], 90.00th=[  318], 95.00th=[  330],
 | 99.00th=[  351], 99.50th=[  359], 99.90th=[  388], 99.95th=[  400],
 | 99.99th=[  529]
iodepth=128
clat percentiles (usec):
 |  1.00th=[  465],  5.00th=[  478], 10.00th=[  482], 20.00th=[  486],
 | 30.00th=[  490], 40.00th=[  490], 50.00th=[  494], 60.00th=[  498],
 | 70.00th=[  506], 80.00th=[  553], 90.00th=[  578], 95.00th=[  586],
 | 99.00th=[  635], 99.50th=[  652], 99.90th=[  676], 99.95th=[  717],
 | 99.99th=[ 2278]
iodepth=256
clat percentiles (usec):
 |  1.00th=[  979],  5.00th=[  988], 10.00th=[  996], 20.00th=[ 1012],
 | 30.00th=[ 1020], 40.00th=[ 1037], 50.00th=[ 1037], 60.00th=[ 1045],
 | 70.00th=[ 1057], 80.00th=[ 1090], 90.00th=[ 1123], 95.00th=[ 1139],
 | 99.00th=[ 1205], 99.50th=[ 1237], 99.90th=[ 1254], 99.95th=[ 1270],
 | 99.99th=[ 1385]

with this patchset
iodepth=1
clat percentiles (usec):
 |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
 | 30.00th=[    7], 40.00th=[    7], 50.00th=[    8], 60.00th=[    8],
 | 70.00th=[    8], 80.00th=[    8], 90.00th=[   82], 95.00th=[   97],
 | 99.00th=[   99], 99.50th=[   99], 99.90th=[  100], 99.95th=[  101],
 | 99.99th=[  125]
iodepth=2
clat percentiles (usec):
 |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
 | 30.00th=[    7], 40.00th=[    7], 50.00th=[    8], 60.00th=[    8],
 | 70.00th=[    9], 80.00th=[   11], 90.00th=[   83], 95.00th=[   97],
 | 99.00th=[  100], 99.50th=[  102], 99.90th=[  127], 99.95th=[  141],
 | 99.99th=[  668]
iodepth=4
clat percentiles (usec):
 |  1.00th=[    6],  5.00th=[    6], 10.00th=[    7], 20.00th=[    7],
 | 30.00th=[    7], 40.00th=[    8], 50.00th=[    8], 60.00th=[    9],
 | 70.00th=[   10], 80.00th=[   12], 90.00th=[   85], 95.00th=[   97],
 | 99.00th=[  104], 99.50th=[  115], 99.90th=[  141], 99.95th=[  149],
 | 99.99th=[  194]
iodepth=8
clat percentiles (usec):
 |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    9],
 | 30.00th=[   11], 40.00th=[   12], 50.00th=[   14], 60.00th=[   15],
 | 70.00th=[   18], 80.00th=[   22], 90.00th=[   93], 95.00th=[  103],
 | 99.00th=[  120], 99.50th=[  130], 99.90th=[  157], 99.95th=[  167],
 | 99.99th=[  208]
iodepth=16
clat percentiles (usec):
 |  1.00th=[   16],  5.00th=[   24], 10.00th=[   28], 20.00th=[   32],
 | 30.00th=[   34], 40.00th=[   37], 50.00th=[   39], 60.00th=[   41],
 | 70.00th=[   44], 80.00th=[   51], 90.00th=[  117], 95.00th=[  128],
 | 99.00th=[  147], 99.50th=[  159], 99.90th=[  194], 99.95th=[  235],
 | 99.99th=[ 1909]
iodepth=32
clat percentiles (usec):
 |  1.00th=[   72],  5.00th=[   78], 10.00th=[   81], 20.00th=[   84],
 | 30.00th=[   86], 40.00th=[   88], 50.00th=[   90], 60.00th=[   93],
 | 70.00th=[   96], 80.00th=[  114], 90.00th=[  169], 95.00th=[  182],
 | 99.00th=[  202], 99.50th=[  212], 99.90th=[  239], 99.95th=[  253],
 | 99.99th=[  302]
iodepth=64
clat percentiles (usec):
 |  1.00th=[  178],  5.00th=[  184], 10.00th=[  186], 20.00th=[  192],
 | 30.00th=[  196], 40.00th=[  200], 50.00th=[  204], 60.00th=[  206],
 | 70.00th=[  210], 80.00th=[  221], 90.00th=[  281], 95.00th=[  293],
 | 99.00th=[  318], 99.50th=[  330], 99.90th=[  355], 99.95th=[  367],
 | 99.99th=[  437]
iodepth=128
clat percentiles (usec):
 |  1.00th=[  379],  5.00th=[  388], 10.00th=[  392], 20.00th=[  396],
 | 30.00th=[  396], 40.00th=[  400], 50.00th=[  404], 60.00th=[  408],
 | 70.00th=[  424], 80.00th=[  437], 90.00th=[  482], 95.00th=[  498],
 | 99.00th=[  529], 99.50th=[  537], 99.90th=[  570], 99.95th=[  635],
 | 99.99th=[ 2311]
iodepth=256
clat percentiles (usec):
 |  1.00th=[  783],  5.00th=[  783], 10.00th=[  791], 20.00th=[  791],
 | 30.00th=[  791], 40.00th=[  799], 50.00th=[  799], 60.00th=[  799],
 | 70.00th=[  807], 80.00th=[  816], 90.00th=[  881], 95.00th=[  889],
 | 99.00th=[  914], 99.50th=[  930], 99.90th=[  979], 99.95th=[  996],
 | 99.99th=[ 1237]


Hao Xu (8):
  io-wq: code clean for io_wq_add_work_after()
  io-wq: add helper to merge two wq_lists
  io_uring: add a limited tw list for irq completion work
  io_uring: add helper for task work execution code
  io_uring: split io_req_complete_post() and add a helper
  io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  io_uring: add tw_ctx for io_uring_task
  io_uring: batch completion in prior_task_list

 fs/io-wq.h    |  26 ++++++--
 fs/io_uring.c | 170 ++++++++++++++++++++++++++++++++++----------------
 2 files changed, 137 insertions(+), 59 deletions(-)

-- 
2.24.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/8] io-wq: code clean for io_wq_add_work_after()
  2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
@ 2021-09-27  6:17 ` Hao Xu
  2021-09-28 11:08   ` Pavel Begunkov
  2021-09-27  6:17 ` [PATCH 2/8] io-wq: add helper to merge two wq_lists Hao Xu
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Remove a local variable.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io-wq.h | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/io-wq.h b/fs/io-wq.h
index bf5c4c533760..8369a51b65c0 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -33,11 +33,9 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
 				     struct io_wq_work_node *pos,
 				     struct io_wq_work_list *list)
 {
-	struct io_wq_work_node *next = pos->next;
-
+	node->next = pos->next;
 	pos->next = node;
-	node->next = next;
-	if (!next)
+	if (!node->next)
 		list->last = node;
 }
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/8] io-wq: add helper to merge two wq_lists
  2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
  2021-09-27  6:17 ` [PATCH 1/8] io-wq: code clean for io_wq_add_work_after() Hao Xu
@ 2021-09-27  6:17 ` Hao Xu
  2021-09-27 10:17   ` Hao Xu
  2021-09-28 11:10   ` Pavel Begunkov
  2021-09-27  6:17 ` [PATCH 3/8] io_uring: add a limited tw list for irq completion work Hao Xu
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

add a helper to merge two wq_lists, it will be useful in the next
patches.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io-wq.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/fs/io-wq.h b/fs/io-wq.h
index 8369a51b65c0..7510b05d4a86 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -39,6 +39,26 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
 		list->last = node;
 }
 
+/**
+ * wq_list_merge - merge the second list to the first one.
+ * @list0: the first list
+ * @list1: the second list
+ * after merge, list0 contains the merged list.
+ */
+static inline void wq_list_merge(struct io_wq_work_list *list0,
+				     struct io_wq_work_list *list1)
+{
+	if (!list1)
+		return;
+
+	if (!list0) {
+		list0 = list1;
+		return;
+	}
+	list0->last->next = list1->first;
+	list0->last = list1->last;
+}
+
 static inline void wq_list_add_tail(struct io_wq_work_node *node,
 				    struct io_wq_work_list *list)
 {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/8] io_uring: add a limited tw list for irq completion work
  2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
  2021-09-27  6:17 ` [PATCH 1/8] io-wq: code clean for io_wq_add_work_after() Hao Xu
  2021-09-27  6:17 ` [PATCH 2/8] io-wq: add helper to merge two wq_lists Hao Xu
@ 2021-09-27  6:17 ` Hao Xu
  2021-09-28 11:29   ` Pavel Begunkov
  2021-09-27  6:17 ` [PATCH 4/8] io_uring: add helper for task work execution code Hao Xu
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency. an explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

One thing to watch out is sometimes irq completion TWs comes
overwhelmingly, which makes the new tw list grows fast, and TWs in
the old list are starved. So we have to limit the length of the new
tw list. A practical value is 1/3:
    len of new tw list < 1/3 * (len of new + old tw list)

In this way, the new tw list has a limited length and normal task get
there chance to run.

Tested this patch(and the following ones) by manually replace
__io_queue_sqe() to io_req_task_complete() to construct 'heavy' task
works. Then test with fio:

ioengine=io_uring
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 314K, while the old one is 249K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
	depth      new          old
	 1        22.80        23.77
	 2        23.48        24.54
	 4        24.26        25.57
	 8        29.21        32.89
	 16       53.61        63.50
	 32       106.29       131.34
	 64       217.21       256.33
	 128      421.59       513.87
	 256      815.15       1050.99

95%, 99% etc more data in cover letter.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 44 +++++++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8317c360f7a4..9272b2cfcfb7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -461,6 +461,7 @@ struct io_ring_ctx {
 	};
 };
 
+#define MAX_EMERGENCY_TW_RATIO	3
 struct io_uring_task {
 	/* submission side */
 	int			cached_refs;
@@ -475,6 +476,9 @@ struct io_uring_task {
 	spinlock_t		task_lock;
 	struct io_wq_work_list	task_list;
 	struct callback_head	task_work;
+	struct io_wq_work_list	prior_task_list;
+	unsigned int		nr;
+	unsigned int		prior_nr;
 	bool			task_running;
 };
 
@@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb)
 	while (1) {
 		struct io_wq_work_node *node;
 
-		if (!tctx->task_list.first && locked)
+		if (!tctx->prior_task_list.first &&
+		    !tctx->task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node = tctx->task_list.first;
+		wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
+		node = tctx->prior_task_list.first;
 		INIT_WQ_LIST(&tctx->task_list);
+		INIT_WQ_LIST(&tctx->prior_task_list);
+		tctx->nr = tctx->prior_nr = 0;
 		if (!node)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
@@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb)
 	ctx_flush_and_put(ctx, &locked);
 }
 
-static void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
 {
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
@@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req)
 	WARN_ON_ONCE(!tctx);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
+		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
+		tctx->prior_nr++;
+	} else {
+		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	}
+	tctx->nr++;
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
@@ -2202,9 +2216,12 @@ static void io_req_task_work_add(struct io_kiocb *req)
 	}
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
+	tctx->nr = tctx->prior_nr = 0;
 	tctx->task_running = false;
-	node = tctx->task_list.first;
+	wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
+	node = tctx->prior_task_list.first;
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
@@ -2241,19 +2258,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
 	req->result = ret;
 	req->io_task_work.func = io_req_task_cancel;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, true);
 }
 
 static void io_req_task_queue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_req_task_submit;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue_reissue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_queue_async_work;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static inline void io_queue_next(struct io_kiocb *req)
@@ -2373,7 +2390,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
 {
 	if (req_ref_put_and_test(req)) {
 		req->io_task_work.func = io_free_req_work;
-		io_req_task_work_add(req);
+		io_req_task_work_add(req, false);
 	}
 }
 
@@ -2693,7 +2710,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 		return;
 	req->result = res;
 	req->io_task_work.func = io_req_task_complete;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, true);
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
@@ -5256,7 +5273,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 	 * of executing it. We can't safely execute it anyway, as we may not
 	 * have the needed state needed for it anyway.
 	 */
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return 1;
 }
 
@@ -5934,7 +5951,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -6916,7 +6933,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_link_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -8543,6 +8560,7 @@ static int io_uring_alloc_task_context(struct task_struct *task,
 	task->io_uring = tctx;
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	init_task_work(&tctx->task_work, tctx_task_work);
 	return 0;
 }
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/8] io_uring: add helper for task work execution code
  2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
                   ` (2 preceding siblings ...)
  2021-09-27  6:17 ` [PATCH 3/8] io_uring: add a limited tw list for irq completion work Hao Xu
@ 2021-09-27  6:17 ` Hao Xu
  2021-09-27  6:17 ` [PATCH 5/8] io_uring: split io_req_complete_post() and add a helper Hao Xu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Add a helper for task work execution code. We will use it later.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9272b2cfcfb7..58ce58e7c65d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2126,6 +2126,25 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
+{
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		if (req->ctx != *ctx) {
+			ctx_flush_and_put(*ctx, locked);
+			*ctx = req->ctx;
+			/* if not contended, grab and improve batching */
+			*locked = mutex_trylock(&(*ctx)->uring_lock);
+			percpu_ref_get(&(*ctx)->refs);
+		}
+		req->io_task_work.func(req, locked);
+		node = next;
+	} while (node);
+}
+
 static void tctx_task_work(struct callback_head *cb)
 {
 	bool locked = false;
@@ -2152,22 +2171,7 @@ static void tctx_task_work(struct callback_head *cb)
 		if (!node)
 			break;
 
-		do {
-			struct io_wq_work_node *next = node->next;
-			struct io_kiocb *req = container_of(node, struct io_kiocb,
-							    io_task_work.node);
-
-			if (req->ctx != ctx) {
-				ctx_flush_and_put(ctx, &locked);
-				ctx = req->ctx;
-				/* if not contended, grab and improve batching */
-				locked = mutex_trylock(&ctx->uring_lock);
-				percpu_ref_get(&ctx->refs);
-			}
-			req->io_task_work.func(req, &locked);
-			node = next;
-		} while (node);
-
+		handle_tw_list(node, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/8] io_uring: split io_req_complete_post() and add a helper
  2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
                   ` (3 preceding siblings ...)
  2021-09-27  6:17 ` [PATCH 4/8] io_uring: add helper for task work execution code Hao Xu
@ 2021-09-27  6:17 ` Hao Xu
  2021-09-27  6:17 ` [PATCH 6/8] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Split io_req_complete_post(), this is a prep for the next patch.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 58ce58e7c65d..4ee5bbe36e3b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1793,12 +1793,11 @@ static noinline bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data
 	return __io_cqring_fill_event(ctx, user_data, res, cflags);
 }
 
-static void io_req_complete_post(struct io_kiocb *req, long res,
+static void __io_req_complete_post(struct io_kiocb *req, long res,
 				 unsigned int cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	spin_lock(&ctx->completion_lock);
 	__io_cqring_fill_event(ctx, req->user_data, res, cflags);
 	/*
 	 * If we're the last reference to this request, add to our locked
@@ -1819,6 +1818,15 @@ static void io_req_complete_post(struct io_kiocb *req, long res,
 		ctx->locked_free_nr++;
 		percpu_ref_put(&ctx->refs);
 	}
+}
+
+static void io_req_complete_post(struct io_kiocb *req, long res,
+				 unsigned int cflags)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	spin_lock(&ctx->completion_lock);
+	__io_req_complete_post(req, res, cflags);
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
 	io_cqring_ev_posted(ctx);
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/8] io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
                   ` (4 preceding siblings ...)
  2021-09-27  6:17 ` [PATCH 5/8] io_uring: split io_req_complete_post() and add a helper Hao Xu
@ 2021-09-27  6:17 ` Hao Xu
  2021-09-27  6:17 ` [PATCH 7/8] io_uring: add tw_ctx for io_uring_task Hao Xu
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Move them up to avoid explicit declaration. We will use them in later
patches.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4ee5bbe36e3b..48387ea47c15 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2134,6 +2134,27 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
+{
+	unsigned int cflags;
+
+	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
+	cflags |= IORING_CQE_F_BUFFER;
+	req->flags &= ~REQ_F_BUFFER_SELECTED;
+	kfree(kbuf);
+	return cflags;
+}
+
+static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
+{
+	struct io_buffer *kbuf;
+
+	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
+		return 0;
+	kbuf = (struct io_buffer *) (unsigned long) req->rw.addr;
+	return io_put_kbuf(req, kbuf);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2421,27 +2442,6 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
 	return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head;
 }
 
-static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
-{
-	unsigned int cflags;
-
-	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
-	cflags |= IORING_CQE_F_BUFFER;
-	req->flags &= ~REQ_F_BUFFER_SELECTED;
-	kfree(kbuf);
-	return cflags;
-}
-
-static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
-{
-	struct io_buffer *kbuf;
-
-	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
-		return 0;
-	kbuf = (struct io_buffer *) (unsigned long) req->rw.addr;
-	return io_put_kbuf(req, kbuf);
-}
-
 static inline bool io_run_task_work(void)
 {
 	if (test_thread_flag(TIF_NOTIFY_SIGNAL) || current->task_works) {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 7/8] io_uring: add tw_ctx for io_uring_task
  2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
                   ` (5 preceding siblings ...)
  2021-09-27  6:17 ` [PATCH 6/8] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
@ 2021-09-27  6:17 ` Hao Xu
  2021-09-27  6:17 ` [PATCH 8/8] io_uring: batch completion in prior_task_list Hao Xu
  2021-09-27  6:21 ` [PATCH 0/6] task_work optimization Hao Xu
  8 siblings, 0 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Add tw_ctx to represent whether there is only one ctx in
prior_task_list or not, this is useful in the next patch

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 48387ea47c15..596e9e885362 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -477,6 +477,7 @@ struct io_uring_task {
 	struct io_wq_work_list	task_list;
 	struct callback_head	task_work;
 	struct io_wq_work_list	prior_task_list;
+	struct io_ring_ctx	*tw_ctx;
 	unsigned int		nr;
 	unsigned int		prior_nr;
 	bool			task_running;
@@ -2222,6 +2223,10 @@ static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
 	if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
 		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
 		tctx->prior_nr++;
+		if (tctx->prior_nr == 1)
+			tctx->tw_ctx = req->ctx;
+		else if (tctx->tw_ctx && req->ctx != tctx->tw_ctx)
+			tctx->tw_ctx = NULL;
 	} else {
 		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
 	}
@@ -2250,6 +2255,7 @@ static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
 	tctx->nr = tctx->prior_nr = 0;
+	tctx->tw_ctx = NULL;
 	tctx->task_running = false;
 	wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 	node = tctx->prior_task_list.first;
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 8/8] io_uring: batch completion in prior_task_list
  2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
                   ` (6 preceding siblings ...)
  2021-09-27  6:17 ` [PATCH 7/8] io_uring: add tw_ctx for io_uring_task Hao Xu
@ 2021-09-27  6:17 ` Hao Xu
  2021-09-27  6:21 ` [PATCH 0/6] task_work optimization Hao Xu
  8 siblings, 0 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch. This is better than before in cases where !locked.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 38 +++++++++++++++++++++++++++++++-------
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 596e9e885362..138bf8477c9b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2156,6 +2156,23 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
 	return io_put_kbuf(req, kbuf);
 }
 
+static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ctx *ctx)
+{
+	spin_lock(&ctx->completion_lock);
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		__io_req_complete_post(req, req->result, io_put_rw_kbuf(req));
+		node = next;
+	} while (node);
+
+	io_commit_cqring(ctx);
+	spin_unlock(&ctx->completion_lock);
+	io_cqring_ev_posted(ctx);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2178,30 +2195,37 @@ static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ct
 static void tctx_task_work(struct callback_head *cb)
 {
 	bool locked = false;
-	struct io_ring_ctx *ctx = NULL;
+	struct io_ring_ctx *ctx = NULL, *tw_ctx;
 	struct io_uring_task *tctx = container_of(cb, struct io_uring_task,
 						  task_work);
 
 	while (1) {
-		struct io_wq_work_node *node;
+		struct io_wq_work_node *node1, *node2;
 
 		if (!tctx->prior_task_list.first &&
 		    !tctx->task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
-		node = tctx->prior_task_list.first;
+		node1 = tctx->prior_task_list.first;
+		node2 = tctx->task_list.first;
+		tw_ctx = tctx->tw_ctx;
 		INIT_WQ_LIST(&tctx->task_list);
 		INIT_WQ_LIST(&tctx->prior_task_list);
 		tctx->nr = tctx->prior_nr = 0;
-		if (!node)
+		tctx->tw_ctx = NULL;
+		if (!node1 && !node2)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
-		if (!node)
+		if (!node1 && !node2)
 			break;
 
-		handle_tw_list(node, &ctx, &locked);
+		if (tw_ctx)
+			handle_prior_tw_list(node1, tw_ctx);
+		else if (node1)
+			handle_tw_list(node1, &ctx, &locked);
+
+		handle_tw_list(node2, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/6] task_work optimization
  2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
                   ` (7 preceding siblings ...)
  2021-09-27  6:17 ` [PATCH 8/8] io_uring: batch completion in prior_task_list Hao Xu
@ 2021-09-27  6:21 ` Hao Xu
  8 siblings, 0 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27  6:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Apologize for the subject, it should be 0/8.. forgot to update it.

在 2021/9/27 下午2:17, Hao Xu 写道:
> The main patches are 3/6 and 6/6. 3/6 is to set a new task list and
                         ^ 3/8  ^ 8/8  ^ 3/8
> complete its task works prior to the normal task works in old task list.
> 
> 6/6 is an optimization of batching completion of task works in the new
    ^ 8/8
> task list if they all have same ctx which is the normal situation, the
> benefit is we now batch them regardless uring_lock.
> 
> Tested this patchset by manually replace __io_queue_sqe() to
> io_req_task_complete() to construct 'heavy' task works. Then test with
> fio:
> 
> ioengine=io_uring
> thread=1
> bs=4k
> direct=1
> rw=randread
> time_based=1
> runtime=600
> randrepeat=0
> group_reporting=1
> filename=/dev/nvme0n1
> 
> Tried various iodepth.
> The peak IOPS for this patch is 314K, while the old one is 249K.
> For avg latency, difference shows when iodepth grow:
> depth and avg latency(usec):
> 	depth      new          old
> 	 1        22.80        23.77
> 	 2        23.48        24.54
> 	 4        24.26        25.57
> 	 8        29.21        32.89
> 	 16       53.61        63.50
> 	 32       106.29       131.34
> 	 64       217.21       256.33
> 	 128      421.59       513.87
>   	 256      815.15       1050.99
> 
> without this patchset
> iodepth=1
> clat percentiles (usec):
>   |  1.00th=[    7],  5.00th=[    7], 10.00th=[    8], 20.00th=[    8],
>   | 30.00th=[    8], 40.00th=[    8], 50.00th=[    8], 60.00th=[    8],
>   | 70.00th=[    8], 80.00th=[    8], 90.00th=[   82], 95.00th=[   97],
>   | 99.00th=[   99], 99.50th=[   99], 99.90th=[  100], 99.95th=[  101],
>   | 99.99th=[  126]
> iodepth=2
> clat percentiles (usec):
>   |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
>   | 30.00th=[    8], 40.00th=[    8], 50.00th=[    8], 60.00th=[    9],
>   | 70.00th=[   10], 80.00th=[   10], 90.00th=[   83], 95.00th=[   97],
>   | 99.00th=[  100], 99.50th=[  102], 99.90th=[  126], 99.95th=[  145],
>   | 99.99th=[  971]
> iodepth=4
> clat percentiles (usec):
>   |  1.00th=[    7],  5.00th=[    7], 10.00th=[    8], 20.00th=[    8],
>   | 30.00th=[    8], 40.00th=[    9], 50.00th=[    9], 60.00th=[   10],
>   | 70.00th=[   11], 80.00th=[   13], 90.00th=[   86], 95.00th=[   98],
>   | 99.00th=[  105], 99.50th=[  115], 99.90th=[  139], 99.95th=[  149],
>   | 99.99th=[  169]
> iodepth=8
> clat percentiles (usec):
>   |  1.00th=[    7],  5.00th=[    8], 10.00th=[    9], 20.00th=[   12],
>   | 30.00th=[   13], 40.00th=[   16], 50.00th=[   18], 60.00th=[   20],
>   | 70.00th=[   22], 80.00th=[   27], 90.00th=[   95], 95.00th=[  105],
>   | 99.00th=[  121], 99.50th=[  131], 99.90th=[  157], 99.95th=[  167],
>   | 99.99th=[  206]
> iodepth=16
> clat percentiles (usec):
>   |  1.00th=[   25],  5.00th=[   33], 10.00th=[   37], 20.00th=[   41],
>   | 30.00th=[   44], 40.00th=[   46], 50.00th=[   49], 60.00th=[   51],
>   | 70.00th=[   55], 80.00th=[   63], 90.00th=[  125], 95.00th=[  137],
>   | 99.00th=[  155], 99.50th=[  165], 99.90th=[  198], 99.95th=[  235],
>   | 99.99th=[ 1844]
> iodepth=32
> clat percentiles (usec):
>   |  1.00th=[   92],  5.00th=[   98], 10.00th=[  102], 20.00th=[  106],
>   | 30.00th=[  110], 40.00th=[  112], 50.00th=[  116], 60.00th=[  120],
>   | 70.00th=[  128], 80.00th=[  141], 90.00th=[  192], 95.00th=[  204],
>   | 99.00th=[  227], 99.50th=[  235], 99.90th=[  260], 99.95th=[  273],
>   | 99.99th=[  322]
> iodepth=64
> clat percentiles (usec):
>   |  1.00th=[  221],  5.00th=[  227], 10.00th=[  231], 20.00th=[  233],
>   | 30.00th=[  237], 40.00th=[  239], 50.00th=[  241], 60.00th=[  243],
>   | 70.00th=[  247], 80.00th=[  253], 90.00th=[  318], 95.00th=[  330],
>   | 99.00th=[  351], 99.50th=[  359], 99.90th=[  388], 99.95th=[  400],
>   | 99.99th=[  529]
> iodepth=128
> clat percentiles (usec):
>   |  1.00th=[  465],  5.00th=[  478], 10.00th=[  482], 20.00th=[  486],
>   | 30.00th=[  490], 40.00th=[  490], 50.00th=[  494], 60.00th=[  498],
>   | 70.00th=[  506], 80.00th=[  553], 90.00th=[  578], 95.00th=[  586],
>   | 99.00th=[  635], 99.50th=[  652], 99.90th=[  676], 99.95th=[  717],
>   | 99.99th=[ 2278]
> iodepth=256
> clat percentiles (usec):
>   |  1.00th=[  979],  5.00th=[  988], 10.00th=[  996], 20.00th=[ 1012],
>   | 30.00th=[ 1020], 40.00th=[ 1037], 50.00th=[ 1037], 60.00th=[ 1045],
>   | 70.00th=[ 1057], 80.00th=[ 1090], 90.00th=[ 1123], 95.00th=[ 1139],
>   | 99.00th=[ 1205], 99.50th=[ 1237], 99.90th=[ 1254], 99.95th=[ 1270],
>   | 99.99th=[ 1385]
> 
> with this patchset
> iodepth=1
> clat percentiles (usec):
>   |  1.00th=[    7],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
>   | 30.00th=[    7], 40.00th=[    7], 50.00th=[    8], 60.00th=[    8],
>   | 70.00th=[    8], 80.00th=[    8], 90.00th=[   82], 95.00th=[   97],
>   | 99.00th=[   99], 99.50th=[   99], 99.90th=[  100], 99.95th=[  101],
>   | 99.99th=[  125]
> iodepth=2
> clat percentiles (usec):
>   |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
>   | 30.00th=[    7], 40.00th=[    7], 50.00th=[    8], 60.00th=[    8],
>   | 70.00th=[    9], 80.00th=[   11], 90.00th=[   83], 95.00th=[   97],
>   | 99.00th=[  100], 99.50th=[  102], 99.90th=[  127], 99.95th=[  141],
>   | 99.99th=[  668]
> iodepth=4
> clat percentiles (usec):
>   |  1.00th=[    6],  5.00th=[    6], 10.00th=[    7], 20.00th=[    7],
>   | 30.00th=[    7], 40.00th=[    8], 50.00th=[    8], 60.00th=[    9],
>   | 70.00th=[   10], 80.00th=[   12], 90.00th=[   85], 95.00th=[   97],
>   | 99.00th=[  104], 99.50th=[  115], 99.90th=[  141], 99.95th=[  149],
>   | 99.99th=[  194]
> iodepth=8
> clat percentiles (usec):
>   |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    9],
>   | 30.00th=[   11], 40.00th=[   12], 50.00th=[   14], 60.00th=[   15],
>   | 70.00th=[   18], 80.00th=[   22], 90.00th=[   93], 95.00th=[  103],
>   | 99.00th=[  120], 99.50th=[  130], 99.90th=[  157], 99.95th=[  167],
>   | 99.99th=[  208]
> iodepth=16
> clat percentiles (usec):
>   |  1.00th=[   16],  5.00th=[   24], 10.00th=[   28], 20.00th=[   32],
>   | 30.00th=[   34], 40.00th=[   37], 50.00th=[   39], 60.00th=[   41],
>   | 70.00th=[   44], 80.00th=[   51], 90.00th=[  117], 95.00th=[  128],
>   | 99.00th=[  147], 99.50th=[  159], 99.90th=[  194], 99.95th=[  235],
>   | 99.99th=[ 1909]
> iodepth=32
> clat percentiles (usec):
>   |  1.00th=[   72],  5.00th=[   78], 10.00th=[   81], 20.00th=[   84],
>   | 30.00th=[   86], 40.00th=[   88], 50.00th=[   90], 60.00th=[   93],
>   | 70.00th=[   96], 80.00th=[  114], 90.00th=[  169], 95.00th=[  182],
>   | 99.00th=[  202], 99.50th=[  212], 99.90th=[  239], 99.95th=[  253],
>   | 99.99th=[  302]
> iodepth=64
> clat percentiles (usec):
>   |  1.00th=[  178],  5.00th=[  184], 10.00th=[  186], 20.00th=[  192],
>   | 30.00th=[  196], 40.00th=[  200], 50.00th=[  204], 60.00th=[  206],
>   | 70.00th=[  210], 80.00th=[  221], 90.00th=[  281], 95.00th=[  293],
>   | 99.00th=[  318], 99.50th=[  330], 99.90th=[  355], 99.95th=[  367],
>   | 99.99th=[  437]
> iodepth=128
> clat percentiles (usec):
>   |  1.00th=[  379],  5.00th=[  388], 10.00th=[  392], 20.00th=[  396],
>   | 30.00th=[  396], 40.00th=[  400], 50.00th=[  404], 60.00th=[  408],
>   | 70.00th=[  424], 80.00th=[  437], 90.00th=[  482], 95.00th=[  498],
>   | 99.00th=[  529], 99.50th=[  537], 99.90th=[  570], 99.95th=[  635],
>   | 99.99th=[ 2311]
> iodepth=256
> clat percentiles (usec):
>   |  1.00th=[  783],  5.00th=[  783], 10.00th=[  791], 20.00th=[  791],
>   | 30.00th=[  791], 40.00th=[  799], 50.00th=[  799], 60.00th=[  799],
>   | 70.00th=[  807], 80.00th=[  816], 90.00th=[  881], 95.00th=[  889],
>   | 99.00th=[  914], 99.50th=[  930], 99.90th=[  979], 99.95th=[  996],
>   | 99.99th=[ 1237]
> 
> 
> Hao Xu (8):
>    io-wq: code clean for io_wq_add_work_after()
>    io-wq: add helper to merge two wq_lists
>    io_uring: add a limited tw list for irq completion work
>    io_uring: add helper for task work execution code
>    io_uring: split io_req_complete_post() and add a helper
>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>    io_uring: add tw_ctx for io_uring_task
>    io_uring: batch completion in prior_task_list
> 
>   fs/io-wq.h    |  26 ++++++--
>   fs/io_uring.c | 170 ++++++++++++++++++++++++++++++++++----------------
>   2 files changed, 137 insertions(+), 59 deletions(-)
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/8] io-wq: add helper to merge two wq_lists
  2021-09-27  6:17 ` [PATCH 2/8] io-wq: add helper to merge two wq_lists Hao Xu
@ 2021-09-27 10:17   ` Hao Xu
  2021-09-28 11:10   ` Pavel Begunkov
  1 sibling, 0 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27 10:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

在 2021/9/27 下午2:17, Hao Xu 写道:
> add a helper to merge two wq_lists, it will be useful in the next
> patches.
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
>   fs/io-wq.h | 20 ++++++++++++++++++++
>   1 file changed, 20 insertions(+)
> 
> diff --git a/fs/io-wq.h b/fs/io-wq.h
> index 8369a51b65c0..7510b05d4a86 100644
> --- a/fs/io-wq.h
> +++ b/fs/io-wq.h
> @@ -39,6 +39,26 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
>   		list->last = node;
>   }
>   
> +/**
> + * wq_list_merge - merge the second list to the first one.
> + * @list0: the first list
> + * @list1: the second list
> + * after merge, list0 contains the merged list.
> + */
> +static inline void wq_list_merge(struct io_wq_work_list *list0,
> +				     struct io_wq_work_list *list1)
> +{
> +	if (!list1)
> +		return;
> +
> +	if (!list0) {
> +		list0 = list1;
> +		return;
> +	}
> +	list0->last->next = list1->first;
> +	list0->last = list1->last;
> +}
This needs some tweak, will send v2 soon.
> +
>   static inline void wq_list_add_tail(struct io_wq_work_node *node,
>   				    struct io_wq_work_list *list)
>   {
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 4/8] io_uring: add helper for task work execution code
  2021-09-27 10:51 [PATCH v2 0/8] " Hao Xu
@ 2021-09-27 10:51 ` Hao Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-27 10:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Add a helper for task work execution code. We will use it later.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 582ef7f55a35..af3811f1ef2e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2126,6 +2126,25 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
+{
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		if (req->ctx != *ctx) {
+			ctx_flush_and_put(*ctx, locked);
+			*ctx = req->ctx;
+			/* if not contended, grab and improve batching */
+			*locked = mutex_trylock(&(*ctx)->uring_lock);
+			percpu_ref_get(&(*ctx)->refs);
+		}
+		req->io_task_work.func(req, locked);
+		node = next;
+	} while (node);
+}
+
 static void tctx_task_work(struct callback_head *cb)
 {
 	bool locked = false;
@@ -2153,22 +2172,7 @@ static void tctx_task_work(struct callback_head *cb)
 		if (!node)
 			break;
 
-		do {
-			struct io_wq_work_node *next = node->next;
-			struct io_kiocb *req = container_of(node, struct io_kiocb,
-							    io_task_work.node);
-
-			if (req->ctx != ctx) {
-				ctx_flush_and_put(ctx, &locked);
-				ctx = req->ctx;
-				/* if not contended, grab and improve batching */
-				locked = mutex_trylock(&ctx->uring_lock);
-				percpu_ref_get(&ctx->refs);
-			}
-			req->io_task_work.func(req, &locked);
-			node = next;
-		} while (node);
-
+		handle_tw_list(node, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/8] io-wq: code clean for io_wq_add_work_after()
  2021-09-27  6:17 ` [PATCH 1/8] io-wq: code clean for io_wq_add_work_after() Hao Xu
@ 2021-09-28 11:08   ` Pavel Begunkov
  2021-09-29  7:36     ` Hao Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Pavel Begunkov @ 2021-09-28 11:08 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/27/21 7:17 AM, Hao Xu wrote:
> Remove a local variable.

It's there to help alias analysis, which usually can't do anything
with pointer heavy logic. Compare ASMs below, before and after
respectively:
	testq	%rax, %rax	# next

replaced with
	cmpq	$0, (%rdi)	#, node_2(D)->next

One extra memory dereference and a bigger binary


=====================================================

wq_list_add_after:
# fs/io_uring.c:271: 	struct io_wq_work_node *next = pos->next;
	movq	(%rsi), %rax	# pos_3(D)->next, next
# fs/io_uring.c:273: 	pos->next = node;
	movq	%rdi, (%rsi)	# node, pos_3(D)->next
# fs/io_uring.c:275: 	if (!next)
	testq	%rax, %rax	# next
# fs/io_uring.c:274: 	node->next = next;
	movq	%rax, (%rdi)	# next, node_5(D)->next
# fs/io_uring.c:275: 	if (!next)
	je	.L5927	#,
	ret	
.L5927:
# fs/io_uring.c:276: 		list->last = node;
	movq	%rdi, 8(%rdx)	# node, list_8(D)->last
	ret	

=====================================================

wq_list_add_after:
# fs/io-wq.h:48: 	node->next = pos->next;
	movq	(%rsi), %rax	# pos_3(D)->next, _5
# fs/io-wq.h:48: 	node->next = pos->next;
	movq	%rax, (%rdi)	# _5, node_2(D)->next
# fs/io-wq.h:49: 	pos->next = node;
	movq	%rdi, (%rsi)	# node, pos_3(D)->next
# fs/io-wq.h:50: 	if (!node->next)
	cmpq	$0, (%rdi)	#, node_2(D)->next
	je	.L5924	#,
	ret	
.L5924:
# fs/io-wq.h:51: 		list->last = node;
	movq	%rdi, 8(%rdx)	# node, list_4(D)->last
	ret	


> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
>  fs/io-wq.h | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/io-wq.h b/fs/io-wq.h
> index bf5c4c533760..8369a51b65c0 100644
> --- a/fs/io-wq.h
> +++ b/fs/io-wq.h
> @@ -33,11 +33,9 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
>  				     struct io_wq_work_node *pos,
>  				     struct io_wq_work_list *list)
>  {
> -	struct io_wq_work_node *next = pos->next;
> -
> +	node->next = pos->next;
>  	pos->next = node;
> -	node->next = next;
> -	if (!next)
> +	if (!node->next)
>  		list->last = node;
>  }
>  
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/8] io-wq: add helper to merge two wq_lists
  2021-09-27  6:17 ` [PATCH 2/8] io-wq: add helper to merge two wq_lists Hao Xu
  2021-09-27 10:17   ` Hao Xu
@ 2021-09-28 11:10   ` Pavel Begunkov
  2021-09-28 16:48     ` Hao Xu
  1 sibling, 1 reply; 24+ messages in thread
From: Pavel Begunkov @ 2021-09-28 11:10 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/27/21 7:17 AM, Hao Xu wrote:
> add a helper to merge two wq_lists, it will be useful in the next
> patches.
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
>  fs/io-wq.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/fs/io-wq.h b/fs/io-wq.h
> index 8369a51b65c0..7510b05d4a86 100644
> --- a/fs/io-wq.h
> +++ b/fs/io-wq.h
> @@ -39,6 +39,26 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
>  		list->last = node;
>  }
>  
> +/**
> + * wq_list_merge - merge the second list to the first one.
> + * @list0: the first list
> + * @list1: the second list
> + * after merge, list0 contains the merged list.
> + */
> +static inline void wq_list_merge(struct io_wq_work_list *list0,
> +				     struct io_wq_work_list *list1)
> +{
> +	if (!list1)
> +		return;
> +
> +	if (!list0) {
> +		list0 = list1;

It assigns a local var and returns, the assignment will be compiled
out, something is wrong

> +		return;
> +	}
> +	list0->last->next = list1->first;
> +	list0->last = list1->last;
> +}
> +
>  static inline void wq_list_add_tail(struct io_wq_work_node *node,
>  				    struct io_wq_work_list *list)
>  {
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] io_uring: add a limited tw list for irq completion work
  2021-09-27  6:17 ` [PATCH 3/8] io_uring: add a limited tw list for irq completion work Hao Xu
@ 2021-09-28 11:29   ` Pavel Begunkov
  2021-09-28 16:55     ` Hao Xu
                       ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Pavel Begunkov @ 2021-09-28 11:29 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/27/21 7:17 AM, Hao Xu wrote:
> Now we have a lot of task_work users, some are just to complete a req
> and generate a cqe. Let's put the work to a new tw list which has a
> higher priority, so that it can be handled quickly and thus to reduce
> avg req latency. an explanatory case:
> 
> origin timeline:
>     submit_sqe-->irq-->add completion task_work
>     -->run heavy work0~n-->run completion task_work
> now timeline:
>     submit_sqe-->irq-->add completion task_work
>     -->run completion task_work-->run heavy work0~n
> 
> One thing to watch out is sometimes irq completion TWs comes
> overwhelmingly, which makes the new tw list grows fast, and TWs in
> the old list are starved. So we have to limit the length of the new
> tw list. A practical value is 1/3:
>     len of new tw list < 1/3 * (len of new + old tw list)
> 
> In this way, the new tw list has a limited length and normal task get
> there chance to run.
> 
> Tested this patch(and the following ones) by manually replace
> __io_queue_sqe() to io_req_task_complete() to construct 'heavy' task
> works. Then test with fio:
> 
> ioengine=io_uring
> thread=1
> bs=4k
> direct=1
> rw=randread
> time_based=1
> runtime=600
> randrepeat=0
> group_reporting=1
> filename=/dev/nvme0n1
> 
> Tried various iodepth.
> The peak IOPS for this patch is 314K, while the old one is 249K.
> For avg latency, difference shows when iodepth grow:
> depth and avg latency(usec):
> 	depth      new          old
> 	 1        22.80        23.77
> 	 2        23.48        24.54
> 	 4        24.26        25.57
> 	 8        29.21        32.89
> 	 16       53.61        63.50
> 	 32       106.29       131.34
> 	 64       217.21       256.33
> 	 128      421.59       513.87
> 	 256      815.15       1050.99
> 
> 95%, 99% etc more data in cover letter.
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
>  fs/io_uring.c | 44 +++++++++++++++++++++++++++++++-------------
>  1 file changed, 31 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 8317c360f7a4..9272b2cfcfb7 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -461,6 +461,7 @@ struct io_ring_ctx {
>  	};
>  };
>  
> +#define MAX_EMERGENCY_TW_RATIO	3
>  struct io_uring_task {
>  	/* submission side */
>  	int			cached_refs;
> @@ -475,6 +476,9 @@ struct io_uring_task {
>  	spinlock_t		task_lock;
>  	struct io_wq_work_list	task_list;
>  	struct callback_head	task_work;
> +	struct io_wq_work_list	prior_task_list;
> +	unsigned int		nr;
> +	unsigned int		prior_nr;
>  	bool			task_running;
>  };
>  
> @@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb)
>  	while (1) {
>  		struct io_wq_work_node *node;
>  
> -		if (!tctx->task_list.first && locked)
> +		if (!tctx->prior_task_list.first &&
> +		    !tctx->task_list.first && locked)
>  			io_submit_flush_completions(ctx);
>  
>  		spin_lock_irq(&tctx->task_lock);
> -		node = tctx->task_list.first;
> +		wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
> +		node = tctx->prior_task_list.first;

I find all this accounting expensive, sure I'll see it for my BPF tests.

How about
1) remove MAX_EMERGENCY_TW_RATIO and all the counters,
prior_nr and others.

2) rely solely on list merging

So, when it enters an iteration of the loop it finds a set of requests
to run, it first executes all priority ones of that set and then the
rest (just by the fact that you merged the lists and execute all from
them).

It solves the problem of total starvation of non-prio requests, e.g.
if new completions coming as fast as you complete previous ones. One
downside is that prio requests coming while we execute a previous
batch will be executed only after a previous batch of non-prio
requests, I don't think it's much of a problem but interesting to
see numbers.


>  		INIT_WQ_LIST(&tctx->task_list);
> +		INIT_WQ_LIST(&tctx->prior_task_list);
> +		tctx->nr = tctx->prior_nr = 0;
>  		if (!node)
>  			tctx->task_running = false;
>  		spin_unlock_irq(&tctx->task_lock);
> @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb)
>  	ctx_flush_and_put(ctx, &locked);
>  }
>  
> -static void io_req_task_work_add(struct io_kiocb *req)
> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)

It think "priority" instead of "emergency" will be more accurate

>  {
>  	struct task_struct *tsk = req->task;
>  	struct io_uring_task *tctx = tsk->io_uring;
> @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req)
>  	WARN_ON_ONCE(!tctx);
>  
>  	spin_lock_irqsave(&tctx->task_lock, flags);
> -	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
> +	if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
> +		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
> +		tctx->prior_nr++;
> +	} else {
> +		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
> +	}
> +	tctx->nr++;
>  	running = tctx->task_running;
>  	if (!running)
>  		tctx->task_running = true;



-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/8] io-wq: add helper to merge two wq_lists
  2021-09-28 11:10   ` Pavel Begunkov
@ 2021-09-28 16:48     ` Hao Xu
  2021-09-29 11:23       ` Pavel Begunkov
  0 siblings, 1 reply; 24+ messages in thread
From: Hao Xu @ 2021-09-28 16:48 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/9/28 下午7:10, Pavel Begunkov 写道:
> On 9/27/21 7:17 AM, Hao Xu wrote:
>> add a helper to merge two wq_lists, it will be useful in the next
>> patches.
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   fs/io-wq.h | 20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/fs/io-wq.h b/fs/io-wq.h
>> index 8369a51b65c0..7510b05d4a86 100644
>> --- a/fs/io-wq.h
>> +++ b/fs/io-wq.h
>> @@ -39,6 +39,26 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
>>   		list->last = node;
>>   }
>>   
>> +/**
>> + * wq_list_merge - merge the second list to the first one.
>> + * @list0: the first list
>> + * @list1: the second list
>> + * after merge, list0 contains the merged list.
>> + */
>> +static inline void wq_list_merge(struct io_wq_work_list *list0,
>> +				     struct io_wq_work_list *list1)
>> +{
>> +	if (!list1)
>> +		return;
>> +
>> +	if (!list0) {
>> +		list0 = list1;
> 
> It assigns a local var and returns, the assignment will be compiled
> out, something is wrong
True, I've corrected it in v2.
> 
>> +		return;
>> +	}
>> +	list0->last->next = list1->first;
>> +	list0->last = list1->last;
>> +}
>> +
>>   static inline void wq_list_add_tail(struct io_wq_work_node *node,
>>   				    struct io_wq_work_list *list)
>>   {
>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] io_uring: add a limited tw list for irq completion work
  2021-09-28 11:29   ` Pavel Begunkov
@ 2021-09-28 16:55     ` Hao Xu
  2021-09-29 11:25       ` Pavel Begunkov
  2021-09-29 11:38     ` Hao Xu
  2021-09-30  3:21     ` Hao Xu
  2 siblings, 1 reply; 24+ messages in thread
From: Hao Xu @ 2021-09-28 16:55 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/9/28 下午7:29, Pavel Begunkov 写道:
> On 9/27/21 7:17 AM, Hao Xu wrote:
>> Now we have a lot of task_work users, some are just to complete a req
>> and generate a cqe. Let's put the work to a new tw list which has a
>> higher priority, so that it can be handled quickly and thus to reduce
>> avg req latency. an explanatory case:
>>
>> origin timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run heavy work0~n-->run completion task_work
>> now timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run completion task_work-->run heavy work0~n
>>
>> One thing to watch out is sometimes irq completion TWs comes
>> overwhelmingly, which makes the new tw list grows fast, and TWs in
>> the old list are starved. So we have to limit the length of the new
>> tw list. A practical value is 1/3:
>>      len of new tw list < 1/3 * (len of new + old tw list)
>>
>> In this way, the new tw list has a limited length and normal task get
>> there chance to run.
>>
>> Tested this patch(and the following ones) by manually replace
>> __io_queue_sqe() to io_req_task_complete() to construct 'heavy' task
>> works. Then test with fio:
>>
>> ioengine=io_uring
>> thread=1
>> bs=4k
>> direct=1
>> rw=randread
>> time_based=1
>> runtime=600
>> randrepeat=0
>> group_reporting=1
>> filename=/dev/nvme0n1
>>
>> Tried various iodepth.
>> The peak IOPS for this patch is 314K, while the old one is 249K.
>> For avg latency, difference shows when iodepth grow:
>> depth and avg latency(usec):
>> 	depth      new          old
>> 	 1        22.80        23.77
>> 	 2        23.48        24.54
>> 	 4        24.26        25.57
>> 	 8        29.21        32.89
>> 	 16       53.61        63.50
>> 	 32       106.29       131.34
>> 	 64       217.21       256.33
>> 	 128      421.59       513.87
>> 	 256      815.15       1050.99
>>
>> 95%, 99% etc more data in cover letter.
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   fs/io_uring.c | 44 +++++++++++++++++++++++++++++++-------------
>>   1 file changed, 31 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 8317c360f7a4..9272b2cfcfb7 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -461,6 +461,7 @@ struct io_ring_ctx {
>>   	};
>>   };
>>   
>> +#define MAX_EMERGENCY_TW_RATIO	3
>>   struct io_uring_task {
>>   	/* submission side */
>>   	int			cached_refs;
>> @@ -475,6 +476,9 @@ struct io_uring_task {
>>   	spinlock_t		task_lock;
>>   	struct io_wq_work_list	task_list;
>>   	struct callback_head	task_work;
>> +	struct io_wq_work_list	prior_task_list;
>> +	unsigned int		nr;
>> +	unsigned int		prior_nr;
>>   	bool			task_running;
>>   };
>>   
>> @@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb)
>>   	while (1) {
>>   		struct io_wq_work_node *node;
>>   
>> -		if (!tctx->task_list.first && locked)
>> +		if (!tctx->prior_task_list.first &&
>> +		    !tctx->task_list.first && locked)
>>   			io_submit_flush_completions(ctx);
>>   
>>   		spin_lock_irq(&tctx->task_lock);
>> -		node = tctx->task_list.first;
>> +		wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
>> +		node = tctx->prior_task_list.first;
> 
> I find all this accounting expensive, sure I'll see it for my BPF tests.
> 
> How about
> 1) remove MAX_EMERGENCY_TW_RATIO and all the counters,
> prior_nr and others.
> 
> 2) rely solely on list merging
> 
> So, when it enters an iteration of the loop it finds a set of requests
> to run, it first executes all priority ones of that set and then the
> rest (just by the fact that you merged the lists and execute all from
> them).
> 
> It solves the problem of total starvation of non-prio requests, e.g.
> if new completions coming as fast as you complete previous ones. One
> downside is that prio requests coming while we execute a previous
> batch will be executed only after a previous batch of non-prio
> requests, I don't think it's much of a problem but interesting to
> see numbers.
Actually this was one of my implementation, I splited it to two lists
explicitly most because the convience of 8/8 batch the tw in prior list.
I'll evaluate the overhead tomorrow.
> 
> 
>>   		INIT_WQ_LIST(&tctx->task_list);
>> +		INIT_WQ_LIST(&tctx->prior_task_list);
>> +		tctx->nr = tctx->prior_nr = 0;
>>   		if (!node)
>>   			tctx->task_running = false;
>>   		spin_unlock_irq(&tctx->task_lock);
>> @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb)
>>   	ctx_flush_and_put(ctx, &locked);
>>   }
>>   
>> -static void io_req_task_work_add(struct io_kiocb *req)
>> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
> 
> It think "priority" instead of "emergency" will be more accurate
> 
>>   {
>>   	struct task_struct *tsk = req->task;
>>   	struct io_uring_task *tctx = tsk->io_uring;
>> @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req)
>>   	WARN_ON_ONCE(!tctx);
>>   
>>   	spin_lock_irqsave(&tctx->task_lock, flags);
>> -	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>> +	if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
>> +		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
>> +		tctx->prior_nr++;
>> +	} else {
>> +		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>> +	}
>> +	tctx->nr++;
>>   	running = tctx->task_running;
>>   	if (!running)
>>   		tctx->task_running = true;
> 
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/8] io-wq: code clean for io_wq_add_work_after()
  2021-09-28 11:08   ` Pavel Begunkov
@ 2021-09-29  7:36     ` Hao Xu
  2021-09-29 11:23       ` Pavel Begunkov
  0 siblings, 1 reply; 24+ messages in thread
From: Hao Xu @ 2021-09-29  7:36 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/9/28 下午7:08, Pavel Begunkov 写道:
> On 9/27/21 7:17 AM, Hao Xu wrote:
>> Remove a local variable.
> 
> It's there to help alias analysis, which usually can't do anything
> with pointer heavy logic. Compare ASMs below, before and after
> respectively:
> 	testq	%rax, %rax	# next
> 
> replaced with
> 	cmpq	$0, (%rdi)	#, node_2(D)->next
> 
> One extra memory dereference and a bigger binary
> 
> 
> =====================================================
> 
> wq_list_add_after:
> # fs/io_uring.c:271: 	struct io_wq_work_node *next = pos->next;
> 	movq	(%rsi), %rax	# pos_3(D)->next, next
> # fs/io_uring.c:273: 	pos->next = node;
> 	movq	%rdi, (%rsi)	# node, pos_3(D)->next
> # fs/io_uring.c:275: 	if (!next)
> 	testq	%rax, %rax	# next
> # fs/io_uring.c:274: 	node->next = next;
> 	movq	%rax, (%rdi)	# next, node_5(D)->next
> # fs/io_uring.c:275: 	if (!next)
> 	je	.L5927	#,
> 	ret	
> .L5927:
> # fs/io_uring.c:276: 		list->last = node;
> 	movq	%rdi, 8(%rdx)	# node, list_8(D)->last
> 	ret	
> 
> =====================================================
> 
> wq_list_add_after:
> # fs/io-wq.h:48: 	node->next = pos->next;
> 	movq	(%rsi), %rax	# pos_3(D)->next, _5
> # fs/io-wq.h:48: 	node->next = pos->next;
> 	movq	%rax, (%rdi)	# _5, node_2(D)->next
> # fs/io-wq.h:49: 	pos->next = node;
> 	movq	%rdi, (%rsi)	# node, pos_3(D)->next
> # fs/io-wq.h:50: 	if (!node->next)
> 	cmpq	$0, (%rdi)	#, node_2(D)->next
hmm, this is definitely not good, not sure why this is not optimised to
cmpq $0, %rax (haven't touched assembly for a long time..)
> 	je	.L5924	#,
> 	ret	
> .L5924:
> # fs/io-wq.h:51: 		list->last = node;
> 	movq	%rdi, 8(%rdx)	# node, list_4(D)->last
> 	ret	
> 
> 
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   fs/io-wq.h | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/io-wq.h b/fs/io-wq.h
>> index bf5c4c533760..8369a51b65c0 100644
>> --- a/fs/io-wq.h
>> +++ b/fs/io-wq.h
>> @@ -33,11 +33,9 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
>>   				     struct io_wq_work_node *pos,
>>   				     struct io_wq_work_list *list)
>>   {
>> -	struct io_wq_work_node *next = pos->next;
>> -
>> +	node->next = pos->next;
>>   	pos->next = node;
>> -	node->next = next;
>> -	if (!next)
>> +	if (!node->next)
>>   		list->last = node;
>>   }
>>   
>>
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/8] io-wq: code clean for io_wq_add_work_after()
  2021-09-29  7:36     ` Hao Xu
@ 2021-09-29 11:23       ` Pavel Begunkov
  0 siblings, 0 replies; 24+ messages in thread
From: Pavel Begunkov @ 2021-09-29 11:23 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/29/21 8:36 AM, Hao Xu wrote:
> 在 2021/9/28 下午7:08, Pavel Begunkov 写道:
>> On 9/27/21 7:17 AM, Hao Xu wrote:
>>> Remove a local variable.
>>
>> It's there to help alias analysis, which usually can't do anything
>> with pointer heavy logic. Compare ASMs below, before and after
>> respectively:
>>     testq    %rax, %rax    # next
>>
>> replaced with
>>     cmpq    $0, (%rdi)    #, node_2(D)->next
>>
>> One extra memory dereference and a bigger binary

>> wq_list_add_after:
>> # fs/io-wq.h:48:     node->next = pos->next;
>>     movq    (%rsi), %rax    # pos_3(D)->next, _5
>> # fs/io-wq.h:48:     node->next = pos->next;
>>     movq    %rax, (%rdi)    # _5, node_2(D)->next
>> # fs/io-wq.h:49:     pos->next = node;
>>     movq    %rdi, (%rsi)    # node, pos_3(D)->next
>> # fs/io-wq.h:50:     if (!node->next)
>>     cmpq    $0, (%rdi)    #, node_2(D)->next
> hmm, this is definitely not good, not sure why this is not optimised to
> cmpq $0, %rax (haven't touched assembly for a long time..)

Nothing strange, alias analysis, it can't infer that the pointers
don't point to overlapping memory, and so can do nothing but reload.

__restrict__ C keyword would've helped, but it's not used in
the kernel.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/8] io-wq: add helper to merge two wq_lists
  2021-09-28 16:48     ` Hao Xu
@ 2021-09-29 11:23       ` Pavel Begunkov
  0 siblings, 0 replies; 24+ messages in thread
From: Pavel Begunkov @ 2021-09-29 11:23 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/28/21 5:48 PM, Hao Xu wrote:
> 在 2021/9/28 下午7:10, Pavel Begunkov 写道:
>> On 9/27/21 7:17 AM, Hao Xu wrote:
>>> add a helper to merge two wq_lists, it will be useful in the next
>>> patches.
>>>
>>> Signed-off-by: Hao Xu <[email protected]>
>>> ---
>>>   fs/io-wq.h | 20 ++++++++++++++++++++
>>>   1 file changed, 20 insertions(+)
>>>
>>> diff --git a/fs/io-wq.h b/fs/io-wq.h
>>> index 8369a51b65c0..7510b05d4a86 100644
>>> --- a/fs/io-wq.h
>>> +++ b/fs/io-wq.h
>>> @@ -39,6 +39,26 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
>>>           list->last = node;
>>>   }
>>>   +/**
>>> + * wq_list_merge - merge the second list to the first one.
>>> + * @list0: the first list
>>> + * @list1: the second list
>>> + * after merge, list0 contains the merged list.
>>> + */
>>> +static inline void wq_list_merge(struct io_wq_work_list *list0,
>>> +                     struct io_wq_work_list *list1)
>>> +{
>>> +    if (!list1)
>>> +        return;
>>> +
>>> +    if (!list0) {
>>> +        list0 = list1;
>>
>> It assigns a local var and returns, the assignment will be compiled
>> out, something is wrong
> True, I've corrected it in v2.

Was looking at a wrong version then, need to look through v2

>>
>>> +        return;
>>> +    }
>>> +    list0->last->next = list1->first;
>>> +    list0->last = list1->last;
>>> +}
>>> +
>>>   static inline void wq_list_add_tail(struct io_wq_work_node *node,
>>>                       struct io_wq_work_list *list)
>>>   {
>>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] io_uring: add a limited tw list for irq completion work
  2021-09-28 16:55     ` Hao Xu
@ 2021-09-29 11:25       ` Pavel Begunkov
  0 siblings, 0 replies; 24+ messages in thread
From: Pavel Begunkov @ 2021-09-29 11:25 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/28/21 5:55 PM, Hao Xu wrote:
> 在 2021/9/28 下午7:29, Pavel Begunkov 写道:
[...]
>> It solves the problem of total starvation of non-prio requests, e.g.
>> if new completions coming as fast as you complete previous ones. One
>> downside is that prio requests coming while we execute a previous
>> batch will be executed only after a previous batch of non-prio
>> requests, I don't think it's much of a problem but interesting to
>> see numbers.
> Actually this was one of my implementation, I splited it to two lists
> explicitly most because the convience of 8/8 batch the tw in prior list.
> I'll evaluate the overhead tomorrow.

I guess so because it more resembles v1 but without inverting order
of the IRQ sublist.


>>
>>>           INIT_WQ_LIST(&tctx->task_list);
>>> +        INIT_WQ_LIST(&tctx->prior_task_list);
>>> +        tctx->nr = tctx->prior_nr = 0;
>>>           if (!node)
>>>               tctx->task_running = false;
>>>           spin_unlock_irq(&tctx->task_lock);
>>> @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb)
>>>       ctx_flush_and_put(ctx, &locked);
>>>   }
>>>   -static void io_req_task_work_add(struct io_kiocb *req)
>>> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
>>
>> It think "priority" instead of "emergency" will be more accurate
>>
>>>   {
>>>       struct task_struct *tsk = req->task;
>>>       struct io_uring_task *tctx = tsk->io_uring;
>>> @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req)
>>>       WARN_ON_ONCE(!tctx);
>>>         spin_lock_irqsave(&tctx->task_lock, flags);
>>> -    wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>> +    if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
>>> +        wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
>>> +        tctx->prior_nr++;
>>> +    } else {
>>> +        wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>> +    }
>>> +    tctx->nr++;
>>>       running = tctx->task_running;
>>>       if (!running)
>>>           tctx->task_running = true;

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] io_uring: add a limited tw list for irq completion work
  2021-09-28 11:29   ` Pavel Begunkov
  2021-09-28 16:55     ` Hao Xu
@ 2021-09-29 11:38     ` Hao Xu
  2021-09-30  9:02       ` Pavel Begunkov
  2021-09-30  3:21     ` Hao Xu
  2 siblings, 1 reply; 24+ messages in thread
From: Hao Xu @ 2021-09-29 11:38 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/9/28 下午7:29, Pavel Begunkov 写道:
> On 9/27/21 7:17 AM, Hao Xu wrote:
>> Now we have a lot of task_work users, some are just to complete a req
>> and generate a cqe. Let's put the work to a new tw list which has a
>> higher priority, so that it can be handled quickly and thus to reduce
>> avg req latency. an explanatory case:
>>
>> origin timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run heavy work0~n-->run completion task_work
>> now timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run completion task_work-->run heavy work0~n
>>
>> One thing to watch out is sometimes irq completion TWs comes
>> overwhelmingly, which makes the new tw list grows fast, and TWs in
>> the old list are starved. So we have to limit the length of the new
>> tw list. A practical value is 1/3:
>>      len of new tw list < 1/3 * (len of new + old tw list)
>>
>> In this way, the new tw list has a limited length and normal task get
>> there chance to run.
>>
>> Tested this patch(and the following ones) by manually replace
>> __io_queue_sqe() to io_req_task_complete() to construct 'heavy' task
>> works. Then test with fio:
>>
>> ioengine=io_uring
>> thread=1
>> bs=4k
>> direct=1
>> rw=randread
>> time_based=1
>> runtime=600
>> randrepeat=0
>> group_reporting=1
>> filename=/dev/nvme0n1
>>
>> Tried various iodepth.
>> The peak IOPS for this patch is 314K, while the old one is 249K.
>> For avg latency, difference shows when iodepth grow:
>> depth and avg latency(usec):
>> 	depth      new          old
>> 	 1        22.80        23.77
>> 	 2        23.48        24.54
>> 	 4        24.26        25.57
>> 	 8        29.21        32.89
>> 	 16       53.61        63.50
>> 	 32       106.29       131.34
>> 	 64       217.21       256.33
>> 	 128      421.59       513.87
>> 	 256      815.15       1050.99
>>
>> 95%, 99% etc more data in cover letter.
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   fs/io_uring.c | 44 +++++++++++++++++++++++++++++++-------------
>>   1 file changed, 31 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 8317c360f7a4..9272b2cfcfb7 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -461,6 +461,7 @@ struct io_ring_ctx {
>>   	};
>>   };
>>   
>> +#define MAX_EMERGENCY_TW_RATIO	3
>>   struct io_uring_task {
>>   	/* submission side */
>>   	int			cached_refs;
>> @@ -475,6 +476,9 @@ struct io_uring_task {
>>   	spinlock_t		task_lock;
>>   	struct io_wq_work_list	task_list;
>>   	struct callback_head	task_work;
>> +	struct io_wq_work_list	prior_task_list;
>> +	unsigned int		nr;
>> +	unsigned int		prior_nr;
>>   	bool			task_running;
>>   };
>>   
>> @@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb)
>>   	while (1) {
>>   		struct io_wq_work_node *node;
>>   
>> -		if (!tctx->task_list.first && locked)
>> +		if (!tctx->prior_task_list.first &&
>> +		    !tctx->task_list.first && locked)
>>   			io_submit_flush_completions(ctx);
>>   
>>   		spin_lock_irq(&tctx->task_lock);
>> -		node = tctx->task_list.first;
>> +		wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
>> +		node = tctx->prior_task_list.first;
> 
> I find all this accounting expensive, sure I'll see it for my BPF tests.
May I ask how do you evaluate the overhead with BPF here?
> 
> How about
> 1) remove MAX_EMERGENCY_TW_RATIO and all the counters,
> prior_nr and others.
> 
> 2) rely solely on list merging
> 
> So, when it enters an iteration of the loop it finds a set of requests
> to run, it first executes all priority ones of that set and then the
> rest (just by the fact that you merged the lists and execute all from
> them).
> 
> It solves the problem of total starvation of non-prio requests, e.g.
> if new completions coming as fast as you complete previous ones. One
> downside is that prio requests coming while we execute a previous
> batch will be executed only after a previous batch of non-prio
> requests, I don't think it's much of a problem but interesting to
> see numbers.
hmm, this probably doesn't solve the starvation, since there may be
a number of priority TWs ahead of non-prio TWs in one iteration, in the
case of submitting many sqes in one io_submit_sqes. That's why I keep
just 1/3 priority TWs there.
> 
> 
>>   		INIT_WQ_LIST(&tctx->task_list);
>> +		INIT_WQ_LIST(&tctx->prior_task_list);
>> +		tctx->nr = tctx->prior_nr = 0;
>>   		if (!node)
>>   			tctx->task_running = false;
>>   		spin_unlock_irq(&tctx->task_lock);
>> @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb)
>>   	ctx_flush_and_put(ctx, &locked);
>>   }
>>   
>> -static void io_req_task_work_add(struct io_kiocb *req)
>> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
> 
> It think "priority" instead of "emergency" will be more accurate
> 
>>   {
>>   	struct task_struct *tsk = req->task;
>>   	struct io_uring_task *tctx = tsk->io_uring;
>> @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req)
>>   	WARN_ON_ONCE(!tctx);
>>   
>>   	spin_lock_irqsave(&tctx->task_lock, flags);
>> -	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>> +	if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
>> +		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
>> +		tctx->prior_nr++;
>> +	} else {
>> +		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>> +	}
>> +	tctx->nr++;
>>   	running = tctx->task_running;
>>   	if (!running)
>>   		tctx->task_running = true;
> 
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] io_uring: add a limited tw list for irq completion work
  2021-09-28 11:29   ` Pavel Begunkov
  2021-09-28 16:55     ` Hao Xu
  2021-09-29 11:38     ` Hao Xu
@ 2021-09-30  3:21     ` Hao Xu
  2 siblings, 0 replies; 24+ messages in thread
From: Hao Xu @ 2021-09-30  3:21 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/9/28 下午7:29, Pavel Begunkov 写道:
> On 9/27/21 7:17 AM, Hao Xu wrote:
>> Now we have a lot of task_work users, some are just to complete a req
>> and generate a cqe. Let's put the work to a new tw list which has a
>> higher priority, so that it can be handled quickly and thus to reduce
>> avg req latency. an explanatory case:
>>
>> origin timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run heavy work0~n-->run completion task_work
>> now timeline:
>>      submit_sqe-->irq-->add completion task_work
>>      -->run completion task_work-->run heavy work0~n
>>
>> One thing to watch out is sometimes irq completion TWs comes
>> overwhelmingly, which makes the new tw list grows fast, and TWs in
>> the old list are starved. So we have to limit the length of the new
>> tw list. A practical value is 1/3:
>>      len of new tw list < 1/3 * (len of new + old tw list)
>>
>> In this way, the new tw list has a limited length and normal task get
>> there chance to run.
>>
>> Tested this patch(and the following ones) by manually replace
>> __io_queue_sqe() to io_req_task_complete() to construct 'heavy' task
>> works. Then test with fio:
>>
>> ioengine=io_uring
>> thread=1
>> bs=4k
>> direct=1
>> rw=randread
>> time_based=1
>> runtime=600
>> randrepeat=0
>> group_reporting=1
>> filename=/dev/nvme0n1
>>
>> Tried various iodepth.
>> The peak IOPS for this patch is 314K, while the old one is 249K.
>> For avg latency, difference shows when iodepth grow:
>> depth and avg latency(usec):
>> 	depth      new          old
>> 	 1        22.80        23.77
>> 	 2        23.48        24.54
>> 	 4        24.26        25.57
>> 	 8        29.21        32.89
>> 	 16       53.61        63.50
>> 	 32       106.29       131.34
>> 	 64       217.21       256.33
>> 	 128      421.59       513.87
>> 	 256      815.15       1050.99
>>
>> 95%, 99% etc more data in cover letter.
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   fs/io_uring.c | 44 +++++++++++++++++++++++++++++++-------------
>>   1 file changed, 31 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 8317c360f7a4..9272b2cfcfb7 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -461,6 +461,7 @@ struct io_ring_ctx {
>>   	};
>>   };
>>   
>> +#define MAX_EMERGENCY_TW_RATIO	3
>>   struct io_uring_task {
>>   	/* submission side */
>>   	int			cached_refs;
>> @@ -475,6 +476,9 @@ struct io_uring_task {
>>   	spinlock_t		task_lock;
>>   	struct io_wq_work_list	task_list;
>>   	struct callback_head	task_work;
>> +	struct io_wq_work_list	prior_task_list;
>> +	unsigned int		nr;
>> +	unsigned int		prior_nr;
>>   	bool			task_running;
>>   };
>>   
>> @@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb)
>>   	while (1) {
>>   		struct io_wq_work_node *node;
>>   
>> -		if (!tctx->task_list.first && locked)
>> +		if (!tctx->prior_task_list.first &&
>> +		    !tctx->task_list.first && locked)
>>   			io_submit_flush_completions(ctx);
>>   
>>   		spin_lock_irq(&tctx->task_lock);
>> -		node = tctx->task_list.first;
>> +		wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
>> +		node = tctx->prior_task_list.first;
> 
> I find all this accounting expensive, sure I'll see it for my BPF tests.
> 
> How about
> 1) remove MAX_EMERGENCY_TW_RATIO and all the counters,
> prior_nr and others.
> 
> 2) rely solely on list merging
> 
> So, when it enters an iteration of the loop it finds a set of requests
> to run, it first executes all priority ones of that set and then the
> rest (just by the fact that you merged the lists and execute all from
> them).
> 
> It solves the problem of total starvation of non-prio requests, e.g.
> if new completions coming as fast as you complete previous ones. One
> downside is that prio requests coming while we execute a previous
> batch will be executed only after a previous batch of non-prio
> requests, I don't think it's much of a problem but interesting to
> see numbers.
Hi Pavel,
How about doing it with one counter.
say MAX_EMERGENCY_TW_RATIO is k, the number of TWs in priority list is
x, in non-priority list in is y. Then a TW can be inserted to the
priority list in the condition:
             x <= 1/k * (x + y)
           =>k * x <= x + y
           =>(1 - k) * x + y >= 0

So we just need a variable z = (1 - k) * x + y. Everytime a new TW
comes,
     if z >= 0, we add it to prio list, and z += (1 - k)
     if z < 0, we add it to non-prio list, and z++

So we just one extra operation, and we can simplify the check to:
        if (emergency && k >= 0) add to prio list;

Regards,
Hao

> 
> 
>>   		INIT_WQ_LIST(&tctx->task_list);
>> +		INIT_WQ_LIST(&tctx->prior_task_list);
>> +		tctx->nr = tctx->prior_nr = 0;
>>   		if (!node)
>>   			tctx->task_running = false;
>>   		spin_unlock_irq(&tctx->task_lock);
>> @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb)
>>   	ctx_flush_and_put(ctx, &locked);
>>   }
>>   
>> -static void io_req_task_work_add(struct io_kiocb *req)
>> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
> 
> It think "priority" instead of "emergency" will be more accurate
> 
>>   {
>>   	struct task_struct *tsk = req->task;
>>   	struct io_uring_task *tctx = tsk->io_uring;
>> @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req)
>>   	WARN_ON_ONCE(!tctx);
>>   
>>   	spin_lock_irqsave(&tctx->task_lock, flags);
>> -	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>> +	if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
>> +		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
>> +		tctx->prior_nr++;
>> +	} else {
>> +		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>> +	}
>> +	tctx->nr++;
>>   	running = tctx->task_running;
>>   	if (!running)
>>   		tctx->task_running = true;
> 
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/8] io_uring: add a limited tw list for irq completion work
  2021-09-29 11:38     ` Hao Xu
@ 2021-09-30  9:02       ` Pavel Begunkov
  0 siblings, 0 replies; 24+ messages in thread
From: Pavel Begunkov @ 2021-09-30  9:02 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/29/21 12:38 PM, Hao Xu wrote:
> 在 2021/9/28 下午7:29, Pavel Begunkov 写道:
[...]
>>>   @@ -2132,12 +2136,16 @@ static void tctx_task_work(struct callback_head *cb)
>>>       while (1) {
>>>           struct io_wq_work_node *node;
>>>   -        if (!tctx->task_list.first && locked)
>>> +        if (!tctx->prior_task_list.first &&
>>> +            !tctx->task_list.first && locked)
>>>               io_submit_flush_completions(ctx);
>>>             spin_lock_irq(&tctx->task_lock);
>>> -        node = tctx->task_list.first;
>>> +        wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
>>> +        node = tctx->prior_task_list.first;
>>
>> I find all this accounting expensive, sure I'll see it for my BPF tests.
> May I ask how do you evaluate the overhead with BPF here?

It's a custom branch and apparently would need some thinking on how
to apply your stuff on top, because of yet another list in [1]. In
short, the case in mind spins inside of tctx_task_work() doing one
request at a time.
I think would be easier if I try it out myself.

[1] https://github.com/isilence/linux/commit/d6285a9817eb26aa52ad54a79584512d7efa82fd

>>
>> How about
>> 1) remove MAX_EMERGENCY_TW_RATIO and all the counters,
>> prior_nr and others.
>>
>> 2) rely solely on list merging
>>
>> So, when it enters an iteration of the loop it finds a set of requests
>> to run, it first executes all priority ones of that set and then the
>> rest (just by the fact that you merged the lists and execute all from
>> them).
>>
>> It solves the problem of total starvation of non-prio requests, e.g.
>> if new completions coming as fast as you complete previous ones. One
>> downside is that prio requests coming while we execute a previous
>> batch will be executed only after a previous batch of non-prio
>> requests, I don't think it's much of a problem but interesting to
>> see numbers.
> hmm, this probably doesn't solve the starvation, since there may be
> a number of priority TWs ahead of non-prio TWs in one iteration, in the
> case of submitting many sqes in one io_submit_sqes. That's why I keep
> just 1/3 priority TWs there.

I don't think it's a problem, they should be fast enough and we have
a forward progress guarantees for non-prio. IMHO that should be enough.


>>
>>
>>>           INIT_WQ_LIST(&tctx->task_list);
>>> +        INIT_WQ_LIST(&tctx->prior_task_list);
>>> +        tctx->nr = tctx->prior_nr = 0;
>>>           if (!node)
>>>               tctx->task_running = false;
>>>           spin_unlock_irq(&tctx->task_lock);
>>> @@ -2166,7 +2174,7 @@ static void tctx_task_work(struct callback_head *cb)
>>>       ctx_flush_and_put(ctx, &locked);
>>>   }
>>>   -static void io_req_task_work_add(struct io_kiocb *req)
>>> +static void io_req_task_work_add(struct io_kiocb *req, bool emergency)
>>
>> It think "priority" instead of "emergency" will be more accurate
>>
>>>   {
>>>       struct task_struct *tsk = req->task;
>>>       struct io_uring_task *tctx = tsk->io_uring;
>>> @@ -2178,7 +2186,13 @@ static void io_req_task_work_add(struct io_kiocb *req)
>>>       WARN_ON_ONCE(!tctx);
>>>         spin_lock_irqsave(&tctx->task_lock, flags);
>>> -    wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>> +    if (emergency && tctx->prior_nr * MAX_EMERGENCY_TW_RATIO < tctx->nr) {
>>> +        wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
>>> +        tctx->prior_nr++;
>>> +    } else {
>>> +        wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
>>> +    }
>>> +    tctx->nr++;
>>>       running = tctx->task_running;
>>>       if (!running)
>>>           tctx->task_running = true;
>>
>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-09-30  9:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-09-27  6:17 [PATCH 0/6] task_work optimization Hao Xu
2021-09-27  6:17 ` [PATCH 1/8] io-wq: code clean for io_wq_add_work_after() Hao Xu
2021-09-28 11:08   ` Pavel Begunkov
2021-09-29  7:36     ` Hao Xu
2021-09-29 11:23       ` Pavel Begunkov
2021-09-27  6:17 ` [PATCH 2/8] io-wq: add helper to merge two wq_lists Hao Xu
2021-09-27 10:17   ` Hao Xu
2021-09-28 11:10   ` Pavel Begunkov
2021-09-28 16:48     ` Hao Xu
2021-09-29 11:23       ` Pavel Begunkov
2021-09-27  6:17 ` [PATCH 3/8] io_uring: add a limited tw list for irq completion work Hao Xu
2021-09-28 11:29   ` Pavel Begunkov
2021-09-28 16:55     ` Hao Xu
2021-09-29 11:25       ` Pavel Begunkov
2021-09-29 11:38     ` Hao Xu
2021-09-30  9:02       ` Pavel Begunkov
2021-09-30  3:21     ` Hao Xu
2021-09-27  6:17 ` [PATCH 4/8] io_uring: add helper for task work execution code Hao Xu
2021-09-27  6:17 ` [PATCH 5/8] io_uring: split io_req_complete_post() and add a helper Hao Xu
2021-09-27  6:17 ` [PATCH 6/8] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
2021-09-27  6:17 ` [PATCH 7/8] io_uring: add tw_ctx for io_uring_task Hao Xu
2021-09-27  6:17 ` [PATCH 8/8] io_uring: batch completion in prior_task_list Hao Xu
2021-09-27  6:21 ` [PATCH 0/6] task_work optimization Hao Xu
  -- strict thread matches above, loose matches on Subject: below --
2021-09-27 10:51 [PATCH v2 0/8] " Hao Xu
2021-09-27 10:51 ` [PATCH 4/8] io_uring: add helper for task work execution code Hao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox