public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH v6 0/6] task work optimization
@ 2021-11-26 10:07 Hao Xu
  2021-11-26 10:07 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Hao Xu @ 2021-11-26 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

v4->v5
- change the implementation of merge_wq_list

v5->v6
- change the logic of handling prior task list to:
  1) grabbed uring_lock: leverage the inline completion infra
  2) otherwise: batch __req_complete_post() calls to save
     completion_lock operations.


Hao Xu (6):
  io-wq: add helper to merge two wq_lists
  io_uring: add a priority tw list for irq completion work
  io_uring: add helper for task work execution code
  io_uring: split io_req_complete_post() and add a helper
  io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  io_uring: batch completion in prior_task_list

 fs/io-wq.h    |  22 +++++++
 fs/io_uring.c | 168 ++++++++++++++++++++++++++++++++++----------------
 2 files changed, 136 insertions(+), 54 deletions(-)

-- 
2.24.4


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] io-wq: add helper to merge two wq_lists
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
@ 2021-11-26 10:07 ` Hao Xu
  2021-11-26 10:07 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2021-11-26 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

add a helper to merge two wq_lists, it will be useful in the next
patches.

Reviewed-by: Pavel Begunkov <[email protected]>
Signed-off-by: Hao Xu <[email protected]>
---
 fs/io-wq.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/io-wq.h b/fs/io-wq.h
index 41bf37674a49..3709b7c5ec98 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -52,6 +52,28 @@ static inline void wq_list_add_after(struct io_wq_work_node *node,
 		list->last = node;
 }
 
+/**
+ * wq_list_merge - merge the second list to the first one.
+ * @list0: the first list
+ * @list1: the second list
+ * Return the first node after mergence.
+ */
+static inline struct io_wq_work_node *wq_list_merge(struct io_wq_work_list *list0,
+						    struct io_wq_work_list *list1)
+{
+	struct io_wq_work_node *ret;
+
+	if (!list0->first) {
+		ret = list1->first;
+	} else {
+		ret = list0->first;
+		list0->last->next = list1->first;
+	}
+	INIT_WQ_LIST(list0);
+	INIT_WQ_LIST(list1);
+	return ret;
+}
+
 static inline void wq_list_add_tail(struct io_wq_work_node *node,
 				    struct io_wq_work_list *list)
 {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] io_uring: add a priority tw list for irq completion work
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
  2021-11-26 10:07 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
@ 2021-11-26 10:07 ` Hao Xu
  2021-11-26 10:07 ` [PATCH 3/6] io_uring: add helper for task work execution code Hao Xu
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2021-11-26 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Now we have a lot of task_work users, some are just to complete a req
and generate a cqe. Let's put the work to a new tw list which has a
higher priority, so that it can be handled quickly and thus to reduce
avg req latency and users can issue next round of sqes earlier.
An explanatory case:

origin timeline:
    submit_sqe-->irq-->add completion task_work
    -->run heavy work0~n-->run completion task_work
now timeline:
    submit_sqe-->irq-->add completion task_work
    -->run completion task_work-->run heavy work0~n

Limitation: this optimization is only for those that submission and
reaping process are in different threads. Otherwise anyhow we have to
submit new sqes after returning to userspace, then the order of TWs
doesn't matter.

Tested this patch(and the following ones) by manually replace
__io_queue_sqe() in io_queue_sqe() by io_req_task_queue() to construct
'heavy' task works. Then test with fio:

ioengine=io_uring
sqpoll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=600
randrepeat=0
group_reporting=1
filename=/dev/nvme0n1

Tried various iodepth.
The peak IOPS for this patch is 710K, while the old one is 665K.
For avg latency, difference shows when iodepth grow:
depth and avg latency(usec):
	depth      new          old
	 1        7.05         7.10
	 2        8.47         8.60
	 4        10.42        10.42
	 8        13.78        13.22
	 16       27.41        24.33
	 32       49.40        53.08
	 64       102.53       103.36
	 128      196.98       205.61
	 256      372.99       414.88
         512      747.23       791.30
         1024     1472.59      1538.72
         2048     3153.49      3329.01
         4096     6387.86      6682.54
         8192     12150.25     12774.14
         16384    23085.58     26044.71

Signed-off-by: Hao Xu <[email protected]>
[pavel: rebase, fixup new *work_add call sites]
Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 08b1b3de9b3f..821aa0f8c643 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -474,6 +474,7 @@ struct io_uring_task {
 
 	spinlock_t		task_lock;
 	struct io_wq_work_list	task_list;
+	struct io_wq_work_list	prior_task_list;
 	struct callback_head	task_work;
 	bool			task_running;
 };
@@ -2206,12 +2207,12 @@ static void tctx_task_work(struct callback_head *cb)
 	while (1) {
 		struct io_wq_work_node *node;
 
-		if (!tctx->task_list.first && locked)
+		if (!tctx->prior_task_list.first &&
+		    !tctx->task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node = tctx->task_list.first;
-		INIT_WQ_LIST(&tctx->task_list);
+		node= wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 		if (!node)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
@@ -2240,7 +2241,7 @@ static void tctx_task_work(struct callback_head *cb)
 	ctx_flush_and_put(ctx, &locked);
 }
 
-static void io_req_task_work_add(struct io_kiocb *req)
+static void io_req_task_work_add(struct io_kiocb *req, bool priority)
 {
 	struct task_struct *tsk = req->task;
 	struct io_uring_task *tctx = tsk->io_uring;
@@ -2252,7 +2253,10 @@ static void io_req_task_work_add(struct io_kiocb *req)
 	WARN_ON_ONCE(!tctx);
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
-	wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
+	if (priority)
+		wq_list_add_tail(&req->io_task_work.node, &tctx->prior_task_list);
+	else
+		wq_list_add_tail(&req->io_task_work.node, &tctx->task_list);
 	running = tctx->task_running;
 	if (!running)
 		tctx->task_running = true;
@@ -2277,8 +2281,7 @@ static void io_req_task_work_add(struct io_kiocb *req)
 
 	spin_lock_irqsave(&tctx->task_lock, flags);
 	tctx->task_running = false;
-	node = tctx->task_list.first;
-	INIT_WQ_LIST(&tctx->task_list);
+	node = wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
 	spin_unlock_irqrestore(&tctx->task_lock, flags);
 
 	while (node) {
@@ -2315,19 +2318,19 @@ static void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
 	req->result = ret;
 	req->io_task_work.func = io_req_task_cancel;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_req_task_submit;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static void io_req_task_queue_reissue(struct io_kiocb *req)
 {
 	req->io_task_work.func = io_queue_async_work;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 }
 
 static inline void io_queue_next(struct io_kiocb *req)
@@ -2437,7 +2440,7 @@ static inline void io_put_req_deferred(struct io_kiocb *req)
 {
 	if (req_ref_put_and_test(req)) {
 		req->io_task_work.func = io_free_req_work;
-		io_req_task_work_add(req);
+		io_req_task_work_add(req, false);
 	}
 }
 
@@ -2742,7 +2745,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
 		return;
 	req->result = res;
 	req->io_task_work.func = io_req_task_complete;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
 }
 
 static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)
@@ -2984,7 +2987,7 @@ static void kiocb_done(struct io_kiocb *req, ssize_t ret,
 			req_set_fail(req);
 			req->result = ret;
 			req->io_task_work.func = io_req_task_complete;
-			io_req_task_work_add(req);
+			io_req_task_work_add(req, false);
 		}
 	}
 }
@@ -5317,7 +5320,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll,
 	 * of executing it. We can't safely execute it anyway, as we may not
 	 * have the needed state needed for it anyway.
 	 */
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return 1;
 }
 
@@ -5985,7 +5988,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -6960,7 +6963,7 @@ static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer)
 	spin_unlock_irqrestore(&ctx->timeout_lock, flags);
 
 	req->io_task_work.func = io_req_task_link_timeout;
-	io_req_task_work_add(req);
+	io_req_task_work_add(req, false);
 	return HRTIMER_NORESTART;
 }
 
@@ -8675,6 +8678,7 @@ static __cold int io_uring_alloc_task_context(struct task_struct *task,
 	task->io_uring = tctx;
 	spin_lock_init(&tctx->task_lock);
 	INIT_WQ_LIST(&tctx->task_list);
+	INIT_WQ_LIST(&tctx->prior_task_list);
 	init_task_work(&tctx->task_work, tctx_task_work);
 	return 0;
 }
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] io_uring: add helper for task work execution code
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
  2021-11-26 10:07 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
  2021-11-26 10:07 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
@ 2021-11-26 10:07 ` Hao Xu
  2021-11-26 10:07 ` [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper Hao Xu
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2021-11-26 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Add a helper for task work execution code. We will use it later.

Reviewed-by: Pavel Begunkov <[email protected]>
Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 821aa0f8c643..c00363d761f4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2197,6 +2197,25 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
+{
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		if (req->ctx != *ctx) {
+			ctx_flush_and_put(*ctx, locked);
+			*ctx = req->ctx;
+			/* if not contended, grab and improve batching */
+			*locked = mutex_trylock(&(*ctx)->uring_lock);
+			percpu_ref_get(&(*ctx)->refs);
+		}
+		req->io_task_work.func(req, locked);
+		node = next;
+	} while (node);
+}
+
 static void tctx_task_work(struct callback_head *cb)
 {
 	bool locked = false;
@@ -2219,22 +2238,7 @@ static void tctx_task_work(struct callback_head *cb)
 		if (!node)
 			break;
 
-		do {
-			struct io_wq_work_node *next = node->next;
-			struct io_kiocb *req = container_of(node, struct io_kiocb,
-							    io_task_work.node);
-
-			if (req->ctx != ctx) {
-				ctx_flush_and_put(ctx, &locked);
-				ctx = req->ctx;
-				/* if not contended, grab and improve batching */
-				locked = mutex_trylock(&ctx->uring_lock);
-				percpu_ref_get(&ctx->refs);
-			}
-			req->io_task_work.func(req, &locked);
-			node = next;
-		} while (node);
-
+		handle_tw_list(node, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
                   ` (2 preceding siblings ...)
  2021-11-26 10:07 ` [PATCH 3/6] io_uring: add helper for task work execution code Hao Xu
@ 2021-11-26 10:07 ` Hao Xu
  2021-11-26 10:07 ` [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2021-11-26 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Split io_req_complete_post(), this is a prep for the next patch.

Signed-off-by: Hao Xu <[email protected]>
[pavel: hand rebase]
Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c00363d761f4..e7508ea4cd68 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1871,12 +1871,11 @@ static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data,
 	return __io_fill_cqe(ctx, user_data, res, cflags);
 }
 
-static void io_req_complete_post(struct io_kiocb *req, s32 res,
-				 u32 cflags)
+static void __io_req_complete_post(struct io_kiocb *req, s32 res,
+				   u32 cflags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
-	spin_lock(&ctx->completion_lock);
 	if (!(req->flags & REQ_F_CQE_SKIP))
 		__io_fill_cqe(ctx, req->user_data, res, cflags);
 	/*
@@ -1898,6 +1897,15 @@ static void io_req_complete_post(struct io_kiocb *req, s32 res,
 		wq_list_add_head(&req->comp_list, &ctx->locked_free_list);
 		ctx->locked_free_nr++;
 	}
+}
+
+static void io_req_complete_post(struct io_kiocb *req, s32 res,
+				 u32 cflags)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	spin_lock(&ctx->completion_lock);
+	__io_req_complete_post(req, res, cflags);
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
 	io_cqring_ev_posted(ctx);
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
                   ` (3 preceding siblings ...)
  2021-11-26 10:07 ` [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper Hao Xu
@ 2021-11-26 10:07 ` Hao Xu
  2021-11-26 10:07 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2021-11-26 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

Move them up to avoid explicit declaration. We will use them in later
patches.

Reviewed-by: Pavel Begunkov <[email protected]>
Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e7508ea4cd68..e9c67f19d585 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2205,6 +2205,24 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, bool *locked)
 	percpu_ref_put(&ctx->refs);
 }
 
+static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
+{
+	unsigned int cflags;
+
+	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
+	cflags |= IORING_CQE_F_BUFFER;
+	req->flags &= ~REQ_F_BUFFER_SELECTED;
+	kfree(kbuf);
+	return cflags;
+}
+
+static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
+{
+	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
+		return 0;
+	return io_put_kbuf(req, req->kbuf);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2471,24 +2489,6 @@ static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
 	return smp_load_acquire(&rings->sq.tail) - ctx->cached_sq_head;
 }
 
-static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
-{
-	unsigned int cflags;
-
-	cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
-	cflags |= IORING_CQE_F_BUFFER;
-	req->flags &= ~REQ_F_BUFFER_SELECTED;
-	kfree(kbuf);
-	return cflags;
-}
-
-static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
-{
-	if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
-		return 0;
-	return io_put_kbuf(req, req->kbuf);
-}
-
 static inline bool io_run_task_work(void)
 {
 	if (test_thread_flag(TIF_NOTIFY_SIGNAL) || current->task_works) {
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] io_uring: batch completion in prior_task_list
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
                   ` (4 preceding siblings ...)
  2021-11-26 10:07 ` [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
@ 2021-11-26 10:07 ` Hao Xu
  2021-11-26 12:56   ` Hao Xu
  2021-11-26 13:37 ` [PATCH RESEND " Hao Xu
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Hao Xu @ 2021-11-26 10:07 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch while we cannot grab uring lock. In this way, we batch
the req_complete_post path.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 58 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 51 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9c67f19d585..b474340c5ea5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2223,6 +2223,43 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
 	return io_put_kbuf(req, req->kbuf);
 }
 
+static inline void ctx_commit_and_unlock(struct io_ring_ctx *ctx)
+{
+	io_commit_cqring(ctx);
+	spin_unlock(&ctx->completion_lock);
+	io_cqring_ev_posted(ctx);
+}
+
+static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx,
+				 bool *locked)
+{
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		if (req->ctx != *ctx) {
+			if (unlikely(!*locked) && *ctx)
+				ctx_commit_and_unlock(*ctx);
+			ctx_flush_and_put(*ctx, locked);
+			*ctx = req->ctx;
+			/* if not contended, grab and improve batching */
+			*locked = mutex_trylock(&(*ctx)->uring_lock);
+			percpu_ref_get(&(*ctx)->refs);
+			if (unlikely(*locked))
+				spin_lock(&(*ctx)->completion_lock);
+		}
+		if (likely(*locked))
+			req->io_task_work.func(req, locked);
+		else
+			__io_req_complete_post(req, req->result, io_put_rw_kbuf(req));
+		node = next;
+	} while (node);
+
+	if (unlikely(!*locked) && *ctx)
+		ctx_commit_and_unlock(*ctx);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2250,21 +2287,28 @@ static void tctx_task_work(struct callback_head *cb)
 						  task_work);
 
 	while (1) {
-		struct io_wq_work_node *node;
+		struct io_wq_work_node *node1, *node2;
 
-		if (!tctx->prior_task_list.first &&
-		    !tctx->task_list.first && locked)
+		if (!tctx->task_list.first &&
+		    !tctx->prior_task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node= wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
-		if (!node)
+		node1 = tctx->prior_task_list.first;
+		node2 = tctx->task_list.first;
+		INIT_WQ_LIST(&tctx->task_list);
+		INIT_WQ_LIST(&tctx->prior_task_list);
+		if (!node2 && !node1)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
-		if (!node)
+		if (!node2 && !node1)
 			break;
 
-		handle_tw_list(node, &ctx, &locked);
+		if (node1)
+			handle_prior_tw_list(node1, &ctx, &locked);
+
+		if (node2)
+			handle_tw_list(node2, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] io_uring: batch completion in prior_task_list
  2021-11-26 10:07 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
@ 2021-11-26 12:56   ` Hao Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2021-11-26 12:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

在 2021/11/26 下午6:07, Hao Xu 写道:
> In previous patches, we have already gathered some tw with
> io_req_task_complete() as callback in prior_task_list, let's complete
> them in batch while we cannot grab uring lock. In this way, we batch
> the req_complete_post path.
> 
> Signed-off-by: Hao Xu <[email protected]>
something is wrong with this one, will update it and resend later..
> ---
>   fs/io_uring.c | 58 ++++++++++++++++++++++++++++++++++++++++++++-------
>   1 file changed, 51 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e9c67f19d585..b474340c5ea5 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2223,6 +2223,43 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
>   	return io_put_kbuf(req, req->kbuf);
>   }
>   
> +static inline void ctx_commit_and_unlock(struct io_ring_ctx *ctx)
> +{
> +	io_commit_cqring(ctx);
> +	spin_unlock(&ctx->completion_lock);
> +	io_cqring_ev_posted(ctx);
> +}
> +
> +static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx,
> +				 bool *locked)
> +{
> +	do {
> +		struct io_wq_work_node *next = node->next;
> +		struct io_kiocb *req = container_of(node, struct io_kiocb,
> +						    io_task_work.node);
> +
> +		if (req->ctx != *ctx) {
> +			if (unlikely(!*locked) && *ctx)
> +				ctx_commit_and_unlock(*ctx);
> +			ctx_flush_and_put(*ctx, locked);
> +			*ctx = req->ctx;
> +			/* if not contended, grab and improve batching */
> +			*locked = mutex_trylock(&(*ctx)->uring_lock);
> +			percpu_ref_get(&(*ctx)->refs);
> +			if (unlikely(*locked))
> +				spin_lock(&(*ctx)->completion_lock);
> +		}
> +		if (likely(*locked))
> +			req->io_task_work.func(req, locked);
> +		else
> +			__io_req_complete_post(req, req->result, io_put_rw_kbuf(req));
> +		node = next;
> +	} while (node);
> +
> +	if (unlikely(!*locked) && *ctx)
> +		ctx_commit_and_unlock(*ctx);
> +}
> +
>   static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
>   {
>   	do {
> @@ -2250,21 +2287,28 @@ static void tctx_task_work(struct callback_head *cb)
>   						  task_work);
>   
>   	while (1) {
> -		struct io_wq_work_node *node;
> +		struct io_wq_work_node *node1, *node2;
>   
> -		if (!tctx->prior_task_list.first &&
> -		    !tctx->task_list.first && locked)
> +		if (!tctx->task_list.first &&
> +		    !tctx->prior_task_list.first && locked)
>   			io_submit_flush_completions(ctx);
>   
>   		spin_lock_irq(&tctx->task_lock);
> -		node= wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
> -		if (!node)
> +		node1 = tctx->prior_task_list.first;
> +		node2 = tctx->task_list.first;
> +		INIT_WQ_LIST(&tctx->task_list);
> +		INIT_WQ_LIST(&tctx->prior_task_list);
> +		if (!node2 && !node1)
>   			tctx->task_running = false;
>   		spin_unlock_irq(&tctx->task_lock);
> -		if (!node)
> +		if (!node2 && !node1)
>   			break;
>   
> -		handle_tw_list(node, &ctx, &locked);
> +		if (node1)
> +			handle_prior_tw_list(node1, &ctx, &locked);
> +
> +		if (node2)
> +			handle_tw_list(node2, &ctx, &locked);
>   		cond_resched();
>   	}
>   
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RESEND 6/6] io_uring: batch completion in prior_task_list
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
                   ` (5 preceding siblings ...)
  2021-11-26 10:07 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
@ 2021-11-26 13:37 ` Hao Xu
  2021-11-27 15:24   ` [PATCH v7] " Hao Xu
  2021-12-03  1:39 ` [PATCH v6 0/6] task work optimization Pavel Begunkov
  2021-12-04 20:58 ` Pavel Begunkov
  8 siblings, 1 reply; 22+ messages in thread
From: Hao Xu @ 2021-11-26 13:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch while we cannot grab uring lock. In this way, we batch
the req_complete_post path.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c | 58 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 51 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9c67f19d585..b7515de45c00 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2223,6 +2223,43 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
 	return io_put_kbuf(req, req->kbuf);
 }
 
+static inline void ctx_commit_and_unlock(struct io_ring_ctx *ctx)
+{
+	io_commit_cqring(ctx);
+	spin_unlock(&ctx->completion_lock);
+	io_cqring_ev_posted(ctx);
+}
+
+static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx,
+				 bool *locked)
+{
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		if (req->ctx != *ctx) {
+			if (unlikely(!*locked) && *ctx)
+				ctx_commit_and_unlock(*ctx);
+			ctx_flush_and_put(*ctx, locked);
+			*ctx = req->ctx;
+			/* if not contended, grab and improve batching */
+			*locked = mutex_trylock(&(*ctx)->uring_lock);
+			percpu_ref_get(&(*ctx)->refs);
+			if (unlikely(!*locked))
+				spin_lock(&(*ctx)->completion_lock);
+		}
+		if (likely(*locked))
+			req->io_task_work.func(req, locked);
+		else
+			__io_req_complete_post(req, req->result, io_put_rw_kbuf(req));
+		node = next;
+	} while (node);
+
+	if (unlikely(!*locked) && *ctx)
+		ctx_commit_and_unlock(*ctx);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2250,21 +2287,28 @@ static void tctx_task_work(struct callback_head *cb)
 						  task_work);
 
 	while (1) {
-		struct io_wq_work_node *node;
+		struct io_wq_work_node *node1, *node2;
 
-		if (!tctx->prior_task_list.first &&
-		    !tctx->task_list.first && locked)
+		if (!tctx->task_list.first &&
+		    !tctx->prior_task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node= wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
-		if (!node)
+		node1 = tctx->prior_task_list.first;
+		node2 = tctx->task_list.first;
+		INIT_WQ_LIST(&tctx->task_list);
+		INIT_WQ_LIST(&tctx->prior_task_list);
+		if (!node2 && !node1)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
-		if (!node)
+		if (!node2 && !node1)
 			break;
 
-		handle_tw_list(node, &ctx, &locked);
+		if (node1)
+			handle_prior_tw_list(node1, &ctx, &locked);
+
+		if (node2)
+			handle_tw_list(node2, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7] io_uring: batch completion in prior_task_list
  2021-11-26 13:37 ` [PATCH RESEND " Hao Xu
@ 2021-11-27 15:24   ` Hao Xu
  2021-11-28 15:28     ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Hao Xu @ 2021-11-27 15:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

In previous patches, we have already gathered some tw with
io_req_task_complete() as callback in prior_task_list, let's complete
them in batch while we cannot grab uring lock. In this way, we batch
the req_complete_post path.

Signed-off-by: Hao Xu <[email protected]>
---

v6->v7
- use function pointer to reduce the if check everytime running a task
work in handle_prior_tw_list()

 fs/io_uring.c | 68 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9c67f19d585..2c0ff1fc6974 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2223,6 +2223,53 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
 	return io_put_kbuf(req, req->kbuf);
 }
 
+static inline void ctx_commit_and_unlock(struct io_ring_ctx *ctx)
+{
+	io_commit_cqring(ctx);
+	spin_unlock(&ctx->completion_lock);
+	io_cqring_ev_posted(ctx);
+}
+
+static inline void io_req_complete_inline(struct io_kiocb *req, s32 res,
+					  u32 cflags)
+{
+		io_req_complete_state(req, res, cflags);
+		io_req_add_compl_list(req);
+}
+
+static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx,
+				 bool *locked)
+{
+	void (*io_req_complete_func)(struct io_kiocb *, s32, u32);
+
+	do {
+		struct io_wq_work_node *next = node->next;
+		struct io_kiocb *req = container_of(node, struct io_kiocb,
+						    io_task_work.node);
+
+		if (req->ctx != *ctx) {
+			if (unlikely(!*locked) && *ctx)
+				ctx_commit_and_unlock(*ctx);
+			ctx_flush_and_put(*ctx, locked);
+			*ctx = req->ctx;
+			/* if not contended, grab and improve batching */
+			*locked = mutex_trylock(&(*ctx)->uring_lock);
+			percpu_ref_get(&(*ctx)->refs);
+			if (unlikely(!*locked)) {
+				spin_lock(&(*ctx)->completion_lock);
+				io_req_complete_func = __io_req_complete_post;
+			} else {
+				io_req_complete_func = io_req_complete_inline;
+			}
+		}
+		io_req_complete_func(req, req->result, io_put_rw_kbuf(req));
+		node = next;
+	} while (node);
+
+	if (unlikely(!*locked) && *ctx)
+		ctx_commit_and_unlock(*ctx);
+}
+
 static void handle_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx, bool *locked)
 {
 	do {
@@ -2250,21 +2297,28 @@ static void tctx_task_work(struct callback_head *cb)
 						  task_work);
 
 	while (1) {
-		struct io_wq_work_node *node;
+		struct io_wq_work_node *node1, *node2;
 
-		if (!tctx->prior_task_list.first &&
-		    !tctx->task_list.first && locked)
+		if (!tctx->task_list.first &&
+		    !tctx->prior_task_list.first && locked)
 			io_submit_flush_completions(ctx);
 
 		spin_lock_irq(&tctx->task_lock);
-		node= wq_list_merge(&tctx->prior_task_list, &tctx->task_list);
-		if (!node)
+		node1 = tctx->prior_task_list.first;
+		node2 = tctx->task_list.first;
+		INIT_WQ_LIST(&tctx->task_list);
+		INIT_WQ_LIST(&tctx->prior_task_list);
+		if (!node2 && !node1)
 			tctx->task_running = false;
 		spin_unlock_irq(&tctx->task_lock);
-		if (!node)
+		if (!node2 && !node1)
 			break;
 
-		handle_tw_list(node, &ctx, &locked);
+		if (node1)
+			handle_prior_tw_list(node1, &ctx, &locked);
+
+		if (node2)
+			handle_tw_list(node2, &ctx, &locked);
 		cond_resched();
 	}
 
-- 
2.24.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v7] io_uring: batch completion in prior_task_list
  2021-11-27 15:24   ` [PATCH v7] " Hao Xu
@ 2021-11-28 15:28     ` Pavel Begunkov
  0 siblings, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2021-11-28 15:28 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 11/27/21 15:24, Hao Xu wrote:
> In previous patches, we have already gathered some tw with
> io_req_task_complete() as callback in prior_task_list, let's complete
> them in batch while we cannot grab uring lock. In this way, we batch
> the req_complete_post path.
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
> 
> v6->v7
> - use function pointer to reduce the if check everytime running a task
> work in handle_prior_tw_list()

Ifs are not a big problem, but retpolines may screw performance,
let's stay with v6.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
                   ` (6 preceding siblings ...)
  2021-11-26 13:37 ` [PATCH RESEND " Hao Xu
@ 2021-12-03  1:39 ` Pavel Begunkov
  2021-12-03  2:01   ` Pavel Begunkov
  2021-12-03  3:24   ` Hao Xu
  2021-12-04 20:58 ` Pavel Begunkov
  8 siblings, 2 replies; 22+ messages in thread
From: Pavel Begunkov @ 2021-12-03  1:39 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 11/26/21 10:07, Hao Xu wrote:
> v4->v5
> - change the implementation of merge_wq_list
> 
> v5->v6
> - change the logic of handling prior task list to:
>    1) grabbed uring_lock: leverage the inline completion infra
>    2) otherwise: batch __req_complete_post() calls to save
>       completion_lock operations.

some testing for v6, first is taking first 5 patches (1-5), and
then all 6 (see 1-6).

modprobe null_blk no_sched=1 irqmode=1 completion_nsec=0 submit_queues=16 poll_queues=32 hw_queue_depth=128
echo 2 | sudo tee /sys/block/nullb0/queue/nomerges
echo 0 | sudo tee /sys/block/nullb0/queue/iostats
mitigations=off

added this to test non-sqpoll:

@@ -2840,7 +2840,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
                 return;
         req->result = res;
         req->io_task_work.func = io_req_task_complete;
-       io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
+       io_req_task_work_add(req, true);
  }

# 1-5, sqpoll=0
nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
IOPS=3238688, IOS/call=32/32, inflight=32 (32)
IOPS=3299776, IOS/call=32/32, inflight=32 (32)
IOPS=3328416, IOS/call=32/32, inflight=32 (32)
IOPS=3291488, IOS/call=32/32, inflight=32 (32)
IOPS=3284480, IOS/call=32/32, inflight=32 (32)
IOPS=3305248, IOS/call=32/32, inflight=32 (32)
IOPS=3275392, IOS/call=32/32, inflight=32 (32)
IOPS=3301376, IOS/call=32/32, inflight=32 (32)
IOPS=3287392, IOS/call=32/32, inflight=32 (32)

# 1-5, sqpoll=1
nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
IOPS=2730752, IOS/call=2730752/2730752, inflight=32 (32)
IOPS=2822432, IOS/call=-1/-1, inflight=0 (32)
IOPS=2818464, IOS/call=-1/-1, inflight=32 (32)
IOPS=2802880, IOS/call=-1/-1, inflight=0 (32)
IOPS=2773440, IOS/call=-1/-1, inflight=32 (32)
IOPS=2827296, IOS/call=-1/-1, inflight=32 (32)
IOPS=2808320, IOS/call=-1/-1, inflight=32 (32)
IOPS=2793120, IOS/call=-1/-1, inflight=32 (32)
IOPS=2769632, IOS/call=-1/-1, inflight=32 (32)
IOPS=2752896, IOS/call=-1/-1, inflight=32 (32)

# 1-6, sqpoll=0
nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
IOPS=3219552, IOS/call=32/32, inflight=32 (32)
IOPS=3284128, IOS/call=32/32, inflight=32 (32)
IOPS=3305024, IOS/call=32/32, inflight=32 (32)
IOPS=3301920, IOS/call=32/32, inflight=32 (32)
IOPS=3330592, IOS/call=32/32, inflight=32 (32)
IOPS=3286496, IOS/call=32/32, inflight=32 (32)
IOPS=3236160, IOS/call=32/32, inflight=32 (32)
IOPS=3307552, IOS/call=32/32, inflight=32 (32)

# 1-6, sqpoll=1
nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
IOPS=2777152, IOS/call=2777152/2777152, inflight=32 (32)
IOPS=2822080, IOS/call=-1/-1, inflight=32 (32)
IOPS=2785472, IOS/call=-1/-1, inflight=0 (32)
IOPS=2763360, IOS/call=-1/-1, inflight=0 (32)
IOPS=2789856, IOS/call=-1/-1, inflight=32 (32)
IOPS=2783296, IOS/call=-1/-1, inflight=32 (32)
IOPS=2786016, IOS/call=-1/-1, inflight=0 (32)
IOPS=2773760, IOS/call=-1/-1, inflight=32 (32)
IOPS=2745408, IOS/call=-1/-1, inflight=32 (32)
IOPS=2764352, IOS/call=-1/-1, inflight=32 (32)
IOPS=2766912, IOS/call=-1/-1, inflight=32 (32)
IOPS=2757216, IOS/call=-1/-1, inflight=32 (32)

So, no difference here as expected, it just takes uring_lock
as per v6 changes and goes through the old path. Than I added
this to compare old vs new paths:

@@ -2283,7 +2283,7 @@ static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ct
                         ctx_flush_and_put(*ctx, locked);
                         *ctx = req->ctx;
                         /* if not contended, grab and improve batching */
-                       *locked = mutex_trylock(&(*ctx)->uring_lock);
+                       // *locked = mutex_trylock(&(*ctx)->uring_lock);
                         percpu_ref_get(&(*ctx)->refs);
                         if (unlikely(!*locked))
                                 spin_lock(&(*ctx)->completion_lock);


# 1-6 + no trylock, sqpoll=0
nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
IOPS=3239040, IOS/call=32/32, inflight=32 (32)
IOPS=3244800, IOS/call=32/32, inflight=32 (32)
IOPS=3208544, IOS/call=32/32, inflight=32 (32)
IOPS=3264384, IOS/call=32/32, inflight=32 (32)
IOPS=3264000, IOS/call=32/32, inflight=32 (32)
IOPS=3296960, IOS/call=32/32, inflight=32 (32)
IOPS=3283424, IOS/call=32/32, inflight=32 (32)
IOPS=3284064, IOS/call=32/32, inflight=32 (32)
IOPS=3275232, IOS/call=32/32, inflight=32 (32)
IOPS=3261248, IOS/call=32/32, inflight=32 (32)
IOPS=3273792, IOS/call=32/32, inflight=32 (32)

#1-6 + no trylock, sqpoll=1
nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
IOPS=2676736, IOS/call=2676736/2676736, inflight=32 (32)
IOPS=2639776, IOS/call=-1/-1, inflight=32 (32)
IOPS=2660000, IOS/call=-1/-1, inflight=32 (32)
IOPS=2639584, IOS/call=-1/-1, inflight=32 (32)
IOPS=2634592, IOS/call=-1/-1, inflight=0 (32)
IOPS=2611488, IOS/call=-1/-1, inflight=32 (32)
IOPS=2647360, IOS/call=-1/-1, inflight=32 (32)
IOPS=2630720, IOS/call=-1/-1, inflight=32 (32)
IOPS=2663200, IOS/call=-1/-1, inflight=32 (32)
IOPS=2694240, IOS/call=-1/-1, inflight=32 (32)
IOPS=2674592, IOS/call=-1/-1, inflight=32 (32)

Seems it goes a little bit down, but not much. Considering that
it's an optimisation for cases where there is no batching at all,
that's good.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-12-03  1:39 ` [PATCH v6 0/6] task work optimization Pavel Begunkov
@ 2021-12-03  2:01   ` Pavel Begunkov
  2021-12-03  7:30     ` Hao Xu
  2021-12-03  3:24   ` Hao Xu
  1 sibling, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2021-12-03  2:01 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 12/3/21 01:39, Pavel Begunkov wrote:
> On 11/26/21 10:07, Hao Xu wrote:
>> v4->v5
>> - change the implementation of merge_wq_list
>>
>> v5->v6
>> - change the logic of handling prior task list to:
>>    1) grabbed uring_lock: leverage the inline completion infra
>>    2) otherwise: batch __req_complete_post() calls to save
>>       completion_lock operations.
> 
> some testing for v6, first is taking first 5 patches (1-5), and
> then all 6 (see 1-6).
> 
> modprobe null_blk no_sched=1 irqmode=1 completion_nsec=0 submit_queues=16 poll_queues=32 hw_queue_depth=128
> echo 2 | sudo tee /sys/block/nullb0/queue/nomerges
> echo 0 | sudo tee /sys/block/nullb0/queue/iostats
> mitigations=off
> 
> added this to test non-sqpoll:
> 
> @@ -2840,7 +2840,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
>                  return;
>          req->result = res;
>          req->io_task_work.func = io_req_task_complete;
> -       io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
> +       io_req_task_work_add(req, true);
>   }
> 
> # 1-5, sqpoll=0
> nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
> IOPS=3238688, IOS/call=32/32, inflight=32 (32)
> IOPS=3299776, IOS/call=32/32, inflight=32 (32)
> IOPS=3328416, IOS/call=32/32, inflight=32 (32)
> IOPS=3291488, IOS/call=32/32, inflight=32 (32)
> IOPS=3284480, IOS/call=32/32, inflight=32 (32)
> IOPS=3305248, IOS/call=32/32, inflight=32 (32)
> IOPS=3275392, IOS/call=32/32, inflight=32 (32)
> IOPS=3301376, IOS/call=32/32, inflight=32 (32)
> IOPS=3287392, IOS/call=32/32, inflight=32 (32)
> 
> # 1-5, sqpoll=1
> nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
> IOPS=2730752, IOS/call=2730752/2730752, inflight=32 (32)
> IOPS=2822432, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2818464, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2802880, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2773440, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2827296, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2808320, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2793120, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2769632, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2752896, IOS/call=-1/-1, inflight=32 (32)
> 
> # 1-6, sqpoll=0
> nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
> IOPS=3219552, IOS/call=32/32, inflight=32 (32)
> IOPS=3284128, IOS/call=32/32, inflight=32 (32)
> IOPS=3305024, IOS/call=32/32, inflight=32 (32)
> IOPS=3301920, IOS/call=32/32, inflight=32 (32)
> IOPS=3330592, IOS/call=32/32, inflight=32 (32)
> IOPS=3286496, IOS/call=32/32, inflight=32 (32)
> IOPS=3236160, IOS/call=32/32, inflight=32 (32)
> IOPS=3307552, IOS/call=32/32, inflight=32 (32)
> 
> # 1-6, sqpoll=1
> nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
> IOPS=2777152, IOS/call=2777152/2777152, inflight=32 (32)
> IOPS=2822080, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2785472, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2763360, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2789856, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2783296, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2786016, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2773760, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2745408, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2764352, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2766912, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2757216, IOS/call=-1/-1, inflight=32 (32)
> 
> So, no difference here as expected, it just takes uring_lock
> as per v6 changes and goes through the old path. Than I added
> this to compare old vs new paths:
> 
> @@ -2283,7 +2283,7 @@ static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ct
>                          ctx_flush_and_put(*ctx, locked);
>                          *ctx = req->ctx;
>                          /* if not contended, grab and improve batching */
> -                       *locked = mutex_trylock(&(*ctx)->uring_lock);
> +                       // *locked = mutex_trylock(&(*ctx)->uring_lock);
>                          percpu_ref_get(&(*ctx)->refs);
>                          if (unlikely(!*locked))
>                                  spin_lock(&(*ctx)->completion_lock);
> 
> 
> # 1-6 + no trylock, sqpoll=0
> nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
> IOPS=3239040, IOS/call=32/32, inflight=32 (32)
> IOPS=3244800, IOS/call=32/32, inflight=32 (32)
> IOPS=3208544, IOS/call=32/32, inflight=32 (32)
> IOPS=3264384, IOS/call=32/32, inflight=32 (32)
> IOPS=3264000, IOS/call=32/32, inflight=32 (32)
> IOPS=3296960, IOS/call=32/32, inflight=32 (32)
> IOPS=3283424, IOS/call=32/32, inflight=32 (32)
> IOPS=3284064, IOS/call=32/32, inflight=32 (32)
> IOPS=3275232, IOS/call=32/32, inflight=32 (32)
> IOPS=3261248, IOS/call=32/32, inflight=32 (32)
> IOPS=3273792, IOS/call=32/32, inflight=32 (32)
> 
> #1-6 + no trylock, sqpoll=1
> nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
> IOPS=2676736, IOS/call=2676736/2676736, inflight=32 (32)
> IOPS=2639776, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2660000, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2639584, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2634592, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2611488, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2647360, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2630720, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2663200, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2694240, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2674592, IOS/call=-1/-1, inflight=32 (32)
> 
> Seems it goes a little bit down, but not much. Considering that
> it's an optimisation for cases where there is no batching at all,
> that's good.

But testing with liburing tests I'm getting the stuff below,
e.g. cq-overflow hits it every time. Double checked that
I took [RESEND] version of 6/6.

[   30.360370] BUG: scheduling while atomic: cq-overflow/2082/0x00000000
[   30.360520] Call Trace:
[   30.360523]  <TASK>
[   30.360527]  dump_stack_lvl+0x4c/0x63
[   30.360536]  dump_stack+0x10/0x12
[   30.360540]  __schedule_bug.cold+0x50/0x5e
[   30.360545]  __schedule+0x754/0x900
[   30.360551]  ? __io_cqring_overflow_flush+0xb6/0x200
[   30.360558]  schedule+0x55/0xd0
[   30.360563]  schedule_timeout+0xf8/0x140
[   30.360567]  ? prepare_to_wait_exclusive+0x58/0xa0
[   30.360573]  __x64_sys_io_uring_enter+0x69c/0x8e0
[   30.360578]  ? io_rsrc_buf_put+0x30/0x30
[   30.360582]  do_syscall_64+0x3b/0x80
[   30.360588]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   30.360592] RIP: 0033:0x7f9f9680118d
[   30.360618]  </TASK>
[   30.362295] BUG: scheduling while atomic: cq-overflow/2082/0x7ffffffe
[   30.362396] Call Trace:
[   30.362397]  <TASK>
[   30.362399]  dump_stack_lvl+0x4c/0x63
[   30.362406]  dump_stack+0x10/0x12
[   30.362409]  __schedule_bug.cold+0x50/0x5e
[   30.362413]  __schedule+0x754/0x900
[   30.362419]  schedule+0x55/0xd0
[   30.362423]  schedule_timeout+0xf8/0x140
[   30.362427]  ? prepare_to_wait_exclusive+0x58/0xa0
[   30.362431]  __x64_sys_io_uring_enter+0x69c/0x8e0
[   30.362437]  ? io_rsrc_buf_put+0x30/0x30
[   30.362440]  do_syscall_64+0x3b/0x80
[   30.362445]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   30.362449] RIP: 0033:0x7f9f9680118d
[   30.362470]  </TASK>
<repeated>


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-12-03  1:39 ` [PATCH v6 0/6] task work optimization Pavel Begunkov
  2021-12-03  2:01   ` Pavel Begunkov
@ 2021-12-03  3:24   ` Hao Xu
  1 sibling, 0 replies; 22+ messages in thread
From: Hao Xu @ 2021-12-03  3:24 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/12/3 上午9:39, Pavel Begunkov 写道:
> On 11/26/21 10:07, Hao Xu wrote:
>> v4->v5
>> - change the implementation of merge_wq_list
>>
>> v5->v6
>> - change the logic of handling prior task list to:
>>    1) grabbed uring_lock: leverage the inline completion infra
>>    2) otherwise: batch __req_complete_post() calls to save
>>       completion_lock operations.
> 
> some testing for v6, first is taking first 5 patches (1-5), and
> then all 6 (see 1-6).
> 
> modprobe null_blk no_sched=1 irqmode=1 completion_nsec=0 
> submit_queues=16 poll_queues=32 hw_queue_depth=128
> echo 2 | sudo tee /sys/block/nullb0/queue/nomerges
> echo 0 | sudo tee /sys/block/nullb0/queue/iostats
> mitigations=off
> 
> added this to test non-sqpoll:
> 
> @@ -2840,7 +2840,7 @@ static void io_complete_rw(struct kiocb *kiocb, 
> long res)
>                  return;
>          req->result = res;
>          req->io_task_work.func = io_req_task_complete;
> -       io_req_task_work_add(req, !!(req->ctx->flags & 
> IORING_SETUP_SQPOLL));
> +       io_req_task_work_add(req, true);
>   }
> 
> # 1-5, sqpoll=0
> nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 
> /dev/nullb0
> IOPS=3238688, IOS/call=32/32, inflight=32 (32)
> IOPS=3299776, IOS/call=32/32, inflight=32 (32)
> IOPS=3328416, IOS/call=32/32, inflight=32 (32)
> IOPS=3291488, IOS/call=32/32, inflight=32 (32)
> IOPS=3284480, IOS/call=32/32, inflight=32 (32)
> IOPS=3305248, IOS/call=32/32, inflight=32 (32)
> IOPS=3275392, IOS/call=32/32, inflight=32 (32)
> IOPS=3301376, IOS/call=32/32, inflight=32 (32)
> IOPS=3287392, IOS/call=32/32, inflight=32 (32)
> 
> # 1-5, sqpoll=1
> nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
> IOPS=2730752, IOS/call=2730752/2730752, inflight=32 (32)
> IOPS=2822432, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2818464, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2802880, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2773440, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2827296, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2808320, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2793120, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2769632, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2752896, IOS/call=-1/-1, inflight=32 (32)
> 
> # 1-6, sqpoll=0
> nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 
> /dev/nullb0
> IOPS=3219552, IOS/call=32/32, inflight=32 (32)
> IOPS=3284128, IOS/call=32/32, inflight=32 (32)
> IOPS=3305024, IOS/call=32/32, inflight=32 (32)
> IOPS=3301920, IOS/call=32/32, inflight=32 (32)
> IOPS=3330592, IOS/call=32/32, inflight=32 (32)
> IOPS=3286496, IOS/call=32/32, inflight=32 (32)
> IOPS=3236160, IOS/call=32/32, inflight=32 (32)
> IOPS=3307552, IOS/call=32/32, inflight=32 (32)
> 
> # 1-6, sqpoll=1
> nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
> IOPS=2777152, IOS/call=2777152/2777152, inflight=32 (32)
> IOPS=2822080, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2785472, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2763360, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2789856, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2783296, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2786016, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2773760, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2745408, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2764352, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2766912, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2757216, IOS/call=-1/-1, inflight=32 (32)
> 
> So, no difference here as expected, it just takes uring_lock
> as per v6 changes and goes through the old path. Than I added
> this to compare old vs new paths:
> 
> @@ -2283,7 +2283,7 @@ static void handle_prior_tw_list(struct 
> io_wq_work_node *node, struct io_ring_ct
>                          ctx_flush_and_put(*ctx, locked);
>                          *ctx = req->ctx;
>                          /* if not contended, grab and improve batching */
> -                       *locked = mutex_trylock(&(*ctx)->uring_lock);
> +                       // *locked = mutex_trylock(&(*ctx)->uring_lock);
>                          percpu_ref_get(&(*ctx)->refs);
>                          if (unlikely(!*locked))
>                                  spin_lock(&(*ctx)->completion_lock);
> 
> 
> # 1-6 + no trylock, sqpoll=0
> nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 
> /dev/nullb0
> IOPS=3239040, IOS/call=32/32, inflight=32 (32)
> IOPS=3244800, IOS/call=32/32, inflight=32 (32)
> IOPS=3208544, IOS/call=32/32, inflight=32 (32)
> IOPS=3264384, IOS/call=32/32, inflight=32 (32)
> IOPS=3264000, IOS/call=32/32, inflight=32 (32)
> IOPS=3296960, IOS/call=32/32, inflight=32 (32)
> IOPS=3283424, IOS/call=32/32, inflight=32 (32)
> IOPS=3284064, IOS/call=32/32, inflight=32 (32)
> IOPS=3275232, IOS/call=32/32, inflight=32 (32)
> IOPS=3261248, IOS/call=32/32, inflight=32 (32)
> IOPS=3273792, IOS/call=32/32, inflight=32 (32)
> 
> #1-6 + no trylock, sqpoll=1
> nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
> IOPS=2676736, IOS/call=2676736/2676736, inflight=32 (32)
> IOPS=2639776, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2660000, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2639584, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2634592, IOS/call=-1/-1, inflight=0 (32)
> IOPS=2611488, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2647360, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2630720, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2663200, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2694240, IOS/call=-1/-1, inflight=32 (32)
> IOPS=2674592, IOS/call=-1/-1, inflight=32 (32)
> 
> Seems it goes a little bit down, but not much. Considering that
> it's an optimisation for cases where there is no batching at all,
> that's good.
Nice, thanks for testing this, now it's clear that the inline completion
path is faster.

Regards,
Hao
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-12-03  2:01   ` Pavel Begunkov
@ 2021-12-03  7:30     ` Hao Xu
  2021-12-03 14:21       ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Hao Xu @ 2021-12-03  7:30 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/12/3 上午10:01, Pavel Begunkov 写道:
> On 12/3/21 01:39, Pavel Begunkov wrote:
>> On 11/26/21 10:07, Hao Xu wrote:
>>> v4->v5
>>> - change the implementation of merge_wq_list
>>>
>>> v5->v6
>>> - change the logic of handling prior task list to:
>>>    1) grabbed uring_lock: leverage the inline completion infra
>>>    2) otherwise: batch __req_complete_post() calls to save
>>>       completion_lock operations.
>>
>> some testing for v6, first is taking first 5 patches (1-5), and
>> then all 6 (see 1-6).
>>
>> modprobe null_blk no_sched=1 irqmode=1 completion_nsec=0 
>> submit_queues=16 poll_queues=32 hw_queue_depth=128
>> echo 2 | sudo tee /sys/block/nullb0/queue/nomerges
>> echo 0 | sudo tee /sys/block/nullb0/queue/iostats
>> mitigations=off
>>
>> added this to test non-sqpoll:
>>
>> @@ -2840,7 +2840,7 @@ static void io_complete_rw(struct kiocb *kiocb, 
>> long res)
>>                  return;
>>          req->result = res;
>>          req->io_task_work.func = io_req_task_complete;
>> -       io_req_task_work_add(req, !!(req->ctx->flags & 
>> IORING_SETUP_SQPOLL));
>> +       io_req_task_work_add(req, true);
>>   }
>>
>> # 1-5, sqpoll=0
>> nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 
>> /dev/nullb0
>> IOPS=3238688, IOS/call=32/32, inflight=32 (32)
>> IOPS=3299776, IOS/call=32/32, inflight=32 (32)
>> IOPS=3328416, IOS/call=32/32, inflight=32 (32)
>> IOPS=3291488, IOS/call=32/32, inflight=32 (32)
>> IOPS=3284480, IOS/call=32/32, inflight=32 (32)
>> IOPS=3305248, IOS/call=32/32, inflight=32 (32)
>> IOPS=3275392, IOS/call=32/32, inflight=32 (32)
>> IOPS=3301376, IOS/call=32/32, inflight=32 (32)
>> IOPS=3287392, IOS/call=32/32, inflight=32 (32)
>>
>> # 1-5, sqpoll=1
>> nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
>> IOPS=2730752, IOS/call=2730752/2730752, inflight=32 (32)
>> IOPS=2822432, IOS/call=-1/-1, inflight=0 (32)
>> IOPS=2818464, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2802880, IOS/call=-1/-1, inflight=0 (32)
>> IOPS=2773440, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2827296, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2808320, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2793120, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2769632, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2752896, IOS/call=-1/-1, inflight=32 (32)
>>
>> # 1-6, sqpoll=0
>> nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 
>> /dev/nullb0
>> IOPS=3219552, IOS/call=32/32, inflight=32 (32)
>> IOPS=3284128, IOS/call=32/32, inflight=32 (32)
>> IOPS=3305024, IOS/call=32/32, inflight=32 (32)
>> IOPS=3301920, IOS/call=32/32, inflight=32 (32)
>> IOPS=3330592, IOS/call=32/32, inflight=32 (32)
>> IOPS=3286496, IOS/call=32/32, inflight=32 (32)
>> IOPS=3236160, IOS/call=32/32, inflight=32 (32)
>> IOPS=3307552, IOS/call=32/32, inflight=32 (32)
>>
>> # 1-6, sqpoll=1
>> nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
>> IOPS=2777152, IOS/call=2777152/2777152, inflight=32 (32)
>> IOPS=2822080, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2785472, IOS/call=-1/-1, inflight=0 (32)
>> IOPS=2763360, IOS/call=-1/-1, inflight=0 (32)
>> IOPS=2789856, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2783296, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2786016, IOS/call=-1/-1, inflight=0 (32)
>> IOPS=2773760, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2745408, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2764352, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2766912, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2757216, IOS/call=-1/-1, inflight=32 (32)
>>
>> So, no difference here as expected, it just takes uring_lock
>> as per v6 changes and goes through the old path. Than I added
>> this to compare old vs new paths:
>>
>> @@ -2283,7 +2283,7 @@ static void handle_prior_tw_list(struct 
>> io_wq_work_node *node, struct io_ring_ct
>>                          ctx_flush_and_put(*ctx, locked);
>>                          *ctx = req->ctx;
>>                          /* if not contended, grab and improve 
>> batching */
>> -                       *locked = mutex_trylock(&(*ctx)->uring_lock);
>> +                       // *locked = mutex_trylock(&(*ctx)->uring_lock);
>>                          percpu_ref_get(&(*ctx)->refs);
>>                          if (unlikely(!*locked))
>>                                  spin_lock(&(*ctx)->completion_lock);
>>
>>
>> # 1-6 + no trylock, sqpoll=0
>> nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 
>> /dev/nullb0
>> IOPS=3239040, IOS/call=32/32, inflight=32 (32)
>> IOPS=3244800, IOS/call=32/32, inflight=32 (32)
>> IOPS=3208544, IOS/call=32/32, inflight=32 (32)
>> IOPS=3264384, IOS/call=32/32, inflight=32 (32)
>> IOPS=3264000, IOS/call=32/32, inflight=32 (32)
>> IOPS=3296960, IOS/call=32/32, inflight=32 (32)
>> IOPS=3283424, IOS/call=32/32, inflight=32 (32)
>> IOPS=3284064, IOS/call=32/32, inflight=32 (32)
>> IOPS=3275232, IOS/call=32/32, inflight=32 (32)
>> IOPS=3261248, IOS/call=32/32, inflight=32 (32)
>> IOPS=3273792, IOS/call=32/32, inflight=32 (32)
>>
>> #1-6 + no trylock, sqpoll=1
>> nice -n -20  ./io_uring -d32 -s32 -c32 -p0 -B1 -F1 -b512 /dev/nullb0
>> IOPS=2676736, IOS/call=2676736/2676736, inflight=32 (32)
>> IOPS=2639776, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2660000, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2639584, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2634592, IOS/call=-1/-1, inflight=0 (32)
>> IOPS=2611488, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2647360, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2630720, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2663200, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2694240, IOS/call=-1/-1, inflight=32 (32)
>> IOPS=2674592, IOS/call=-1/-1, inflight=32 (32)
>>
>> Seems it goes a little bit down, but not much. Considering that
>> it's an optimisation for cases where there is no batching at all,
>> that's good.
> 
> But testing with liburing tests I'm getting the stuff below,
> e.g. cq-overflow hits it every time. Double checked that
> I took [RESEND] version of 6/6.
> 
> [   30.360370] BUG: scheduling while atomic: cq-overflow/2082/0x00000000
> [   30.360520] Call Trace:
> [   30.360523]  <TASK>
> [   30.360527]  dump_stack_lvl+0x4c/0x63
> [   30.360536]  dump_stack+0x10/0x12
> [   30.360540]  __schedule_bug.cold+0x50/0x5e
> [   30.360545]  __schedule+0x754/0x900
> [   30.360551]  ? __io_cqring_overflow_flush+0xb6/0x200
> [   30.360558]  schedule+0x55/0xd0
> [   30.360563]  schedule_timeout+0xf8/0x140
> [   30.360567]  ? prepare_to_wait_exclusive+0x58/0xa0
> [   30.360573]  __x64_sys_io_uring_enter+0x69c/0x8e0
> [   30.360578]  ? io_rsrc_buf_put+0x30/0x30
> [   30.360582]  do_syscall_64+0x3b/0x80
> [   30.360588]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   30.360592] RIP: 0033:0x7f9f9680118d
> [   30.360618]  </TASK>
> [   30.362295] BUG: scheduling while atomic: cq-overflow/2082/0x7ffffffe
> [   30.362396] Call Trace:
> [   30.362397]  <TASK>
> [   30.362399]  dump_stack_lvl+0x4c/0x63
> [   30.362406]  dump_stack+0x10/0x12
> [   30.362409]  __schedule_bug.cold+0x50/0x5e
> [   30.362413]  __schedule+0x754/0x900
> [   30.362419]  schedule+0x55/0xd0
> [   30.362423]  schedule_timeout+0xf8/0x140
> [   30.362427]  ? prepare_to_wait_exclusive+0x58/0xa0
> [   30.362431]  __x64_sys_io_uring_enter+0x69c/0x8e0
> [   30.362437]  ? io_rsrc_buf_put+0x30/0x30
> [   30.362440]  do_syscall_64+0x3b/0x80
> [   30.362445]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   30.362449] RIP: 0033:0x7f9f9680118d
> [   30.362470]  </TASK>
> <repeated>
> 
cannot repro this, all the liburing tests work well on my side..
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-12-03  7:30     ` Hao Xu
@ 2021-12-03 14:21       ` Pavel Begunkov
  2021-12-05 15:02         ` Hao Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2021-12-03 14:21 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 12/3/21 07:30, Hao Xu wrote:
> 在 2021/12/3 上午10:01, Pavel Begunkov 写道:
>> On 12/3/21 01:39, Pavel Begunkov wrote:
>>> On 11/26/21 10:07, Hao Xu wrote:
>>>> v4->v5
>>>> - change the implementation of merge_wq_list
>>>>
[...]
>> But testing with liburing tests I'm getting the stuff below,
>> e.g. cq-overflow hits it every time. Double checked that
>> I took [RESEND] version of 6/6.
>>
>> [   30.360370] BUG: scheduling while atomic: cq-overflow/2082/0x00000000
>> [   30.360520] Call Trace:
>> [   30.360523]  <TASK>
>> [   30.360527]  dump_stack_lvl+0x4c/0x63
>> [   30.360536]  dump_stack+0x10/0x12
>> [   30.360540]  __schedule_bug.cold+0x50/0x5e
>> [   30.360545]  __schedule+0x754/0x900
>> [   30.360551]  ? __io_cqring_overflow_flush+0xb6/0x200
>> [   30.360558]  schedule+0x55/0xd0
>> [   30.360563]  schedule_timeout+0xf8/0x140
>> [   30.360567]  ? prepare_to_wait_exclusive+0x58/0xa0
>> [   30.360573]  __x64_sys_io_uring_enter+0x69c/0x8e0
>> [   30.360578]  ? io_rsrc_buf_put+0x30/0x30
>> [   30.360582]  do_syscall_64+0x3b/0x80
>> [   30.360588]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [   30.360592] RIP: 0033:0x7f9f9680118d
>> [   30.360618]  </TASK>
>> [   30.362295] BUG: scheduling while atomic: cq-overflow/2082/0x7ffffffe
>> [   30.362396] Call Trace:
>> [   30.362397]  <TASK>
>> [   30.362399]  dump_stack_lvl+0x4c/0x63
>> [   30.362406]  dump_stack+0x10/0x12
>> [   30.362409]  __schedule_bug.cold+0x50/0x5e
>> [   30.362413]  __schedule+0x754/0x900
>> [   30.362419]  schedule+0x55/0xd0
>> [   30.362423]  schedule_timeout+0xf8/0x140
>> [   30.362427]  ? prepare_to_wait_exclusive+0x58/0xa0
>> [   30.362431]  __x64_sys_io_uring_enter+0x69c/0x8e0
>> [   30.362437]  ? io_rsrc_buf_put+0x30/0x30
>> [   30.362440]  do_syscall_64+0x3b/0x80
>> [   30.362445]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [   30.362449] RIP: 0033:0x7f9f9680118d
>> [   30.362470]  </TASK>
>> <repeated>
>>
> cannot repro this, all the liburing tests work well on my side..

One problem is when on the first iteration tctx_task_work doen't
have anything in prior_task_list, it goes to handle_tw_list(),
which sets up @ctx but leaves @locked=false (say there is
contention). And then on the second iteration it goes to
handle_prior_tw_list() with non-NULL @ctx and @locked=false,
and tries to unlock not locked spin.

Not sure that's the exactly the problem from traces, but at
least a quick hack resetting the ctx at the beginning of
handle_prior_tw_list() heals it.

note: apart from the quick fix the diff below includes
a couple of lines to force it to go through the new path.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 66d119ac4424..3868123eef87 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2272,6 +2272,9 @@ static inline void ctx_commit_and_unlock(struct io_ring_ctx *ctx)
  static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx,
                                  bool *locked)
  {
+       ctx_flush_and_put(*ctx, locked);
+       *ctx = NULL;
+
         do {
                 struct io_wq_work_node *next = node->next;
                 struct io_kiocb *req = container_of(node, struct io_kiocb,
@@ -2283,7 +2286,8 @@ static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ct
                         ctx_flush_and_put(*ctx, locked);
                         *ctx = req->ctx;
                         /* if not contended, grab and improve batching */
-                       *locked = mutex_trylock(&(*ctx)->uring_lock);
+                       *locked = false;
+                       // *locked = mutex_trylock(&(*ctx)->uring_lock);
                         percpu_ref_get(&(*ctx)->refs);
                         if (unlikely(!*locked))
                                 spin_lock(&(*ctx)->completion_lock);
@@ -2840,7 +2844,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
                 return;
         req->result = res;
         req->io_task_work.func = io_req_task_complete;
-       io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
+       io_req_task_work_add(req, true);
  }



-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
                   ` (7 preceding siblings ...)
  2021-12-03  1:39 ` [PATCH v6 0/6] task work optimization Pavel Begunkov
@ 2021-12-04 20:58 ` Pavel Begunkov
  2021-12-05 15:11   ` Hao Xu
  8 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2021-12-04 20:58 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 11/26/21 10:07, Hao Xu wrote:
> v4->v5
> - change the implementation of merge_wq_list
> 
> v5->v6
> - change the logic of handling prior task list to:
>    1) grabbed uring_lock: leverage the inline completion infra
>    2) otherwise: batch __req_complete_post() calls to save
>       completion_lock operations.

FYI, took 5/6 into another patchset to avoid conflicts.

also, you can remove "[pavel: ...]" from patches, I was just
leaving them as a hint that the patches were very slightly
modified. I'll retest/etc. once you fix 6/6. Hopefully, will
be merged soon, the patches already had been bouncing around
for too long.


> Hao Xu (6):
>    io-wq: add helper to merge two wq_lists
>    io_uring: add a priority tw list for irq completion work
>    io_uring: add helper for task work execution code
>    io_uring: split io_req_complete_post() and add a helper
>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>    io_uring: batch completion in prior_task_list
> 
>   fs/io-wq.h    |  22 +++++++
>   fs/io_uring.c | 168 ++++++++++++++++++++++++++++++++++----------------
>   2 files changed, 136 insertions(+), 54 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-12-03 14:21       ` Pavel Begunkov
@ 2021-12-05 15:02         ` Hao Xu
  2021-12-05 15:42           ` Pavel Begunkov
  0 siblings, 1 reply; 22+ messages in thread
From: Hao Xu @ 2021-12-05 15:02 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/12/3 下午10:21, Pavel Begunkov 写道:
> On 12/3/21 07:30, Hao Xu wrote:
>> 在 2021/12/3 上午10:01, Pavel Begunkov 写道:
>>> On 12/3/21 01:39, Pavel Begunkov wrote:
>>>> On 11/26/21 10:07, Hao Xu wrote:
>>>>> v4->v5
>>>>> - change the implementation of merge_wq_list
>>>>>
> [...]
>>> But testing with liburing tests I'm getting the stuff below,
>>> e.g. cq-overflow hits it every time. Double checked that
>>> I took [RESEND] version of 6/6.
>>>
>>> [   30.360370] BUG: scheduling while atomic: cq-overflow/2082/0x00000000
>>> [   30.360520] Call Trace:
>>> [   30.360523]  <TASK>
>>> [   30.360527]  dump_stack_lvl+0x4c/0x63
>>> [   30.360536]  dump_stack+0x10/0x12
>>> [   30.360540]  __schedule_bug.cold+0x50/0x5e
>>> [   30.360545]  __schedule+0x754/0x900
>>> [   30.360551]  ? __io_cqring_overflow_flush+0xb6/0x200
>>> [   30.360558]  schedule+0x55/0xd0
>>> [   30.360563]  schedule_timeout+0xf8/0x140
>>> [   30.360567]  ? prepare_to_wait_exclusive+0x58/0xa0
>>> [   30.360573]  __x64_sys_io_uring_enter+0x69c/0x8e0
>>> [   30.360578]  ? io_rsrc_buf_put+0x30/0x30
>>> [   30.360582]  do_syscall_64+0x3b/0x80
>>> [   30.360588]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [   30.360592] RIP: 0033:0x7f9f9680118d
>>> [   30.360618]  </TASK>
>>> [   30.362295] BUG: scheduling while atomic: cq-overflow/2082/0x7ffffffe
>>> [   30.362396] Call Trace:
>>> [   30.362397]  <TASK>
>>> [   30.362399]  dump_stack_lvl+0x4c/0x63
>>> [   30.362406]  dump_stack+0x10/0x12
>>> [   30.362409]  __schedule_bug.cold+0x50/0x5e
>>> [   30.362413]  __schedule+0x754/0x900
>>> [   30.362419]  schedule+0x55/0xd0
>>> [   30.362423]  schedule_timeout+0xf8/0x140
>>> [   30.362427]  ? prepare_to_wait_exclusive+0x58/0xa0
>>> [   30.362431]  __x64_sys_io_uring_enter+0x69c/0x8e0
>>> [   30.362437]  ? io_rsrc_buf_put+0x30/0x30
>>> [   30.362440]  do_syscall_64+0x3b/0x80
>>> [   30.362445]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [   30.362449] RIP: 0033:0x7f9f9680118d
>>> [   30.362470]  </TASK>
>>> <repeated>
>>>
>> cannot repro this, all the liburing tests work well on my side..
> 
> One problem is when on the first iteration tctx_task_work doen't
> have anything in prior_task_list, it goes to handle_tw_list(),
> which sets up @ctx but leaves @locked=false (say there is
> contention). And then on the second iteration it goes to
> handle_prior_tw_list() with non-NULL @ctx and @locked=false,
> and tries to unlock not locked spin.
> 
> Not sure that's the exactly the problem from traces, but at
> least a quick hack resetting the ctx at the beginning of
> handle_prior_tw_list() heals it.
Good catch, thanks.
> 
> note: apart from the quick fix the diff below includes
> a couple of lines to force it to go through the new path.
> 
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 66d119ac4424..3868123eef87 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2272,6 +2272,9 @@ static inline void ctx_commit_and_unlock(struct 
> io_ring_ctx *ctx)
>   static void handle_prior_tw_list(struct io_wq_work_node *node, struct 
> io_ring_ctx **ctx,
>                                   bool *locked)
>   {
> +       ctx_flush_and_put(*ctx, locked);
> +       *ctx = NULL;
> +
>          do {
>                  struct io_wq_work_node *next = node->next;
>                  struct io_kiocb *req = container_of(node, struct io_kiocb,
> @@ -2283,7 +2286,8 @@ static void handle_prior_tw_list(struct 
> io_wq_work_node *node, struct io_ring_ct
>                          ctx_flush_and_put(*ctx, locked);
>                          *ctx = req->ctx;
>                          /* if not contended, grab and improve batching */
> -                       *locked = mutex_trylock(&(*ctx)->uring_lock);
> +                       *locked = false;
> +                       // *locked = mutex_trylock(&(*ctx)->uring_lock);
I believe this one is your debug code which I shouldn't take, should I?
>                          percpu_ref_get(&(*ctx)->refs);
>                          if (unlikely(!*locked))
>                                  spin_lock(&(*ctx)->completion_lock);
> @@ -2840,7 +2844,7 @@ static void io_complete_rw(struct kiocb *kiocb, 
> long res)
>                  return;
>          req->result = res;
>          req->io_task_work.func = io_req_task_complete;
> -       io_req_task_work_add(req, !!(req->ctx->flags & 
> IORING_SETUP_SQPOLL));
> +       io_req_task_work_add(req, true);
>   }
> 
> 
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-12-04 20:58 ` Pavel Begunkov
@ 2021-12-05 15:11   ` Hao Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2021-12-05 15:11 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/12/5 上午4:58, Pavel Begunkov 写道:
> On 11/26/21 10:07, Hao Xu wrote:
>> v4->v5
>> - change the implementation of merge_wq_list
>>
>> v5->v6
>> - change the logic of handling prior task list to:
>>    1) grabbed uring_lock: leverage the inline completion infra
>>    2) otherwise: batch __req_complete_post() calls to save
>>       completion_lock operations.
> 
> FYI, took 5/6 into another patchset to avoid conflicts.
> 
> also, you can remove "[pavel: ...]" from patches, I was just
> leaving them as a hint that the patches were very slightly
> modified. I'll retest/etc. once you fix 6/6. Hopefully, will
> be merged soon, the patches already had been bouncing around
> for too long.
Gotcha.
> 
> 
>> Hao Xu (6):
>>    io-wq: add helper to merge two wq_lists
>>    io_uring: add a priority tw list for irq completion work
>>    io_uring: add helper for task work execution code
>>    io_uring: split io_req_complete_post() and add a helper
>>    io_uring: move up io_put_kbuf() and io_put_rw_kbuf()
>>    io_uring: batch completion in prior_task_list
>>
>>   fs/io-wq.h    |  22 +++++++
>>   fs/io_uring.c | 168 ++++++++++++++++++++++++++++++++++----------------
>>   2 files changed, 136 insertions(+), 54 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-12-05 15:02         ` Hao Xu
@ 2021-12-05 15:42           ` Pavel Begunkov
  2021-12-06  8:35             ` Hao Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Pavel Begunkov @ 2021-12-05 15:42 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 12/5/21 15:02, Hao Xu wrote:
> 在 2021/12/3 下午10:21, Pavel Begunkov 写道:
>> On 12/3/21 07:30, Hao Xu wrote:
>>> 在 2021/12/3 上午10:01, Pavel Begunkov 写道:
>>>> On 12/3/21 01:39, Pavel Begunkov wrote:
>>>>> On 11/26/21 10:07, Hao Xu wrote:
>>>>>> v4->v5
>>>>>> - change the implementation of merge_wq_list
>>>>>>
>> [...]
>>>> But testing with liburing tests I'm getting the stuff below,
>>>> e.g. cq-overflow hits it every time. Double checked that
>>>> I took [RESEND] version of 6/6.
>>>>
>>>> [   30.360370] BUG: scheduling while atomic: cq-overflow/2082/0x00000000
>>>> [   30.360520] Call Trace:
>>>> [   30.360523]  <TASK>
>>>> [   30.360527]  dump_stack_lvl+0x4c/0x63
>>>> [   30.360536]  dump_stack+0x10/0x12
>>>> [   30.360540]  __schedule_bug.cold+0x50/0x5e
>>>> [   30.360545]  __schedule+0x754/0x900
>>>> [   30.360551]  ? __io_cqring_overflow_flush+0xb6/0x200
>>>> [   30.360558]  schedule+0x55/0xd0
>>>> [   30.360563]  schedule_timeout+0xf8/0x140
>>>> [   30.360567]  ? prepare_to_wait_exclusive+0x58/0xa0
>>>> [   30.360573]  __x64_sys_io_uring_enter+0x69c/0x8e0
>>>> [   30.360578]  ? io_rsrc_buf_put+0x30/0x30
>>>> [   30.360582]  do_syscall_64+0x3b/0x80
>>>> [   30.360588]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> [   30.360592] RIP: 0033:0x7f9f9680118d
>>>> [   30.360618]  </TASK>
>>>> [   30.362295] BUG: scheduling while atomic: cq-overflow/2082/0x7ffffffe
>>>> [   30.362396] Call Trace:
>>>> [   30.362397]  <TASK>
>>>> [   30.362399]  dump_stack_lvl+0x4c/0x63
>>>> [   30.362406]  dump_stack+0x10/0x12
>>>> [   30.362409]  __schedule_bug.cold+0x50/0x5e
>>>> [   30.362413]  __schedule+0x754/0x900
>>>> [   30.362419]  schedule+0x55/0xd0
>>>> [   30.362423]  schedule_timeout+0xf8/0x140
>>>> [   30.362427]  ? prepare_to_wait_exclusive+0x58/0xa0
>>>> [   30.362431]  __x64_sys_io_uring_enter+0x69c/0x8e0
>>>> [   30.362437]  ? io_rsrc_buf_put+0x30/0x30
>>>> [   30.362440]  do_syscall_64+0x3b/0x80
>>>> [   30.362445]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> [   30.362449] RIP: 0033:0x7f9f9680118d
>>>> [   30.362470]  </TASK>
>>>> <repeated>
>>>>
>>> cannot repro this, all the liburing tests work well on my side..
>>
>> One problem is when on the first iteration tctx_task_work doen't
>> have anything in prior_task_list, it goes to handle_tw_list(),
>> which sets up @ctx but leaves @locked=false (say there is
>> contention). And then on the second iteration it goes to
>> handle_prior_tw_list() with non-NULL @ctx and @locked=false,
>> and tries to unlock not locked spin.
>>
>> Not sure that's the exactly the problem from traces, but at
>> least a quick hack resetting the ctx at the beginning of
>> handle_prior_tw_list() heals it.
> Good catch, thanks.
>>
>> note: apart from the quick fix the diff below includes
>> a couple of lines to force it to go through the new path.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 66d119ac4424..3868123eef87 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2272,6 +2272,9 @@ static inline void ctx_commit_and_unlock(struct io_ring_ctx *ctx)
>>   static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ctx **ctx,
>>                                   bool *locked)
>>   {
>> +       ctx_flush_and_put(*ctx, locked);
>> +       *ctx = NULL;
>> +
>>          do {
>>                  struct io_wq_work_node *next = node->next;
>>                  struct io_kiocb *req = container_of(node, struct io_kiocb,
>> @@ -2283,7 +2286,8 @@ static void handle_prior_tw_list(struct io_wq_work_node *node, struct io_ring_ct
>>                          ctx_flush_and_put(*ctx, locked);
>>                          *ctx = req->ctx;
>>                          /* if not contended, grab and improve batching */
>> -                       *locked = mutex_trylock(&(*ctx)->uring_lock);
>> +                       *locked = false;
>> +                       // *locked = mutex_trylock(&(*ctx)->uring_lock);
> I believe this one is your debug code which I shouldn't take, should I?

Right, just for debug, helped to catch the issue. FWIW, it doesn't seem
ctx_flush_and_put() is a good solution but was good enough to verify
my assumptions.

>>                          percpu_ref_get(&(*ctx)->refs);
>>                          if (unlikely(!*locked))
>>                                  spin_lock(&(*ctx)->completion_lock);
>> @@ -2840,7 +2844,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res)
>>                  return;
>>          req->result = res;
>>          req->io_task_work.func = io_req_task_complete;
>> -       io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL));
>> +       io_req_task_work_add(req, true);
>>   }
>>
>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-12-05 15:42           ` Pavel Begunkov
@ 2021-12-06  8:35             ` Hao Xu
  2021-12-06  9:48               ` Hao Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Hao Xu @ 2021-12-06  8:35 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/12/5 下午11:42, Pavel Begunkov 写道:
> On 12/5/21 15:02, Hao Xu wrote:
>> 在 2021/12/3 下午10:21, Pavel Begunkov 写道:
>>> On 12/3/21 07:30, Hao Xu wrote:
>>>> 在 2021/12/3 上午10:01, Pavel Begunkov 写道:
>>>>> On 12/3/21 01:39, Pavel Begunkov wrote:
>>>>>> On 11/26/21 10:07, Hao Xu wrote:
>>>>>>> v4->v5
>>>>>>> - change the implementation of merge_wq_list
>>>>>>>
>>> [...]
>>>>> But testing with liburing tests I'm getting the stuff below,
>>>>> e.g. cq-overflow hits it every time. Double checked that
>>>>> I took [RESEND] version of 6/6.
>>>>>
>>>>> [   30.360370] BUG: scheduling while atomic: 
>>>>> cq-overflow/2082/0x00000000
>>>>> [   30.360520] Call Trace:
>>>>> [   30.360523]  <TASK>
>>>>> [   30.360527]  dump_stack_lvl+0x4c/0x63
>>>>> [   30.360536]  dump_stack+0x10/0x12
>>>>> [   30.360540]  __schedule_bug.cold+0x50/0x5e
>>>>> [   30.360545]  __schedule+0x754/0x900
>>>>> [   30.360551]  ? __io_cqring_overflow_flush+0xb6/0x200
>>>>> [   30.360558]  schedule+0x55/0xd0
>>>>> [   30.360563]  schedule_timeout+0xf8/0x140
>>>>> [   30.360567]  ? prepare_to_wait_exclusive+0x58/0xa0
>>>>> [   30.360573]  __x64_sys_io_uring_enter+0x69c/0x8e0
>>>>> [   30.360578]  ? io_rsrc_buf_put+0x30/0x30
>>>>> [   30.360582]  do_syscall_64+0x3b/0x80
>>>>> [   30.360588]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [   30.360592] RIP: 0033:0x7f9f9680118d
>>>>> [   30.360618]  </TASK>
>>>>> [   30.362295] BUG: scheduling while atomic: 
>>>>> cq-overflow/2082/0x7ffffffe
>>>>> [   30.362396] Call Trace:
>>>>> [   30.362397]  <TASK>
>>>>> [   30.362399]  dump_stack_lvl+0x4c/0x63
>>>>> [   30.362406]  dump_stack+0x10/0x12
>>>>> [   30.362409]  __schedule_bug.cold+0x50/0x5e
>>>>> [   30.362413]  __schedule+0x754/0x900
>>>>> [   30.362419]  schedule+0x55/0xd0
>>>>> [   30.362423]  schedule_timeout+0xf8/0x140
>>>>> [   30.362427]  ? prepare_to_wait_exclusive+0x58/0xa0
>>>>> [   30.362431]  __x64_sys_io_uring_enter+0x69c/0x8e0
>>>>> [   30.362437]  ? io_rsrc_buf_put+0x30/0x30
>>>>> [   30.362440]  do_syscall_64+0x3b/0x80
>>>>> [   30.362445]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> [   30.362449] RIP: 0033:0x7f9f9680118d
>>>>> [   30.362470]  </TASK>
>>>>> <repeated>
>>>>>
>>>> cannot repro this, all the liburing tests work well on my side..
>>>
>>> One problem is when on the first iteration tctx_task_work doen't
>>> have anything in prior_task_list, it goes to handle_tw_list(),
>>> which sets up @ctx but leaves @locked=false (say there is
>>> contention). And then on the second iteration it goes to
>>> handle_prior_tw_list() with non-NULL @ctx and @locked=false,
>>> and tries to unlock not locked spin.
>>>
>>> Not sure that's the exactly the problem from traces, but at
>>> least a quick hack resetting the ctx at the beginning of
>>> handle_prior_tw_list() heals it.
>> Good catch, thanks.
>>>
>>> note: apart from the quick fix the diff below includes
>>> a couple of lines to force it to go through the new path.
>>>
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 66d119ac4424..3868123eef87 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -2272,6 +2272,9 @@ static inline void ctx_commit_and_unlock(struct 
>>> io_ring_ctx *ctx)
>>>   static void handle_prior_tw_list(struct io_wq_work_node *node, 
>>> struct io_ring_ctx **ctx,
>>>                                   bool *locked)
>>>   {
>>> +       ctx_flush_and_put(*ctx, locked);
>>> +       *ctx = NULL;
>>> +
>>>          do {
>>>                  struct io_wq_work_node *next = node->next;
>>>                  struct io_kiocb *req = container_of(node, struct 
>>> io_kiocb,
>>> @@ -2283,7 +2286,8 @@ static void handle_prior_tw_list(struct 
>>> io_wq_work_node *node, struct io_ring_ct
>>>                          ctx_flush_and_put(*ctx, locked);
>>>                          *ctx = req->ctx;
>>>                          /* if not contended, grab and improve 
>>> batching */
>>> -                       *locked = mutex_trylock(&(*ctx)->uring_lock);
>>> +                       *locked = false;
>>> +                       // *locked = mutex_trylock(&(*ctx)->uring_lock);
>> I believe this one is your debug code which I shouldn't take, should I?
> 
> Right, just for debug, helped to catch the issue. FWIW, it doesn't seem
> ctx_flush_and_put() is a good solution but was good enough to verify
> my assumptions.
How about a new compl_lock variable to indicate the completion_lock
state, which will make the complete_post() batching as large as possible.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5771489a980d..e17892183f82 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2231,36 +2231,34 @@ static inline void ctx_commit_and_unlock(struct 
io_ring_ctx *ctx)
  }

  static void handle_prior_tw_list(struct io_wq_work_node *node, struct 
io_ring_ctx **ctx,
-                                bool *locked)
+                                bool *uring_locked, bool *compl_locked)
  {
-       ctx_flush_and_put(*ctx, locked);
-       *ctx = NULL;
-
         do {
                 struct io_wq_work_node *next = node->next;
                 struct io_kiocb *req = container_of(node, struct io_kiocb,
                                                 ¦   io_task_work.node);

                 if (req->ctx != *ctx) {
-                       if (unlikely(!*locked) && *ctx)
+                       if (unlikely(*compl_locked)) {
                                 ctx_commit_and_unlock(*ctx);
-                       ctx_flush_and_put(*ctx, locked);
+                               *compl_locked = false;
+                       }
+                       ctx_flush_and_put(*ctx, uring_locked);
                         *ctx = req->ctx;
                         /* if not contended, grab and improve batching */
-                       *locked = mutex_trylock(&(*ctx)->uring_lock);
+                       *uring_locked = mutex_trylock(&(*ctx)->uring_lock);
                         percpu_ref_get(&(*ctx)->refs);
-                       if (unlikely(!*locked))
+                       if (unlikely(!*uring_locked)) {
                                 spin_lock(&(*ctx)->completion_lock);
+                               *compl_locked = true;
+                       }
                 }
-               if (likely(*locked))
-                       req->io_task_work.func(req, locked);
+               if (likely(*uring_locked))
+                       req->io_task_work.func(req, uring_locked);
                 else
                         __io_req_complete_post(req, req->result, 
io_put_rw_kbuf(req));
                 node = next;
         } while (node);
-
-       if (unlikely(!*locked) && *ctx)
-               ctx_commit_and_unlock(*ctx);
  }

  static void handle_tw_list(struct io_wq_work_node *node, struct 
io_ring_ctx **ctx, bool *locked)
@@ -2284,7 +2282,7 @@ static void handle_tw_list(struct io_wq_work_node 
*node, struct io_ring_ctx **ct

  static void tctx_task_work(struct callback_head *cb)
  {
-       bool locked = false;
+       bool uring_locked = false, compl_locked = false;
         struct io_ring_ctx *ctx = NULL;
         struct io_uring_task *tctx = container_of(cb, struct io_uring_task,
                                                 ¦ task_work);
@@ -2293,7 +2291,7 @@ static void tctx_task_work(struct callback_head *cb)
                 struct io_wq_work_node *node1, *node2;

                 if (!tctx->task_list.first &&
-                   !tctx->prior_task_list.first && locked)
+                   !tctx->prior_task_list.first && uring_locked)
                         io_submit_flush_completions(ctx);

                 spin_lock_irq(&tctx->task_lock);
@@ -2308,14 +2306,18 @@ static void tctx_task_work(struct callback_head *cb)
                         break;

                 if (node1)
-                       handle_prior_tw_list(node1, &ctx, &locked);
+                       handle_prior_tw_list(node1, &ctx, &uring_locked, 
&compl_locked);

                 if (node2)
-                       handle_tw_list(node2, &ctx, &locked);
+                       handle_tw_list(node2, &ctx, &uring_locked);
                 cond_resched();
         }

-       ctx_flush_and_put(ctx, &locked);
+       if (unlikely(compl_locked)) {
+               ctx_commit_and_unlock(ctx);
+               compl_locked = false; // this may not be needed
+       }
+       ctx_flush_and_put(ctx, &uring_locked);
  }

  static void io_req_task_work_add(struct io_kiocb *req, bool priority)
@@ -2804,7 +2806,7 @@ static void io_complete_rw(struct kiocb *kiocb, 
long res)
                 return;
         req->result = res;
         req->io_task_work.func = io_req_task_complete;
-       io_req_task_work_add(req, true);
+       io_req_task_work_add(req, !!(req->ctx->flags & 
IORING_SETUP_SQPOLL));
  }

  static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 0/6] task work optimization
  2021-12-06  8:35             ` Hao Xu
@ 2021-12-06  9:48               ` Hao Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2021-12-06  9:48 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/12/6 下午4:35, Hao Xu 写道:
> 在 2021/12/5 下午11:42, Pavel Begunkov 写道:
>> On 12/5/21 15:02, Hao Xu wrote:
>>> 在 2021/12/3 下午10:21, Pavel Begunkov 写道:
>>>> On 12/3/21 07:30, Hao Xu wrote:
>>>>> 在 2021/12/3 上午10:01, Pavel Begunkov 写道:
>>>>>> On 12/3/21 01:39, Pavel Begunkov wrote:
>>>>>>> On 11/26/21 10:07, Hao Xu wrote:
>>>>>>>> v4->v5
>>>>>>>> - change the implementation of merge_wq_list
>>>>>>>>
>>>> [...]
>>>>>> But testing with liburing tests I'm getting the stuff below,
>>>>>> e.g. cq-overflow hits it every time. Double checked that
>>>>>> I took [RESEND] version of 6/6.
>>>>>>
>>>>>> [   30.360370] BUG: scheduling while atomic: 
>>>>>> cq-overflow/2082/0x00000000
>>>>>> [   30.360520] Call Trace:
>>>>>> [   30.360523]  <TASK>
>>>>>> [   30.360527]  dump_stack_lvl+0x4c/0x63
>>>>>> [   30.360536]  dump_stack+0x10/0x12
>>>>>> [   30.360540]  __schedule_bug.cold+0x50/0x5e
>>>>>> [   30.360545]  __schedule+0x754/0x900
>>>>>> [   30.360551]  ? __io_cqring_overflow_flush+0xb6/0x200
>>>>>> [   30.360558]  schedule+0x55/0xd0
>>>>>> [   30.360563]  schedule_timeout+0xf8/0x140
>>>>>> [   30.360567]  ? prepare_to_wait_exclusive+0x58/0xa0
>>>>>> [   30.360573]  __x64_sys_io_uring_enter+0x69c/0x8e0
>>>>>> [   30.360578]  ? io_rsrc_buf_put+0x30/0x30
>>>>>> [   30.360582]  do_syscall_64+0x3b/0x80
>>>>>> [   30.360588]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> [   30.360592] RIP: 0033:0x7f9f9680118d
>>>>>> [   30.360618]  </TASK>
>>>>>> [   30.362295] BUG: scheduling while atomic: 
>>>>>> cq-overflow/2082/0x7ffffffe
>>>>>> [   30.362396] Call Trace:
>>>>>> [   30.362397]  <TASK>
>>>>>> [   30.362399]  dump_stack_lvl+0x4c/0x63
>>>>>> [   30.362406]  dump_stack+0x10/0x12
>>>>>> [   30.362409]  __schedule_bug.cold+0x50/0x5e
>>>>>> [   30.362413]  __schedule+0x754/0x900
>>>>>> [   30.362419]  schedule+0x55/0xd0
>>>>>> [   30.362423]  schedule_timeout+0xf8/0x140
>>>>>> [   30.362427]  ? prepare_to_wait_exclusive+0x58/0xa0
>>>>>> [   30.362431]  __x64_sys_io_uring_enter+0x69c/0x8e0
>>>>>> [   30.362437]  ? io_rsrc_buf_put+0x30/0x30
>>>>>> [   30.362440]  do_syscall_64+0x3b/0x80
>>>>>> [   30.362445]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> [   30.362449] RIP: 0033:0x7f9f9680118d
>>>>>> [   30.362470]  </TASK>
>>>>>> <repeated>
>>>>>>
>>>>> cannot repro this, all the liburing tests work well on my side..
>>>>
>>>> One problem is when on the first iteration tctx_task_work doen't
>>>> have anything in prior_task_list, it goes to handle_tw_list(),
>>>> which sets up @ctx but leaves @locked=false (say there is
>>>> contention). And then on the second iteration it goes to
>>>> handle_prior_tw_list() with non-NULL @ctx and @locked=false,
>>>> and tries to unlock not locked spin.
>>>>
>>>> Not sure that's the exactly the problem from traces, but at
>>>> least a quick hack resetting the ctx at the beginning of
>>>> handle_prior_tw_list() heals it.
>>> Good catch, thanks.
>>>>
>>>> note: apart from the quick fix the diff below includes
>>>> a couple of lines to force it to go through the new path.
>>>>
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 66d119ac4424..3868123eef87 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -2272,6 +2272,9 @@ static inline void 
>>>> ctx_commit_and_unlock(struct io_ring_ctx *ctx)
>>>>   static void handle_prior_tw_list(struct io_wq_work_node *node, 
>>>> struct io_ring_ctx **ctx,
>>>>                                   bool *locked)
>>>>   {
>>>> +       ctx_flush_and_put(*ctx, locked);
>>>> +       *ctx = NULL;
>>>> +
>>>>          do {
>>>>                  struct io_wq_work_node *next = node->next;
>>>>                  struct io_kiocb *req = container_of(node, struct 
>>>> io_kiocb,
>>>> @@ -2283,7 +2286,8 @@ static void handle_prior_tw_list(struct 
>>>> io_wq_work_node *node, struct io_ring_ct
>>>>                          ctx_flush_and_put(*ctx, locked);
>>>>                          *ctx = req->ctx;
>>>>                          /* if not contended, grab and improve 
>>>> batching */
>>>> -                       *locked = mutex_trylock(&(*ctx)->uring_lock);
>>>> +                       *locked = false;
>>>> +                       // *locked = 
>>>> mutex_trylock(&(*ctx)->uring_lock);
>>> I believe this one is your debug code which I shouldn't take, should I?
>>
>> Right, just for debug, helped to catch the issue. FWIW, it doesn't seem
>> ctx_flush_and_put() is a good solution but was good enough to verify
>> my assumptions.
> How about a new compl_lock variable to indicate the completion_lock
> state, which will make the complete_post() batching as large as possible.
> 
Forgot to add compl_lock stuff in handle_tw_list(), but anyway I now
think it may not be a good idea to let completion_lock cross
handle_prior_tw_list() and handle_tw_list() since this may delay
the completion committing though it scale up the batching.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-12-06  9:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-11-26 10:07 [PATCH v6 0/6] task work optimization Hao Xu
2021-11-26 10:07 ` [PATCH 1/6] io-wq: add helper to merge two wq_lists Hao Xu
2021-11-26 10:07 ` [PATCH 2/6] io_uring: add a priority tw list for irq completion work Hao Xu
2021-11-26 10:07 ` [PATCH 3/6] io_uring: add helper for task work execution code Hao Xu
2021-11-26 10:07 ` [PATCH 4/6] io_uring: split io_req_complete_post() and add a helper Hao Xu
2021-11-26 10:07 ` [PATCH 5/6] io_uring: move up io_put_kbuf() and io_put_rw_kbuf() Hao Xu
2021-11-26 10:07 ` [PATCH 6/6] io_uring: batch completion in prior_task_list Hao Xu
2021-11-26 12:56   ` Hao Xu
2021-11-26 13:37 ` [PATCH RESEND " Hao Xu
2021-11-27 15:24   ` [PATCH v7] " Hao Xu
2021-11-28 15:28     ` Pavel Begunkov
2021-12-03  1:39 ` [PATCH v6 0/6] task work optimization Pavel Begunkov
2021-12-03  2:01   ` Pavel Begunkov
2021-12-03  7:30     ` Hao Xu
2021-12-03 14:21       ` Pavel Begunkov
2021-12-05 15:02         ` Hao Xu
2021-12-05 15:42           ` Pavel Begunkov
2021-12-06  8:35             ` Hao Xu
2021-12-06  9:48               ` Hao Xu
2021-12-03  3:24   ` Hao Xu
2021-12-04 20:58 ` Pavel Begunkov
2021-12-05 15:11   ` Hao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox