public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH 0/3] decouple work_list protection from the big wqe->lock
@ 2022-02-06  9:52 Hao Xu
  2022-02-06  9:52 ` [PATCH 1/3] io-wq: " Hao Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Hao Xu @ 2022-02-06  9:52 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Joseph Qi

wqe->lock is abused, it now protects acct->work_list, hash stuff,
nr_workers, wqe->free_list and so on. Lets first get the work_list out
of the wqe-lock mess by introduce a specific lock for work list. This
is the first step to solve the huge contension between work insertion
and work consumption.
good thing:
  - split locking for bound and unbound work list
  - reduce contension between work_list visit and (worker's)free_list.

For the hash stuff, since there won't be a work with same file in both
bound and unbound work list, thus they won't visit same hash entry. it
works well to use the new lock to protect hash stuff.

Results:
set max_unbound_worker = 4, test with echo-server:
nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
(-n connection, -l workload)
before this patch:
Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
Overhead  Command          Shared Object         Symbol
  28.59%  iou-wrk-10021    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
   8.89%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
   6.20%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock
   2.45%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
   2.36%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
   2.29%  iou-wrk-10021    [kernel.vmlinux]      [k] io_worker_handle_work
   1.29%  io_uring_echo_s  [kernel.vmlinux]      [k] io_wqe_enqueue
   1.06%  iou-wrk-10021    [kernel.vmlinux]      [k] io_wqe_worker
   1.06%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
   1.03%  iou-wrk-10021    [kernel.vmlinux]      [k] __schedule
   0.99%  iou-wrk-10021    [kernel.vmlinux]      [k] tcp_sendmsg_locked

with this patch:
Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
Overhead  Command          Shared Object         Symbol
  16.86%  iou-wrk-10893    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
   9.10%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock
   4.53%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
   2.87%  iou-wrk-10893    [kernel.vmlinux]      [k] io_worker_handle_work
   2.57%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
   2.56%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
   1.82%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
   1.33%  iou-wrk-10893    [kernel.vmlinux]      [k] io_wqe_worker
   1.26%  io_uring_echo_s  [kernel.vmlinux]      [k] try_to_wake_up

spin_lock failure from 25.59% + 8.89% =  34.48% to 16.86% + 4.53% = 21.39%
TPS is similar, while cpu usage is from almost 400% to 350%


Hao Xu (3):
  io-wq: decouple work_list protection from the big wqe->lock
  io-wq: reduce acct->lock crossing functions lock/unlock
  io-wq: use IO_WQ_ACCT_NR rather than hardcoded number

 fs/io-wq.c | 114 ++++++++++++++++++++++++++---------------------------
 1 file changed, 57 insertions(+), 57 deletions(-)


base-commit: f6133fbd373811066c8441737e65f384c8f31974
-- 
2.25.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] io-wq: decouple work_list protection from the big wqe->lock
  2022-02-06  9:52 [PATCH 0/3] decouple work_list protection from the big wqe->lock Hao Xu
@ 2022-02-06  9:52 ` Hao Xu
  2022-02-06  9:52 ` [PATCH 2/3] io-wq: reduce acct->lock crossing functions lock/unlock Hao Xu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Hao Xu @ 2022-02-06  9:52 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Joseph Qi

wqe->lock is abused, it now protects acct->work_list, hash stuff,
nr_workers, wqe->free_list and so on. Lets first get the work_list out
of the wqe-lock mess by introduce a specific lock for work list. This
is the first step to solve the huge contension between work insertion
and work consumption.
good thing:
  - split locking for bound and unbound work list
  - reduce contension between work_list visit and (worker's)free_list.

For the hash stuff, since there won't be a work with same file in both
bound and unbound work list, thus they won't visit same hash entry. it
works well to use the new lock to protect hash stuff.

Results:
set max_unbound_worker = 4, test with echo-server:
nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
(-n connection, -l workload)
before this patch:
Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
Overhead  Command          Shared Object         Symbol
  28.59%  iou-wrk-10021    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
   8.89%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
   6.20%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock
   2.45%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
   2.36%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
   2.29%  iou-wrk-10021    [kernel.vmlinux]      [k] io_worker_handle_work
   1.29%  io_uring_echo_s  [kernel.vmlinux]      [k] io_wqe_enqueue
   1.06%  iou-wrk-10021    [kernel.vmlinux]      [k] io_wqe_worker
   1.06%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
   1.03%  iou-wrk-10021    [kernel.vmlinux]      [k] __schedule
   0.99%  iou-wrk-10021    [kernel.vmlinux]      [k] tcp_sendmsg_locked

with this patch:
Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
Overhead  Command          Shared Object         Symbol
  16.86%  iou-wrk-10893    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
   9.10%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock
   4.53%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
   2.87%  iou-wrk-10893    [kernel.vmlinux]      [k] io_worker_handle_work
   2.57%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
   2.56%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
   1.82%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
   1.33%  iou-wrk-10893    [kernel.vmlinux]      [k] io_wqe_worker
   1.26%  io_uring_echo_s  [kernel.vmlinux]      [k] try_to_wake_up

spin_lock failure from 25.59% + 8.89% =  34.48% to 16.86% + 4.53% = 21.39%
TPS is similar, while cpu usage is from almost 400% to 350%

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io-wq.c | 96 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 52 insertions(+), 44 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index bb7f161bb19c..9595616ccaa3 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -76,6 +76,7 @@ struct io_wqe_acct {
 	unsigned max_workers;
 	int index;
 	atomic_t nr_running;
+	raw_spinlock_t lock;
 	struct io_wq_work_list work_list;
 	unsigned long flags;
 };
@@ -224,12 +225,12 @@ static void io_worker_exit(struct io_worker *worker)
 	if (worker->flags & IO_WORKER_F_FREE)
 		hlist_nulls_del_rcu(&worker->nulls_node);
 	list_del_rcu(&worker->all_list);
-	preempt_disable();
+	raw_spin_unlock(&wqe->lock);
 	io_wqe_dec_running(worker);
 	worker->flags = 0;
+	preempt_disable();
 	current->flags &= ~PF_IO_WORKER;
 	preempt_enable();
-	raw_spin_unlock(&wqe->lock);
 
 	kfree_rcu(worker, rcu);
 	io_worker_ref_put(wqe->wq);
@@ -385,7 +386,6 @@ static bool io_queue_worker_create(struct io_worker *worker,
 }
 
 static void io_wqe_dec_running(struct io_worker *worker)
-	__must_hold(wqe->lock)
 {
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
@@ -393,13 +393,18 @@ static void io_wqe_dec_running(struct io_worker *worker)
 	if (!(worker->flags & IO_WORKER_F_UP))
 		return;
 
-	if (atomic_dec_and_test(&acct->nr_running) && io_acct_run_queue(acct)) {
-		atomic_inc(&acct->nr_running);
-		atomic_inc(&wqe->wq->worker_refs);
+	if (!atomic_dec_and_test(&acct->nr_running))
+		return;
+	raw_spin_lock(&wqe->lock);
+	if (!io_acct_run_queue(acct)) {
 		raw_spin_unlock(&wqe->lock);
-		io_queue_worker_create(worker, acct, create_worker_cb);
-		raw_spin_lock(&wqe->lock);
+		return;
 	}
+
+	raw_spin_unlock(&wqe->lock);
+	atomic_inc(&acct->nr_running);
+	atomic_inc(&wqe->wq->worker_refs);
+	io_queue_worker_create(worker, acct, create_worker_cb);
 }
 
 /*
@@ -407,11 +412,12 @@ static void io_wqe_dec_running(struct io_worker *worker)
  * it's currently on the freelist
  */
 static void __io_worker_busy(struct io_wqe *wqe, struct io_worker *worker)
-	__must_hold(wqe->lock)
 {
 	if (worker->flags & IO_WORKER_F_FREE) {
 		worker->flags &= ~IO_WORKER_F_FREE;
+		raw_spin_lock(&wqe->lock);
 		hlist_nulls_del_init_rcu(&worker->nulls_node);
+		raw_spin_unlock(&wqe->lock);
 	}
 }
 
@@ -456,7 +462,7 @@ static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
 
 static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
 					   struct io_worker *worker)
-	__must_hold(wqe->lock)
+	__must_hold(acct->lock)
 {
 	struct io_wq_work_node *node, *prev;
 	struct io_wq_work *work, *tail;
@@ -498,9 +504,9 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
 		 * work being added and clearing the stalled bit.
 		 */
 		set_bit(IO_ACCT_STALLED_BIT, &acct->flags);
-		raw_spin_unlock(&wqe->lock);
+		raw_spin_unlock(&acct->lock);
 		unstalled = io_wait_on_hash(wqe, stall_hash);
-		raw_spin_lock(&wqe->lock);
+		raw_spin_lock(&acct->lock);
 		if (unstalled) {
 			clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 			if (wq_has_sleeper(&wqe->wq->hash->wait))
@@ -538,7 +544,7 @@ static void io_assign_current_work(struct io_worker *worker,
 static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work);
 
 static void io_worker_handle_work(struct io_worker *worker)
-	__releases(wqe->lock)
+	__releases(acct->lock)
 {
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
@@ -556,6 +562,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 		 * clear the stalled flag.
 		 */
 		work = io_get_next_work(acct, worker);
+		raw_spin_unlock(&acct->lock);
 		if (work) {
 			__io_worker_busy(wqe, worker);
 
@@ -569,10 +576,9 @@ static void io_worker_handle_work(struct io_worker *worker)
 			raw_spin_lock(&worker->lock);
 			worker->next_work = work;
 			raw_spin_unlock(&worker->lock);
-		}
-		raw_spin_unlock(&wqe->lock);
-		if (!work)
+		} else {
 			break;
+		}
 		io_assign_current_work(worker, work);
 		__set_current_state(TASK_RUNNING);
 
@@ -609,7 +615,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 			}
 		} while (work);
 
-		raw_spin_lock(&wqe->lock);
+		raw_spin_lock(&acct->lock);
 	} while (1);
 }
 
@@ -634,11 +640,14 @@ static int io_wqe_worker(void *data)
 
 		set_current_state(TASK_INTERRUPTIBLE);
 loop:
-		raw_spin_lock(&wqe->lock);
+		raw_spin_lock(&acct->lock);
 		if (io_acct_run_queue(acct)) {
 			io_worker_handle_work(worker);
 			goto loop;
+		} else {
+			raw_spin_unlock(&acct->lock);
 		}
+		raw_spin_lock(&wqe->lock);
 		/* timed out, exit unless we're the last worker */
 		if (last_timeout && acct->nr_workers > 1) {
 			acct->nr_workers--;
@@ -663,7 +672,7 @@ static int io_wqe_worker(void *data)
 	}
 
 	if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
-		raw_spin_lock(&wqe->lock);
+		raw_spin_lock(&acct->lock);
 		io_worker_handle_work(worker);
 	}
 
@@ -705,10 +714,7 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
 		return;
 
 	worker->flags &= ~IO_WORKER_F_RUNNING;
-
-	raw_spin_lock(&worker->wqe->lock);
 	io_wqe_dec_running(worker);
-	raw_spin_unlock(&worker->wqe->lock);
 }
 
 static void io_init_new_worker(struct io_wqe *wqe, struct io_worker *worker,
@@ -778,10 +784,12 @@ static void create_worker_cont(struct callback_head *cb)
 				.cancel_all	= true,
 			};
 
+			raw_spin_unlock(&wqe->lock);
 			while (io_acct_cancel_pending_work(wqe, acct, &match))
-				raw_spin_lock(&wqe->lock);
+				;
+		} else {
+			raw_spin_unlock(&wqe->lock);
 		}
-		raw_spin_unlock(&wqe->lock);
 		io_worker_ref_put(wqe->wq);
 		kfree(worker);
 		return;
@@ -914,6 +922,7 @@ static bool io_wq_work_match_item(struct io_wq_work *work, void *data)
 static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
 {
 	struct io_wqe_acct *acct = io_work_get_acct(wqe, work);
+	struct io_cb_cancel_data match;
 	unsigned work_flags = work->flags;
 	bool do_create;
 
@@ -927,10 +936,12 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
 		return;
 	}
 
-	raw_spin_lock(&wqe->lock);
+	raw_spin_lock(&acct->lock);
 	io_wqe_insert_work(wqe, work);
 	clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
+	raw_spin_unlock(&acct->lock);
 
+	raw_spin_lock(&wqe->lock);
 	rcu_read_lock();
 	do_create = !io_wqe_activate_free_worker(wqe, acct);
 	rcu_read_unlock();
@@ -946,18 +957,18 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
 			return;
 
 		raw_spin_lock(&wqe->lock);
-		/* fatal condition, failed to create the first worker */
-		if (!acct->nr_workers) {
-			struct io_cb_cancel_data match = {
-				.fn		= io_wq_work_match_item,
-				.data		= work,
-				.cancel_all	= false,
-			};
-
-			if (io_acct_cancel_pending_work(wqe, acct, &match))
-				raw_spin_lock(&wqe->lock);
+		if (acct->nr_workers) {
+			raw_spin_unlock(&wqe->lock);
+			return;
 		}
 		raw_spin_unlock(&wqe->lock);
+
+		/* fatal condition, failed to create the first worker */
+		match.fn		= io_wq_work_match_item,
+		match.data		= work,
+		match.cancel_all	= false,
+
+		io_acct_cancel_pending_work(wqe, acct, &match);
 	}
 }
 
@@ -1032,22 +1043,23 @@ static inline void io_wqe_remove_pending(struct io_wqe *wqe,
 static bool io_acct_cancel_pending_work(struct io_wqe *wqe,
 					struct io_wqe_acct *acct,
 					struct io_cb_cancel_data *match)
-	__releases(wqe->lock)
 {
 	struct io_wq_work_node *node, *prev;
 	struct io_wq_work *work;
 
+	raw_spin_lock(&acct->lock);
 	wq_list_for_each(node, prev, &acct->work_list) {
 		work = container_of(node, struct io_wq_work, list);
 		if (!match->fn(work, match->data))
 			continue;
 		io_wqe_remove_pending(wqe, work, prev);
-		raw_spin_unlock(&wqe->lock);
+		raw_spin_unlock(&acct->lock);
 		io_run_cancel(work, wqe);
 		match->nr_pending++;
 		/* not safe to continue after unlock */
 		return true;
 	}
+	raw_spin_unlock(&acct->lock);
 
 	return false;
 }
@@ -1061,7 +1073,6 @@ static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
 		struct io_wqe_acct *acct = io_get_acct(wqe, i == 0);
 
 		if (io_acct_cancel_pending_work(wqe, acct, match)) {
-			raw_spin_lock(&wqe->lock);
 			if (match->cancel_all)
 				goto retry;
 			break;
@@ -1103,13 +1114,11 @@ enum io_wq_cancel io_wq_cancel_cb(struct io_wq *wq, work_cancel_fn *cancel,
 	for_each_node(node) {
 		struct io_wqe *wqe = wq->wqes[node];
 
-		raw_spin_lock(&wqe->lock);
 		io_wqe_cancel_pending_work(wqe, &match);
-		if (match.nr_pending && !match.cancel_all) {
-			raw_spin_unlock(&wqe->lock);
+		if (match.nr_pending && !match.cancel_all)
 			return IO_WQ_CANCEL_OK;
-		}
 
+		raw_spin_lock(&wqe->lock);
 		io_wqe_cancel_running_work(wqe, &match);
 		raw_spin_unlock(&wqe->lock);
 		if (match.nr_running && !match.cancel_all)
@@ -1190,6 +1199,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
 			acct->index = i;
 			atomic_set(&acct->nr_running, 0);
 			INIT_WQ_LIST(&acct->work_list);
+			raw_spin_lock_init(&acct->lock);
 		}
 		wqe->wq = wq;
 		raw_spin_lock_init(&wqe->lock);
@@ -1282,9 +1292,7 @@ static void io_wq_destroy(struct io_wq *wq)
 			.fn		= io_wq_work_match_all,
 			.cancel_all	= true,
 		};
-		raw_spin_lock(&wqe->lock);
 		io_wqe_cancel_pending_work(wqe, &match);
-		raw_spin_unlock(&wqe->lock);
 		free_cpumask_var(wqe->cpu_mask);
 		kfree(wqe);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] io-wq: reduce acct->lock crossing functions lock/unlock
  2022-02-06  9:52 [PATCH 0/3] decouple work_list protection from the big wqe->lock Hao Xu
  2022-02-06  9:52 ` [PATCH 1/3] io-wq: " Hao Xu
@ 2022-02-06  9:52 ` Hao Xu
  2022-02-06  9:52 ` [PATCH 3/3] io-wq: use IO_WQ_ACCT_NR rather than hardcoded number Hao Xu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Hao Xu @ 2022-02-06  9:52 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Joseph Qi

reduce acct->lock lock and unlock in different functions to make the
code clearer.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io-wq.c | 32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 9595616ccaa3..f7b7fa396faf 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -239,10 +239,15 @@ static void io_worker_exit(struct io_worker *worker)
 
 static inline bool io_acct_run_queue(struct io_wqe_acct *acct)
 {
+	bool ret = false;
+
+	raw_spin_lock(&acct->lock);
 	if (!wq_list_empty(&acct->work_list) &&
 	    !test_bit(IO_ACCT_STALLED_BIT, &acct->flags))
-		return true;
-	return false;
+		ret = true;
+	raw_spin_unlock(&acct->lock);
+
+	return ret;
 }
 
 /*
@@ -395,13 +400,9 @@ static void io_wqe_dec_running(struct io_worker *worker)
 
 	if (!atomic_dec_and_test(&acct->nr_running))
 		return;
-	raw_spin_lock(&wqe->lock);
-	if (!io_acct_run_queue(acct)) {
-		raw_spin_unlock(&wqe->lock);
+	if (!io_acct_run_queue(acct))
 		return;
-	}
 
-	raw_spin_unlock(&wqe->lock);
 	atomic_inc(&acct->nr_running);
 	atomic_inc(&wqe->wq->worker_refs);
 	io_queue_worker_create(worker, acct, create_worker_cb);
@@ -544,7 +545,6 @@ static void io_assign_current_work(struct io_worker *worker,
 static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work);
 
 static void io_worker_handle_work(struct io_worker *worker)
-	__releases(acct->lock)
 {
 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
 	struct io_wqe *wqe = worker->wqe;
@@ -561,6 +561,7 @@ static void io_worker_handle_work(struct io_worker *worker)
 		 * can't make progress, any work completion or insertion will
 		 * clear the stalled flag.
 		 */
+		raw_spin_lock(&acct->lock);
 		work = io_get_next_work(acct, worker);
 		raw_spin_unlock(&acct->lock);
 		if (work) {
@@ -614,8 +615,6 @@ static void io_worker_handle_work(struct io_worker *worker)
 					wake_up(&wq->hash->wait);
 			}
 		} while (work);
-
-		raw_spin_lock(&acct->lock);
 	} while (1);
 }
 
@@ -639,14 +638,9 @@ static int io_wqe_worker(void *data)
 		long ret;
 
 		set_current_state(TASK_INTERRUPTIBLE);
-loop:
-		raw_spin_lock(&acct->lock);
-		if (io_acct_run_queue(acct)) {
+		while (io_acct_run_queue(acct))
 			io_worker_handle_work(worker);
-			goto loop;
-		} else {
-			raw_spin_unlock(&acct->lock);
-		}
+
 		raw_spin_lock(&wqe->lock);
 		/* timed out, exit unless we're the last worker */
 		if (last_timeout && acct->nr_workers > 1) {
@@ -671,10 +665,8 @@ static int io_wqe_worker(void *data)
 		last_timeout = !ret;
 	}
 
-	if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
-		raw_spin_lock(&acct->lock);
+	if (test_bit(IO_WQ_BIT_EXIT, &wq->state))
 		io_worker_handle_work(worker);
-	}
 
 	audit_free(current);
 	io_worker_exit(worker);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] io-wq: use IO_WQ_ACCT_NR rather than hardcoded number
  2022-02-06  9:52 [PATCH 0/3] decouple work_list protection from the big wqe->lock Hao Xu
  2022-02-06  9:52 ` [PATCH 1/3] io-wq: " Hao Xu
  2022-02-06  9:52 ` [PATCH 2/3] io-wq: reduce acct->lock crossing functions lock/unlock Hao Xu
@ 2022-02-06  9:52 ` Hao Xu
  2022-02-11 16:55 ` [PATCH 0/3] decouple work_list protection from the big wqe->lock Jens Axboe
  2022-02-11 16:56 ` Jens Axboe
  4 siblings, 0 replies; 7+ messages in thread
From: Hao Xu @ 2022-02-06  9:52 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, Pavel Begunkov, Joseph Qi

It's better to use the defined enum stuff not the hardcoded number to
define array.

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io-wq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index f7b7fa396faf..5b93fa67d346 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -92,7 +92,7 @@ enum {
  */
 struct io_wqe {
 	raw_spinlock_t lock;
-	struct io_wqe_acct acct[2];
+	struct io_wqe_acct acct[IO_WQ_ACCT_NR];
 
 	int node;
 
@@ -1376,7 +1376,7 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
 	BUILD_BUG_ON((int) IO_WQ_ACCT_UNBOUND != (int) IO_WQ_UNBOUND);
 	BUILD_BUG_ON((int) IO_WQ_ACCT_NR      != 2);
 
-	for (i = 0; i < 2; i++) {
+	for (i = 0; i < IO_WQ_ACCT_NR; i++) {
 		if (new_count[i] > task_rlimit(current, RLIMIT_NPROC))
 			new_count[i] = task_rlimit(current, RLIMIT_NPROC);
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] decouple work_list protection from the big wqe->lock
  2022-02-06  9:52 [PATCH 0/3] decouple work_list protection from the big wqe->lock Hao Xu
                   ` (2 preceding siblings ...)
  2022-02-06  9:52 ` [PATCH 3/3] io-wq: use IO_WQ_ACCT_NR rather than hardcoded number Hao Xu
@ 2022-02-11 16:55 ` Jens Axboe
  2022-02-14 17:18   ` Hao Xu
  2022-02-11 16:56 ` Jens Axboe
  4 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2022-02-11 16:55 UTC (permalink / raw)
  To: Hao Xu, io-uring; +Cc: Pavel Begunkov, Joseph Qi

On 2/6/22 2:52 AM, Hao Xu wrote:
> wqe->lock is abused, it now protects acct->work_list, hash stuff,
> nr_workers, wqe->free_list and so on. Lets first get the work_list out
> of the wqe-lock mess by introduce a specific lock for work list. This
> is the first step to solve the huge contension between work insertion
> and work consumption.
> good thing:
>   - split locking for bound and unbound work list
>   - reduce contension between work_list visit and (worker's)free_list.
> 
> For the hash stuff, since there won't be a work with same file in both
> bound and unbound work list, thus they won't visit same hash entry. it
> works well to use the new lock to protect hash stuff.
> 
> Results:
> set max_unbound_worker = 4, test with echo-server:
> nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
> (-n connection, -l workload)
> before this patch:
> Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
> Overhead  Command          Shared Object         Symbol
>   28.59%  iou-wrk-10021    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>    8.89%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>    6.20%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock
>    2.45%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
>    2.36%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
>    2.29%  iou-wrk-10021    [kernel.vmlinux]      [k] io_worker_handle_work
>    1.29%  io_uring_echo_s  [kernel.vmlinux]      [k] io_wqe_enqueue
>    1.06%  iou-wrk-10021    [kernel.vmlinux]      [k] io_wqe_worker
>    1.06%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
>    1.03%  iou-wrk-10021    [kernel.vmlinux]      [k] __schedule
>    0.99%  iou-wrk-10021    [kernel.vmlinux]      [k] tcp_sendmsg_locked
> 
> with this patch:
> Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
> Overhead  Command          Shared Object         Symbol
>   16.86%  iou-wrk-10893    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
>    9.10%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock
>    4.53%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
>    2.87%  iou-wrk-10893    [kernel.vmlinux]      [k] io_worker_handle_work
>    2.57%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
>    2.56%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
>    1.82%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
>    1.33%  iou-wrk-10893    [kernel.vmlinux]      [k] io_wqe_worker
>    1.26%  io_uring_echo_s  [kernel.vmlinux]      [k] try_to_wake_up
> 
> spin_lock failure from 25.59% + 8.89% =  34.48% to 16.86% + 4.53% = 21.39%
> TPS is similar, while cpu usage is from almost 400% to 350%

I think this looks like a good start to improving the io-wq locking. I
didnt spot anything immediately wrong with the series, my only worker
was worker->flags protection, but I _think_ that looks OK to in terms of
the worker itself doing the manipulations.

Let's queue this up for 5.18 testing, thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] decouple work_list protection from the big wqe->lock
  2022-02-06  9:52 [PATCH 0/3] decouple work_list protection from the big wqe->lock Hao Xu
                   ` (3 preceding siblings ...)
  2022-02-11 16:55 ` [PATCH 0/3] decouple work_list protection from the big wqe->lock Jens Axboe
@ 2022-02-11 16:56 ` Jens Axboe
  4 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2022-02-11 16:56 UTC (permalink / raw)
  To: Hao Xu, io-uring; +Cc: Joseph Qi, Pavel Begunkov

On Sun, 6 Feb 2022 17:52:38 +0800, Hao Xu wrote:
> wqe->lock is abused, it now protects acct->work_list, hash stuff,
> nr_workers, wqe->free_list and so on. Lets first get the work_list out
> of the wqe-lock mess by introduce a specific lock for work list. This
> is the first step to solve the huge contension between work insertion
> and work consumption.
> good thing:
>   - split locking for bound and unbound work list
>   - reduce contension between work_list visit and (worker's)free_list.
> 
> [...]

Applied, thanks!

[1/3] io-wq: decouple work_list protection from the big wqe->lock
      (no commit info)
[2/3] io-wq: reduce acct->lock crossing functions lock/unlock
      (no commit info)
[3/3] io-wq: use IO_WQ_ACCT_NR rather than hardcoded number
      (no commit info)

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] decouple work_list protection from the big wqe->lock
  2022-02-11 16:55 ` [PATCH 0/3] decouple work_list protection from the big wqe->lock Jens Axboe
@ 2022-02-14 17:18   ` Hao Xu
  0 siblings, 0 replies; 7+ messages in thread
From: Hao Xu @ 2022-02-14 17:18 UTC (permalink / raw)
  To: Jens Axboe, io-uring; +Cc: Pavel Begunkov, Joseph Qi


On 2/12/22 00:55, Jens Axboe wrote:
> On 2/6/22 2:52 AM, Hao Xu wrote:
>> wqe->lock is abused, it now protects acct->work_list, hash stuff,
>> nr_workers, wqe->free_list and so on. Lets first get the work_list out
>> of the wqe-lock mess by introduce a specific lock for work list. This
>> is the first step to solve the huge contension between work insertion
>> and work consumption.
>> good thing:
>>    - split locking for bound and unbound work list
>>    - reduce contension between work_list visit and (worker's)free_list.
>>
>> For the hash stuff, since there won't be a work with same file in both
>> bound and unbound work list, thus they won't visit same hash entry. it
>> works well to use the new lock to protect hash stuff.
>>
>> Results:
>> set max_unbound_worker = 4, test with echo-server:
>> nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
>> (-n connection, -l workload)
>> before this patch:
>> Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
>> Overhead  Command          Shared Object         Symbol
>>    28.59%  iou-wrk-10021    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>     8.89%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>     6.20%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock
>>     2.45%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
>>     2.36%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
>>     2.29%  iou-wrk-10021    [kernel.vmlinux]      [k] io_worker_handle_work
>>     1.29%  io_uring_echo_s  [kernel.vmlinux]      [k] io_wqe_enqueue
>>     1.06%  iou-wrk-10021    [kernel.vmlinux]      [k] io_wqe_worker
>>     1.06%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
>>     1.03%  iou-wrk-10021    [kernel.vmlinux]      [k] __schedule
>>     0.99%  iou-wrk-10021    [kernel.vmlinux]      [k] tcp_sendmsg_locked
>>
>> with this patch:
>> Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
>> Overhead  Command          Shared Object         Symbol
>>    16.86%  iou-wrk-10893    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
>>     9.10%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock
>>     4.53%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
>>     2.87%  iou-wrk-10893    [kernel.vmlinux]      [k] io_worker_handle_work
>>     2.57%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
>>     2.56%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
>>     1.82%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
>>     1.33%  iou-wrk-10893    [kernel.vmlinux]      [k] io_wqe_worker
>>     1.26%  io_uring_echo_s  [kernel.vmlinux]      [k] try_to_wake_up
>>
>> spin_lock failure from 25.59% + 8.89% =  34.48% to 16.86% + 4.53% = 21.39%
>> TPS is similar, while cpu usage is from almost 400% to 350%
> I think this looks like a good start to improving the io-wq locking. I
> didnt spot anything immediately wrong with the series, my only worker
> was worker->flags protection, but I _think_ that looks OK to in terms of
> the worker itself doing the manipulations.

Yes, I went over the code  to make sure worker->flags is manipulated by 
the worker

itself when doing coding.


Regards,

Hao

>
> Let's queue this up for 5.18 testing, thanks!
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-02-14 17:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-06  9:52 [PATCH 0/3] decouple work_list protection from the big wqe->lock Hao Xu
2022-02-06  9:52 ` [PATCH 1/3] io-wq: " Hao Xu
2022-02-06  9:52 ` [PATCH 2/3] io-wq: reduce acct->lock crossing functions lock/unlock Hao Xu
2022-02-06  9:52 ` [PATCH 3/3] io-wq: use IO_WQ_ACCT_NR rather than hardcoded number Hao Xu
2022-02-11 16:55 ` [PATCH 0/3] decouple work_list protection from the big wqe->lock Jens Axboe
2022-02-14 17:18   ` Hao Xu
2022-02-11 16:56 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox