[RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
@ 2025-05-22  9:09 Fengnan Chang
  2025-05-22 11:34 ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Fengnan Chang @ 2025-05-22  9:09 UTC (permalink / raw)
  To: axboe, asml.silence; +Cc: io-uring, Fengnan Chang, Diangang Li

When running fio with buffer io and stable iops, I observed that
part of io_worker threads keeps creating and destroying.
Using this command can reproduce:
fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
--iodepth=256 --filename=/data03/fio-rand-read --name=test
ps -L -p pid, you can see about 256 io_worker threads, and thread
id keeps changing.
And I do some debugging, most workers create happen in
create_worker_cb. In create_worker_cb, if all workers have gone to
sleep, and we have more work, we try to create new worker (let's
call it worker B) to handle it.  And when new work comes,
io_wq_enqueue will activate free worker (let's call it worker A) or
create new one. It may cause worker A and B compete for one work.
Since buffered write is hashed work, buffered write to a given file
is serialized, only one worker gets the work in the end, the other
worker goes to sleep. After repeating it many times, a lot of
io_worker threads created, handles a few works or even no work to
handle,and exit.
There are several solutions:
1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
create worker too, remove create worker action in create_worker_cb
is fine, maybe affect performance?
2. When wq->hash->map bit is set, insert hashed work item, new work
only put in wq->hash_tail, not link to work_list,
io_worker_handle_work need to check hash_tail after a whole dependent
link, io_acct_run_queue will return false when new work insert, no
new thread will be created either in io_wqe_dec_running.
3. Check is there only one hash bucket in io_wqe_dec_running. If only
one hash bucket, don't create worker, io_wq_enqueue will handle it.

I choose plan 3 to avoid this problem. After my test, there is no
performance degradation.

Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Signed-off-by: Diangang Li <lidiangang@bytedance.com>
---
 io_uring/io-wq.c | 73 +++++++++++++++++++++++++++++++++++-------------
 1 file changed, 54 insertions(+), 19 deletions(-)

diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index 04a75d666195..37723a785204 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -412,25 +412,6 @@ static bool io_queue_worker_create(struct io_worker *worker,
 	return false;
 }
 
-static void io_wq_dec_running(struct io_worker *worker)
-{
-	struct io_wq_acct *acct = io_wq_get_acct(worker);
-	struct io_wq *wq = worker->wq;
-
-	if (!test_bit(IO_WORKER_F_UP, &worker->flags))
-		return;
-
-	if (!atomic_dec_and_test(&acct->nr_running))
-		return;
-	if (!io_acct_run_queue(acct))
-		return;
-
-	raw_spin_unlock(&acct->lock);
-	atomic_inc(&acct->nr_running);
-	atomic_inc(&wq->worker_refs);
-	io_queue_worker_create(worker, acct, create_worker_cb);
-}
-
 /*
  * Worker will start processing some work. Move it to the busy list, if
  * it's currently on the freelist
@@ -484,6 +465,60 @@ static bool io_wait_on_hash(struct io_wq *wq, unsigned int hash)
 	return ret;
 }
 
+static  bool only_one_hashed_work(struct io_wq_acct *acct,
+				struct io_wq *wq)
+	__must_hold(acct->lock)
+{
+	struct io_wq_work_node *node, *prev;
+	struct io_wq_work *work, *tail;
+	unsigned int len = 0;
+
+	wq_list_for_each(node, prev, &acct->work_list) {
+		unsigned int work_flags;
+		unsigned int hash;
+
+		work = container_of(node, struct io_wq_work, list);
+		len += 1;
+		if (len > 1)
+			return false;
+		/* not hashed, can run anytime */
+		work_flags = atomic_read(&work->flags);
+		if (!__io_wq_is_hashed(work_flags))
+			return false;
+
+		hash = __io_get_work_hash(work_flags);
+		/* all items with this hash lie in [work, tail] */
+		tail = wq->hash_tail[hash];
+		/* fast forward to a next hash, for-each will fix up @prev */
+		node = &tail->list;
+	}
+
+	return true;
+}
+
+static void io_wq_dec_running(struct io_worker *worker)
+{
+	struct io_wq_acct *acct = io_wq_get_acct(worker);
+	struct io_wq *wq = worker->wq;
+
+	if (!test_bit(IO_WORKER_F_UP, &worker->flags))
+		return;
+
+	if (!atomic_dec_and_test(&acct->nr_running))
+		return;
+	if (!io_acct_run_queue(acct))
+		return;
+	if (only_one_hashed_work(acct, wq)) {
+		raw_spin_unlock(&acct->lock);
+		return;
+	}
+
+	raw_spin_unlock(&acct->lock);
+	atomic_inc(&acct->nr_running);
+	atomic_inc(&wq->worker_refs);
+	io_queue_worker_create(worker, acct, create_worker_cb);
+}
+
 static struct io_wq_work *io_get_next_work(struct io_wq_acct *acct,
 					   struct io_wq *wq)
 	__must_hold(acct->lock)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
  2025-05-22  9:09 [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying Fengnan Chang
@ 2025-05-22 11:34 ` Jens Axboe
  2025-05-22 12:01   ` [External] " Fengnan Chang
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-05-22 11:34 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence; +Cc: io-uring, Diangang Li

On 5/22/25 3:09 AM, Fengnan Chang wrote:
> When running fio with buffer io and stable iops, I observed that
> part of io_worker threads keeps creating and destroying.
> Using this command can reproduce:
> fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
> --iodepth=256 --filename=/data03/fio-rand-read --name=test
> ps -L -p pid, you can see about 256 io_worker threads, and thread
> id keeps changing.
> And I do some debugging, most workers create happen in
> create_worker_cb. In create_worker_cb, if all workers have gone to
> sleep, and we have more work, we try to create new worker (let's
> call it worker B) to handle it.  And when new work comes,
> io_wq_enqueue will activate free worker (let's call it worker A) or
> create new one. It may cause worker A and B compete for one work.
> Since buffered write is hashed work, buffered write to a given file
> is serialized, only one worker gets the work in the end, the other
> worker goes to sleep. After repeating it many times, a lot of
> io_worker threads created, handles a few works or even no work to
> handle,and exit.
> There are several solutions:
> 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
> create worker too, remove create worker action in create_worker_cb
> is fine, maybe affect performance?
> 2. When wq->hash->map bit is set, insert hashed work item, new work
> only put in wq->hash_tail, not link to work_list,
> io_worker_handle_work need to check hash_tail after a whole dependent
> link, io_acct_run_queue will return false when new work insert, no
> new thread will be created either in io_wqe_dec_running.
> 3. Check is there only one hash bucket in io_wqe_dec_running. If only
> one hash bucket, don't create worker, io_wq_enqueue will handle it.

Nice catch on this! Does indeed look like a problem. Not a huge fan of
approach 3. Without having really looked into this yet, my initial idea
would've been to do some variant of solution 1 above. io_wq_enqueue()
checks if we need to create a worker, which basically boils down to "do
we have a free worker right now". If we do not, we create one. But the
question is really "do we need a new worker for this?", and if we're
inserting hashed worked and we have existing hashed work for the SAME
hash and it's busy, then the answer should be "no" as it'd be pointless
to create that worker.

Would it be feasible to augment the check in there such that
io_wq_enqueue() doesn't create a new worker for that case? And I guess a
followup question is, would that even be enough, do we always need to
cover the io_wq_dec_running() running case as well as
io_acct_run_queue() will return true as well since it doesn't know about
this either?

I'll take a closer look at this later today, but figured I'd shoot some
questions your way first...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
  2025-05-22 11:34 ` Jens Axboe
@ 2025-05-22 12:01   ` Fengnan Chang
  2025-05-22 14:29     ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Fengnan Chang @ 2025-05-22 12:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: asml.silence, io-uring, Diangang Li

Jens Axboe <axboe@kernel.dk> 于2025年5月22日周四 19:35写道：
>
> On 5/22/25 3:09 AM, Fengnan Chang wrote:
> > When running fio with buffer io and stable iops, I observed that
> > part of io_worker threads keeps creating and destroying.
> > Using this command can reproduce:
> > fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
> > --iodepth=256 --filename=/data03/fio-rand-read --name=test
> > ps -L -p pid, you can see about 256 io_worker threads, and thread
> > id keeps changing.
> > And I do some debugging, most workers create happen in
> > create_worker_cb. In create_worker_cb, if all workers have gone to
> > sleep, and we have more work, we try to create new worker (let's
> > call it worker B) to handle it.  And when new work comes,
> > io_wq_enqueue will activate free worker (let's call it worker A) or
> > create new one. It may cause worker A and B compete for one work.
> > Since buffered write is hashed work, buffered write to a given file
> > is serialized, only one worker gets the work in the end, the other
> > worker goes to sleep. After repeating it many times, a lot of
> > io_worker threads created, handles a few works or even no work to
> > handle,and exit.
> > There are several solutions:
> > 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
> > create worker too, remove create worker action in create_worker_cb
> > is fine, maybe affect performance?
> > 2. When wq->hash->map bit is set, insert hashed work item, new work
> > only put in wq->hash_tail, not link to work_list,
> > io_worker_handle_work need to check hash_tail after a whole dependent
> > link, io_acct_run_queue will return false when new work insert, no
> > new thread will be created either in io_wqe_dec_running.
> > 3. Check is there only one hash bucket in io_wqe_dec_running. If only
> > one hash bucket, don't create worker, io_wq_enqueue will handle it.
>
> Nice catch on this! Does indeed look like a problem. Not a huge fan of
> approach 3. Without having really looked into this yet, my initial idea
> would've been to do some variant of solution 1 above. io_wq_enqueue()
> checks if we need to create a worker, which basically boils down to "do
> we have a free worker right now". If we do not, we create one. But the
> question is really "do we need a new worker for this?", and if we're
> inserting hashed worked and we have existing hashed work for the SAME
> hash and it's busy, then the answer should be "no" as it'd be pointless
> to create that worker.

Agree

>
> Would it be feasible to augment the check in there such that
> io_wq_enqueue() doesn't create a new worker for that case? And I guess a
> followup question is, would that even be enough, do we always need to
> cover the io_wq_dec_running() running case as well as
> io_acct_run_queue() will return true as well since it doesn't know about
> this either?
Yes，It is feasible to avoid creating a worker by adding some checks in
io_wq_enqueue. But what I have observed so far is most workers are
created in io_wq_dec_running (why no worker create in io_wq_enqueue?
I didn't figure it out now), it seems no need to check this
in io_wq_enqueue.  And cover io_wq_dec_running is necessary.

>
> I'll take a closer look at this later today, but figured I'd shoot some
> questions your way first...
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
  2025-05-22 12:01   ` [External] " Fengnan Chang
@ 2025-05-22 14:29     ` Jens Axboe
  2025-05-23  7:57       ` Fengnan Chang
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-05-22 14:29 UTC (permalink / raw)
  To: Fengnan Chang; +Cc: asml.silence, io-uring, Diangang Li

On 5/22/25 6:01 AM, Fengnan Chang wrote:
> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 19:35???
>>
>> On 5/22/25 3:09 AM, Fengnan Chang wrote:
>>> When running fio with buffer io and stable iops, I observed that
>>> part of io_worker threads keeps creating and destroying.
>>> Using this command can reproduce:
>>> fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
>>> --iodepth=256 --filename=/data03/fio-rand-read --name=test
>>> ps -L -p pid, you can see about 256 io_worker threads, and thread
>>> id keeps changing.
>>> And I do some debugging, most workers create happen in
>>> create_worker_cb. In create_worker_cb, if all workers have gone to
>>> sleep, and we have more work, we try to create new worker (let's
>>> call it worker B) to handle it.  And when new work comes,
>>> io_wq_enqueue will activate free worker (let's call it worker A) or
>>> create new one. It may cause worker A and B compete for one work.
>>> Since buffered write is hashed work, buffered write to a given file
>>> is serialized, only one worker gets the work in the end, the other
>>> worker goes to sleep. After repeating it many times, a lot of
>>> io_worker threads created, handles a few works or even no work to
>>> handle,and exit.
>>> There are several solutions:
>>> 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
>>> create worker too, remove create worker action in create_worker_cb
>>> is fine, maybe affect performance?
>>> 2. When wq->hash->map bit is set, insert hashed work item, new work
>>> only put in wq->hash_tail, not link to work_list,
>>> io_worker_handle_work need to check hash_tail after a whole dependent
>>> link, io_acct_run_queue will return false when new work insert, no
>>> new thread will be created either in io_wqe_dec_running.
>>> 3. Check is there only one hash bucket in io_wqe_dec_running. If only
>>> one hash bucket, don't create worker, io_wq_enqueue will handle it.
>>
>> Nice catch on this! Does indeed look like a problem. Not a huge fan of
>> approach 3. Without having really looked into this yet, my initial idea
>> would've been to do some variant of solution 1 above. io_wq_enqueue()
>> checks if we need to create a worker, which basically boils down to "do
>> we have a free worker right now". If we do not, we create one. But the
>> question is really "do we need a new worker for this?", and if we're
>> inserting hashed worked and we have existing hashed work for the SAME
>> hash and it's busy, then the answer should be "no" as it'd be pointless
>> to create that worker.
> 
> Agree
> 
>>
>> Would it be feasible to augment the check in there such that
>> io_wq_enqueue() doesn't create a new worker for that case? And I guess a
>> followup question is, would that even be enough, do we always need to
>> cover the io_wq_dec_running() running case as well as
>> io_acct_run_queue() will return true as well since it doesn't know about
>> this either?
> Yes?It is feasible to avoid creating a worker by adding some checks in
> io_wq_enqueue. But what I have observed so far is most workers are
> created in io_wq_dec_running (why no worker create in io_wq_enqueue?
> I didn't figure it out now), it seems no need to check this
> in io_wq_enqueue.  And cover io_wq_dec_running is necessary.

The general concept for io-wq is that it's always assumed that a worker
won't block, and if it does AND more work is available, at that point a
new worker is created. io_wq_dec_running() is called by the scheduler
when a worker is scheduled out, eg blocking, and then an extra worker is
created at that point, if necessary.

I wonder if we can get away with something like the below? Basically two
things in there:

1) If a worker goes to sleep AND it doesn't have a current work
   assigned, just ignore it. Really a separate change, but seems to
   conceptually make sense - a new worker should only be created off
   that path, if it's currenly handling a work item and goes to sleep.

2) If there is current work, defer if it's hashed and the next work item
   in that list is also hashed and of the same value.


diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
index d52069b1177b..cd1fcb115739 100644
--- a/io_uring/io-wq.c
+++ b/io_uring/io-wq.c
@@ -150,6 +150,16 @@ static bool io_acct_cancel_pending_work(struct io_wq *wq,
 static void create_worker_cb(struct callback_head *cb);
 static void io_wq_cancel_tw_create(struct io_wq *wq);
 
+static inline unsigned int __io_get_work_hash(unsigned int work_flags)
+{
+	return work_flags >> IO_WQ_HASH_SHIFT;
+}
+
+static inline unsigned int io_get_work_hash(struct io_wq_work *work)
+{
+	return __io_get_work_hash(atomic_read(&work->flags));
+}
+
 static bool io_worker_get(struct io_worker *worker)
 {
 	return refcount_inc_not_zero(&worker->ref);
@@ -409,6 +419,30 @@ static bool io_queue_worker_create(struct io_worker *worker,
 	return false;
 }
 
+/* Defer if current and next work are both hashed to the same chain */
+static bool io_wq_hash_defer(struct io_wq_work *work, struct io_wq_acct *acct)
+{
+	unsigned int hash, work_flags;
+	struct io_wq_work *next;
+
+	lockdep_assert_held(&acct->lock);
+
+	work_flags = atomic_read(&work->flags);
+	if (!__io_wq_is_hashed(work_flags))
+		return false;
+
+	/* should not happen, io_acct_run_queue() said we had work */
+	if (wq_list_empty(&acct->work_list))
+		return true;
+
+	hash = __io_get_work_hash(work_flags);
+	next = container_of(acct->work_list.first, struct io_wq_work, list);
+	work_flags = atomic_read(&next->flags);
+	if (!__io_wq_is_hashed(work_flags))
+		return false;
+	return hash == __io_get_work_hash(work_flags);
+}
+
 static void io_wq_dec_running(struct io_worker *worker)
 {
 	struct io_wq_acct *acct = io_wq_get_acct(worker);
@@ -419,8 +453,14 @@ static void io_wq_dec_running(struct io_worker *worker)
 
 	if (!atomic_dec_and_test(&acct->nr_running))
 		return;
+	if (!worker->cur_work)
+		return;
 	if (!io_acct_run_queue(acct))
 		return;
+	if (io_wq_hash_defer(worker->cur_work, acct)) {
+		raw_spin_unlock(&acct->lock);
+		return;
+	}
 
 	raw_spin_unlock(&acct->lock);
 	atomic_inc(&acct->nr_running);
@@ -454,16 +494,6 @@ static void __io_worker_idle(struct io_wq_acct *acct, struct io_worker *worker)
 	}
 }
 
-static inline unsigned int __io_get_work_hash(unsigned int work_flags)
-{
-	return work_flags >> IO_WQ_HASH_SHIFT;
-}
-
-static inline unsigned int io_get_work_hash(struct io_wq_work *work)
-{
-	return __io_get_work_hash(atomic_read(&work->flags));
-}
-
 static bool io_wait_on_hash(struct io_wq *wq, unsigned int hash)
 {
 	bool ret = false;

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
  2025-05-22 14:29     ` Jens Axboe
@ 2025-05-23  7:57       ` Fengnan Chang
  2025-05-23 15:20         ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Fengnan Chang @ 2025-05-23  7:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: asml.silence, io-uring, Diangang Li

Jens Axboe <axboe@kernel.dk> 于2025年5月22日周四 22:29写道：
>
> On 5/22/25 6:01 AM, Fengnan Chang wrote:
> > Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 19:35???
> >>
> >> On 5/22/25 3:09 AM, Fengnan Chang wrote:
> >>> When running fio with buffer io and stable iops, I observed that
> >>> part of io_worker threads keeps creating and destroying.
> >>> Using this command can reproduce:
> >>> fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
> >>> --iodepth=256 --filename=/data03/fio-rand-read --name=test
> >>> ps -L -p pid, you can see about 256 io_worker threads, and thread
> >>> id keeps changing.
> >>> And I do some debugging, most workers create happen in
> >>> create_worker_cb. In create_worker_cb, if all workers have gone to
> >>> sleep, and we have more work, we try to create new worker (let's
> >>> call it worker B) to handle it.  And when new work comes,
> >>> io_wq_enqueue will activate free worker (let's call it worker A) or
> >>> create new one. It may cause worker A and B compete for one work.
> >>> Since buffered write is hashed work, buffered write to a given file
> >>> is serialized, only one worker gets the work in the end, the other
> >>> worker goes to sleep. After repeating it many times, a lot of
> >>> io_worker threads created, handles a few works or even no work to
> >>> handle,and exit.
> >>> There are several solutions:
> >>> 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
> >>> create worker too, remove create worker action in create_worker_cb
> >>> is fine, maybe affect performance?
> >>> 2. When wq->hash->map bit is set, insert hashed work item, new work
> >>> only put in wq->hash_tail, not link to work_list,
> >>> io_worker_handle_work need to check hash_tail after a whole dependent
> >>> link, io_acct_run_queue will return false when new work insert, no
> >>> new thread will be created either in io_wqe_dec_running.
> >>> 3. Check is there only one hash bucket in io_wqe_dec_running. If only
> >>> one hash bucket, don't create worker, io_wq_enqueue will handle it.
> >>
> >> Nice catch on this! Does indeed look like a problem. Not a huge fan of
> >> approach 3. Without having really looked into this yet, my initial idea
> >> would've been to do some variant of solution 1 above. io_wq_enqueue()
> >> checks if we need to create a worker, which basically boils down to "do
> >> we have a free worker right now". If we do not, we create one. But the
> >> question is really "do we need a new worker for this?", and if we're
> >> inserting hashed worked and we have existing hashed work for the SAME
> >> hash and it's busy, then the answer should be "no" as it'd be pointless
> >> to create that worker.
> >
> > Agree
> >
> >>
> >> Would it be feasible to augment the check in there such that
> >> io_wq_enqueue() doesn't create a new worker for that case? And I guess a
> >> followup question is, would that even be enough, do we always need to
> >> cover the io_wq_dec_running() running case as well as
> >> io_acct_run_queue() will return true as well since it doesn't know about
> >> this either?
> > Yes?It is feasible to avoid creating a worker by adding some checks in
> > io_wq_enqueue. But what I have observed so far is most workers are
> > created in io_wq_dec_running (why no worker create in io_wq_enqueue?
> > I didn't figure it out now), it seems no need to check this
> > in io_wq_enqueue.  And cover io_wq_dec_running is necessary.
>
> The general concept for io-wq is that it's always assumed that a worker
> won't block, and if it does AND more work is available, at that point a
> new worker is created. io_wq_dec_running() is called by the scheduler
> when a worker is scheduled out, eg blocking, and then an extra worker is
> created at that point, if necessary.
>
> I wonder if we can get away with something like the below? Basically two
> things in there:
>
> 1) If a worker goes to sleep AND it doesn't have a current work
>    assigned, just ignore it. Really a separate change, but seems to
>    conceptually make sense - a new worker should only be created off
>    that path, if it's currenly handling a work item and goes to sleep.
>
> 2) If there is current work, defer if it's hashed and the next work item
>    in that list is also hashed and of the same value.
I like this change, this makes the logic clearer. This patch looks good,
I'll do more tests next week.

>
>
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index d52069b1177b..cd1fcb115739 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -150,6 +150,16 @@ static bool io_acct_cancel_pending_work(struct io_wq *wq,
>  static void create_worker_cb(struct callback_head *cb);
>  static void io_wq_cancel_tw_create(struct io_wq *wq);
>
> +static inline unsigned int __io_get_work_hash(unsigned int work_flags)
> +{
> +       return work_flags >> IO_WQ_HASH_SHIFT;
> +}
> +
> +static inline unsigned int io_get_work_hash(struct io_wq_work *work)
> +{
> +       return __io_get_work_hash(atomic_read(&work->flags));
> +}
> +
>  static bool io_worker_get(struct io_worker *worker)
>  {
>         return refcount_inc_not_zero(&worker->ref);
> @@ -409,6 +419,30 @@ static bool io_queue_worker_create(struct io_worker *worker,
>         return false;
>  }
>
> +/* Defer if current and next work are both hashed to the same chain */
> +static bool io_wq_hash_defer(struct io_wq_work *work, struct io_wq_acct *acct)
> +{
> +       unsigned int hash, work_flags;
> +       struct io_wq_work *next;
> +
> +       lockdep_assert_held(&acct->lock);
> +
> +       work_flags = atomic_read(&work->flags);
> +       if (!__io_wq_is_hashed(work_flags))
> +               return false;
> +
> +       /* should not happen, io_acct_run_queue() said we had work */
> +       if (wq_list_empty(&acct->work_list))
> +               return true;
> +
> +       hash = __io_get_work_hash(work_flags);
> +       next = container_of(acct->work_list.first, struct io_wq_work, list);
> +       work_flags = atomic_read(&next->flags);
> +       if (!__io_wq_is_hashed(work_flags))
> +               return false;
> +       return hash == __io_get_work_hash(work_flags);
> +}
> +
>  static void io_wq_dec_running(struct io_worker *worker)
>  {
>         struct io_wq_acct *acct = io_wq_get_acct(worker);
> @@ -419,8 +453,14 @@ static void io_wq_dec_running(struct io_worker *worker)
>
>         if (!atomic_dec_and_test(&acct->nr_running))
>                 return;
> +       if (!worker->cur_work)
> +               return;
>         if (!io_acct_run_queue(acct))
>                 return;
> +       if (io_wq_hash_defer(worker->cur_work, acct)) {
> +               raw_spin_unlock(&acct->lock);
> +               return;
> +       }
>
>         raw_spin_unlock(&acct->lock);
>         atomic_inc(&acct->nr_running);
> @@ -454,16 +494,6 @@ static void __io_worker_idle(struct io_wq_acct *acct, struct io_worker *worker)
>         }
>  }
>
> -static inline unsigned int __io_get_work_hash(unsigned int work_flags)
> -{
> -       return work_flags >> IO_WQ_HASH_SHIFT;
> -}
> -
> -static inline unsigned int io_get_work_hash(struct io_wq_work *work)
> -{
> -       return __io_get_work_hash(atomic_read(&work->flags));
> -}
> -
>  static bool io_wait_on_hash(struct io_wq *wq, unsigned int hash)
>  {
>         bool ret = false;
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
  2025-05-23  7:57       ` Fengnan Chang
@ 2025-05-23 15:20         ` Jens Axboe
  2025-05-26 11:14           ` Fengnan Chang
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-05-23 15:20 UTC (permalink / raw)
  To: Fengnan Chang; +Cc: asml.silence, io-uring, Diangang Li

On 5/23/25 1:57 AM, Fengnan Chang wrote:
> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 22:29???
>>
>> On 5/22/25 6:01 AM, Fengnan Chang wrote:
>>> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 19:35???
>>>>
>>>> On 5/22/25 3:09 AM, Fengnan Chang wrote:
>>>>> When running fio with buffer io and stable iops, I observed that
>>>>> part of io_worker threads keeps creating and destroying.
>>>>> Using this command can reproduce:
>>>>> fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
>>>>> --iodepth=256 --filename=/data03/fio-rand-read --name=test
>>>>> ps -L -p pid, you can see about 256 io_worker threads, and thread
>>>>> id keeps changing.
>>>>> And I do some debugging, most workers create happen in
>>>>> create_worker_cb. In create_worker_cb, if all workers have gone to
>>>>> sleep, and we have more work, we try to create new worker (let's
>>>>> call it worker B) to handle it.  And when new work comes,
>>>>> io_wq_enqueue will activate free worker (let's call it worker A) or
>>>>> create new one. It may cause worker A and B compete for one work.
>>>>> Since buffered write is hashed work, buffered write to a given file
>>>>> is serialized, only one worker gets the work in the end, the other
>>>>> worker goes to sleep. After repeating it many times, a lot of
>>>>> io_worker threads created, handles a few works or even no work to
>>>>> handle,and exit.
>>>>> There are several solutions:
>>>>> 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
>>>>> create worker too, remove create worker action in create_worker_cb
>>>>> is fine, maybe affect performance?
>>>>> 2. When wq->hash->map bit is set, insert hashed work item, new work
>>>>> only put in wq->hash_tail, not link to work_list,
>>>>> io_worker_handle_work need to check hash_tail after a whole dependent
>>>>> link, io_acct_run_queue will return false when new work insert, no
>>>>> new thread will be created either in io_wqe_dec_running.
>>>>> 3. Check is there only one hash bucket in io_wqe_dec_running. If only
>>>>> one hash bucket, don't create worker, io_wq_enqueue will handle it.
>>>>
>>>> Nice catch on this! Does indeed look like a problem. Not a huge fan of
>>>> approach 3. Without having really looked into this yet, my initial idea
>>>> would've been to do some variant of solution 1 above. io_wq_enqueue()
>>>> checks if we need to create a worker, which basically boils down to "do
>>>> we have a free worker right now". If we do not, we create one. But the
>>>> question is really "do we need a new worker for this?", and if we're
>>>> inserting hashed worked and we have existing hashed work for the SAME
>>>> hash and it's busy, then the answer should be "no" as it'd be pointless
>>>> to create that worker.
>>>
>>> Agree
>>>
>>>>
>>>> Would it be feasible to augment the check in there such that
>>>> io_wq_enqueue() doesn't create a new worker for that case? And I guess a
>>>> followup question is, would that even be enough, do we always need to
>>>> cover the io_wq_dec_running() running case as well as
>>>> io_acct_run_queue() will return true as well since it doesn't know about
>>>> this either?
>>> Yes?It is feasible to avoid creating a worker by adding some checks in
>>> io_wq_enqueue. But what I have observed so far is most workers are
>>> created in io_wq_dec_running (why no worker create in io_wq_enqueue?
>>> I didn't figure it out now), it seems no need to check this
>>> in io_wq_enqueue.  And cover io_wq_dec_running is necessary.
>>
>> The general concept for io-wq is that it's always assumed that a worker
>> won't block, and if it does AND more work is available, at that point a
>> new worker is created. io_wq_dec_running() is called by the scheduler
>> when a worker is scheduled out, eg blocking, and then an extra worker is
>> created at that point, if necessary.
>>
>> I wonder if we can get away with something like the below? Basically two
>> things in there:
>>
>> 1) If a worker goes to sleep AND it doesn't have a current work
>>    assigned, just ignore it. Really a separate change, but seems to
>>    conceptually make sense - a new worker should only be created off
>>    that path, if it's currenly handling a work item and goes to sleep.
>>
>> 2) If there is current work, defer if it's hashed and the next work item
>>    in that list is also hashed and of the same value.
> I like this change, this makes the logic clearer. This patch looks good,
> I'll do more tests next week.

Thanks for taking a look - I've posted it as a 3 patch series, as 1+2
above are really two separate things that need sorting imho. I've queued
it up for the next kernel release, so please do test next week when you
have time.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
  2025-05-23 15:20         ` Jens Axboe
@ 2025-05-26 11:14           ` Fengnan Chang
  2025-05-28 13:39             ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Fengnan Chang @ 2025-05-26 11:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: asml.silence, io-uring, Diangang Li

Jens Axboe <axboe@kernel.dk> 于2025年5月23日周五 23:20写道：
>
> On 5/23/25 1:57 AM, Fengnan Chang wrote:
> > Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 22:29???
> >>
> >> On 5/22/25 6:01 AM, Fengnan Chang wrote:
> >>> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 19:35???
> >>>>
> >>>> On 5/22/25 3:09 AM, Fengnan Chang wrote:
> >>>>> When running fio with buffer io and stable iops, I observed that
> >>>>> part of io_worker threads keeps creating and destroying.
> >>>>> Using this command can reproduce:
> >>>>> fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
> >>>>> --iodepth=256 --filename=/data03/fio-rand-read --name=test
> >>>>> ps -L -p pid, you can see about 256 io_worker threads, and thread
> >>>>> id keeps changing.
> >>>>> And I do some debugging, most workers create happen in
> >>>>> create_worker_cb. In create_worker_cb, if all workers have gone to
> >>>>> sleep, and we have more work, we try to create new worker (let's
> >>>>> call it worker B) to handle it.  And when new work comes,
> >>>>> io_wq_enqueue will activate free worker (let's call it worker A) or
> >>>>> create new one. It may cause worker A and B compete for one work.
> >>>>> Since buffered write is hashed work, buffered write to a given file
> >>>>> is serialized, only one worker gets the work in the end, the other
> >>>>> worker goes to sleep. After repeating it many times, a lot of
> >>>>> io_worker threads created, handles a few works or even no work to
> >>>>> handle,and exit.
> >>>>> There are several solutions:
> >>>>> 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
> >>>>> create worker too, remove create worker action in create_worker_cb
> >>>>> is fine, maybe affect performance?
> >>>>> 2. When wq->hash->map bit is set, insert hashed work item, new work
> >>>>> only put in wq->hash_tail, not link to work_list,
> >>>>> io_worker_handle_work need to check hash_tail after a whole dependent
> >>>>> link, io_acct_run_queue will return false when new work insert, no
> >>>>> new thread will be created either in io_wqe_dec_running.
> >>>>> 3. Check is there only one hash bucket in io_wqe_dec_running. If only
> >>>>> one hash bucket, don't create worker, io_wq_enqueue will handle it.
> >>>>
> >>>> Nice catch on this! Does indeed look like a problem. Not a huge fan of
> >>>> approach 3. Without having really looked into this yet, my initial idea
> >>>> would've been to do some variant of solution 1 above. io_wq_enqueue()
> >>>> checks if we need to create a worker, which basically boils down to "do
> >>>> we have a free worker right now". If we do not, we create one. But the
> >>>> question is really "do we need a new worker for this?", and if we're
> >>>> inserting hashed worked and we have existing hashed work for the SAME
> >>>> hash and it's busy, then the answer should be "no" as it'd be pointless
> >>>> to create that worker.
> >>>
> >>> Agree
> >>>
> >>>>
> >>>> Would it be feasible to augment the check in there such that
> >>>> io_wq_enqueue() doesn't create a new worker for that case? And I guess a
> >>>> followup question is, would that even be enough, do we always need to
> >>>> cover the io_wq_dec_running() running case as well as
> >>>> io_acct_run_queue() will return true as well since it doesn't know about
> >>>> this either?
> >>> Yes?It is feasible to avoid creating a worker by adding some checks in
> >>> io_wq_enqueue. But what I have observed so far is most workers are
> >>> created in io_wq_dec_running (why no worker create in io_wq_enqueue?
> >>> I didn't figure it out now), it seems no need to check this
> >>> in io_wq_enqueue.  And cover io_wq_dec_running is necessary.
> >>
> >> The general concept for io-wq is that it's always assumed that a worker
> >> won't block, and if it does AND more work is available, at that point a
> >> new worker is created. io_wq_dec_running() is called by the scheduler
> >> when a worker is scheduled out, eg blocking, and then an extra worker is
> >> created at that point, if necessary.
> >>
> >> I wonder if we can get away with something like the below? Basically two
> >> things in there:
> >>
> >> 1) If a worker goes to sleep AND it doesn't have a current work
> >>    assigned, just ignore it. Really a separate change, but seems to
> >>    conceptually make sense - a new worker should only be created off
> >>    that path, if it's currenly handling a work item and goes to sleep.
> >>
> >> 2) If there is current work, defer if it's hashed and the next work item
> >>    in that list is also hashed and of the same value.
> > I like this change, this makes the logic clearer. This patch looks good,
> > I'll do more tests next week.
>
> Thanks for taking a look - I've posted it as a 3 patch series, as 1+2
> above are really two separate things that need sorting imho. I've queued
> it up for the next kernel release, so please do test next week when you
> have time.

I have completed the test and the results are good.
But I still have a concern. When using one uring queue to buffer write
multiple files, previously there were multiple workers working, this
change will make only one worker working, which will reduce some
concurrency and may cause performance degradation.
But I didn't find it in the actual test, so my worry may be unnecessary.

>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
  2025-05-26 11:14           ` Fengnan Chang
@ 2025-05-28 13:39             ` Jens Axboe
  2025-05-28 13:56               ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2025-05-28 13:39 UTC (permalink / raw)
  To: Fengnan Chang; +Cc: asml.silence, io-uring, Diangang Li

On 5/26/25 5:14 AM, Fengnan Chang wrote:
> Jens Axboe <axboe@kernel.dk> ?2025?5?23??? 23:20???
>>
>> On 5/23/25 1:57 AM, Fengnan Chang wrote:
>>> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 22:29???
>>>>
>>>> On 5/22/25 6:01 AM, Fengnan Chang wrote:
>>>>> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 19:35???
>>>>>>
>>>>>> On 5/22/25 3:09 AM, Fengnan Chang wrote:
>>>>>>> When running fio with buffer io and stable iops, I observed that
>>>>>>> part of io_worker threads keeps creating and destroying.
>>>>>>> Using this command can reproduce:
>>>>>>> fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
>>>>>>> --iodepth=256 --filename=/data03/fio-rand-read --name=test
>>>>>>> ps -L -p pid, you can see about 256 io_worker threads, and thread
>>>>>>> id keeps changing.
>>>>>>> And I do some debugging, most workers create happen in
>>>>>>> create_worker_cb. In create_worker_cb, if all workers have gone to
>>>>>>> sleep, and we have more work, we try to create new worker (let's
>>>>>>> call it worker B) to handle it.  And when new work comes,
>>>>>>> io_wq_enqueue will activate free worker (let's call it worker A) or
>>>>>>> create new one. It may cause worker A and B compete for one work.
>>>>>>> Since buffered write is hashed work, buffered write to a given file
>>>>>>> is serialized, only one worker gets the work in the end, the other
>>>>>>> worker goes to sleep. After repeating it many times, a lot of
>>>>>>> io_worker threads created, handles a few works or even no work to
>>>>>>> handle,and exit.
>>>>>>> There are several solutions:
>>>>>>> 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
>>>>>>> create worker too, remove create worker action in create_worker_cb
>>>>>>> is fine, maybe affect performance?
>>>>>>> 2. When wq->hash->map bit is set, insert hashed work item, new work
>>>>>>> only put in wq->hash_tail, not link to work_list,
>>>>>>> io_worker_handle_work need to check hash_tail after a whole dependent
>>>>>>> link, io_acct_run_queue will return false when new work insert, no
>>>>>>> new thread will be created either in io_wqe_dec_running.
>>>>>>> 3. Check is there only one hash bucket in io_wqe_dec_running. If only
>>>>>>> one hash bucket, don't create worker, io_wq_enqueue will handle it.
>>>>>>
>>>>>> Nice catch on this! Does indeed look like a problem. Not a huge fan of
>>>>>> approach 3. Without having really looked into this yet, my initial idea
>>>>>> would've been to do some variant of solution 1 above. io_wq_enqueue()
>>>>>> checks if we need to create a worker, which basically boils down to "do
>>>>>> we have a free worker right now". If we do not, we create one. But the
>>>>>> question is really "do we need a new worker for this?", and if we're
>>>>>> inserting hashed worked and we have existing hashed work for the SAME
>>>>>> hash and it's busy, then the answer should be "no" as it'd be pointless
>>>>>> to create that worker.
>>>>>
>>>>> Agree
>>>>>
>>>>>>
>>>>>> Would it be feasible to augment the check in there such that
>>>>>> io_wq_enqueue() doesn't create a new worker for that case? And I guess a
>>>>>> followup question is, would that even be enough, do we always need to
>>>>>> cover the io_wq_dec_running() running case as well as
>>>>>> io_acct_run_queue() will return true as well since it doesn't know about
>>>>>> this either?
>>>>> Yes?It is feasible to avoid creating a worker by adding some checks in
>>>>> io_wq_enqueue. But what I have observed so far is most workers are
>>>>> created in io_wq_dec_running (why no worker create in io_wq_enqueue?
>>>>> I didn't figure it out now), it seems no need to check this
>>>>> in io_wq_enqueue.  And cover io_wq_dec_running is necessary.
>>>>
>>>> The general concept for io-wq is that it's always assumed that a worker
>>>> won't block, and if it does AND more work is available, at that point a
>>>> new worker is created. io_wq_dec_running() is called by the scheduler
>>>> when a worker is scheduled out, eg blocking, and then an extra worker is
>>>> created at that point, if necessary.
>>>>
>>>> I wonder if we can get away with something like the below? Basically two
>>>> things in there:
>>>>
>>>> 1) If a worker goes to sleep AND it doesn't have a current work
>>>>    assigned, just ignore it. Really a separate change, but seems to
>>>>    conceptually make sense - a new worker should only be created off
>>>>    that path, if it's currenly handling a work item and goes to sleep.
>>>>
>>>> 2) If there is current work, defer if it's hashed and the next work item
>>>>    in that list is also hashed and of the same value.
>>> I like this change, this makes the logic clearer. This patch looks good,
>>> I'll do more tests next week.
>>
>> Thanks for taking a look - I've posted it as a 3 patch series, as 1+2
>> above are really two separate things that need sorting imho. I've queued
>> it up for the next kernel release, so please do test next week when you
>> have time.
> 
> I have completed the test and the results are good.

Thanks for re-testing!

> But I still have a concern. When using one uring queue to buffer write
> multiple files, previously there were multiple workers working, this
> change will make only one worker working, which will reduce some
> concurrency and may cause performance degradation. But I didn't find
> it in the actual test, so my worry may be unnecessary.

Could be one of two things:

1) None of the workers _actually_ end up blocking, in which case it's
   working as-designed.

2) We're now missing cases where we should indeed create a worker. This
   is a bug.

I'll run some specific testing for io-wq here to see if it's 1 or 2.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
  2025-05-28 13:39             ` Jens Axboe
@ 2025-05-28 13:56               ` Jens Axboe
  0 siblings, 0 replies; 9+ messages in thread
From: Jens Axboe @ 2025-05-28 13:56 UTC (permalink / raw)
  To: Fengnan Chang; +Cc: asml.silence, io-uring, Diangang Li

On 5/28/25 7:39 AM, Jens Axboe wrote:
> On 5/26/25 5:14 AM, Fengnan Chang wrote:
>> Jens Axboe <axboe@kernel.dk> ?2025?5?23??? 23:20???
>>>
>>> On 5/23/25 1:57 AM, Fengnan Chang wrote:
>>>> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 22:29???
>>>>>
>>>>> On 5/22/25 6:01 AM, Fengnan Chang wrote:
>>>>>> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 19:35???
>>>>>>>
>>>>>>> On 5/22/25 3:09 AM, Fengnan Chang wrote:
>>>>>>>> When running fio with buffer io and stable iops, I observed that
>>>>>>>> part of io_worker threads keeps creating and destroying.
>>>>>>>> Using this command can reproduce:
>>>>>>>> fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
>>>>>>>> --iodepth=256 --filename=/data03/fio-rand-read --name=test
>>>>>>>> ps -L -p pid, you can see about 256 io_worker threads, and thread
>>>>>>>> id keeps changing.
>>>>>>>> And I do some debugging, most workers create happen in
>>>>>>>> create_worker_cb. In create_worker_cb, if all workers have gone to
>>>>>>>> sleep, and we have more work, we try to create new worker (let's
>>>>>>>> call it worker B) to handle it.  And when new work comes,
>>>>>>>> io_wq_enqueue will activate free worker (let's call it worker A) or
>>>>>>>> create new one. It may cause worker A and B compete for one work.
>>>>>>>> Since buffered write is hashed work, buffered write to a given file
>>>>>>>> is serialized, only one worker gets the work in the end, the other
>>>>>>>> worker goes to sleep. After repeating it many times, a lot of
>>>>>>>> io_worker threads created, handles a few works or even no work to
>>>>>>>> handle,and exit.
>>>>>>>> There are several solutions:
>>>>>>>> 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
>>>>>>>> create worker too, remove create worker action in create_worker_cb
>>>>>>>> is fine, maybe affect performance?
>>>>>>>> 2. When wq->hash->map bit is set, insert hashed work item, new work
>>>>>>>> only put in wq->hash_tail, not link to work_list,
>>>>>>>> io_worker_handle_work need to check hash_tail after a whole dependent
>>>>>>>> link, io_acct_run_queue will return false when new work insert, no
>>>>>>>> new thread will be created either in io_wqe_dec_running.
>>>>>>>> 3. Check is there only one hash bucket in io_wqe_dec_running. If only
>>>>>>>> one hash bucket, don't create worker, io_wq_enqueue will handle it.
>>>>>>>
>>>>>>> Nice catch on this! Does indeed look like a problem. Not a huge fan of
>>>>>>> approach 3. Without having really looked into this yet, my initial idea
>>>>>>> would've been to do some variant of solution 1 above. io_wq_enqueue()
>>>>>>> checks if we need to create a worker, which basically boils down to "do
>>>>>>> we have a free worker right now". If we do not, we create one. But the
>>>>>>> question is really "do we need a new worker for this?", and if we're
>>>>>>> inserting hashed worked and we have existing hashed work for the SAME
>>>>>>> hash and it's busy, then the answer should be "no" as it'd be pointless
>>>>>>> to create that worker.
>>>>>>
>>>>>> Agree
>>>>>>
>>>>>>>
>>>>>>> Would it be feasible to augment the check in there such that
>>>>>>> io_wq_enqueue() doesn't create a new worker for that case? And I guess a
>>>>>>> followup question is, would that even be enough, do we always need to
>>>>>>> cover the io_wq_dec_running() running case as well as
>>>>>>> io_acct_run_queue() will return true as well since it doesn't know about
>>>>>>> this either?
>>>>>> Yes?It is feasible to avoid creating a worker by adding some checks in
>>>>>> io_wq_enqueue. But what I have observed so far is most workers are
>>>>>> created in io_wq_dec_running (why no worker create in io_wq_enqueue?
>>>>>> I didn't figure it out now), it seems no need to check this
>>>>>> in io_wq_enqueue.  And cover io_wq_dec_running is necessary.
>>>>>
>>>>> The general concept for io-wq is that it's always assumed that a worker
>>>>> won't block, and if it does AND more work is available, at that point a
>>>>> new worker is created. io_wq_dec_running() is called by the scheduler
>>>>> when a worker is scheduled out, eg blocking, and then an extra worker is
>>>>> created at that point, if necessary.
>>>>>
>>>>> I wonder if we can get away with something like the below? Basically two
>>>>> things in there:
>>>>>
>>>>> 1) If a worker goes to sleep AND it doesn't have a current work
>>>>>    assigned, just ignore it. Really a separate change, but seems to
>>>>>    conceptually make sense - a new worker should only be created off
>>>>>    that path, if it's currenly handling a work item and goes to sleep.
>>>>>
>>>>> 2) If there is current work, defer if it's hashed and the next work item
>>>>>    in that list is also hashed and of the same value.
>>>> I like this change, this makes the logic clearer. This patch looks good,
>>>> I'll do more tests next week.
>>>
>>> Thanks for taking a look - I've posted it as a 3 patch series, as 1+2
>>> above are really two separate things that need sorting imho. I've queued
>>> it up for the next kernel release, so please do test next week when you
>>> have time.
>>
>> I have completed the test and the results are good.
> 
> Thanks for re-testing!
> 
>> But I still have a concern. When using one uring queue to buffer write
>> multiple files, previously there were multiple workers working, this
>> change will make only one worker working, which will reduce some
>> concurrency and may cause performance degradation. But I didn't find
>> it in the actual test, so my worry may be unnecessary.
> 
> Could be one of two things:
> 
> 1) None of the workers _actually_ end up blocking, in which case it's
>    working as-designed.
> 
> 2) We're now missing cases where we should indeed create a worker. This
>    is a bug.
> 
> I'll run some specific testing for io-wq here to see if it's 1 or 2.

Ran a very basic test case, which is using btrfs and O_DIRECT reads.
btrfs has this "bug" where anything dio bigger than 4k will always
return -EAGAIN if RWF_NOWAIT is set, which makes it a good use case to
test this as any directly issued IO will return -EAGAIN, and you know
that a dio read will always end up blocking if issued sync.> 

Doing:

fio --name=foo --filename=/data/foo --bs=8k --direct=1 \
--ioengine=io_uring --iodepth=8

should this yield 8 iou-wrk workers running, which it does seems to do.
If I bump that to iodepth=32, then I should see about 32 workers on
average, which also looks to be true.

Hence it doesn't immediately look like we're not creating workers when
we should, and we're not over-creating either.

This is not to say that the current code is perfect. If you have a case
you think is wrong, then please do send it along. But I suspect that
your writing case ends up largely not blocking, which is where io-wq
would otherwise create a new worker if one is needed. There's no
checking in io-wq for "this worker is not blocking but is running for a
long time" that will create a new worker. There probably could/should
be, but that's not how it's been done so far. New worker creation is
strictly driven by an existing worker going to sleep AND there being
other work items to start.

Note: this is _blocking_, not being scheduled out. If an io-wq worker is
being preempted right now, then that will not count as an event that
warrants creating a new worker. We could potentially have it called for
preemption too, and let that be the driver of "long running work". That
might be an improvement, at least for certain workloads.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-05-28 13:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22  9:09 [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying Fengnan Chang
2025-05-22 11:34 ` Jens Axboe
2025-05-22 12:01   ` [External] " Fengnan Chang
2025-05-22 14:29     ` Jens Axboe
2025-05-23  7:57       ` Fengnan Chang
2025-05-23 15:20         ` Jens Axboe
2025-05-26 11:14           ` Fengnan Chang
2025-05-28 13:39             ` Jens Axboe
2025-05-28 13:56               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox