Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: Fengnan Chang <changfengnan@bytedance.com>
Cc: asml.silence@gmail.com, io-uring@vger.kernel.org,
	Diangang Li <lidiangang@bytedance.com>
Subject: Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
Date: Wed, 28 May 2025 07:39:36 -0600	[thread overview]
Message-ID: <84ac5b93-7f1c-4092-80a8-9f9813cdbc1b@kernel.dk> (raw)
In-Reply-To: <CAPFOzZvajLPeCk7OOWoww8XdtA3mSkT+hkuMomBt=5pqMZ29SQ@mail.gmail.com>

On 5/26/25 5:14 AM, Fengnan Chang wrote:
> Jens Axboe <axboe@kernel.dk> ?2025?5?23??? 23:20???
>>
>> On 5/23/25 1:57 AM, Fengnan Chang wrote:
>>> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 22:29???
>>>>
>>>> On 5/22/25 6:01 AM, Fengnan Chang wrote:
>>>>> Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 19:35???
>>>>>>
>>>>>> On 5/22/25 3:09 AM, Fengnan Chang wrote:
>>>>>>> When running fio with buffer io and stable iops, I observed that
>>>>>>> part of io_worker threads keeps creating and destroying.
>>>>>>> Using this command can reproduce:
>>>>>>> fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
>>>>>>> --iodepth=256 --filename=/data03/fio-rand-read --name=test
>>>>>>> ps -L -p pid, you can see about 256 io_worker threads, and thread
>>>>>>> id keeps changing.
>>>>>>> And I do some debugging, most workers create happen in
>>>>>>> create_worker_cb. In create_worker_cb, if all workers have gone to
>>>>>>> sleep, and we have more work, we try to create new worker (let's
>>>>>>> call it worker B) to handle it.  And when new work comes,
>>>>>>> io_wq_enqueue will activate free worker (let's call it worker A) or
>>>>>>> create new one. It may cause worker A and B compete for one work.
>>>>>>> Since buffered write is hashed work, buffered write to a given file
>>>>>>> is serialized, only one worker gets the work in the end, the other
>>>>>>> worker goes to sleep. After repeating it many times, a lot of
>>>>>>> io_worker threads created, handles a few works or even no work to
>>>>>>> handle,and exit.
>>>>>>> There are several solutions:
>>>>>>> 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
>>>>>>> create worker too, remove create worker action in create_worker_cb
>>>>>>> is fine, maybe affect performance?
>>>>>>> 2. When wq->hash->map bit is set, insert hashed work item, new work
>>>>>>> only put in wq->hash_tail, not link to work_list,
>>>>>>> io_worker_handle_work need to check hash_tail after a whole dependent
>>>>>>> link, io_acct_run_queue will return false when new work insert, no
>>>>>>> new thread will be created either in io_wqe_dec_running.
>>>>>>> 3. Check is there only one hash bucket in io_wqe_dec_running. If only
>>>>>>> one hash bucket, don't create worker, io_wq_enqueue will handle it.
>>>>>>
>>>>>> Nice catch on this! Does indeed look like a problem. Not a huge fan of
>>>>>> approach 3. Without having really looked into this yet, my initial idea
>>>>>> would've been to do some variant of solution 1 above. io_wq_enqueue()
>>>>>> checks if we need to create a worker, which basically boils down to "do
>>>>>> we have a free worker right now". If we do not, we create one. But the
>>>>>> question is really "do we need a new worker for this?", and if we're
>>>>>> inserting hashed worked and we have existing hashed work for the SAME
>>>>>> hash and it's busy, then the answer should be "no" as it'd be pointless
>>>>>> to create that worker.
>>>>>
>>>>> Agree
>>>>>
>>>>>>
>>>>>> Would it be feasible to augment the check in there such that
>>>>>> io_wq_enqueue() doesn't create a new worker for that case? And I guess a
>>>>>> followup question is, would that even be enough, do we always need to
>>>>>> cover the io_wq_dec_running() running case as well as
>>>>>> io_acct_run_queue() will return true as well since it doesn't know about
>>>>>> this either?
>>>>> Yes?It is feasible to avoid creating a worker by adding some checks in
>>>>> io_wq_enqueue. But what I have observed so far is most workers are
>>>>> created in io_wq_dec_running (why no worker create in io_wq_enqueue?
>>>>> I didn't figure it out now), it seems no need to check this
>>>>> in io_wq_enqueue.  And cover io_wq_dec_running is necessary.
>>>>
>>>> The general concept for io-wq is that it's always assumed that a worker
>>>> won't block, and if it does AND more work is available, at that point a
>>>> new worker is created. io_wq_dec_running() is called by the scheduler
>>>> when a worker is scheduled out, eg blocking, and then an extra worker is
>>>> created at that point, if necessary.
>>>>
>>>> I wonder if we can get away with something like the below? Basically two
>>>> things in there:
>>>>
>>>> 1) If a worker goes to sleep AND it doesn't have a current work
>>>>    assigned, just ignore it. Really a separate change, but seems to
>>>>    conceptually make sense - a new worker should only be created off
>>>>    that path, if it's currenly handling a work item and goes to sleep.
>>>>
>>>> 2) If there is current work, defer if it's hashed and the next work item
>>>>    in that list is also hashed and of the same value.
>>> I like this change, this makes the logic clearer. This patch looks good,
>>> I'll do more tests next week.
>>
>> Thanks for taking a look - I've posted it as a 3 patch series, as 1+2
>> above are really two separate things that need sorting imho. I've queued
>> it up for the next kernel release, so please do test next week when you
>> have time.
> 
> I have completed the test and the results are good.

Thanks for re-testing!

> But I still have a concern. When using one uring queue to buffer write
> multiple files, previously there were multiple workers working, this
> change will make only one worker working, which will reduce some
> concurrency and may cause performance degradation. But I didn't find
> it in the actual test, so my worry may be unnecessary.

Could be one of two things:

1) None of the workers _actually_ end up blocking, in which case it's
   working as-designed.

2) We're now missing cases where we should indeed create a worker. This
   is a bug.

I'll run some specific testing for io-wq here to see if it's 1 or 2.

-- 
Jens Axboe

next prev parent reply	other threads:[~2025-05-28 13:39 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-22  9:09 [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying Fengnan Chang
2025-05-22 11:34 ` Jens Axboe
2025-05-22 12:01   ` [External] " Fengnan Chang
2025-05-22 14:29     ` Jens Axboe
2025-05-23  7:57       ` Fengnan Chang
2025-05-23 15:20         ` Jens Axboe
2025-05-26 11:14           ` Fengnan Chang
2025-05-28 13:39             ` Jens Axboe [this message]
2025-05-28 13:56               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84ac5b93-7f1c-4092-80a8-9f9813cdbc1b@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=changfengnan@bytedance.com \
    --cc=io-uring@vger.kernel.org \
    --cc=lidiangang@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox