Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Fengnan Chang <changfengnan@bytedance.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: asml.silence@gmail.com, io-uring@vger.kernel.org,
	 Diangang Li <lidiangang@bytedance.com>
Subject: Re: [External] Re: [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying
Date: Fri, 23 May 2025 15:57:34 +0800	[thread overview]
Message-ID: <CAPFOzZtxXOQvC0wcNLaj-hZUOf2PWqon0uEvbQh7if7a7DdX=g@mail.gmail.com> (raw)
In-Reply-To: <356c5068-bd97-419a-884c-bcdb04ad6820@kernel.dk>

Jens Axboe <axboe@kernel.dk> 于2025年5月22日周四 22:29写道：
>
> On 5/22/25 6:01 AM, Fengnan Chang wrote:
> > Jens Axboe <axboe@kernel.dk> ?2025?5?22??? 19:35???
> >>
> >> On 5/22/25 3:09 AM, Fengnan Chang wrote:
> >>> When running fio with buffer io and stable iops, I observed that
> >>> part of io_worker threads keeps creating and destroying.
> >>> Using this command can reproduce:
> >>> fio --ioengine=io_uring --rw=randrw --bs=4k --direct=0 --size=100G
> >>> --iodepth=256 --filename=/data03/fio-rand-read --name=test
> >>> ps -L -p pid, you can see about 256 io_worker threads, and thread
> >>> id keeps changing.
> >>> And I do some debugging, most workers create happen in
> >>> create_worker_cb. In create_worker_cb, if all workers have gone to
> >>> sleep, and we have more work, we try to create new worker (let's
> >>> call it worker B) to handle it.  And when new work comes,
> >>> io_wq_enqueue will activate free worker (let's call it worker A) or
> >>> create new one. It may cause worker A and B compete for one work.
> >>> Since buffered write is hashed work, buffered write to a given file
> >>> is serialized, only one worker gets the work in the end, the other
> >>> worker goes to sleep. After repeating it many times, a lot of
> >>> io_worker threads created, handles a few works or even no work to
> >>> handle,and exit.
> >>> There are several solutions:
> >>> 1. Since all work is insert in io_wq_enqueue, io_wq_enqueue will
> >>> create worker too, remove create worker action in create_worker_cb
> >>> is fine, maybe affect performance?
> >>> 2. When wq->hash->map bit is set, insert hashed work item, new work
> >>> only put in wq->hash_tail, not link to work_list,
> >>> io_worker_handle_work need to check hash_tail after a whole dependent
> >>> link, io_acct_run_queue will return false when new work insert, no
> >>> new thread will be created either in io_wqe_dec_running.
> >>> 3. Check is there only one hash bucket in io_wqe_dec_running. If only
> >>> one hash bucket, don't create worker, io_wq_enqueue will handle it.
> >>
> >> Nice catch on this! Does indeed look like a problem. Not a huge fan of
> >> approach 3. Without having really looked into this yet, my initial idea
> >> would've been to do some variant of solution 1 above. io_wq_enqueue()
> >> checks if we need to create a worker, which basically boils down to "do
> >> we have a free worker right now". If we do not, we create one. But the
> >> question is really "do we need a new worker for this?", and if we're
> >> inserting hashed worked and we have existing hashed work for the SAME
> >> hash and it's busy, then the answer should be "no" as it'd be pointless
> >> to create that worker.
> >
> > Agree
> >
> >>
> >> Would it be feasible to augment the check in there such that
> >> io_wq_enqueue() doesn't create a new worker for that case? And I guess a
> >> followup question is, would that even be enough, do we always need to
> >> cover the io_wq_dec_running() running case as well as
> >> io_acct_run_queue() will return true as well since it doesn't know about
> >> this either?
> > Yes?It is feasible to avoid creating a worker by adding some checks in
> > io_wq_enqueue. But what I have observed so far is most workers are
> > created in io_wq_dec_running (why no worker create in io_wq_enqueue?
> > I didn't figure it out now), it seems no need to check this
> > in io_wq_enqueue.  And cover io_wq_dec_running is necessary.
>
> The general concept for io-wq is that it's always assumed that a worker
> won't block, and if it does AND more work is available, at that point a
> new worker is created. io_wq_dec_running() is called by the scheduler
> when a worker is scheduled out, eg blocking, and then an extra worker is
> created at that point, if necessary.
>
> I wonder if we can get away with something like the below? Basically two
> things in there:
>
> 1) If a worker goes to sleep AND it doesn't have a current work
>    assigned, just ignore it. Really a separate change, but seems to
>    conceptually make sense - a new worker should only be created off
>    that path, if it's currenly handling a work item and goes to sleep.
>
> 2) If there is current work, defer if it's hashed and the next work item
>    in that list is also hashed and of the same value.
I like this change, this makes the logic clearer. This patch looks good,
I'll do more tests next week.

>
>
> diff --git a/io_uring/io-wq.c b/io_uring/io-wq.c
> index d52069b1177b..cd1fcb115739 100644
> --- a/io_uring/io-wq.c
> +++ b/io_uring/io-wq.c
> @@ -150,6 +150,16 @@ static bool io_acct_cancel_pending_work(struct io_wq *wq,
>  static void create_worker_cb(struct callback_head *cb);
>  static void io_wq_cancel_tw_create(struct io_wq *wq);
>
> +static inline unsigned int __io_get_work_hash(unsigned int work_flags)
> +{
> +       return work_flags >> IO_WQ_HASH_SHIFT;
> +}
> +
> +static inline unsigned int io_get_work_hash(struct io_wq_work *work)
> +{
> +       return __io_get_work_hash(atomic_read(&work->flags));
> +}
> +
>  static bool io_worker_get(struct io_worker *worker)
>  {
>         return refcount_inc_not_zero(&worker->ref);
> @@ -409,6 +419,30 @@ static bool io_queue_worker_create(struct io_worker *worker,
>         return false;
>  }
>
> +/* Defer if current and next work are both hashed to the same chain */
> +static bool io_wq_hash_defer(struct io_wq_work *work, struct io_wq_acct *acct)
> +{
> +       unsigned int hash, work_flags;
> +       struct io_wq_work *next;
> +
> +       lockdep_assert_held(&acct->lock);
> +
> +       work_flags = atomic_read(&work->flags);
> +       if (!__io_wq_is_hashed(work_flags))
> +               return false;
> +
> +       /* should not happen, io_acct_run_queue() said we had work */
> +       if (wq_list_empty(&acct->work_list))
> +               return true;
> +
> +       hash = __io_get_work_hash(work_flags);
> +       next = container_of(acct->work_list.first, struct io_wq_work, list);
> +       work_flags = atomic_read(&next->flags);
> +       if (!__io_wq_is_hashed(work_flags))
> +               return false;
> +       return hash == __io_get_work_hash(work_flags);
> +}
> +
>  static void io_wq_dec_running(struct io_worker *worker)
>  {
>         struct io_wq_acct *acct = io_wq_get_acct(worker);
> @@ -419,8 +453,14 @@ static void io_wq_dec_running(struct io_worker *worker)
>
>         if (!atomic_dec_and_test(&acct->nr_running))
>                 return;
> +       if (!worker->cur_work)
> +               return;
>         if (!io_acct_run_queue(acct))
>                 return;
> +       if (io_wq_hash_defer(worker->cur_work, acct)) {
> +               raw_spin_unlock(&acct->lock);
> +               return;
> +       }
>
>         raw_spin_unlock(&acct->lock);
>         atomic_inc(&acct->nr_running);
> @@ -454,16 +494,6 @@ static void __io_worker_idle(struct io_wq_acct *acct, struct io_worker *worker)
>         }
>  }
>
> -static inline unsigned int __io_get_work_hash(unsigned int work_flags)
> -{
> -       return work_flags >> IO_WQ_HASH_SHIFT;
> -}
> -
> -static inline unsigned int io_get_work_hash(struct io_wq_work *work)
> -{
> -       return __io_get_work_hash(atomic_read(&work->flags));
> -}
> -
>  static bool io_wait_on_hash(struct io_wq *wq, unsigned int hash)
>  {
>         bool ret = false;
>
> --
> Jens Axboe

next prev parent reply	other threads:[~2025-05-23  7:57 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-22  9:09 [RFC PATCH] io_uring: fix io worker thread that keeps creating and destroying Fengnan Chang
2025-05-22 11:34 ` Jens Axboe
2025-05-22 12:01   ` [External] " Fengnan Chang
2025-05-22 14:29     ` Jens Axboe
2025-05-23  7:57       ` Fengnan Chang [this message]
2025-05-23 15:20         ` Jens Axboe
2025-05-26 11:14           ` Fengnan Chang
2025-05-28 13:39             ` Jens Axboe
2025-05-28 13:56               ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPFOzZtxXOQvC0wcNLaj-hZUOf2PWqon0uEvbQh7if7a7DdX=g@mail.gmail.com' \
    --to=changfengnan@bytedance.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    --cc=lidiangang@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox