From: Nadav Amit <[email protected]>
To: Jens Axboe <[email protected]>
Cc: [email protected], Hao Xu <[email protected]>
Subject: Re: Race between io_wqe_worker() and io_wqe_wake_worker()
Date: Tue, 3 Aug 2021 14:16:51 -0700 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
> On Aug 3, 2021, at 12:53 PM, Jens Axboe <[email protected]> wrote:
>
> How about this? I think this largely stems from the fact that we only
> do a partial running decrement on exit. Left the previous checks in
> place as well, as it will reduce the amount of times that we do need
> to hit that case.
It did not apply cleanly on my 5.13, but after I cleaned it, it still
got stuck (more frequently than when I used your previous solution).
I do not see the problem related to the partial running decrement.
Thinking of it, I think that the problem might even happen if
multiple calls to io_wqe_activate_free_worker() wake up the same worker,
not realizing that they race (since __io_worker_busy() was still not
called by io_worker_handle_work()).
Anyhow, I think there are a few problems in the patch you sent. Once I
addressed a couple of problems, my test passes, but I am not sure you
actually want to final result, and I am not sure it is robust/correct.
See my comments below for the changes I added and other questions I
have (you can answer only if you have time).
>
>
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index cf086b01c6c6..f072995d382b 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -35,12 +35,17 @@ enum {
> IO_WQE_FLAG_STALLED = 1, /* stalled on hash */
> };
>
> +enum {
> + IO_WORKER_EXITING = 0, /* worker is exiting */
> +};
> +
> /*
> * One for each thread in a wqe pool
> */
> struct io_worker {
> refcount_t ref;
> unsigned flags;
> + unsigned long state;
> struct hlist_nulls_node nulls_node;
> struct list_head all_list;
> struct task_struct *task;
> @@ -130,6 +135,7 @@ struct io_cb_cancel_data {
> };
>
> static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index);
> +static void io_wqe_dec_running(struct io_worker *worker);
>
> static bool io_worker_get(struct io_worker *worker)
> {
> @@ -168,26 +174,21 @@ static void io_worker_exit(struct io_worker *worker)
> {
> struct io_wqe *wqe = worker->wqe;
> struct io_wqe_acct *acct = io_wqe_get_acct(worker);
> - unsigned flags;
>
> if (refcount_dec_and_test(&worker->ref))
> complete(&worker->ref_done);
> wait_for_completion(&worker->ref_done);
>
> - preempt_disable();
> - current->flags &= ~PF_IO_WORKER;
> - flags = worker->flags;
> - worker->flags = 0;
> - if (flags & IO_WORKER_F_RUNNING)
> - atomic_dec(&acct->nr_running);
> - worker->flags = 0;
> - preempt_enable();
> -
> raw_spin_lock_irq(&wqe->lock);
> - if (flags & IO_WORKER_F_FREE)
> + if (worker->flags & IO_WORKER_F_FREE)
> hlist_nulls_del_rcu(&worker->nulls_node);
> list_del_rcu(&worker->all_list);
> acct->nr_workers--;
> + preempt_disable();
> + io_wqe_dec_running(worker);
IIUC, in the scenario I encountered, acct->nr_running might be non-zero,
but still a new worker would be needed. So the check in io_wqe_dec_running()
is insufficient to spawn a new worker at this point, no?
> + worker->flags = 0;
> + current->flags &= ~PF_IO_WORKER;
> + preempt_enable();
> raw_spin_unlock_irq(&wqe->lock);
>
> kfree_rcu(worker, rcu);
> @@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
> struct hlist_nulls_node *n;
> struct io_worker *worker;
>
> - n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list));
> - if (is_a_nulls(n))
> - return false;
> -
> - worker = hlist_nulls_entry(n, struct io_worker, nulls_node);
> - if (io_worker_get(worker)) {
> - wake_up_process(worker->task);
> + /*
> + * Iterate free_list and see if we can find an idle worker to
> + * activate. If a given worker is on the free_list but in the process
> + * of exiting, keep trying.
> + */
> + hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
> + if (!io_worker_get(worker))
> + continue;
Presumably you want to rely on the order between io_worker_get(), i.e.
the refcount_inc_not_zero() and the test_bit(). I guess no memory-barrier
is needed here (since refcount_inc_not_zero() returns a value) but
documentation would help. Anyhow, I do not see how it helps.
> + if (!test_bit(IO_WORKER_EXITING, &worker->state)) {
> + wake_up_process(worker->task);
So this might be the main problem. The worker might be in between waking
and setting IO_WORKER_EXITING. One option (that I tried and works, at
least in limited testing), is to look whether the process was actually
woken according to the return value of wake_up_process() and not to
use workers that were not actually woken.
So I changed it to:
if (wake_up_process(worker->task)) {
io_worker_release(worker);
return true;
}
> + io_worker_release(worker);
The refcount is decreased, so the refcount_read in io_wqe_worker()
would not see the elevated refcount. No?
> + return true;
> + }
> io_worker_release(worker);
> - return true;
> }
>
> return false;
> @@ -560,8 +566,17 @@ static int io_wqe_worker(void *data)
> if (ret)
> continue;
> /* timed out, exit unless we're the fixed worker */
> - if (!(worker->flags & IO_WORKER_F_FIXED))
> + if (!(worker->flags & IO_WORKER_F_FIXED)) {
> + /*
> + * Someone elevated our refs, which could be trying
> + * to re-activate for work. Loop one more time for
> + * that case.
> + */
> + if (refcount_read(&worker->ref) != 1)
> + continue;
I am not sure what it serves, as the refcount is decreased in
io_wqe_activate_free_worker() right after wake_up_process().
Anyhow, presumably you need smp_mb__before_atomic() here, no? I added
one. Yet, without the check in the wake_up_process() this still seems
borken.
> + set_bit(IO_WORKER_EXITING, &worker->state);
> break;
> + }
> }
next prev parent reply other threads:[~2021-08-03 21:16 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-03 1:05 Race between io_wqe_worker() and io_wqe_wake_worker() Nadav Amit
2021-08-03 13:22 ` Jens Axboe
2021-08-03 14:37 ` Jens Axboe
2021-08-03 17:25 ` Hao Xu
2021-08-03 18:04 ` Nadav Amit
2021-08-03 18:14 ` Jens Axboe
2021-08-03 19:20 ` Nadav Amit
2021-08-03 19:24 ` Jens Axboe
2021-08-03 19:53 ` Jens Axboe
2021-08-03 21:16 ` Nadav Amit [this message]
2021-08-03 21:25 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox