Re: Race between io_wqe_worker() and io_wqe_wake_worker()

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Jens Axboe <[email protected]>
To: Nadav Amit <[email protected]>
Cc: [email protected], Hao Xu <[email protected]>
Subject: Re: Race between io_wqe_worker() and io_wqe_wake_worker()
Date: Tue, 3 Aug 2021 15:25:23 -0600	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 8/3/21 3:16 PM, Nadav Amit wrote:
> 
> 
>> On Aug 3, 2021, at 12:53 PM, Jens Axboe <[email protected]> wrote:
>>
>> How about this? I think this largely stems from the fact that we only
>> do a partial running decrement on exit. Left the previous checks in
>> place as well, as it will reduce the amount of times that we do need
>> to hit that case.
> 
> It did not apply cleanly on my 5.13, but after I cleaned it, it still
> got stuck (more frequently than when I used your previous solution).
> 
> I do not see the problem related to the partial running decrement.
> Thinking of it, I think that the problem might even happen if
> multiple calls to io_wqe_activate_free_worker() wake up the same worker,
> not realizing that they race (since __io_worker_busy() was still not
> called by io_worker_handle_work()).

That's actually by design for io-wq in general, we assume that the work
won't block, and in that case we only want to activate the one worker.

> Anyhow, I think there are a few problems in the patch you sent. Once I
> addressed a couple of problems, my test passes, but I am not sure you
> actually want to final result, and I am not sure it is robust/correct.
> 
> See my comments below for the changes I added and other questions I
> have (you can answer only if you have time).
> 
>>
>>
>> diff --git a/fs/io-wq.c b/fs/io-wq.c
>> index cf086b01c6c6..f072995d382b 100644
>> --- a/fs/io-wq.c
>> +++ b/fs/io-wq.c
>> @@ -35,12 +35,17 @@ enum {
>> 	IO_WQE_FLAG_STALLED	= 1,	/* stalled on hash */
>> };
>>
>> +enum {
>> +	IO_WORKER_EXITING	= 0,	/* worker is exiting */
>> +};
>> +
>> /*
>>  * One for each thread in a wqe pool
>>  */
>> struct io_worker {
>> 	refcount_t ref;
>> 	unsigned flags;
>> +	unsigned long state;
>> 	struct hlist_nulls_node nulls_node;
>> 	struct list_head all_list;
>> 	struct task_struct *task;
>> @@ -130,6 +135,7 @@ struct io_cb_cancel_data {
>> };
>>
>> static void create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index);
>> +static void io_wqe_dec_running(struct io_worker *worker);
>>
>> static bool io_worker_get(struct io_worker *worker)
>> {
>> @@ -168,26 +174,21 @@ static void io_worker_exit(struct io_worker *worker)
>> {
>> 	struct io_wqe *wqe = worker->wqe;
>> 	struct io_wqe_acct *acct = io_wqe_get_acct(worker);
>> -	unsigned flags;
>>
>> 	if (refcount_dec_and_test(&worker->ref))
>> 		complete(&worker->ref_done);
>> 	wait_for_completion(&worker->ref_done);
>>
>> -	preempt_disable();
>> -	current->flags &= ~PF_IO_WORKER;
>> -	flags = worker->flags;
>> -	worker->flags = 0;
>> -	if (flags & IO_WORKER_F_RUNNING)
>> -		atomic_dec(&acct->nr_running);
>> -	worker->flags = 0;
>> -	preempt_enable();
>> -
>> 	raw_spin_lock_irq(&wqe->lock);
>> -	if (flags & IO_WORKER_F_FREE)
>> +	if (worker->flags & IO_WORKER_F_FREE)
>> 		hlist_nulls_del_rcu(&worker->nulls_node);
>> 	list_del_rcu(&worker->all_list);
>> 	acct->nr_workers--;
>> +	preempt_disable();
>> +	io_wqe_dec_running(worker);
> 
> IIUC, in the scenario I encountered, acct->nr_running might be non-zero,
> but still a new worker would be needed. So the check in io_wqe_dec_running()
> is insufficient to spawn a new worker at this point, no?

If nr_running != 0, then we have active workers. They will either
complete the work they have without blocking, or if they block, then
we'll create a new one. So it really should be enough, I'm a bit
puzzled...

>> +	worker->flags = 0;
>> +	current->flags &= ~PF_IO_WORKER;
>> +	preempt_enable();
>> 	raw_spin_unlock_irq(&wqe->lock);
>>
>> 	kfree_rcu(worker, rcu);
>> @@ -214,15 +215,20 @@ static bool io_wqe_activate_free_worker(struct io_wqe *wqe)
>> 	struct hlist_nulls_node *n;
>> 	struct io_worker *worker;
>>
>> -	n = rcu_dereference(hlist_nulls_first_rcu(&wqe->free_list));
>> -	if (is_a_nulls(n))
>> -		return false;
>> -
>> -	worker = hlist_nulls_entry(n, struct io_worker, nulls_node);
>> -	if (io_worker_get(worker)) {
>> -		wake_up_process(worker->task);
>> +	/*
>> +	 * Iterate free_list and see if we can find an idle worker to
>> +	 * activate. If a given worker is on the free_list but in the process
>> +	 * of exiting, keep trying.
>> +	 */
>> +	hlist_nulls_for_each_entry_rcu(worker, n, &wqe->free_list, nulls_node) {
>> +		if (!io_worker_get(worker))
>> +			continue;
> 
> Presumably you want to rely on the order between io_worker_get(), i.e.
> the refcount_inc_not_zero() and the test_bit(). I guess no memory-barrier
> is needed here (since refcount_inc_not_zero() returns a value) but
> documentation would help. Anyhow, I do not see how it helps.

Right, no extra barriers needed.

>> +		if (!test_bit(IO_WORKER_EXITING, &worker->state)) {
>> +			wake_up_process(worker->task);
> 
> So this might be the main problem. The worker might be in between waking
> and setting IO_WORKER_EXITING. One option (that I tried and works, at
> least in limited testing), is to look whether the process was actually
> woken according to the return value of wake_up_process() and not to
> use workers that were not actually woken.
> 
> So I changed it to:
>                         if (wake_up_process(worker->task)) {
>                                 io_worker_release(worker);
>                                 return true;
>                         }
> 
> 
>> +			io_worker_release(worker);
> 
> The refcount is decreased, so the refcount_read in io_wqe_worker()
> would not see the elevated refcount. No?

That's probably not a bad idea, though not quite sure that'd always be
safe. I'm going to need to look deeper, because we really should not
have a lot of concurrent activity here in terms of multiple issuers
looking up free workers and activating them.

Can you share a bit about what the workload looks like? That might help
create a reproducer, which would be handy going forward as well.

>> +			return true;
>> +		}
>> 		io_worker_release(worker);
>> -		return true;
>> 	}
>>
>> 	return false;
>> @@ -560,8 +566,17 @@ static int io_wqe_worker(void *data)
>> 		if (ret)
>> 			continue;
>> 		/* timed out, exit unless we're the fixed worker */
>> -		if (!(worker->flags & IO_WORKER_F_FIXED))
>> +		if (!(worker->flags & IO_WORKER_F_FIXED)) {
>> +			/*
>> +			 * Someone elevated our refs, which could be trying
>> +			 * to re-activate for work. Loop one more time for
>> +			 * that case.
>> +			 */
>> +			if (refcount_read(&worker->ref) != 1)
>> +				continue;
> 
> I am not sure what it serves, as the refcount is decreased in
> io_wqe_activate_free_worker() right after wake_up_process().

It should just go away I think, it'd be better to cut the patch down to
the functional part.

> Anyhow, presumably you need smp_mb__before_atomic() here, no? I added
> one. Yet, without the check in the wake_up_process() this still seems
> borken.

Yes, it would need that.

-- 
Jens Axboe

     prev parent reply	other threads:[~2021-08-03 21:25 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-03  1:05 Race between io_wqe_worker() and io_wqe_wake_worker() Nadav Amit
2021-08-03 13:22 ` Jens Axboe
2021-08-03 14:37   ` Jens Axboe
2021-08-03 17:25     ` Hao Xu
2021-08-03 18:04     ` Nadav Amit
2021-08-03 18:14       ` Jens Axboe
2021-08-03 19:20         ` Nadav Amit
2021-08-03 19:24           ` Jens Axboe
2021-08-03 19:53             ` Jens Axboe
2021-08-03 21:16               ` Nadav Amit
2021-08-03 21:25                 ` Jens Axboe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox