public inbox for [email protected]
 help / color / mirror / Atom feed
From: Pavel Begunkov <[email protected]>
To: Ming Lei <[email protected]>
Cc: Jens Axboe <[email protected]>,
	[email protected], [email protected],
	David Howells <[email protected]>,
	Chengming Zhou <[email protected]>
Subject: Re: [PATCH V2] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()
Date: Fri, 8 Sep 2023 17:01:31 +0100	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <ZPq9mY51e7++cbpC@fedora>

On 9/8/23 07:22, Ming Lei wrote:
> On Fri, Sep 08, 2023 at 02:03:11AM +0100, Pavel Begunkov wrote:
>> On 9/7/23 16:36, Jens Axboe wrote:
>>> On 9/1/23 9:13 AM, Jens Axboe wrote:
>>>>
>>>> On Fri, 01 Sep 2023 21:49:16 +0800, Ming Lei wrote:
>>>>> io_wq_put_and_exit() is called from do_exit(), but all FIXED_FILE requests
>>>>> in io_wq aren't canceled in io_uring_cancel_generic() called from do_exit().
>>>>> Meantime io_wq IO code path may share resource with normal iopoll code
>>>>> path.
>>>>>
>>>>> So if any HIPRI request is submittd via io_wq, this request may not get resouce
>>>>> for moving on, given iopoll isn't possible in io_wq_put_and_exit().
>>>>>>> [...]
>>>>
>>>> Applied, thanks!
>>>>
>>>> [1/1] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()
>>>>         commit: b484a40dc1f16edb58e5430105a021e1916e6f27
>>>
>>> This causes a regression with the test/thread-exit.t test case, as it's
>>> canceling requests from other tasks as well. I will drop this patch for
>>> now.
>>
>> And this one has never hit my mailbox... Anyway, I'm confused with
>> the issue:
>>

Thanks CC'ing while resending, I don't know what's up with lore

>> 1) request tracking is done only for io_uring polling io_uring, which
> 
> request tracking isn't done on FIXED_FILE IO, which is used by t/io_uring.
> 
>> shouldn't be the case for t/io_uring, so it's probably unrelated?
> 
> So io_uring_try_cancel_requests() won't be called because
> tctx_inflight() returns zero.

That's true for fixed files, but it also holds for non fixed files
as well apart from a narrow case t/io_uring doesn't exercise

>> 2) In case of iopoll, io-wq only submits a request but doesn't wait/poll
>> for it. If io_issue_sqe() returned -EAGAIN or an error, the request is
>> considered not yet submitted to block and can be just cancelled normally
>> without any dancing like io_iopoll_try_reap_events().
> 
> io_issue_sqe() may never return since IO_URING_F_NONBLOCK isn't set
> for iopoll, and recently polled IO doesn't imply nowait in commit

missing nowait semantics should be fine, io_uring_clean_tctx()
sends a signal to workers, which should break them out of waiting.

...
>> We set the flag, interrupt workers (TIF_NOTIFY_SIGNAL + wake up), and
>> wait for them. Workers are woken up (or just return), see
>> the flag and return. At least that's how it was intended to work.
>>
>> What's missing? Racing for IO_WQ_BIT_EXIT? Not breaking on IO_WQ_BIT_EXIT
>> correctly? Not honouring / clearing TIF_NOTIFY_SIGNAL?
>>
>> I'll try to repro later
> 
> After applying your patch of 9256b9371204 ("io_uring: break out of iowq
> iopoll on teardown") and the above patch, the issue still can be triggered,

Yeah, it doesn't appear that it'd help, it was rather something
randomly spotted than targeting your issue.

> and the worker is keeping to call io_issue_sqe() for ever, and
> io_wq_worker_stopped() returns false. So do_exit() isn't called on
> t/io_uring pthread yet, meantime I guess either iopoll is terminated or not
> get chance to run.

That prior to task exit then, right? I wonder if we're missing
signal checking in that loop like below or just need limit the
number of re-submission attempts.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 6cce8948bddf..c6f566200d04 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1944,6 +1944,8 @@ void io_wq_submit_work(struct io_wq_work *work)
  				break;
  			if (io_wq_worker_stopped())
  				break;
+			if (signal_pending(current))
+				break;
  			cond_resched();
  			continue;
  		}


-- 
Pavel Begunkov

      reply	other threads:[~2023-09-08 16:02 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-01 13:49 [PATCH V2] io_uring: fix IO hang in io_wq_put_and_exit from do_exit() Ming Lei
2023-09-01 14:47 ` Jens Axboe
2023-09-01 15:12   ` Ming Lei
2023-09-01 15:13     ` Jens Axboe
2023-09-01 15:13 ` Jens Axboe
2023-09-07 15:36   ` Jens Axboe
2023-09-08  1:03     ` Pavel Begunkov
2023-09-08  6:22       ` Ming Lei
2023-09-08 16:01         ` Pavel Begunkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox