public inbox for [email protected]
 help / color / mirror / Atom feed
From: Ming Lei <[email protected]>
To: Pavel Begunkov <[email protected]>
Cc: Jens Axboe <[email protected]>,
	[email protected], [email protected],
	David Howells <[email protected]>,
	Chengming Zhou <[email protected]>
Subject: Re: [PATCH V2] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()
Date: Fri, 8 Sep 2023 14:22:17 +0800	[thread overview]
Message-ID: <ZPq9mY51e7++cbpC@fedora> (raw)
In-Reply-To: <[email protected]>

On Fri, Sep 08, 2023 at 02:03:11AM +0100, Pavel Begunkov wrote:
> On 9/7/23 16:36, Jens Axboe wrote:
> > On 9/1/23 9:13 AM, Jens Axboe wrote:
> > > 
> > > On Fri, 01 Sep 2023 21:49:16 +0800, Ming Lei wrote:
> > > > io_wq_put_and_exit() is called from do_exit(), but all FIXED_FILE requests
> > > > in io_wq aren't canceled in io_uring_cancel_generic() called from do_exit().
> > > > Meantime io_wq IO code path may share resource with normal iopoll code
> > > > path.
> > > > 
> > > > So if any HIPRI request is submittd via io_wq, this request may not get resouce
> > > > for moving on, given iopoll isn't possible in io_wq_put_and_exit().
> > > > > > [...]
> > > 
> > > Applied, thanks!
> > > 
> > > [1/1] io_uring: fix IO hang in io_wq_put_and_exit from do_exit()
> > >        commit: b484a40dc1f16edb58e5430105a021e1916e6f27
> > 
> > This causes a regression with the test/thread-exit.t test case, as it's
> > canceling requests from other tasks as well. I will drop this patch for
> > now.
> 
> And this one has never hit my mailbox... Anyway, I'm confused with
> the issue:
> 
> 1) request tracking is done only for io_uring polling io_uring, which

request tracking isn't done on FIXED_FILE IO, which is used by t/io_uring.

> shouldn't be the case for t/io_uring, so it's probably unrelated?

So io_uring_try_cancel_requests() won't be called because
tctx_inflight() returns zero.

> 
> 2) In case of iopoll, io-wq only submits a request but doesn't wait/poll
> for it. If io_issue_sqe() returned -EAGAIN or an error, the request is
> considered not yet submitted to block and can be just cancelled normally
> without any dancing like io_iopoll_try_reap_events().

io_issue_sqe() may never return since IO_URING_F_NONBLOCK isn't set
for iopoll, and recently polled IO doesn't imply nowait in commit
2bc057692599 ("block: don't make REQ_POLLED imply REQ_NOWAIT"), this
commit is actually fragile, cause it is easy to cause io hang if
iopoll & submission is done in same context.

This one should be easy to address, either the following change
or revert 2bc057692599 ("block: don't make REQ_POLLED imply
REQ_NOWAIT").

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ad636954abae..d8419689ad3e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1930,6 +1930,9 @@ void io_wq_submit_work(struct io_wq_work *work)
                }
        }

+       if ((req->ctx->flags & IORING_SETUP_IOPOLL) && def->iopoll_queue)
+               issue_flags |= IO_URING_F_NONBLOCK;
+
        do {
                ret = io_issue_sqe(req, issue_flags);
                if (ret != -EAGAIN)

> 
> 
> 3) If we condense the code it sounds like it effectively will be
> like this:
> 
> void io_wq_exit_start(struct io_wq *wq)
> {
> 	set_bit(IO_WQ_BIT_EXIT, &wq->state);
> }
> 
> io_uring_cancel_generic()
> {
> 	if (tctx->io_wq)
> 		io_wq_exit_start(tctx->io_wq);
> 	io_uring_clean_tctx(tctx);
> 	...
> }
> 
> We set the flag, interrupt workers (TIF_NOTIFY_SIGNAL + wake up), and
> wait for them. Workers are woken up (or just return), see
> the flag and return. At least that's how it was intended to work.
> 
> What's missing? Racing for IO_WQ_BIT_EXIT? Not breaking on IO_WQ_BIT_EXIT
> correctly? Not honouring / clearing TIF_NOTIFY_SIGNAL?
> 
> I'll try to repro later

After applying your patch of 9256b9371204 ("io_uring: break out of iowq
iopoll on teardown") and the above patch, the issue still can be triggered,
and the worker is keeping to call io_issue_sqe() for ever, and
io_wq_worker_stopped() returns false. So do_exit() isn't called on
t/io_uring pthread yet, meantime I guess either iopoll is terminated or not
get chance to run.


Thanks,
Ming


  reply	other threads:[~2023-09-08  6:23 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-01 13:49 [PATCH V2] io_uring: fix IO hang in io_wq_put_and_exit from do_exit() Ming Lei
2023-09-01 14:47 ` Jens Axboe
2023-09-01 15:12   ` Ming Lei
2023-09-01 15:13     ` Jens Axboe
2023-09-01 15:13 ` Jens Axboe
2023-09-07 15:36   ` Jens Axboe
2023-09-08  1:03     ` Pavel Begunkov
2023-09-08  6:22       ` Ming Lei [this message]
2023-09-08 16:01         ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZPq9mY51e7++cbpC@fedora \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox