From: Dave Chinner <[email protected]>
To: Jens Axboe <[email protected]>
Cc: io-uring <[email protected]>,
linux-fsdevel <[email protected]>
Subject: Re: [5.15-rc1 regression] io_uring: fsstress hangs in do_coredump() on exit
Date: Wed, 22 Sep 2021 07:35:52 +1000 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On Tue, Sep 21, 2021 at 08:19:53AM -0600, Jens Axboe wrote:
> On 9/21/21 7:25 AM, Jens Axboe wrote:
> > On 9/21/21 12:40 AM, Dave Chinner wrote:
> >> Hi Jens,
> >>
> >> I updated all my trees from 5.14 to 5.15-rc2 this morning and
> >> immediately had problems running the recoveryloop fstest group on
> >> them. These tests have a typical pattern of "run load in the
> >> background, shutdown the filesystem, kill load, unmount and test
> >> recovery".
> >>
> >> Whent eh load includes fsstress, and it gets killed after shutdown,
> >> it hangs on exit like so:
> >>
> >> # echo w > /proc/sysrq-trigger
> >> [ 370.669482] sysrq: Show Blocked State
> >> [ 370.671732] task:fsstress state:D stack:11088 pid: 9619 ppid: 9615 flags:0x00000000
> >> [ 370.675870] Call Trace:
> >> [ 370.677067] __schedule+0x310/0x9f0
> >> [ 370.678564] schedule+0x67/0xe0
> >> [ 370.679545] schedule_timeout+0x114/0x160
> >> [ 370.682002] __wait_for_common+0xc0/0x160
> >> [ 370.684274] wait_for_completion+0x24/0x30
> >> [ 370.685471] do_coredump+0x202/0x1150
> >> [ 370.690270] get_signal+0x4c2/0x900
> >> [ 370.691305] arch_do_signal_or_restart+0x106/0x7a0
> >> [ 370.693888] exit_to_user_mode_prepare+0xfb/0x1d0
> >> [ 370.695241] syscall_exit_to_user_mode+0x17/0x40
> >> [ 370.696572] do_syscall_64+0x42/0x80
> >> [ 370.697620] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>
> >> It's 100% reproducable on one of my test machines, but only one of
> >> them. That one machine is running fstests on pmem, so it has
> >> synchronous storage. Every other test machine using normal async
> >> storage (nvme, iscsi, etc) and none of them are hanging.
> >>
> >> A quick troll of the commit history between 5.14 and 5.15-rc2
> >> indicates a couple of potential candidates. The 5th kernel build
> >> (instead of ~16 for a bisect) told me that commit 15e20db2e0ce
> >> ("io-wq: only exit on fatal signals") is the cause of the
> >> regression. I've confirmed that this is the first commit where the
> >> problem shows up.
> >
> > Thanks for the report Dave, I'll take a look. Can you elaborate on
> > exactly what is being run? And when killed, it's a non-fatal signal?
It's whatever kill/killall sends by default. Typical behaviour that
causes a hang is something like:
$FSSTRESS_PROG -n10000000 -p $PROCS -d $load_dir >> $seqres.full 2>&1 &
....
sleep 5
_scratch_shutdown
$KILLALL_PROG -q $FSSTRESS_PROG
wait
_shutdown_scratch is typically just an 'xfs_io -rx -c "shutdown"
/mnt/scratch' command that shuts down the filesystem. Other tests in
the recoveryloop group use DM targets to fail IO that trigger a
shutdown, others inject errors that trigger shutdowns, etc. But the
result is that all hang waiting for fsstress processes that have
been using io_uring to exit.
Just run fstests with "./check -g recoveryloop" - there's only a
handful of tests and it only takes about 5 minutes to run them all
on a fake DRAM based pmem device..
> Can you try with this patch?
>
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index b5fd015268d7..1e55a0a2a217 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -586,7 +586,8 @@ static int io_wqe_worker(void *data)
>
> if (!get_signal(&ksig))
> continue;
> - if (fatal_signal_pending(current))
> + if (fatal_signal_pending(current) ||
> + signal_group_exit(current->signal)) {
> break;
> continue;
> }
Cleaned up so it compiles and the tests run properly again. But
playing whack-a-mole with signals seems kinda fragile. I was pointed
to this patchset by another dev on #xfs overnight who saw the same
hangs that also fixed the hang:
https://lore.kernel.org/lkml/[email protected]/
It was posted about a month ago and I don't see any response to it
on the lists...
Cheers,
Dave,
--
Dave Chinner
[email protected]
next prev parent reply other threads:[~2021-09-21 21:36 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-21 6:40 [5.15-rc1 regression] io_uring: fsstress hangs in do_coredump() on exit Dave Chinner
2021-09-21 13:25 ` Jens Axboe
2021-09-21 14:19 ` Jens Axboe
2021-09-21 21:35 ` Dave Chinner [this message]
2021-09-21 21:41 ` Jens Axboe
2021-09-23 14:05 ` Olivier Langlois
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox