public inbox for [email protected]
 help / color / mirror / Atom feed
From: Jens Axboe <[email protected]>
To: Dave Chinner <[email protected]>, Andres Freund <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>,
	"Thorsten Leemhuis" <[email protected]>,
	"Shreeya Patel" <[email protected]>,
	[email protected],
	"Ricardo Cañuelo" <[email protected]>,
	[email protected], [email protected],
	[email protected],
	"Linux regressions mailing list" <[email protected]>,
	[email protected]
Subject: Re: task hung in ext4_fallocate #2
Date: Tue, 24 Oct 2023 08:30:02 -0600	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 10/23/23 7:12 PM, Dave Chinner wrote:
> [cc Jens, io-uring]

Thanks!

> On Tue, Oct 17, 2023 at 07:50:09PM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2023-10-17 20:43:35 -0400, Theodore Ts'o wrote:
>>> On Mon, Oct 16, 2023 at 08:37:25PM -0700, Andres Freund wrote:
>>>> I just was able to reproduce the issue, after upgrading to 6.6-rc6 - this time
>>>> it took ~55min of high load (io_uring using branch of postgres, running a
>>>> write heavy transactional workload concurrently with concurrent bulk data
>>>> load) to trigger the issue.
>>>>
>>>> For now I have left the system running, in case there's something you would
>>>> like me to check while the system is hung.
>>>>
>>>> The first hanging task that I observed:
>>>>
>>>> cat /proc/57606/stack
>>>> [<0>] inode_dio_wait+0xd5/0x100
>>>> [<0>] ext4_fallocate+0x12f/0x1040
>>>> [<0>] vfs_fallocate+0x135/0x360
>>>> [<0>] __x64_sys_fallocate+0x42/0x70
>>>> [<0>] do_syscall_64+0x38/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>
>>> This stack trace is from some process (presumably postgres) trying to
>>> do a fallocate() system call:
>>
>> Yes, it's indeed postgres.
>>
>>
>>>> [ 3194.579297] INFO: task iou-wrk-58004:58874 blocked for more than 122 seconds.
>>>> [ 3194.579304]       Not tainted 6.6.0-rc6-andres-00001-g01edcfe38260 #77
>>>> [ 3194.579310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> [ 3194.579314] task:iou-wrk-58004   state:D stack:0     pid:58874 ppid:52606  flags:0x00004000
>>>> [ 3194.579325] Call Trace:
>>>> [ 3194.579329]  <TASK>
>>>> [ 3194.579334]  __schedule+0x388/0x13e0
>>>> [ 3194.579349]  schedule+0x5f/0xe0
>>>> [ 3194.579361]  schedule_preempt_disabled+0x15/0x20
>>>> [ 3194.579374]  rwsem_down_read_slowpath+0x26e/0x4c0
>>>> [ 3194.579385]  down_read+0x44/0xa0
>>>> [ 3194.579393]  ext4_file_write_iter+0x432/0xa80
>>>> [ 3194.579407]  io_write+0x129/0x420
>>>
>>> This could potentially be a interesting stack trace; but this is where
>>> we really need to map the stack address to line numbers.  Is that
>>> something you could do?
>>
>> Converting the above with scripts/decode_stacktrace.sh yields:
>>
>> __schedule (kernel/sched/core.c:5382 kernel/sched/core.c:6695)
>> schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 13) kernel/sched/core.c:6772 (discriminator 13))
>> schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 (discriminator 10) kernel/sched/core.c:6831 (discriminator 10))
>> rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 4))
>> down_read (./arch/x86/include/asm/preempt.h:95 kernel/locking/rwsem.c:1257 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
>> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
>> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
>>
>> But I'm not sure that that stacktrace is quite right (e.g. what's
>> ./include/linux/fs.h:1073 doing in this stacktrace?). But with an optimized
>> build it's a bit hard to tell what's an optimization artifact and what's an
>> off stack trace...
>>
>>
>> I had the idea to look at the stacks (via /proc/$pid/stack) for all postgres
>> processes and the associated io-uring threads, and then to deduplicate them.
>>
>> 22x:
>> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
>> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
>> io_issue_sqe (io_uring/io_uring.c:1878)
>> io_wq_submit_work (io_uring/io_uring.c:1960)
>> io_worker_handle_work (io_uring/io-wq.c:596)
>> io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
>> ret_from_fork (arch/x86/kernel/process.c:147)
>> ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
> 
> io_uring does some interesting stuff with IO completion and iomap
> now - IIRC IOCB_DIO_CALLER_COMP is new 6.6-rc1 functionality. This
> flag is set by io_write() to tell the iomap IO completion handler
> that the caller will the IO completion, avoiding a context switch
> to run the completion in a kworker thread.
> 
> It's not until the caller runs iocb->dio_complete() that
> inode_dio_end() is called to drop the i_dio_count. This happens when
> io_uring gets completion notification from iomap via
> iocb->ki_complete(iocb, 0);
> 
> It then requires the io_uring layer to process the completion
> and call iocb->dio_complete() before the inode->i_dio_count is
> dropped and inode_dio_wait() will complete.
> 
> So what I suspect here is that we have a situation where the worker
> that would run the completion is blocked on the inode rwsem because
> this isn't a IOCB_NOWAIT IO and the fallocate call holds the rwsem
> exclusive and is waiting on inode_dio_wait() to return.
> 
> Cc Jens, because I'm not sure this is actually an ext4 problem - I
> can't see why XFS won't have the same issue w.r.t.
> truncate/fallocate and IOCB_DIO_CALLER_COMP delaying the
> inode_dio_end() until whenever the io_uring code can get around to
> processing the delayed completion....

I don't think this is related to the io-wq workers doing non-blocking
IO. The callback is eventually executed by the task that originally
submitted the IO, which is the owner and not the async workers. But...
If that original task is blocked in eg fallocate, then I can see how
that would potentially be an issue.

I'll take a closer look.

-- 
Jens Axboe


  parent reply	other threads:[~2023-10-24 14:30 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <[email protected]>
     [not found] ` <[email protected]>
     [not found]   ` <[email protected]>
2023-10-24  1:12     ` task hung in ext4_fallocate #2 Dave Chinner
2023-10-24  1:36       ` Andres Freund
2023-10-24 14:30       ` Jens Axboe [this message]
2023-10-24 18:35         ` Jens Axboe
2023-10-25  0:06           ` Dave Chinner
2023-10-25  0:34             ` Jens Axboe
2023-10-25 15:31               ` Andres Freund
2023-10-25 15:36                 ` Jens Axboe
2023-10-25 16:14                   ` Andres Freund
2023-10-26  2:48                     ` Andres Freund
2023-10-25 19:55                   ` Theodore Ts'o
2023-10-25 22:28               ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox