public inbox for [email protected]
 help / color / mirror / Atom feed
* Re: task hung in ext4_fallocate #2
       [not found]   ` <[email protected]>
@ 2023-10-24  1:12     ` Dave Chinner
  2023-10-24  1:36       ` Andres Freund
  2023-10-24 14:30       ` Jens Axboe
  0 siblings, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2023-10-24  1:12 UTC (permalink / raw)
  To: Andres Freund
  Cc: Theodore Ts'o, Thorsten Leemhuis, Shreeya Patel, linux-ext4,
	Ricardo Cañuelo, gustavo.padovan, zsm, garrick,
	Linux regressions mailing list, Jens Axboe, io-uring

[cc Jens, io-uring]

On Tue, Oct 17, 2023 at 07:50:09PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2023-10-17 20:43:35 -0400, Theodore Ts'o wrote:
> > On Mon, Oct 16, 2023 at 08:37:25PM -0700, Andres Freund wrote:
> > > I just was able to reproduce the issue, after upgrading to 6.6-rc6 - this time
> > > it took ~55min of high load (io_uring using branch of postgres, running a
> > > write heavy transactional workload concurrently with concurrent bulk data
> > > load) to trigger the issue.
> > >
> > > For now I have left the system running, in case there's something you would
> > > like me to check while the system is hung.
> > >
> > > The first hanging task that I observed:
> > >
> > > cat /proc/57606/stack
> > > [<0>] inode_dio_wait+0xd5/0x100
> > > [<0>] ext4_fallocate+0x12f/0x1040
> > > [<0>] vfs_fallocate+0x135/0x360
> > > [<0>] __x64_sys_fallocate+0x42/0x70
> > > [<0>] do_syscall_64+0x38/0x80
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >
> > This stack trace is from some process (presumably postgres) trying to
> > do a fallocate() system call:
> 
> Yes, it's indeed postgres.
> 
> 
> > > [ 3194.579297] INFO: task iou-wrk-58004:58874 blocked for more than 122 seconds.
> > > [ 3194.579304]       Not tainted 6.6.0-rc6-andres-00001-g01edcfe38260 #77
> > > [ 3194.579310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [ 3194.579314] task:iou-wrk-58004   state:D stack:0     pid:58874 ppid:52606  flags:0x00004000
> > > [ 3194.579325] Call Trace:
> > > [ 3194.579329]  <TASK>
> > > [ 3194.579334]  __schedule+0x388/0x13e0
> > > [ 3194.579349]  schedule+0x5f/0xe0
> > > [ 3194.579361]  schedule_preempt_disabled+0x15/0x20
> > > [ 3194.579374]  rwsem_down_read_slowpath+0x26e/0x4c0
> > > [ 3194.579385]  down_read+0x44/0xa0
> > > [ 3194.579393]  ext4_file_write_iter+0x432/0xa80
> > > [ 3194.579407]  io_write+0x129/0x420
> >
> > This could potentially be a interesting stack trace; but this is where
> > we really need to map the stack address to line numbers.  Is that
> > something you could do?
> 
> Converting the above with scripts/decode_stacktrace.sh yields:
> 
> __schedule (kernel/sched/core.c:5382 kernel/sched/core.c:6695)
> schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 13) kernel/sched/core.c:6772 (discriminator 13))
> schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 (discriminator 10) kernel/sched/core.c:6831 (discriminator 10))
> rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 4))
> down_read (./arch/x86/include/asm/preempt.h:95 kernel/locking/rwsem.c:1257 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
> 
> But I'm not sure that that stacktrace is quite right (e.g. what's
> ./include/linux/fs.h:1073 doing in this stacktrace?). But with an optimized
> build it's a bit hard to tell what's an optimization artifact and what's an
> off stack trace...
> 
> 
> I had the idea to look at the stacks (via /proc/$pid/stack) for all postgres
> processes and the associated io-uring threads, and then to deduplicate them.
> 
> 22x:
> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
> io_issue_sqe (io_uring/io_uring.c:1878)
> io_wq_submit_work (io_uring/io_uring.c:1960)
> io_worker_handle_work (io_uring/io-wq.c:596)
> io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
> ret_from_fork (arch/x86/kernel/process.c:147)
> ret_from_fork_asm (arch/x86/entry/entry_64.S:312)

io_uring does some interesting stuff with IO completion and iomap
now - IIRC IOCB_DIO_CALLER_COMP is new 6.6-rc1 functionality. This
flag is set by io_write() to tell the iomap IO completion handler
that the caller will the IO completion, avoiding a context switch
to run the completion in a kworker thread.

It's not until the caller runs iocb->dio_complete() that
inode_dio_end() is called to drop the i_dio_count. This happens when
io_uring gets completion notification from iomap via
iocb->ki_complete(iocb, 0);

It then requires the io_uring layer to process the completion
and call iocb->dio_complete() before the inode->i_dio_count is
dropped and inode_dio_wait() will complete.

So what I suspect here is that we have a situation where the worker
that would run the completion is blocked on the inode rwsem because
this isn't a IOCB_NOWAIT IO and the fallocate call holds the rwsem
exclusive and is waiting on inode_dio_wait() to return.

Cc Jens, because I'm not sure this is actually an ext4 problem - I
can't see why XFS won't have the same issue w.r.t.
truncate/fallocate and IOCB_DIO_CALLER_COMP delaying the
inode_dio_end() until whenever the io_uring code can get around to
processing the delayed completion....

-Dave.

> 
> 8x:
> __do_sys_io_uring_enter (io_uring/io_uring.c:2553 (discriminator 1) io_uring/io_uring.c:2622 (discriminator 1) io_uring/io_uring.c:3706 (discriminator 1))
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> 
> 2x:
> io_wq_worker (./arch/x86/include/asm/current.h:41 (discriminator 5) io_uring/io-wq.c:668 (discriminator 5))
> ret_from_fork (arch/x86/kernel/process.c:147)
> ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
> 
> 2x:
> futex_wait_queue (./arch/x86/include/asm/current.h:41 (discriminator 5) kernel/futex/waitwake.c:357 (discriminator 5))
> futex_wait (kernel/futex/waitwake.c:660)
> do_futex (kernel/futex/syscalls.c:106 (discriminator 1))
> __x64_sys_futex (kernel/futex/syscalls.c:183 kernel/futex/syscalls.c:164 kernel/futex/syscalls.c:164)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> 
> 1x:
> do_epoll_wait (fs/eventpoll.c:1921 (discriminator 1) fs/eventpoll.c:2318 (discriminator 1))
> __x64_sys_epoll_wait (fs/eventpoll.c:2325)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> 
> 1x:
> inode_dio_wait (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:444 ./include/linux/atomic/atomic-instrumented.h:33 fs/inode.c:2429 fs/inode.c:2446)
> ext4_fallocate (fs/ext4/extents.c:4752)
> vfs_fallocate (fs/open.c:324)
> __x64_sys_fallocate (./include/linux/file.h:45 fs/open.c:348 fs/open.c:355 fs/open.c:353 fs/open.c:353)
> do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
> 
> 
> 
> 
> > > Once I hear that you don't want me to test something out on the running
> > > system, I think a sensible next step could be to compile with lockdep and see
> > > if that finds a problem?
> >
> > That's certainly a possibiity.  But also please make sure that you can
> > compile with with debugging information enabled so that we can get
> > reliable line numbers.  I use:
> >
> > CONFIG_DEBUG_INFO=y
> > CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
> > CONFIG_DEBUG_INFO_REDUCED=y
> 
> The kernel from above had debug info enabled, albeit forced to dwarf4 (IIRC I
> ran into issues with pahole not understanding all of dwarf5):
> 
> $ zgrep DEBUG_INFO /proc/config.gz
> CONFIG_DEBUG_INFO=y
> # CONFIG_DEBUG_INFO_NONE is not set
> # CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT is not set
> CONFIG_DEBUG_INFO_DWARF4=y
> # CONFIG_DEBUG_INFO_DWARF5 is not set
> # CONFIG_DEBUG_INFO_REDUCED is not set
> CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
> # CONFIG_DEBUG_INFO_COMPRESSED_ZLIB is not set
> # CONFIG_DEBUG_INFO_COMPRESSED_ZSTD is not set
> # CONFIG_DEBUG_INFO_SPLIT is not set
> CONFIG_DEBUG_INFO_BTF=y
> CONFIG_DEBUG_INFO_BTF_MODULES=y
> 
> I switched it to CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y and am rebuilding
> with lockdep enabled. Curious to see how long it'll take to trigger the
> problem with it enabled...
> 
> Greetings,
> 
> Andres Freund
> 

-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-24  1:12     ` task hung in ext4_fallocate #2 Dave Chinner
@ 2023-10-24  1:36       ` Andres Freund
  2023-10-24 14:30       ` Jens Axboe
  1 sibling, 0 replies; 12+ messages in thread
From: Andres Freund @ 2023-10-24  1:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Thorsten Leemhuis, Shreeya Patel, linux-ext4,
	Ricardo Cañuelo, gustavo.padovan, zsm, garrick,
	Linux regressions mailing list, Jens Axboe, io-uring

Hi,

On 2023-10-24 12:12:23 +1100, Dave Chinner wrote:
> [cc Jens, io-uring]

Yes, good call, had been meaning to do that but forgot.

> On Tue, Oct 17, 2023 at 07:50:09PM -0700, Andres Freund wrote:
> > I had the idea to look at the stacks (via /proc/$pid/stack) for all postgres
> > processes and the associated io-uring threads, and then to deduplicate them.
> > 
> > 22x:
> > ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
> > io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
> > io_issue_sqe (io_uring/io_uring.c:1878)
> > io_wq_submit_work (io_uring/io_uring.c:1960)
> > io_worker_handle_work (io_uring/io-wq.c:596)
> > io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
> > ret_from_fork (arch/x86/kernel/process.c:147)
> > ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
> 
> io_uring does some interesting stuff with IO completion and iomap
> now - IIRC IOCB_DIO_CALLER_COMP is new 6.6-rc1 functionality. This
> flag is set by io_write() to tell the iomap IO completion handler
> that the caller will the IO completion, avoiding a context switch
> to run the completion in a kworker thread.
> 
> It's not until the caller runs iocb->dio_complete() that
> inode_dio_end() is called to drop the i_dio_count. This happens when
> io_uring gets completion notification from iomap via
> iocb->ki_complete(iocb, 0);
> 
> It then requires the io_uring layer to process the completion
> and call iocb->dio_complete() before the inode->i_dio_count is
> dropped and inode_dio_wait() will complete.
> 
> So what I suspect here is that we have a situation where the worker
> that would run the completion is blocked on the inode rwsem because
> this isn't a IOCB_NOWAIT IO and the fallocate call holds the rwsem
> exclusive and is waiting on inode_dio_wait() to return.
> 
> Cc Jens, because I'm not sure this is actually an ext4 problem - I
> can't see why XFS won't have the same issue w.r.t.
> truncate/fallocate and IOCB_DIO_CALLER_COMP delaying the
> inode_dio_end() until whenever the io_uring code can get around to
> processing the delayed completion....

The absence of a reproducer obviously is not proof of an absence of
problems. With that said, I have run the workload on xfs for far longer on
both bigger and smaller machines, without triggering the same symptoms.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-24  1:12     ` task hung in ext4_fallocate #2 Dave Chinner
  2023-10-24  1:36       ` Andres Freund
@ 2023-10-24 14:30       ` Jens Axboe
  2023-10-24 18:35         ` Jens Axboe
  1 sibling, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2023-10-24 14:30 UTC (permalink / raw)
  To: Dave Chinner, Andres Freund
  Cc: Theodore Ts'o, Thorsten Leemhuis, Shreeya Patel, linux-ext4,
	Ricardo Cañuelo, gustavo.padovan, zsm, garrick,
	Linux regressions mailing list, io-uring

On 10/23/23 7:12 PM, Dave Chinner wrote:
> [cc Jens, io-uring]

Thanks!

> On Tue, Oct 17, 2023 at 07:50:09PM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2023-10-17 20:43:35 -0400, Theodore Ts'o wrote:
>>> On Mon, Oct 16, 2023 at 08:37:25PM -0700, Andres Freund wrote:
>>>> I just was able to reproduce the issue, after upgrading to 6.6-rc6 - this time
>>>> it took ~55min of high load (io_uring using branch of postgres, running a
>>>> write heavy transactional workload concurrently with concurrent bulk data
>>>> load) to trigger the issue.
>>>>
>>>> For now I have left the system running, in case there's something you would
>>>> like me to check while the system is hung.
>>>>
>>>> The first hanging task that I observed:
>>>>
>>>> cat /proc/57606/stack
>>>> [<0>] inode_dio_wait+0xd5/0x100
>>>> [<0>] ext4_fallocate+0x12f/0x1040
>>>> [<0>] vfs_fallocate+0x135/0x360
>>>> [<0>] __x64_sys_fallocate+0x42/0x70
>>>> [<0>] do_syscall_64+0x38/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>>
>>> This stack trace is from some process (presumably postgres) trying to
>>> do a fallocate() system call:
>>
>> Yes, it's indeed postgres.
>>
>>
>>>> [ 3194.579297] INFO: task iou-wrk-58004:58874 blocked for more than 122 seconds.
>>>> [ 3194.579304]       Not tainted 6.6.0-rc6-andres-00001-g01edcfe38260 #77
>>>> [ 3194.579310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> [ 3194.579314] task:iou-wrk-58004   state:D stack:0     pid:58874 ppid:52606  flags:0x00004000
>>>> [ 3194.579325] Call Trace:
>>>> [ 3194.579329]  <TASK>
>>>> [ 3194.579334]  __schedule+0x388/0x13e0
>>>> [ 3194.579349]  schedule+0x5f/0xe0
>>>> [ 3194.579361]  schedule_preempt_disabled+0x15/0x20
>>>> [ 3194.579374]  rwsem_down_read_slowpath+0x26e/0x4c0
>>>> [ 3194.579385]  down_read+0x44/0xa0
>>>> [ 3194.579393]  ext4_file_write_iter+0x432/0xa80
>>>> [ 3194.579407]  io_write+0x129/0x420
>>>
>>> This could potentially be a interesting stack trace; but this is where
>>> we really need to map the stack address to line numbers.  Is that
>>> something you could do?
>>
>> Converting the above with scripts/decode_stacktrace.sh yields:
>>
>> __schedule (kernel/sched/core.c:5382 kernel/sched/core.c:6695)
>> schedule (./arch/x86/include/asm/preempt.h:85 (discriminator 13) kernel/sched/core.c:6772 (discriminator 13))
>> schedule_preempt_disabled (./arch/x86/include/asm/preempt.h:80 (discriminator 10) kernel/sched/core.c:6831 (discriminator 10))
>> rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 4))
>> down_read (./arch/x86/include/asm/preempt.h:95 kernel/locking/rwsem.c:1257 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522)
>> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
>> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
>>
>> But I'm not sure that that stacktrace is quite right (e.g. what's
>> ./include/linux/fs.h:1073 doing in this stacktrace?). But with an optimized
>> build it's a bit hard to tell what's an optimization artifact and what's an
>> off stack trace...
>>
>>
>> I had the idea to look at the stacks (via /proc/$pid/stack) for all postgres
>> processes and the associated io-uring threads, and then to deduplicate them.
>>
>> 22x:
>> ext4_file_write_iter (./include/linux/fs.h:1073 fs/ext4/file.c:57 fs/ext4/file.c:564 fs/ext4/file.c:715)
>> io_write (./include/linux/fs.h:1956 io_uring/rw.c:926)
>> io_issue_sqe (io_uring/io_uring.c:1878)
>> io_wq_submit_work (io_uring/io_uring.c:1960)
>> io_worker_handle_work (io_uring/io-wq.c:596)
>> io_wq_worker (io_uring/io-wq.c:258 io_uring/io-wq.c:648)
>> ret_from_fork (arch/x86/kernel/process.c:147)
>> ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
> 
> io_uring does some interesting stuff with IO completion and iomap
> now - IIRC IOCB_DIO_CALLER_COMP is new 6.6-rc1 functionality. This
> flag is set by io_write() to tell the iomap IO completion handler
> that the caller will the IO completion, avoiding a context switch
> to run the completion in a kworker thread.
> 
> It's not until the caller runs iocb->dio_complete() that
> inode_dio_end() is called to drop the i_dio_count. This happens when
> io_uring gets completion notification from iomap via
> iocb->ki_complete(iocb, 0);
> 
> It then requires the io_uring layer to process the completion
> and call iocb->dio_complete() before the inode->i_dio_count is
> dropped and inode_dio_wait() will complete.
> 
> So what I suspect here is that we have a situation where the worker
> that would run the completion is blocked on the inode rwsem because
> this isn't a IOCB_NOWAIT IO and the fallocate call holds the rwsem
> exclusive and is waiting on inode_dio_wait() to return.
> 
> Cc Jens, because I'm not sure this is actually an ext4 problem - I
> can't see why XFS won't have the same issue w.r.t.
> truncate/fallocate and IOCB_DIO_CALLER_COMP delaying the
> inode_dio_end() until whenever the io_uring code can get around to
> processing the delayed completion....

I don't think this is related to the io-wq workers doing non-blocking
IO. The callback is eventually executed by the task that originally
submitted the IO, which is the owner and not the async workers. But...
If that original task is blocked in eg fallocate, then I can see how
that would potentially be an issue.

I'll take a closer look.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-24 14:30       ` Jens Axboe
@ 2023-10-24 18:35         ` Jens Axboe
  2023-10-25  0:06           ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2023-10-24 18:35 UTC (permalink / raw)
  To: Dave Chinner, Andres Freund
  Cc: Theodore Ts'o, Thorsten Leemhuis, Shreeya Patel, linux-ext4,
	Ricardo Cañuelo, gustavo.padovan, zsm, garrick,
	Linux regressions mailing list, io-uring

On 10/24/23 8:30 AM, Jens Axboe wrote:
> I don't think this is related to the io-wq workers doing non-blocking
> IO. The callback is eventually executed by the task that originally
> submitted the IO, which is the owner and not the async workers. But...
> If that original task is blocked in eg fallocate, then I can see how
> that would potentially be an issue.
> 
> I'll take a closer look.

I think the best way to fix this is likely to have inode_dio_wait() be
interruptible, and return -ERESTARTSYS if it should be restarted. Now
the below is obviously not a full patch, but I suspect it'll make ext4
and xfs tick, because they should both be affected.

Andres, any chance you can throw this into the testing mix?


diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 202c76996b62..0d946b6d36fe 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4747,7 +4747,9 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	}
 
 	/* Wait all existing dio workers, newcomers will block on i_rwsem */
-	inode_dio_wait(inode);
+	ret = inode_dio_wait(inode);
+	if (ret)
+		goto out;
 
 	ret = file_modified(file);
 	if (ret)
diff --git a/fs/inode.c b/fs/inode.c
index 84bc3c76e5cc..c4eca812b16b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2417,17 +2417,24 @@ EXPORT_SYMBOL(inode_owner_or_capable);
 /*
  * Direct i/o helper functions
  */
-static void __inode_dio_wait(struct inode *inode)
+static int __inode_dio_wait(struct inode *inode)
 {
 	wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP);
 	DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP);
+	int ret = 0;
 
 	do {
-		prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (atomic_read(&inode->i_dio_count))
-			schedule();
+		prepare_to_wait(wq, &q.wq_entry, TASK_INTERRUPTIBLE);
+		if (!atomic_read(&inode->i_dio_count))
+			break;
+		schedule();
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
 	} while (atomic_read(&inode->i_dio_count));
 	finish_wait(wq, &q.wq_entry);
+	return ret;
 }
 
 /**
@@ -2440,10 +2447,11 @@ static void __inode_dio_wait(struct inode *inode)
  * Must be called under a lock that serializes taking new references
  * to i_dio_count, usually by inode->i_mutex.
  */
-void inode_dio_wait(struct inode *inode)
+int inode_dio_wait(struct inode *inode)
 {
 	if (atomic_read(&inode->i_dio_count))
-		__inode_dio_wait(inode);
+		return __inode_dio_wait(inode);
+	return 0;
 }
 EXPORT_SYMBOL(inode_dio_wait);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 203700278ddb..8ea0c414b173 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -936,7 +936,9 @@ xfs_file_fallocate(
 	 * the on disk and in memory inode sizes, and the operations that follow
 	 * require the in-memory size to be fully up-to-date.
 	 */
-	inode_dio_wait(inode);
+	error = inode_dio_wait(inode);
+	if (error)
+		goto out_unlock;
 
 	/*
 	 * Now AIO and DIO has drained we flush and (if necessary) invalidate
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a40823c3c67..7dff3167cb0c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2971,7 +2971,7 @@ static inline ssize_t blockdev_direct_IO(struct kiocb *iocb,
 }
 #endif
 
-void inode_dio_wait(struct inode *inode);
+int inode_dio_wait(struct inode *inode);
 
 /**
  * inode_dio_begin - signal start of a direct I/O requests

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-24 18:35         ` Jens Axboe
@ 2023-10-25  0:06           ` Dave Chinner
  2023-10-25  0:34             ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2023-10-25  0:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andres Freund, Theodore Ts'o, Thorsten Leemhuis,
	Shreeya Patel, linux-ext4, Ricardo Cañuelo, gustavo.padovan,
	zsm, garrick, Linux regressions mailing list, io-uring

On Tue, Oct 24, 2023 at 12:35:26PM -0600, Jens Axboe wrote:
> On 10/24/23 8:30 AM, Jens Axboe wrote:
> > I don't think this is related to the io-wq workers doing non-blocking
> > IO.

The io-wq worker that has deadlocked _must_ be doing blocking IO. If
it was doing non-blocking IO (i.e. IOCB_NOWAIT) then it would have
done a trylock and returned -EAGAIN to the worker for it to try
again later. I'm not sure that would avoid the issue, however - it
seems to me like it might just turn it into a livelock rather than a
deadlock....

> > The callback is eventually executed by the task that originally
> > submitted the IO, which is the owner and not the async workers. But...
> > If that original task is blocked in eg fallocate, then I can see how
> > that would potentially be an issue.
> > 
> > I'll take a closer look.
> 
> I think the best way to fix this is likely to have inode_dio_wait() be
> interruptible, and return -ERESTARTSYS if it should be restarted. Now
> the below is obviously not a full patch, but I suspect it'll make ext4
> and xfs tick, because they should both be affected.

How does that solve the problem? Nothing will issue a signal to the
process that is waiting in inode_dio_wait() except userspace, so I
can't see how this does anything to solve the problem at hand...

I'm also very leary of adding new error handling complexity to paths
like truncate, extent cloning, fallocate, etc which expect to block
on locks until they can perform the operation safely.

On further thinking, this could be a self deadlock with
just async direct IO submission - submit an async DIO with
IOCB_CALLER_COMP, then run an unaligned async DIO that attempts to
drain in-flight DIO before continuing. Then the thread waits in
inode_dio_wait() because it can't run the completion that will drop
the i_dio_count to zero.

Hence it appears to me that we've missed some critical constraints
around nesting IO submission and completion when using
IOCB_CALLER_COMP. Further, it really isn't clear to me how deep the
scope of this problem is yet, let alone what the solution might be.

With all this in mind, and how late this is in the 6.6 cycle, can we
just revert the IOCB_CALLER_COMP changes for now?

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-25  0:06           ` Dave Chinner
@ 2023-10-25  0:34             ` Jens Axboe
  2023-10-25 15:31               ` Andres Freund
  2023-10-25 22:28               ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Jens Axboe @ 2023-10-25  0:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andres Freund, Theodore Ts'o, Thorsten Leemhuis,
	Shreeya Patel, linux-ext4, Ricardo Cañuelo, gustavo.padovan,
	zsm, garrick, Linux regressions mailing list, io-uring

On 10/24/23 6:06 PM, Dave Chinner wrote:
> On Tue, Oct 24, 2023 at 12:35:26PM -0600, Jens Axboe wrote:
>> On 10/24/23 8:30 AM, Jens Axboe wrote:
>>> I don't think this is related to the io-wq workers doing non-blocking
>>> IO.
> 
> The io-wq worker that has deadlocked _must_ be doing blocking IO. If
> it was doing non-blocking IO (i.e. IOCB_NOWAIT) then it would have
> done a trylock and returned -EAGAIN to the worker for it to try
> again later. I'm not sure that would avoid the issue, however - it
> seems to me like it might just turn it into a livelock rather than a
> deadlock....

Sorry typo, yes they are doing blocking IO, that's all they ever do. My
point is that it's not related to the issue.

>>> The callback is eventually executed by the task that originally
>>> submitted the IO, which is the owner and not the async workers. But...
>>> If that original task is blocked in eg fallocate, then I can see how
>>> that would potentially be an issue.
>>>
>>> I'll take a closer look.
>>
>> I think the best way to fix this is likely to have inode_dio_wait() be
>> interruptible, and return -ERESTARTSYS if it should be restarted. Now
>> the below is obviously not a full patch, but I suspect it'll make ext4
>> and xfs tick, because they should both be affected.
> 
> How does that solve the problem? Nothing will issue a signal to the
> process that is waiting in inode_dio_wait() except userspace, so I
> can't see how this does anything to solve the problem at hand...

Except task_work, which when it completes, will increment the i_dio
count again. This is the whole point of the half assed patch I sent out.

> I'm also very leary of adding new error handling complexity to paths
> like truncate, extent cloning, fallocate, etc which expect to block
> on locks until they can perform the operation safely.

I actually looked at all of them, ext4 and xfs specifically. It really
doesn't seem to bad.

> On further thinking, this could be a self deadlock with
> just async direct IO submission - submit an async DIO with
> IOCB_CALLER_COMP, then run an unaligned async DIO that attempts to
> drain in-flight DIO before continuing. Then the thread waits in
> inode_dio_wait() because it can't run the completion that will drop
> the i_dio_count to zero.

No, because those will be non-blocking. Any blocking IO will go via
io-wq, and that won't then hit the deadlock. If you're doing
inode_dio_wait() from the task itself for a non-blocking issue, then
that would surely be an issue. But we should not be doing that, and we
are checking for it.

> Hence it appears to me that we've missed some critical constraints
> around nesting IO submission and completion when using
> IOCB_CALLER_COMP. Further, it really isn't clear to me how deep the
> scope of this problem is yet, let alone what the solution might be.

I think you're missing exactly what the deadlock is.

> With all this in mind, and how late this is in the 6.6 cycle, can we
> just revert the IOCB_CALLER_COMP changes for now?

Yeah I'm going to do a revert of the io_uring side, which effectively
disables it. Then a revised series can be done, and when done, we could
bring it back.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-25  0:34             ` Jens Axboe
@ 2023-10-25 15:31               ` Andres Freund
  2023-10-25 15:36                 ` Jens Axboe
  2023-10-25 22:28               ` Dave Chinner
  1 sibling, 1 reply; 12+ messages in thread
From: Andres Freund @ 2023-10-25 15:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Theodore Ts'o, Thorsten Leemhuis, Shreeya Patel,
	linux-ext4, Ricardo Cañuelo, gustavo.padovan, zsm, garrick,
	Linux regressions mailing list, io-uring

Hi,

On 2023-10-24 18:34:05 -0600, Jens Axboe wrote:
> Yeah I'm going to do a revert of the io_uring side, which effectively
> disables it. Then a revised series can be done, and when done, we could
> bring it back.

I'm queueing a test to confirm that the revert actually fixes things. Is there
still benefit in testing your other patch in addition upstream?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-25 15:31               ` Andres Freund
@ 2023-10-25 15:36                 ` Jens Axboe
  2023-10-25 16:14                   ` Andres Freund
  2023-10-25 19:55                   ` Theodore Ts'o
  0 siblings, 2 replies; 12+ messages in thread
From: Jens Axboe @ 2023-10-25 15:36 UTC (permalink / raw)
  To: Andres Freund
  Cc: Dave Chinner, Theodore Ts'o, Thorsten Leemhuis, Shreeya Patel,
	linux-ext4, Ricardo Cañuelo, gustavo.padovan, zsm, garrick,
	Linux regressions mailing list, io-uring

On 10/25/23 9:31 AM, Andres Freund wrote:
> Hi,
> 
> On 2023-10-24 18:34:05 -0600, Jens Axboe wrote:
>> Yeah I'm going to do a revert of the io_uring side, which effectively
>> disables it. Then a revised series can be done, and when done, we could
>> bring it back.
> 
> I'm queueing a test to confirm that the revert actually fixes things.
> Is there still benefit in testing your other patch in addition
> upstream?

Don't think there's much point to testing the quick hack, I believe it
should work. So testing the most recent revert is useful, though I also
fully expect that to work. And then we can test the re-enable once that
is sent out, I did prepare a series. But timing is obviously unfortunate
for that, as it'll miss 6.6 and now also 6.7 due to the report timing.

FWIW, I wrote a small test case which does seem to trigger it very fast,
as expected:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <liburing.h>

#define BS	4096
#define FSIZE	(128 * 1024 * 1024UL)

static int set_file_size(int fd, off_t file_size)
{
	off_t this_size;
	char buf[BS];
	int ret;

	memset(buf, 0, BS);
	this_size = 0;
	while (this_size < file_size) {
		ret = write(fd, buf, BS);
		if (ret != BS) {
			fprintf(stderr, "write ret %d\n", ret);
			return 1;
		}
		this_size += BS;
	}
	fsync(fd);
	posix_fadvise(fd, 0, file_size, POSIX_FADV_DONTNEED);
	return 0;
}

int main(int argc, char *argv[])
{
	struct io_uring_sqe *sqe;
	struct io_uring_cqe *cqe;
	struct io_uring ring;
	off_t off, foff;
	int fd, i, ret;
	void *buf;

	if (argc < 2) {
		fprintf(stderr, "%s <file>\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC | O_DIRECT, 0644);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	if (set_file_size(fd, FSIZE))
		return 1;

	if (posix_memalign(&buf, 4096, BS))
		return 1;

	io_uring_queue_init(8, &ring, 0);

	i = 0;
	off = 0;
	foff = FSIZE + BS;
	do {
		sqe = io_uring_get_sqe(&ring);
		io_uring_prep_write(sqe, fd, buf, BS, off);
		off += BS;
		if (off == FSIZE)
			off = 0;

		io_uring_submit(&ring);

		ret = posix_fallocate(fd, 0, foff);
		if (ret < 0) {
			perror("fallocate");
			return 1;
		}
		foff += BS;

		ret = io_uring_wait_cqe(&ring, &cqe);
		if (ret) {
			fprintf(stderr, "wait cqe %d\n", ret);
			return 1;
		}

		io_uring_cqe_seen(&ring, cqe);
		i++;
		if (!(i & 1023))
			fprintf(stdout, "Loop iteration %d\n", i);
	} while (1);

	return 0;
}

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-25 15:36                 ` Jens Axboe
@ 2023-10-25 16:14                   ` Andres Freund
  2023-10-26  2:48                     ` Andres Freund
  2023-10-25 19:55                   ` Theodore Ts'o
  1 sibling, 1 reply; 12+ messages in thread
From: Andres Freund @ 2023-10-25 16:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Theodore Ts'o, Thorsten Leemhuis, Shreeya Patel,
	linux-ext4, Ricardo Cañuelo, gustavo.padovan, zsm, garrick,
	Linux regressions mailing list, io-uring

Hi,

On 2023-10-25 09:36:01 -0600, Jens Axboe wrote:
> On 10/25/23 9:31 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2023-10-24 18:34:05 -0600, Jens Axboe wrote:
> >> Yeah I'm going to do a revert of the io_uring side, which effectively
> >> disables it. Then a revised series can be done, and when done, we could
> >> bring it back.
> > 
> > I'm queueing a test to confirm that the revert actually fixes things.
> > Is there still benefit in testing your other patch in addition
> > upstream?
> 
> Don't think there's much point to testing the quick hack, I believe it
> should work. So testing the most recent revert is useful, though I also
> fully expect that to work.

I'll leave it running for a few hours, just to be sure.


> And then we can test the re-enable once that is sent out, I did prepare a
> series. But timing is obviously unfortunate for that, as it'll miss 6.6 and
> now also 6.7 due to the report timing.

Yea, it's too bad. I'm somewhat lucky to have hit it at all, it was just due
to having procrastinated on talk preparation and having to test things on my
laptop instead of my workstation (where the same workload never triggered the
problem), while travelling.


> FWIW, I wrote a small test case which does seem to trigger it very fast,
> as expected:

Ah, nice.

It's too bad stuff like this can't be discovered by lockdep et al..

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-25 15:36                 ` Jens Axboe
  2023-10-25 16:14                   ` Andres Freund
@ 2023-10-25 19:55                   ` Theodore Ts'o
  1 sibling, 0 replies; 12+ messages in thread
From: Theodore Ts'o @ 2023-10-25 19:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andres Freund, Dave Chinner, Thorsten Leemhuis, Shreeya Patel,
	linux-ext4, Ricardo Cañuelo, gustavo.padovan, zsm, garrick,
	Linux regressions mailing list, io-uring

On Wed, Oct 25, 2023 at 09:36:01AM -0600, Jens Axboe wrote:
> 
> FWIW, I wrote a small test case which does seem to trigger it very fast,
> as expected:

Great!  I was about to ask if we had an easy reproducer.  Any chance
you can package this up as an test in xfstests?  Or would you like
some help with that?

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-25  0:34             ` Jens Axboe
  2023-10-25 15:31               ` Andres Freund
@ 2023-10-25 22:28               ` Dave Chinner
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2023-10-25 22:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andres Freund, Theodore Ts'o, Thorsten Leemhuis,
	Shreeya Patel, linux-ext4, Ricardo Cañuelo, gustavo.padovan,
	zsm, garrick, Linux regressions mailing list, io-uring

On Tue, Oct 24, 2023 at 06:34:05PM -0600, Jens Axboe wrote:
> On 10/24/23 6:06 PM, Dave Chinner wrote:
> > On Tue, Oct 24, 2023 at 12:35:26PM -0600, Jens Axboe wrote:
> >> On 10/24/23 8:30 AM, Jens Axboe wrote:
> >>> I don't think this is related to the io-wq workers doing non-blocking
> >>> IO.
> > 
> > The io-wq worker that has deadlocked _must_ be doing blocking IO. If
> > it was doing non-blocking IO (i.e. IOCB_NOWAIT) then it would have
> > done a trylock and returned -EAGAIN to the worker for it to try
> > again later. I'm not sure that would avoid the issue, however - it
> > seems to me like it might just turn it into a livelock rather than a
> > deadlock....
> 
> Sorry typo, yes they are doing blocking IO, that's all they ever do. My
> point is that it's not related to the issue.
> 
> >>> The callback is eventually executed by the task that originally
> >>> submitted the IO, which is the owner and not the async workers. But...
> >>> If that original task is blocked in eg fallocate, then I can see how
> >>> that would potentially be an issue.
> >>>
> >>> I'll take a closer look.
> >>
> >> I think the best way to fix this is likely to have inode_dio_wait() be
> >> interruptible, and return -ERESTARTSYS if it should be restarted. Now
> >> the below is obviously not a full patch, but I suspect it'll make ext4
> >> and xfs tick, because they should both be affected.
> > 
> > How does that solve the problem? Nothing will issue a signal to the
> > process that is waiting in inode_dio_wait() except userspace, so I
> > can't see how this does anything to solve the problem at hand...
> 
> Except task_work, which when it completes, will increment the i_dio
> count again. This is the whole point of the half assed patch I sent out.

What task_work is that?  When does that actually run?

Please don't assume that everyone is intimately familiar with the
subtle complexities of io_uring infrastructure - if the fix relies
on a signal from -somewhere- then you need to explain where
that signal comes from and why we should be able to rely on that...

> 
> > I'm also very leary of adding new error handling complexity to paths
> > like truncate, extent cloning, fallocate, etc which expect to block
> > on locks until they can perform the operation safely.
> 
> I actually looked at all of them, ext4 and xfs specifically. It really
> doesn't seem to bad.
> 
> > On further thinking, this could be a self deadlock with
> > just async direct IO submission - submit an async DIO with
> > IOCB_CALLER_COMP, then run an unaligned async DIO that attempts to
> > drain in-flight DIO before continuing. Then the thread waits in
> > inode_dio_wait() because it can't run the completion that will drop
> > the i_dio_count to zero.
> 
> No, because those will be non-blocking. Any blocking IO will go via
> io-wq, and that won't then hit the deadlock. If you're doing
> inode_dio_wait() from the task itself for a non-blocking issue, then
> that would surely be an issue. But we should not be doing that, and we
> are checking for it.

There's no documentation that says IO submission inside a
IOCB_DIO_CALLER_COMP context must be IOCB_NOWAIT.

I don't recall it being mentioned during patch submission or review,
and if it was ithe implications certainly didn't register with me -
I would not have given a rvb without such a landmine either being
removed or very well documented.

I don't see anywhere that is checked and I don't see how it can be,
because the filesystem IO submission path itself has no idea if the
caller is already has a IOCB_DIO_CALLER_COMP IO in flight and
pending completion.

> > Hence it appears to me that we've missed some critical constraints
> > around nesting IO submission and completion when using
> > IOCB_CALLER_COMP. Further, it really isn't clear to me how deep the
> > scope of this problem is yet, let alone what the solution might be.
> 
> I think you're missing exactly what the deadlock is.

Then you need to explain exactly what it is, not send undocumented
hacks that appear to do absolutely nothing to fix the problem.

> > With all this in mind, and how late this is in the 6.6 cycle, can we
> > just revert the IOCB_CALLER_COMP changes for now?
> 
> Yeah I'm going to do a revert of the io_uring side, which effectively
> disables it. Then a revised series can be done, and when done, we could
> bring it back.

Please revert the whole lot, I'm now unconvinced that this is
functionality we can sanely support at the filesystem level without
a whole lot more thought.

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: task hung in ext4_fallocate #2
  2023-10-25 16:14                   ` Andres Freund
@ 2023-10-26  2:48                     ` Andres Freund
  0 siblings, 0 replies; 12+ messages in thread
From: Andres Freund @ 2023-10-26  2:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, Theodore Ts'o, Thorsten Leemhuis, Shreeya Patel,
	linux-ext4, Ricardo Cañuelo, gustavo.padovan, zsm, garrick,
	Linux regressions mailing list, io-uring

Hi,

On 2023-10-25 09:14:05 -0700, Andres Freund wrote:
> On 2023-10-25 09:36:01 -0600, Jens Axboe wrote:
> > On 10/25/23 9:31 AM, Andres Freund wrote:
> > > Hi,
> > >
> > > On 2023-10-24 18:34:05 -0600, Jens Axboe wrote:
> > >> Yeah I'm going to do a revert of the io_uring side, which effectively
> > >> disables it. Then a revised series can be done, and when done, we could
> > >> bring it back.
> > >
> > > I'm queueing a test to confirm that the revert actually fixes things.
> > > Is there still benefit in testing your other patch in addition
> > > upstream?
> >
> > Don't think there's much point to testing the quick hack, I believe it
> > should work. So testing the most recent revert is useful, though I also
> > fully expect that to work.
>
> I'll leave it running for a few hours, just to be sure.

Quite a few hours and > 15TBW written later, no hang....

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-10-26  2:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[email protected]>
     [not found] ` <[email protected]>
     [not found]   ` <[email protected]>
2023-10-24  1:12     ` task hung in ext4_fallocate #2 Dave Chinner
2023-10-24  1:36       ` Andres Freund
2023-10-24 14:30       ` Jens Axboe
2023-10-24 18:35         ` Jens Axboe
2023-10-25  0:06           ` Dave Chinner
2023-10-25  0:34             ` Jens Axboe
2023-10-25 15:31               ` Andres Freund
2023-10-25 15:36                 ` Jens Axboe
2023-10-25 16:14                   ` Andres Freund
2023-10-26  2:48                     ` Andres Freund
2023-10-25 19:55                   ` Theodore Ts'o
2023-10-25 22:28               ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox