public inbox for [email protected]
 help / color / mirror / Atom feed
* Re: [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write()
       [not found] <[email protected]>
@ 2022-10-11  0:40 ` Dave Chinner
  2022-10-11  1:10   ` Pavel Begunkov
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2022-10-11  0:40 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, Jens Axboe, io-uring

On Mon, Oct 10, 2022 at 04:03:19PM +1100, Dave Chinner wrote:
> Hi Al,
> 
> Just tried to run fstests on XFS on a current Linus kernel at commit
> 493ffd6605b2 ("Merge tag 'ucount-rlimits-cleanups-for-v5.19' of
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace")
> and generic/068 hangs trying to freeze the filesystem like so:
> 
> [  163.957724] task:xfs_io          state:D stack:14680 pid: 4831 ppid:  4825 flags:0x00004000
> [  163.961425] Call Trace:
> [  163.962553]  <TASK>
> [  163.963497]  __schedule+0x2f9/0xa30
> [  163.965125]  ? percpu_down_write+0x60/0x190
> [  163.966888]  schedule+0x5a/0xc0
> [  163.968206]  percpu_down_write+0xe8/0x190
> [  163.969865]  freeze_super+0x78/0x170
> [  163.971247]  __x64_sys_ioctl+0x61/0xb0
> [  163.973947]  do_syscall_64+0x35/0x80
> [  163.975119]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [  163.976781] RIP: 0033:0x7ff78910bb07
> [  163.978028] RSP: 002b:00007ffefe7279b8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
> [  163.980734] RAX: ffffffffffffffda RBX: 000055a88c183ad0 RCX: 00007ff78910bb07
> [  163.983346] RDX: 00007ffefe7279cc RSI: ffffffffc0045877 RDI: 0000000000000003
> [  163.985864] RBP: 0000000000000008 R08: 00007ff7891d5ba0 R09: 00007ff7891d5c00
> [  163.988371] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000001
> [  163.990704] R13: 000055a88c184270 R14: 000055a88c184fc0 R15: 000055a88c184fe0
> [  163.992964]  </TASK>
> [  163.993612] task:fsstress        state:D stack:12464 pid: 4832 ppid:  4822 flags:0x00000000
> [  163.996390] Call Trace:
> [  163.997310]  <TASK>
> [  163.998076]  __schedule+0x2f9/0xa30
> [  163.999323]  ? __smp_call_single_queue+0x23/0x40
> [  164.000685]  schedule+0x5a/0xc0
> [  164.001719]  ? percpu_rwsem_wait+0x123/0x150
> [  164.003139]  percpu_rwsem_wait+0x123/0x150
> [  164.004535]  ? __percpu_rwsem_trylock.part.0+0x50/0x50
> [  164.006202]  __percpu_down_read+0x5b/0x110
> [  164.007560]  mnt_want_write+0xa0/0xd0
> [  164.008816]  do_renameat2+0x17b/0x530
> [  164.010006]  ? xfs_can_free_eofblocks+0x39/0x1e0
> [  164.011558]  ? __might_fault+0x1e/0x20
> [  164.012870]  ? strncpy_from_user+0x1e/0x160
> [  164.014236]  __x64_sys_renameat2+0x4b/0x60
> [  164.015451]  do_syscall_64+0x35/0x80
> [  164.016576]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [  164.018117] RIP: 0033:0x7f3c7fe97c0f
> [  164.019124] RSP: 002b:00007fff67cf26b8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
> [  164.021202] RAX: ffffffffffffffda RBX: 00007fff67cf2980 RCX: 00007f3c7fe97c0f
> [  164.023309] RDX: 00000000ffffff9c RSI: 000055fd446825a0 RDI: 00000000ffffff9c
> [  164.025479] RBP: 00007fff67cf2990 R08: 0000000000000002 R09: 00007fff67cf2960
> [  164.027655] R10: 000055fd44682f10 R11: 0000000000000202 R12: 0000000000000079
> [  164.029721] R13: 0000000000000002 R14: 000055fd446703a0 R15: 00000000ffffffff
> [  164.031937]  </TASK>
> [  164.032675] task:fsstress        state:D stack:14224 pid: 4833 ppid:  4822 flags:0x00000000
> [  164.035086] Call Trace:
> [  164.035803]  <TASK>
> [  164.036512]  __schedule+0x2f9/0xa30
> [  164.037661]  ? __smp_call_single_queue+0x23/0x40
> [  164.039048]  schedule+0x5a/0xc0
> [  164.040076]  ? percpu_rwsem_wait+0x123/0x150
> [  164.041315]  percpu_rwsem_wait+0x123/0x150
> [  164.042458]  ? __percpu_rwsem_trylock.part.0+0x50/0x50
> [  164.043888]  __percpu_down_read+0x5b/0x110
> [  164.045097]  mnt_want_write+0xa0/0xd0
> [  164.046117]  do_fchownat+0x78/0xe0
> [  164.047060]  __x64_sys_lchown+0x1f/0x30
> [  164.048210]  do_syscall_64+0x35/0x80
> [  164.049295]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [  164.050775] RIP: 0033:0x7f3c7ff2df97
> [  164.051727] RSP: 002b:00007fff67cf2838 EFLAGS: 00000202 ORIG_RAX: 000000000000005e
> [  164.053803] RAX: ffffffffffffffda RBX: 00007fff67cf29a0 RCX: 00007f3c7ff2df97
> [  164.055773] RDX: 0000000000028dd3 RSI: 0000000000022124 RDI: 000055fd44672450
> [  164.057745] RBP: 0000000000022124 R08: 0000000000000064 R09: 00007fff67cf299c
> [  164.059668] R10: fffffffffffffb8b R11: 0000000000000202 R12: 0000000000000008
> [  164.061609] R13: 0000000000028dd3 R14: 0000000000022124 R15: 000055fd44660b50
> [  164.063578]  </TASK>
> 
> Eventually the hung task timer kicks in and reports all these
> blocked threads over and over again. The read locks look like
> sb_start_write() calls, and the freeze_super() call is waiting
> here:
> 
> 
>         sb->s_writers.frozen = SB_FREEZE_WRITE;
> 	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
> 	up_write(&sb->s_umount);
> >>>>>>	sb_wait_write(sb, SB_FREEZE_WRITE);
> 	down_write(&sb->s_umount);
> 
> 
> Every test vm I ran this one failed in an identical manner. They all
> failed trying to freeze on the second iteration of the {freeze;thaw}
> loop, so whatever is going on should be easily reproducable:
> 
> # cat /tmp/9061.out
> # QA output created by 068
> *** init FS
> 
> *** iteration: 0
> *** freezing $SCRATCH_MNT
> *** thawing  $SCRATCH_MNT
> 
> *** iteration: 1
> *** freezing $SCRATCH_MNT
> #
> 
> I just tested ext4 and it hangs in an identical fashion, so whatever
> is going on is not specific to XFS - it smells like a
> mnt_want_write() or sb_want_write() imbalance somewhere in the
> generic code. I haven't had time to bisect it, so this heads-up is
> all I'll have until I get can back to this tomorrow....

Bisect points to the io-uring merge commit:

$ git bisect log
git bisect start '--' 'fs/'
# status: waiting for both good and bad commits
# bad: [cdc94798820e5cdc2fc314540ee3d28e1f2fea0e] Merge remote-tracking branch 'linux-xfs/for-next' into working
git bisect bad cdc94798820e5cdc2fc314540ee3d28e1f2fea0e
# status: waiting for good commit(s), bad commit known
# good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
git bisect good 4fe89d07dcc2804c8b562f6c7896a45643d34b2f
# good: [cbddcc4fa3443fe8cfb2ff8e210deb1f6a0eea38] btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer
git bisect good cbddcc4fa3443fe8cfb2ff8e210deb1f6a0eea38
# good: [7f198ba7ae9874c64ffe8cd3aa60cf5dab78ce3a] Merge tag 'affs-for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
git bisect good 7f198ba7ae9874c64ffe8cd3aa60cf5dab78ce3a
# good: [9f4b9beeb9cf46c4b172fca06de5bd6831108641] Merge tag '6.1-rc-ksmbd-fixes' of git://git.samba.org/ksmbd
git bisect good 9f4b9beeb9cf46c4b172fca06de5bd6831108641
# bad: [e8bc52cb8df80c31c73c726ab58ea9746e9ff734] Merge tag 'driver-core-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
git bisect bad e8bc52cb8df80c31c73c726ab58ea9746e9ff734
# bad: [513389809e138ae903b6ef43c1d5d2ffaf4dca17] Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux
git bisect bad 513389809e138ae903b6ef43c1d5d2ffaf4dca17
# good: [191249f708897fc34c78f4494f7156896aaaeca9] quota: Add more checking after reading from quota file
git bisect good 191249f708897fc34c78f4494f7156896aaaeca9
# good: [188943a15638ceb91f960e072ed7609b2d7f2a55] Merge tag 'fs-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect good 188943a15638ceb91f960e072ed7609b2d7f2a55
# good: [118f3663fbc658e9ad6165e129076981c7b685c5] block: remove PSI accounting from the bio layer
git bisect good 118f3663fbc658e9ad6165e129076981c7b685c5
# bad: [0a78a376ef3c2f3d397df48909f00cd75f92137a] Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux
git bisect bad 0a78a376ef3c2f3d397df48909f00cd75f92137a
# good: [9f0deaa12d832f488500a5afe9b912e9b3cfc432] eventfd: guard wake_up in eventfd fs calls as well
git bisect good 9f0deaa12d832f488500a5afe9b912e9b3cfc432
# first bad commit: [0a78a376ef3c2f3d397df48909f00cd75f92137a] Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux
$

I note that there are changes to the the io_uring IO path and write
IO end accounting in the io_uring stack that was merged, and there
was no doubt about the success/failure of the reproducer at each
step. Hence I think the bisect is good, and the problem is someone
in the io-uring changes.

Jens, over to you.

The reproducer - generic/068 - is 100% reliable here, io_uring is
being exercised by fsstress in the background whilst the filesystem
is being frozen and thawed repeatedly. Some path in the io-uring
code has an unbalanced sb_start_write()/sb_end_write() pair by the
look of it....

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write()
  2022-10-11  0:40 ` [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write() Dave Chinner
@ 2022-10-11  1:10   ` Pavel Begunkov
  2022-10-11  2:01     ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Begunkov @ 2022-10-11  1:10 UTC (permalink / raw)
  To: Dave Chinner, viro; +Cc: linux-fsdevel, Jens Axboe, io-uring

On 10/11/22 01:40, Dave Chinner wrote:
[...]
> I note that there are changes to the the io_uring IO path and write
> IO end accounting in the io_uring stack that was merged, and there
> was no doubt about the success/failure of the reproducer at each
> step. Hence I think the bisect is good, and the problem is someone
> in the io-uring changes.
> 
> Jens, over to you.
> 
> The reproducer - generic/068 - is 100% reliable here, io_uring is
> being exercised by fsstress in the background whilst the filesystem
> is being frozen and thawed repeatedly. Some path in the io-uring
> code has an unbalanced sb_start_write()/sb_end_write() pair by the
> look of it....

A quick guess, it's probably

b000145e99078 ("io_uring/rw: defer fsnotify calls to task context")

 From a quick look, it removes  kiocb_end_write() -> sb_end_write()
from kiocb_done(), which is a kind of buffered rw completion path.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write()
  2022-10-11  1:10   ` Pavel Begunkov
@ 2022-10-11  2:01     ` Jens Axboe
  2022-10-11  2:10       ` Pavel Begunkov
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2022-10-11  2:01 UTC (permalink / raw)
  To: Pavel Begunkov, Dave Chinner, viro; +Cc: linux-fsdevel, io-uring

On 10/10/22 7:10 PM, Pavel Begunkov wrote:
> On 10/11/22 01:40, Dave Chinner wrote:
> [...]
>> I note that there are changes to the the io_uring IO path and write
>> IO end accounting in the io_uring stack that was merged, and there
>> was no doubt about the success/failure of the reproducer at each
>> step. Hence I think the bisect is good, and the problem is someone
>> in the io-uring changes.
>>
>> Jens, over to you.
>>
>> The reproducer - generic/068 - is 100% reliable here, io_uring is
>> being exercised by fsstress in the background whilst the filesystem
>> is being frozen and thawed repeatedly. Some path in the io-uring
>> code has an unbalanced sb_start_write()/sb_end_write() pair by the
>> look of it....
> 
> A quick guess, it's probably
> 
> b000145e99078 ("io_uring/rw: defer fsnotify calls to task context")
> 
> From a quick look, it removes  kiocb_end_write() -> sb_end_write()
> from kiocb_done(), which is a kind of buffered rw completion path.

Yeah, I'll take a look.

Didn't get the original email, only Pavel's reply?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write()
  2022-10-11  2:01     ` Jens Axboe
@ 2022-10-11  2:10       ` Pavel Begunkov
  2022-10-11  2:54         ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Begunkov @ 2022-10-11  2:10 UTC (permalink / raw)
  To: Jens Axboe, Dave Chinner, viro; +Cc: linux-fsdevel, io-uring

On 10/11/22 03:01, Jens Axboe wrote:
> On 10/10/22 7:10 PM, Pavel Begunkov wrote:
>> On 10/11/22 01:40, Dave Chinner wrote:
>> [...]
>>> I note that there are changes to the the io_uring IO path and write
>>> IO end accounting in the io_uring stack that was merged, and there
>>> was no doubt about the success/failure of the reproducer at each
>>> step. Hence I think the bisect is good, and the problem is someone
>>> in the io-uring changes.
>>>
>>> Jens, over to you.
>>>
>>> The reproducer - generic/068 - is 100% reliable here, io_uring is
>>> being exercised by fsstress in the background whilst the filesystem
>>> is being frozen and thawed repeatedly. Some path in the io-uring
>>> code has an unbalanced sb_start_write()/sb_end_write() pair by the
>>> look of it....
>>
>> A quick guess, it's probably
>>
>> b000145e99078 ("io_uring/rw: defer fsnotify calls to task context")
>>
>>  From a quick look, it removes  kiocb_end_write() -> sb_end_write()
>> from kiocb_done(), which is a kind of buffered rw completion path.
> 
> Yeah, I'll take a look.
> Didn't get the original email, only Pavel's reply?

Forwarded.

Not tested, but should be sth like below. Apart of obvious cases
like __io_complete_rw_common() we should also keep in mind
when we don't complete the request but ask for reissue with
REQ_F_REISSUE, that's for the first hunk


diff --git a/io_uring/rw.c b/io_uring/rw.c
index a25cd44cd415..f991aa78803e 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -239,6 +239,18 @@ static bool __io_complete_rw_common(struct io_kiocb *req, long res)
  	if (unlikely(res != req->cqe.res)) {
  		if ((res == -EAGAIN || res == -EOPNOTSUPP) &&
  		    io_rw_should_reissue(req)) {
+			struct io_rw *io = io_kiocb_to_cmd(req, struct io_rw);
+
+			/*
+			 * Need to do it for each rw retry, do it here instead
+			 * of handling it together with REQ_F_REISSUE
+			 */
+			if (io->kiocb.ki_flags & IOCB_WRITE) {
+				kiocb_end_write(req);
+				fsnotify_modify(req->file);
+			} else {
+				fsnotify_access(req->file);
+			}
  			req->flags |= REQ_F_REISSUE | REQ_F_PARTIAL_IO;
  			return true;
  		}
@@ -317,6 +329,12 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
  		req->file->f_pos = rw->kiocb.ki_pos;
  	if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) {
  		if (!__io_complete_rw_common(req, ret)) {
+			if (rw->kiocb.ki_flags & IOCB_WRITE) {
+				kiocb_end_write(req);
+				fsnotify_modify(req->file);
+			} else {
+				fsnotify_access(req->file);
+			}
  			io_req_set_res(req, final_ret,
  				       io_put_kbuf(req, issue_flags));
  			return IOU_OK;


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write()
  2022-10-11  2:10       ` Pavel Begunkov
@ 2022-10-11  2:54         ` Jens Axboe
  2022-10-11 14:18           ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2022-10-11  2:54 UTC (permalink / raw)
  To: Pavel Begunkov, Dave Chinner, viro; +Cc: linux-fsdevel, io-uring

On 10/10/22 8:10 PM, Pavel Begunkov wrote:
> On 10/11/22 03:01, Jens Axboe wrote:
>> On 10/10/22 7:10 PM, Pavel Begunkov wrote:
>>> On 10/11/22 01:40, Dave Chinner wrote:
>>> [...]
>>>> I note that there are changes to the the io_uring IO path and write
>>>> IO end accounting in the io_uring stack that was merged, and there
>>>> was no doubt about the success/failure of the reproducer at each
>>>> step. Hence I think the bisect is good, and the problem is someone
>>>> in the io-uring changes.
>>>>
>>>> Jens, over to you.
>>>>
>>>> The reproducer - generic/068 - is 100% reliable here, io_uring is
>>>> being exercised by fsstress in the background whilst the filesystem
>>>> is being frozen and thawed repeatedly. Some path in the io-uring
>>>> code has an unbalanced sb_start_write()/sb_end_write() pair by the
>>>> look of it....
>>>
>>> A quick guess, it's probably
>>>
>>> b000145e99078 ("io_uring/rw: defer fsnotify calls to task context")
>>>
>>>  From a quick look, it removes  kiocb_end_write() -> sb_end_write()
>>> from kiocb_done(), which is a kind of buffered rw completion path.
>>
>> Yeah, I'll take a look.
>> Didn't get the original email, only Pavel's reply?
> 
> Forwarded.

Looks like the email did get delivered, it just ended up in the
fsdevel inbox.

> Not tested, but should be sth like below. Apart of obvious cases
> like __io_complete_rw_common() we should also keep in mind
> when we don't complete the request but ask for reissue with
> REQ_F_REISSUE, that's for the first hunk

Can we move this into a helper?

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write()
  2022-10-11  2:54         ` Jens Axboe
@ 2022-10-11 14:18           ` Jens Axboe
  2022-10-11 14:39             ` Pavel Begunkov
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2022-10-11 14:18 UTC (permalink / raw)
  To: Pavel Begunkov, Dave Chinner, viro; +Cc: linux-fsdevel, io-uring

On 10/10/22 8:54 PM, Jens Axboe wrote:
> On 10/10/22 8:10 PM, Pavel Begunkov wrote:
>> On 10/11/22 03:01, Jens Axboe wrote:
>>> On 10/10/22 7:10 PM, Pavel Begunkov wrote:
>>>> On 10/11/22 01:40, Dave Chinner wrote:
>>>> [...]
>>>>> I note that there are changes to the the io_uring IO path and write
>>>>> IO end accounting in the io_uring stack that was merged, and there
>>>>> was no doubt about the success/failure of the reproducer at each
>>>>> step. Hence I think the bisect is good, and the problem is someone
>>>>> in the io-uring changes.
>>>>>
>>>>> Jens, over to you.
>>>>>
>>>>> The reproducer - generic/068 - is 100% reliable here, io_uring is
>>>>> being exercised by fsstress in the background whilst the filesystem
>>>>> is being frozen and thawed repeatedly. Some path in the io-uring
>>>>> code has an unbalanced sb_start_write()/sb_end_write() pair by the
>>>>> look of it....
>>>>
>>>> A quick guess, it's probably
>>>>
>>>> b000145e99078 ("io_uring/rw: defer fsnotify calls to task context")
>>>>
>>>> ?From a quick look, it removes? kiocb_end_write() -> sb_end_write()
>>>> from kiocb_done(), which is a kind of buffered rw completion path.
>>>
>>> Yeah, I'll take a look.
>>> Didn't get the original email, only Pavel's reply?
>>
>> Forwarded.
> 
> Looks like the email did get delivered, it just ended up in the
> fsdevel inbox.

Nope, it was marked as spam by gmail...

>> Not tested, but should be sth like below. Apart of obvious cases
>> like __io_complete_rw_common() we should also keep in mind
>> when we don't complete the request but ask for reissue with
>> REQ_F_REISSUE, that's for the first hunk
> 
> Can we move this into a helper?

Something like this? Not super happy with it, but...


diff --git a/io_uring/rw.c b/io_uring/rw.c
index 453e0ae92160..1c8d00f9af9f 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -234,11 +234,32 @@ static void kiocb_end_write(struct io_kiocb *req)
 	}
 }
 
+/*
+ * Trigger the notifications after having done some IO, and finish the write
+ * accounting, if any.
+ */
+static void io_req_io_end(struct io_kiocb *req)
+{
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
+
+	if (rw->kiocb.ki_flags & IOCB_WRITE) {
+		kiocb_end_write(req);
+		fsnotify_modify(req->file);
+	} else {
+		fsnotify_access(req->file);
+	}
+}
+
 static bool __io_complete_rw_common(struct io_kiocb *req, long res)
 {
 	if (unlikely(res != req->cqe.res)) {
 		if ((res == -EAGAIN || res == -EOPNOTSUPP) &&
 		    io_rw_should_reissue(req)) {
+			/*
+			 * Reissue will start accounting again, finish the
+			 * current cycle.
+			 */
+			io_req_io_end(req);
 			req->flags |= REQ_F_REISSUE | REQ_F_PARTIAL_IO;
 			return true;
 		}
@@ -264,15 +285,7 @@ static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
 
 static void io_req_rw_complete(struct io_kiocb *req, bool *locked)
 {
-	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
-
-	if (rw->kiocb.ki_flags & IOCB_WRITE) {
-		kiocb_end_write(req);
-		fsnotify_modify(req->file);
-	} else {
-		fsnotify_access(req->file);
-	}
-
+	io_req_io_end(req);
 	io_req_task_complete(req, locked);
 }
 
@@ -317,6 +330,7 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
 		req->file->f_pos = rw->kiocb.ki_pos;
 	if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) {
 		if (!__io_complete_rw_common(req, ret)) {
+			io_req_io_end(req);
 			io_req_set_res(req, final_ret,
 				       io_put_kbuf(req, issue_flags));
 			return IOU_OK;

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write()
  2022-10-11 14:18           ` Jens Axboe
@ 2022-10-11 14:39             ` Pavel Begunkov
  2022-10-11 14:47               ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Begunkov @ 2022-10-11 14:39 UTC (permalink / raw)
  To: Jens Axboe, Dave Chinner, viro; +Cc: linux-fsdevel, io-uring

On 10/11/22 15:18, Jens Axboe wrote:
> On 10/10/22 8:54 PM, Jens Axboe wrote:
>> On 10/10/22 8:10 PM, Pavel Begunkov wrote:
>>> On 10/11/22 03:01, Jens Axboe wrote:
>>>> On 10/10/22 7:10 PM, Pavel Begunkov wrote:
>>>>> On 10/11/22 01:40, Dave Chinner wrote:
>>>>> [...]
>>>>>> I note that there are changes to the the io_uring IO path and write
>>>>>> IO end accounting in the io_uring stack that was merged, and there
>>>>>> was no doubt about the success/failure of the reproducer at each
>>>>>> step. Hence I think the bisect is good, and the problem is someone
>>>>>> in the io-uring changes.
>>>>>>
>>>>>> Jens, over to you.
>>>>>>
>>>>>> The reproducer - generic/068 - is 100% reliable here, io_uring is
>>>>>> being exercised by fsstress in the background whilst the filesystem
>>>>>> is being frozen and thawed repeatedly. Some path in the io-uring
>>>>>> code has an unbalanced sb_start_write()/sb_end_write() pair by the
>>>>>> look of it....
>>>>>
>>>>> A quick guess, it's probably
>>>>>
>>>>> b000145e99078 ("io_uring/rw: defer fsnotify calls to task context")
>>>>>
>>>>> ?From a quick look, it removes? kiocb_end_write() -> sb_end_write()
>>>>> from kiocb_done(), which is a kind of buffered rw completion path.
>>>>
>>>> Yeah, I'll take a look.
>>>> Didn't get the original email, only Pavel's reply?
>>>
>>> Forwarded.
>>
>> Looks like the email did get delivered, it just ended up in the
>> fsdevel inbox.
> 
> Nope, it was marked as spam by gmail...
> 
>>> Not tested, but should be sth like below. Apart of obvious cases
>>> like __io_complete_rw_common() we should also keep in mind
>>> when we don't complete the request but ask for reissue with
>>> REQ_F_REISSUE, that's for the first hunk
>>
>> Can we move this into a helper?
> 
> Something like this? Not super happy with it, but...

Sounds good. Would be great to drop a comment why it's ok to move
back io_req_io_end() into __io_complete_rw_common() under the
io_rw_should_reissue() "if".


> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 453e0ae92160..1c8d00f9af9f 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -234,11 +234,32 @@ static void kiocb_end_write(struct io_kiocb *req)
>   	}
>   }
>   
> +/*
> + * Trigger the notifications after having done some IO, and finish the write
> + * accounting, if any.
> + */
> +static void io_req_io_end(struct io_kiocb *req)
> +{
> +	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> +
> +	if (rw->kiocb.ki_flags & IOCB_WRITE) {
> +		kiocb_end_write(req);
> +		fsnotify_modify(req->file);
> +	} else {
> +		fsnotify_access(req->file);
> +	}
> +}
> +
>   static bool __io_complete_rw_common(struct io_kiocb *req, long res)
>   {
>   	if (unlikely(res != req->cqe.res)) {
>   		if ((res == -EAGAIN || res == -EOPNOTSUPP) &&
>   		    io_rw_should_reissue(req)) {
> +			/*
> +			 * Reissue will start accounting again, finish the
> +			 * current cycle.
> +			 */
> +			io_req_io_end(req);
>   			req->flags |= REQ_F_REISSUE | REQ_F_PARTIAL_IO;
>   			return true;
>   		}
> @@ -264,15 +285,7 @@ static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
>   
>   static void io_req_rw_complete(struct io_kiocb *req, bool *locked)
>   {
> -	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
> -
> -	if (rw->kiocb.ki_flags & IOCB_WRITE) {
> -		kiocb_end_write(req);
> -		fsnotify_modify(req->file);
> -	} else {
> -		fsnotify_access(req->file);
> -	}
> -
> +	io_req_io_end(req);
>   	io_req_task_complete(req, locked);
>   }
>   
> @@ -317,6 +330,7 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret,
>   		req->file->f_pos = rw->kiocb.ki_pos;
>   	if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) {
>   		if (!__io_complete_rw_common(req, ret)) {
> +			io_req_io_end(req);
>   			io_req_set_res(req, final_ret,
>   				       io_put_kbuf(req, issue_flags));
>   			return IOU_OK;
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write()
  2022-10-11 14:39             ` Pavel Begunkov
@ 2022-10-11 14:47               ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2022-10-11 14:47 UTC (permalink / raw)
  To: Pavel Begunkov, Dave Chinner, viro; +Cc: linux-fsdevel, io-uring

On 10/11/22 8:39 AM, Pavel Begunkov wrote:
> On 10/11/22 15:18, Jens Axboe wrote:
>> On 10/10/22 8:54 PM, Jens Axboe wrote:
>>> On 10/10/22 8:10 PM, Pavel Begunkov wrote:
>>>> On 10/11/22 03:01, Jens Axboe wrote:
>>>>> On 10/10/22 7:10 PM, Pavel Begunkov wrote:
>>>>>> On 10/11/22 01:40, Dave Chinner wrote:
>>>>>> [...]
>>>>>>> I note that there are changes to the the io_uring IO path and write
>>>>>>> IO end accounting in the io_uring stack that was merged, and there
>>>>>>> was no doubt about the success/failure of the reproducer at each
>>>>>>> step. Hence I think the bisect is good, and the problem is someone
>>>>>>> in the io-uring changes.
>>>>>>>
>>>>>>> Jens, over to you.
>>>>>>>
>>>>>>> The reproducer - generic/068 - is 100% reliable here, io_uring is
>>>>>>> being exercised by fsstress in the background whilst the filesystem
>>>>>>> is being frozen and thawed repeatedly. Some path in the io-uring
>>>>>>> code has an unbalanced sb_start_write()/sb_end_write() pair by the
>>>>>>> look of it....
>>>>>>
>>>>>> A quick guess, it's probably
>>>>>>
>>>>>> b000145e99078 ("io_uring/rw: defer fsnotify calls to task context")
>>>>>>
>>>>>> ?From a quick look, it removes? kiocb_end_write() -> sb_end_write()
>>>>>> from kiocb_done(), which is a kind of buffered rw completion path.
>>>>>
>>>>> Yeah, I'll take a look.
>>>>> Didn't get the original email, only Pavel's reply?
>>>>
>>>> Forwarded.
>>>
>>> Looks like the email did get delivered, it just ended up in the
>>> fsdevel inbox.
>>
>> Nope, it was marked as spam by gmail...
>>
>>>> Not tested, but should be sth like below. Apart of obvious cases
>>>> like __io_complete_rw_common() we should also keep in mind
>>>> when we don't complete the request but ask for reissue with
>>>> REQ_F_REISSUE, that's for the first hunk
>>>
>>> Can we move this into a helper?
>>
>> Something like this? Not super happy with it, but...
> 
> Sounds good. Would be great to drop a comment why it's ok to move
> back io_req_io_end() into __io_complete_rw_common() under the
> io_rw_should_reissue() "if".

Agree, I'll add a comment and post this.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-10-11 14:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[email protected]>
2022-10-11  0:40 ` [regression, v6.0-rc0, io-uring?] filesystem freeze hangs on sb_wait_write() Dave Chinner
2022-10-11  1:10   ` Pavel Begunkov
2022-10-11  2:01     ` Jens Axboe
2022-10-11  2:10       ` Pavel Begunkov
2022-10-11  2:54         ` Jens Axboe
2022-10-11 14:18           ` Jens Axboe
2022-10-11 14:39             ` Pavel Begunkov
2022-10-11 14:47               ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox