public inbox for [email protected]
 help / color / mirror / Atom feed
* Re: [syzbot] WARNING in mntput_no_expire (2)
       [not found] <[email protected]>
@ 2021-04-01 15:45 ` Christian Brauner
  2021-04-01 16:09   ` Jens Axboe
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-01 15:45 UTC (permalink / raw)
  To: syzbot, axboe; +Cc: linux-fsdevel, linux-kernel, syzkaller-bugs, viro, io-uring

On Thu, Apr 01, 2021 at 02:09:20AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    d19cc4bf Merge tag 'trace-v5.12-rc5' of git://git.kernel.o..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1018f281d00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=d1a3d65a48dbd1bc
> dashboard link: https://syzkaller.appspot.com/bug?extid=c88a7030da47945a3cc3
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12f50d11d00000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=137694a1d00000
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: [email protected]
> 
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 8409 at fs/namespace.c:1186 mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186
> Modules linked in:
> CPU: 1 PID: 8409 Comm: syz-executor035 Not tainted 5.12.0-rc5-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186
> Code: ff 48 c7 c2 e0 cb 78 89 be c2 02 00 00 48 c7 c7 a0 cb 78 89 c6 05 e5 6d e5 0b 01 e8 ff e1 f6 06 e9 3f fd ff ff e8 c6 a5 a8 ff <0f> 0b e9 fc fc ff ff e8 ba a5 a8 ff e8 55 dc 94 ff 31 ff 89 c5 89
> RSP: 0018:ffffc9000165fc78 EFLAGS: 00010293
> RAX: 0000000000000000 RBX: 1ffff920002cbf95 RCX: 0000000000000000
> RDX: ffff88802072d4c0 RSI: ffffffff81cb4b8a RDI: 0000000000000003
> RBP: ffff888011656900 R08: 0000000000000000 R09: ffffffff8fa978af
> R10: ffffffff81cb4884 R11: 0000000000000000 R12: 0000000000000008
> R13: ffffc9000165fcc8 R14: dffffc0000000000 R15: 00000000ffffffff
> FS:  0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055a722053160 CR3: 000000000bc8e000 CR4: 00000000001506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  mntput fs/namespace.c:1232 [inline]
>  cleanup_mnt+0x523/0x530 fs/namespace.c:1132
>  task_work_run+0xdd/0x1a0 kernel/task_work.c:140
>  exit_task_work include/linux/task_work.h:30 [inline]
>  do_exit+0xbfc/0x2a60 kernel/exit.c:825
>  do_group_exit+0x125/0x310 kernel/exit.c:922
>  __do_sys_exit_group kernel/exit.c:933 [inline]
>  __se_sys_exit_group kernel/exit.c:931 [inline]
>  __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x446af9
> Code: Unable to access opcode bytes at RIP 0x446acf.
> RSP: 002b:00000000005dfe48 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> RAX: ffffffffffffffda RBX: 00000000004ce450 RCX: 0000000000446af9
> RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
> RBP: 0000000000000001 R08: ffffffffffffffbc R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ce450
> R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001

[+Cc Jens + io_uring]

Hm, this reproducer uses io_uring and it's the io_uring_enter() that
triggers this reliably. With this reproducer I've managed to reproduce
the issue on v5.12-rc4, and v5.12-rc3, v5.12-rc2 and v5.12-rc1.
It's not reproducible at
9820b4dca0f9c6b7ab8b4307286cdace171b724d
which is the commit immediately before the first v5.12 io_uring merge.
It's first reproducible with the first io_uring merge for v5.12, i.e.
5bbb336ba75d95611a7b9456355b48705016bdb1

Christian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-01 15:45 ` [syzbot] WARNING in mntput_no_expire (2) Christian Brauner
@ 2021-04-01 16:09   ` Jens Axboe
  2021-04-01 17:46     ` Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Jens Axboe @ 2021-04-01 16:09 UTC (permalink / raw)
  To: Christian Brauner, syzbot
  Cc: linux-fsdevel, linux-kernel, syzkaller-bugs, viro, io-uring

On 4/1/21 9:45 AM, Christian Brauner wrote:
> On Thu, Apr 01, 2021 at 02:09:20AM -0700, syzbot wrote:
>> Hello,
>>
>> syzbot found the following issue on:
>>
>> HEAD commit:    d19cc4bf Merge tag 'trace-v5.12-rc5' of git://git.kernel.o..
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=1018f281d00000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=d1a3d65a48dbd1bc
>> dashboard link: https://syzkaller.appspot.com/bug?extid=c88a7030da47945a3cc3
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12f50d11d00000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=137694a1d00000
>>
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: [email protected]
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 1 PID: 8409 at fs/namespace.c:1186 mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186
>> Modules linked in:
>> CPU: 1 PID: 8409 Comm: syz-executor035 Not tainted 5.12.0-rc5-syzkaller #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>> RIP: 0010:mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186
>> Code: ff 48 c7 c2 e0 cb 78 89 be c2 02 00 00 48 c7 c7 a0 cb 78 89 c6 05 e5 6d e5 0b 01 e8 ff e1 f6 06 e9 3f fd ff ff e8 c6 a5 a8 ff <0f> 0b e9 fc fc ff ff e8 ba a5 a8 ff e8 55 dc 94 ff 31 ff 89 c5 89
>> RSP: 0018:ffffc9000165fc78 EFLAGS: 00010293
>> RAX: 0000000000000000 RBX: 1ffff920002cbf95 RCX: 0000000000000000
>> RDX: ffff88802072d4c0 RSI: ffffffff81cb4b8a RDI: 0000000000000003
>> RBP: ffff888011656900 R08: 0000000000000000 R09: ffffffff8fa978af
>> R10: ffffffff81cb4884 R11: 0000000000000000 R12: 0000000000000008
>> R13: ffffc9000165fcc8 R14: dffffc0000000000 R15: 00000000ffffffff
>> FS:  0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 000055a722053160 CR3: 000000000bc8e000 CR4: 00000000001506e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>>  mntput fs/namespace.c:1232 [inline]
>>  cleanup_mnt+0x523/0x530 fs/namespace.c:1132
>>  task_work_run+0xdd/0x1a0 kernel/task_work.c:140
>>  exit_task_work include/linux/task_work.h:30 [inline]
>>  do_exit+0xbfc/0x2a60 kernel/exit.c:825
>>  do_group_exit+0x125/0x310 kernel/exit.c:922
>>  __do_sys_exit_group kernel/exit.c:933 [inline]
>>  __se_sys_exit_group kernel/exit.c:931 [inline]
>>  __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
>>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> RIP: 0033:0x446af9
>> Code: Unable to access opcode bytes at RIP 0x446acf.
>> RSP: 002b:00000000005dfe48 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>> RAX: ffffffffffffffda RBX: 00000000004ce450 RCX: 0000000000446af9
>> RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
>> RBP: 0000000000000001 R08: ffffffffffffffbc R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ce450
>> R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
> 
> [+Cc Jens + io_uring]
> 
> Hm, this reproducer uses io_uring and it's the io_uring_enter() that
> triggers this reliably. With this reproducer I've managed to reproduce
> the issue on v5.12-rc4, and v5.12-rc3, v5.12-rc2 and v5.12-rc1.
> It's not reproducible at
> 9820b4dca0f9c6b7ab8b4307286cdace171b724d
> which is the commit immediately before the first v5.12 io_uring merge.
> It's first reproducible with the first io_uring merge for v5.12, i.e.
> 5bbb336ba75d95611a7b9456355b48705016bdb1

Thanks, that's good info. I'll take a look at it and see if I can
reproduce.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-01 16:09   ` Jens Axboe
@ 2021-04-01 17:46     ` Christian Brauner
  2021-04-01 17:59       ` Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-01 17:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs, viro,
	io-uring

On Thu, Apr 01, 2021 at 10:09:18AM -0600, Jens Axboe wrote:
> On 4/1/21 9:45 AM, Christian Brauner wrote:
> > On Thu, Apr 01, 2021 at 02:09:20AM -0700, syzbot wrote:
> >> Hello,
> >>
> >> syzbot found the following issue on:
> >>
> >> HEAD commit:    d19cc4bf Merge tag 'trace-v5.12-rc5' of git://git.kernel.o..
> >> git tree:       upstream
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=1018f281d00000
> >> kernel config:  https://syzkaller.appspot.com/x/.config?x=d1a3d65a48dbd1bc
> >> dashboard link: https://syzkaller.appspot.com/bug?extid=c88a7030da47945a3cc3
> >> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12f50d11d00000
> >> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=137694a1d00000
> >>
> >> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> >> Reported-by: [email protected]
> >>
> >> ------------[ cut here ]------------
> >> WARNING: CPU: 1 PID: 8409 at fs/namespace.c:1186 mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186
> >> Modules linked in:
> >> CPU: 1 PID: 8409 Comm: syz-executor035 Not tainted 5.12.0-rc5-syzkaller #0
> >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> >> RIP: 0010:mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186
> >> Code: ff 48 c7 c2 e0 cb 78 89 be c2 02 00 00 48 c7 c7 a0 cb 78 89 c6 05 e5 6d e5 0b 01 e8 ff e1 f6 06 e9 3f fd ff ff e8 c6 a5 a8 ff <0f> 0b e9 fc fc ff ff e8 ba a5 a8 ff e8 55 dc 94 ff 31 ff 89 c5 89
> >> RSP: 0018:ffffc9000165fc78 EFLAGS: 00010293
> >> RAX: 0000000000000000 RBX: 1ffff920002cbf95 RCX: 0000000000000000
> >> RDX: ffff88802072d4c0 RSI: ffffffff81cb4b8a RDI: 0000000000000003
> >> RBP: ffff888011656900 R08: 0000000000000000 R09: ffffffff8fa978af
> >> R10: ffffffff81cb4884 R11: 0000000000000000 R12: 0000000000000008
> >> R13: ffffc9000165fcc8 R14: dffffc0000000000 R15: 00000000ffffffff
> >> FS:  0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: 000055a722053160 CR3: 000000000bc8e000 CR4: 00000000001506e0
> >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> Call Trace:
> >>  mntput fs/namespace.c:1232 [inline]
> >>  cleanup_mnt+0x523/0x530 fs/namespace.c:1132
> >>  task_work_run+0xdd/0x1a0 kernel/task_work.c:140
> >>  exit_task_work include/linux/task_work.h:30 [inline]
> >>  do_exit+0xbfc/0x2a60 kernel/exit.c:825
> >>  do_group_exit+0x125/0x310 kernel/exit.c:922
> >>  __do_sys_exit_group kernel/exit.c:933 [inline]
> >>  __se_sys_exit_group kernel/exit.c:931 [inline]
> >>  __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
> >>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> RIP: 0033:0x446af9
> >> Code: Unable to access opcode bytes at RIP 0x446acf.
> >> RSP: 002b:00000000005dfe48 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> >> RAX: ffffffffffffffda RBX: 00000000004ce450 RCX: 0000000000446af9
> >> RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
> >> RBP: 0000000000000001 R08: ffffffffffffffbc R09: 0000000000000000
> >> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ce450
> >> R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
> > 
> > [+Cc Jens + io_uring]
> > 
> > Hm, this reproducer uses io_uring and it's the io_uring_enter() that
> > triggers this reliably. With this reproducer I've managed to reproduce
> > the issue on v5.12-rc4, and v5.12-rc3, v5.12-rc2 and v5.12-rc1.
> > It's not reproducible at
> > 9820b4dca0f9c6b7ab8b4307286cdace171b724d
> > which is the commit immediately before the first v5.12 io_uring merge.
> > It's first reproducible with the first io_uring merge for v5.12, i.e.
> > 5bbb336ba75d95611a7b9456355b48705016bdb1
> 
> Thanks, that's good info. I'll take a look at it and see if I can
> reproduce.

Ok, I was deep into this anyway and it didn't make much sense to do
anything else at that point so I bisected this a bit further. The first
bad commit is:

commit 3a81fd02045c329f25e5900fa61f613c9b317644
Author: Jens Axboe <[email protected]>
Date:   Thu Dec 10 12:25:36 2020 -0700

    io_uring: enable LOOKUP_CACHED path resolution for filename lookups

    Instead of being pessimistic and assume that path lookup will block, use
    LOOKUP_CACHED to attempt just a cached lookup. This ensures that the
    fast path is always done inline, and we only punt to async context if
    IO is needed to satisfy the lookup.

    For forced nonblock open attempts, mark the file O_NONBLOCK over the
    actual ->open() call as well. We can safely clear this again before
    doing fd_install(), so it'll never be user visible that we fiddled with
    it.

    This greatly improves the performance of file open where the dentry is
    already cached:

    ached           5.10-git        5.10-git+LOOKUP_CACHED  Speedup
    ---------------------------------------------------------------
    33%             1,014,975       900,474                 1.1x
    89%              545,466        292,937                 1.9x
    100%             435,636        151,475                 2.9x

    The more cache hot we are, the faster the inline LOOKUP_CACHED
    optimization helps. This is unsurprising and expected, as a thread
    offload becomes a more dominant part of the total overhead. If we look
    at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in
    the dentry cache will yield:

    275.550481: io_uring_create: ring 00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0
    275.550491: io_uring_submit_sqe: ring 00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0
    275.550498: io_uring_queue_async_work: ring 00000000ddda6278, request 00000000c0267d17, flags 69760, normal queue, work 000000003d683991
    275.550502: io_uring_cqring_wait: ring 00000000ddda6278, min_events 1
    275.550556: io_uring_complete: ring 00000000ddda6278, user_data 0x0, result 4

    which shows a failed nonblock lookup, then punt to worker, and then we
    complete with fd == 4. This takes 65 usec in total. Re-running the same
    test case again:

    281.253956: io_uring_create: ring 0000000008207252, fd 3 sq size 8, cq size 16, flags 0
    281.253967: io_uring_submit_sqe: ring 0000000008207252, op 18, data 0x0, non block 1, sq_thread 0
    281.253973: io_uring_complete: ring 0000000008207252, user_data 0x0, result 4

    shows the same request completing inline, also returning fd == 4. This
    takes 6 usec.

    Signed-off-by: Jens Axboe <[email protected]>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-01 17:46     ` Christian Brauner
@ 2021-04-01 17:59       ` Christian Brauner
  2021-04-01 19:11         ` Al Viro
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-01 17:59 UTC (permalink / raw)
  To: Jens Axboe, viro
  Cc: syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs, io-uring

On Thu, Apr 01, 2021 at 07:46:13PM +0200, Christian Brauner wrote:
> On Thu, Apr 01, 2021 at 10:09:18AM -0600, Jens Axboe wrote:
> > On 4/1/21 9:45 AM, Christian Brauner wrote:
> > > On Thu, Apr 01, 2021 at 02:09:20AM -0700, syzbot wrote:
> > >> Hello,
> > >>
> > >> syzbot found the following issue on:
> > >>
> > >> HEAD commit:    d19cc4bf Merge tag 'trace-v5.12-rc5' of git://git.kernel.o..
> > >> git tree:       upstream
> > >> console output: https://syzkaller.appspot.com/x/log.txt?x=1018f281d00000
> > >> kernel config:  https://syzkaller.appspot.com/x/.config?x=d1a3d65a48dbd1bc
> > >> dashboard link: https://syzkaller.appspot.com/bug?extid=c88a7030da47945a3cc3
> > >> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12f50d11d00000
> > >> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=137694a1d00000
> > >>
> > >> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > >> Reported-by: [email protected]
> > >>
> > >> ------------[ cut here ]------------
> > >> WARNING: CPU: 1 PID: 8409 at fs/namespace.c:1186 mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186
> > >> Modules linked in:
> > >> CPU: 1 PID: 8409 Comm: syz-executor035 Not tainted 5.12.0-rc5-syzkaller #0
> > >> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > >> RIP: 0010:mntput_no_expire+0xaca/0xcb0 fs/namespace.c:1186
> > >> Code: ff 48 c7 c2 e0 cb 78 89 be c2 02 00 00 48 c7 c7 a0 cb 78 89 c6 05 e5 6d e5 0b 01 e8 ff e1 f6 06 e9 3f fd ff ff e8 c6 a5 a8 ff <0f> 0b e9 fc fc ff ff e8 ba a5 a8 ff e8 55 dc 94 ff 31 ff 89 c5 89
> > >> RSP: 0018:ffffc9000165fc78 EFLAGS: 00010293
> > >> RAX: 0000000000000000 RBX: 1ffff920002cbf95 RCX: 0000000000000000
> > >> RDX: ffff88802072d4c0 RSI: ffffffff81cb4b8a RDI: 0000000000000003
> > >> RBP: ffff888011656900 R08: 0000000000000000 R09: ffffffff8fa978af
> > >> R10: ffffffff81cb4884 R11: 0000000000000000 R12: 0000000000000008
> > >> R13: ffffc9000165fcc8 R14: dffffc0000000000 R15: 00000000ffffffff
> > >> FS:  0000000000000000(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
> > >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> CR2: 000055a722053160 CR3: 000000000bc8e000 CR4: 00000000001506e0
> > >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >> Call Trace:
> > >>  mntput fs/namespace.c:1232 [inline]
> > >>  cleanup_mnt+0x523/0x530 fs/namespace.c:1132
> > >>  task_work_run+0xdd/0x1a0 kernel/task_work.c:140
> > >>  exit_task_work include/linux/task_work.h:30 [inline]
> > >>  do_exit+0xbfc/0x2a60 kernel/exit.c:825
> > >>  do_group_exit+0x125/0x310 kernel/exit.c:922
> > >>  __do_sys_exit_group kernel/exit.c:933 [inline]
> > >>  __se_sys_exit_group kernel/exit.c:931 [inline]
> > >>  __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
> > >>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> > >>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > >> RIP: 0033:0x446af9
> > >> Code: Unable to access opcode bytes at RIP 0x446acf.
> > >> RSP: 002b:00000000005dfe48 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > >> RAX: ffffffffffffffda RBX: 00000000004ce450 RCX: 0000000000446af9
> > >> RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
> > >> RBP: 0000000000000001 R08: ffffffffffffffbc R09: 0000000000000000
> > >> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ce450
> > >> R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
> > > 
> > > [+Cc Jens + io_uring]
> > > 
> > > Hm, this reproducer uses io_uring and it's the io_uring_enter() that
> > > triggers this reliably. With this reproducer I've managed to reproduce
> > > the issue on v5.12-rc4, and v5.12-rc3, v5.12-rc2 and v5.12-rc1.
> > > It's not reproducible at
> > > 9820b4dca0f9c6b7ab8b4307286cdace171b724d
> > > which is the commit immediately before the first v5.12 io_uring merge.
> > > It's first reproducible with the first io_uring merge for v5.12, i.e.
> > > 5bbb336ba75d95611a7b9456355b48705016bdb1
> > 
> > Thanks, that's good info. I'll take a look at it and see if I can
> > reproduce.
> 
> Ok, I was deep into this anyway and it didn't make much sense to do
> anything else at that point so I bisected this a bit further. The first
> bad commit is:
> 
> commit 3a81fd02045c329f25e5900fa61f613c9b317644
> Author: Jens Axboe <[email protected]>
> Date:   Thu Dec 10 12:25:36 2020 -0700
> 
>     io_uring: enable LOOKUP_CACHED path resolution for filename lookups
> 
>     Instead of being pessimistic and assume that path lookup will block, use
>     LOOKUP_CACHED to attempt just a cached lookup. This ensures that the
>     fast path is always done inline, and we only punt to async context if
>     IO is needed to satisfy the lookup.
> 
>     For forced nonblock open attempts, mark the file O_NONBLOCK over the
>     actual ->open() call as well. We can safely clear this again before
>     doing fd_install(), so it'll never be user visible that we fiddled with
>     it.
> 
>     This greatly improves the performance of file open where the dentry is
>     already cached:
> 
>     ached           5.10-git        5.10-git+LOOKUP_CACHED  Speedup
>     ---------------------------------------------------------------
>     33%             1,014,975       900,474                 1.1x
>     89%              545,466        292,937                 1.9x
>     100%             435,636        151,475                 2.9x
> 
>     The more cache hot we are, the faster the inline LOOKUP_CACHED
>     optimization helps. This is unsurprising and expected, as a thread
>     offload becomes a more dominant part of the total overhead. If we look
>     at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in
>     the dentry cache will yield:
> 
>     275.550481: io_uring_create: ring 00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0
>     275.550491: io_uring_submit_sqe: ring 00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0
>     275.550498: io_uring_queue_async_work: ring 00000000ddda6278, request 00000000c0267d17, flags 69760, normal queue, work 000000003d683991
>     275.550502: io_uring_cqring_wait: ring 00000000ddda6278, min_events 1
>     275.550556: io_uring_complete: ring 00000000ddda6278, user_data 0x0, result 4
> 
>     which shows a failed nonblock lookup, then punt to worker, and then we
>     complete with fd == 4. This takes 65 usec in total. Re-running the same
>     test case again:
> 
>     281.253956: io_uring_create: ring 0000000008207252, fd 3 sq size 8, cq size 16, flags 0
>     281.253967: io_uring_submit_sqe: ring 0000000008207252, op 18, data 0x0, non block 1, sq_thread 0
>     281.253973: io_uring_complete: ring 0000000008207252, user_data 0x0, result 4
> 
>     shows the same request completing inline, also returning fd == 4. This
>     takes 6 usec.
> 
>     Signed-off-by: Jens Axboe <[email protected]>

I _think_ I see what the issue is. It seems that an assumption made in
this commit might be wrong and we're missing a mnt_add_count() bump that
we would otherwise have gotten if we've moved the failure handling into
the unlazy helpers themselves.

Al, does that sound plausible?

commit eacd9aa8cedeb412842c7b339adbaa0477fdd5ad
Author: Al Viro <[email protected]>
Date:   Mon Feb 15 12:03:23 2021 -0500

    fix handling of nd->depth on LOOKUP_CACHED failures in try_to_unlazy*

    After switching to non-RCU mode, we want nd->depth to match the number
    of entries in nd->stack[] that need eventual path_put().
    legitimize_links() takes care of that on failures; unfortunately,
    failure exits added for LOOKUP_CACHED do not.

    We could add the logics for that into those failure exits, both in
    try_to_unlazy() and in try_to_unlazy_next(), but since both checks
    are immediately followed by legitimize_links() and there's no calls
    of legitimize_links() other than those two...  It's easier to
    move the check (and required handling of nd->depth on failure) into
    legitimize_links() itself.

    [caught by Jens: ... and since we are zeroing ->depth here, we need
    to do drop_links() first]

    Fixes: 6c6ec2b0a3e0 "fs: add support for LOOKUP_CACHED"
    Tested-by: Jens Axboe <[email protected]>
    Signed-off-by: Al Viro <[email protected]>

diff --git a/fs/namei.c b/fs/namei.c
index 4cae88733a5c..de74ad2bc6e2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -630,6 +630,11 @@ static inline bool legitimize_path(struct nameidata *nd,
 static bool legitimize_links(struct nameidata *nd)
 {
        int i;
+       if (unlikely(nd->flags & LOOKUP_CACHED)) {
+               drop_links(nd);
+               nd->depth = 0;
+               return false;
+       }
        for (i = 0; i < nd->depth; i++) {
                struct saved *last = nd->stack + i;
                if (unlikely(!legitimize_path(nd, &last->link, last->seq))) {
@@ -686,8 +691,6 @@ static bool try_to_unlazy(struct nameidata *nd)
        BUG_ON(!(nd->flags & LOOKUP_RCU));

        nd->flags &= ~LOOKUP_RCU;
-       if (nd->flags & LOOKUP_CACHED)
-               goto out1;
        if (unlikely(!legitimize_links(nd)))
                goto out1;
        if (unlikely(!legitimize_path(nd, &nd->path, nd->seq)))
@@ -724,8 +727,6 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry, unsi
        BUG_ON(!(nd->flags & LOOKUP_RCU));

        nd->flags &= ~LOOKUP_RCU;
-       if (nd->flags & LOOKUP_CACHED)
-               goto out2;
        if (unlikely(!legitimize_links(nd)))
                goto out2;
        if (unlikely(!legitimize_mnt(nd->path.mnt, nd->m_seq)))

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-01 17:59       ` Christian Brauner
@ 2021-04-01 19:11         ` Al Viro
  2021-04-04  2:34           ` Al Viro
  0 siblings, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-01 19:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Thu, Apr 01, 2021 at 07:59:19PM +0200, Christian Brauner wrote:

> I _think_ I see what the issue is. It seems that an assumption made in
> this commit might be wrong and we're missing a mnt_add_count() bump that
> we would otherwise have gotten if we've moved the failure handling into
> the unlazy helpers themselves.
> 
> Al, does that sound plausible?

mnt_add_count() on _what_?  Failure in legitimize_links() ends up with
nd->path.mnt zeroed, in both callers.  So which vfsmount would be
affected?

Rules:
	in RCU mode: no mounts pinned
	out of RCU mode: nd->path.mnt and all nd->stack[i].link.mnt for
			i below nd->depth are either NULL or pinned

Transition from RCU to non-RCU mode happens in try_to_unlazy() and
try_to_unlazy_next().

References (if any) are dropped by eventual terminate_walk() (after
that the contents of nameidata is junk).

__legitimize_mnt() is the primitive for pinning.  Return values:
	0 -- successfully pinned (or given NULL as an argument)
	1 -- failed, refcount not affected
	-1 -- failed, refcount bumped.
It stays in RCU mode in all cases.

One user is __legitimize_path(); it also stays in RCU mode.  If it
fails to legitimize path->mnt, it will zero it *IF* __legitimize_mnt()
reports that refcount hadn't been taken.  In all other cases,
path->mnt is pinned.  IOW, the caller is responsible for path_put()
regardless of the outcome.

Another user is legitimize_mnt().  _That_ will make sure that
refcount is unaffected in case of failure (IOW, if __legitimize_mnt()
reports failure with refcount bumped, we drop out of RCU mode,
do mntput() and go back).

On failure in legitimize_links() we either leave nd->depth equal to zero
(in which case all nd->stack[...].link.mnt are to be ignored) or
we set it one higher than the last attempted legitimize_path() in there.
In the latter case, all entries in nd->stack below the value we put into
nd->depth had legitimize_path() called (and thus have ->mnt either NULL
or pinned) and everything starting from nd->depth is to be ignored.

nd->path handling:
1) Callers of legitimize_links() are responsible for zeroing nd->path.mnt
on legitimize_links() failure.  Both do that, AFAICS.
2) in try_to_unlazy() we proceed to call legitimize_path() on nd->path.
Once that call is done, we have nd->path.mnt pinned or NULL, so nothing
further is needed with it.
3) in try_to_unlazy_next() we use legitimize_mnt() instead.  Failure
of that is handled by zeroing nd->path.mnt; success means that nd->path.mnt
is pinned and should be left alone.

We could use __legitimize_mnt() in try_to_unlazy_next() (basically,
substitute the body of legitimize_mnt() there and massage it a bit),
but that ends up being harder to follow:
	res = __legitimize_mnt(nd->path.mnt, nd->m_seq);
	if (unlikely(res)) {
		if (res < 0)	// pinned, leave it there
			goto out1;
		else		// not pinned, zero it
			goto out2;
	}
instead of
        if (unlikely(!legitimize_mnt(nd->path.mnt, nd->m_seq)))
		goto out2;
we have now.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-01 19:11         ` Al Viro
@ 2021-04-04  2:34           ` Al Viro
  2021-04-04  2:38             ` Al Viro
  2021-04-04 11:34             ` Christian Brauner
  0 siblings, 2 replies; 31+ messages in thread
From: Al Viro @ 2021-04-04  2:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Thu, Apr 01, 2021 at 07:11:12PM +0000, Al Viro wrote:

> > I _think_ I see what the issue is. It seems that an assumption made in
> > this commit might be wrong and we're missing a mnt_add_count() bump that
> > we would otherwise have gotten if we've moved the failure handling into
> > the unlazy helpers themselves.
> > 
> > Al, does that sound plausible?
> 
> mnt_add_count() on _what_?  Failure in legitimize_links() ends up with
> nd->path.mnt zeroed, in both callers.  So which vfsmount would be
> affected?

Could you turn that WARN_ON(count < 0) into
	if (WARN_ON(count < 0))
		printk(KERN_ERR "id = %d, dev = %s, count = %d\n",
				mnt->mnt_id,
				mnt->mnt_sb->s_id,
				count);
add system("cat /proc/self/mountinfo"); right after sandbox_common()
call and try to reproduce that?

I really wonder what mount is it happening to.  BTW, how painful would
it be to teach syzcaller to turn those cascades of
	NONFAILING(*(uint8_t*)0x20000080 = 0x12);
	NONFAILING(*(uint8_t*)0x20000081 = 0);
	NONFAILING(*(uint16_t*)0x20000082 = 0);
	NONFAILING(*(uint32_t*)0x20000084 = 0xffffff9c);
	NONFAILING(*(uint64_t*)0x20000088 = 0);
	NONFAILING(*(uint64_t*)0x20000090 = 0x20000180);
	NONFAILING(memcpy((void*)0x20000180, "./file0\000", 8));
	NONFAILING(*(uint32_t*)0x20000098 = 0);
	NONFAILING(*(uint32_t*)0x2000009c = 0x80);
	NONFAILING(*(uint64_t*)0x200000a0 = 0x23456);
	....
	NONFAILING(syz_io_uring_submit(r[1], r[2], 0x20000080, 0));
into something more readable?  Bloody annoyance every time...  Sure, I can
manually translate it into
	struct io_uring_sqe *sqe = (void *)0x20000080;
	char *s = (void *)0x20000180;
	memset(sqe, '\0', sizeof(*sqe));
	sqe->opcode = 0x12; // IORING_OP_OPENAT?
	sqe->fd = -100;	// AT_FDCWD?
	sqe->addr = s;
	strcpy(s, "./file0");
	sqe->open_flags = 0x80;	// O_EXCL???
	sqe->user_data = 0x23456;	// random tag?
	syz_io_uring_submit(r[1], r[2], (unsigned long)p, 0);
but it's really annoying as hell, especially since syz_io_uring_submit()
comes from syzcaller and the damn thing _knows_ that the third argument
is sodding io_uring_sqe, and never passed to anything other than
memcpy() in there, at that, so the exact address can't matter.

Incidentally, solitary O_EXCL (without O_CREAT) is... curious.  Does that
sucker still trigger without it?  I.e. with store to 0x2000009c replaced
with storing 0?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04  2:34           ` Al Viro
@ 2021-04-04  2:38             ` Al Viro
  2021-04-04 11:34             ` Christian Brauner
  1 sibling, 0 replies; 31+ messages in thread
From: Al Viro @ 2021-04-04  2:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 02:34:08AM +0000, Al Viro wrote:

> I really wonder what mount is it happening to.  BTW, how painful would
> it be to teach syzcaller to turn those cascades of
> 	NONFAILING(*(uint8_t*)0x20000080 = 0x12);
> 	NONFAILING(*(uint8_t*)0x20000081 = 0);
> 	NONFAILING(*(uint16_t*)0x20000082 = 0);
> 	NONFAILING(*(uint32_t*)0x20000084 = 0xffffff9c);
> 	NONFAILING(*(uint64_t*)0x20000088 = 0);
> 	NONFAILING(*(uint64_t*)0x20000090 = 0x20000180);
> 	NONFAILING(memcpy((void*)0x20000180, "./file0\000", 8));
> 	NONFAILING(*(uint32_t*)0x20000098 = 0);
> 	NONFAILING(*(uint32_t*)0x2000009c = 0x80);
> 	NONFAILING(*(uint64_t*)0x200000a0 = 0x23456);
> 	....
> 	NONFAILING(syz_io_uring_submit(r[1], r[2], 0x20000080, 0));
> into something more readable?  Bloody annoyance every time...  Sure, I can
> manually translate it into
> 	struct io_uring_sqe *sqe = (void *)0x20000080;
> 	char *s = (void *)0x20000180;
> 	memset(sqe, '\0', sizeof(*sqe));
> 	sqe->opcode = 0x12; // IORING_OP_OPENAT?
> 	sqe->fd = -100;	// AT_FDCWD?
> 	sqe->addr = s;
> 	strcpy(s, "./file0");
> 	sqe->open_flags = 0x80;	// O_EXCL???
> 	sqe->user_data = 0x23456;	// random tag?
> 	syz_io_uring_submit(r[1], r[2], (unsigned long)p, 0);
> but it's really annoying as hell, especially since syz_io_uring_submit()
> comes from syzcaller and the damn thing _knows_ that the third argument
> is sodding io_uring_sqe, and never passed to anything other than
> memcpy() in there, at that, so the exact address can't matter.

... especially since the native syzcaller reproducer clearly *does* have
that information.  Simply putting that into comments side-by-side with
what gets put into C reproducer would be nice, especially if it goes with
field names...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04  2:34           ` Al Viro
  2021-04-04  2:38             ` Al Viro
@ 2021-04-04 11:34             ` Christian Brauner
  2021-04-04 15:56               ` Al Viro
  1 sibling, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-04 11:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 02:34:08AM +0000, Al Viro wrote:
> On Thu, Apr 01, 2021 at 07:11:12PM +0000, Al Viro wrote:
> 
> > > I _think_ I see what the issue is. It seems that an assumption made in
> > > this commit might be wrong and we're missing a mnt_add_count() bump that
> > > we would otherwise have gotten if we've moved the failure handling into
> > > the unlazy helpers themselves.
> > > 
> > > Al, does that sound plausible?
> > 
> > mnt_add_count() on _what_?  Failure in legitimize_links() ends up with
> > nd->path.mnt zeroed, in both callers.  So which vfsmount would be
> > affected?

It looks to me like it's the vfsmount of the nd->root after we called
chroot or pivot_root.

> 
> Could you turn that WARN_ON(count < 0) into
> 	if (WARN_ON(count < 0))
> 		printk(KERN_ERR "id = %d, dev = %s, count = %d\n",
> 				mnt->mnt_id,
> 				mnt->mnt_sb->s_id,
> 				count);
> add system("cat /proc/self/mountinfo"); right after sandbox_common()
> call and try to reproduce that?

Sorry for not replying to your earlier mail but I've been debugging this
too. My current theory is that it's related to LOOKUP_ROOT_GRABBED when
LOOKUP_CACHED is specified _possibly_ with an interaction how
create_io_thread() is created with CLONE_FS. The reproducer requires you
either have called pivot_root() or chroot() in order for the failure to
happen. So I think the fact that we skip legitimize_root() when
LOOKUP_CACHED is set might figure into this. I can keep digging.

Funny enough I already placed a printk statement into the place you
wanted one too so I just amended mine. Here's what you get:

If pivot pivot_root() is used before the chroot() you get:

[  637.464555] AAAA: count(-1) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(579) | dev(tmpfs)

if you only call chroot, i.e. make the pivot_root() branch a simple
if (true) you get:

[  955.206117] AAAA: count(-2) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(580) | dev(tmpfs)

The cat /proc/self/mountinfo is for the id(579) below:

514 513 8:2 / / rw,relatime - ext4 /dev/sda2 rw
515 514 0:5 / /dev rw,nosuid,noexec,relatime - devtmpfs udev rw,size=302716k,nr_inodes=75679,mode=755
516 515 0:26 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
517 515 0:28 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw
518 515 0:48 / /dev/hugepages rw,relatime - hugetlbfs hugetlbfs rw,pagesize=2M
519 515 0:21 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
520 514 0:27 / /run rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,size=62152k,mode=755
521 520 0:29 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,size=5120k
522 520 0:49 / /run/lxd_agent rw,relatime - tmpfs tmpfs rw,size=51200k,mode=700
523 520 0:59 / /run/user/1000 rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=62148k,nr_inodes=15537,mode=700,uid=1000,gid=1000
524 514 0:24 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
525 524 0:6 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime - securityfs securityfs rw
526 524 0:30 / /sys/fs/cgroup ro,nosuid,nodev,noexec - tmpfs tmpfs ro,size=4096k,nr_inodes=1024,mode=755
527 526 0:31 /../../.. /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw
528 526 0:32 /../../.. /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,name=systemd
529 526 0:36 /../../.. /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
530 526 0:37 /.. /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct
531 526 0:38 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb
549 526 0:39 /.. /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
550 526 0:40 /../../.. /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
551 526 0:41 /../../.. /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
552 526 0:42 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio
553 526 0:43 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
554 526 0:44 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset,clone_children
555 526 0:45 /.. /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
556 526 0:46 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma
557 524 0:33 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime - pstore pstore rw
558 524 0:34 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime - efivarfs efivarfs rw
559 524 0:35 / /sys/fs/bpf rw,nosuid,nodev,noexec,relatime - bpf none rw,mode=700
560 524 0:7 / /sys/kernel/debug rw,nosuid,nodev,noexec,relatime - debugfs debugfs rw
561 524 0:12 / /sys/kernel/tracing rw,nosuid,nodev,noexec,relatime - tracefs tracefs rw
562 524 0:51 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime - fusectl fusectl rw
563 524 0:20 / /sys/kernel/config rw,nosuid,nodev,noexec,relatime - configfs configfs rw
564 514 0:25 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
565 564 0:47 / /proc/sys/fs/binfmt_misc rw,relatime - autofs systemd-1 rw,fd=29,pgrp=0,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=31453
566 565 0:52 / /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime - binfmt_misc binfmt_misc rw
567 514 0:50 / /home/ubuntu/src/compiled rw,relatime - virtiofs lxd_lxc rw
568 514 8:1 / /boot/efi rw,relatime - vfat /dev/sda1 rw,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro
569 514 0:57 / /var/lib/lxcfs rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other

> 
> I really wonder what mount is it happening to.  BTW, how painful would
> it be to teach syzcaller to turn those cascades of
> 	NONFAILING(*(uint8_t*)0x20000080 = 0x12);
> 	NONFAILING(*(uint8_t*)0x20000081 = 0);
> 	NONFAILING(*(uint16_t*)0x20000082 = 0);
> 	NONFAILING(*(uint32_t*)0x20000084 = 0xffffff9c);
> 	NONFAILING(*(uint64_t*)0x20000088 = 0);
> 	NONFAILING(*(uint64_t*)0x20000090 = 0x20000180);
> 	NONFAILING(memcpy((void*)0x20000180, "./file0\000", 8));
> 	NONFAILING(*(uint32_t*)0x20000098 = 0);
> 	NONFAILING(*(uint32_t*)0x2000009c = 0x80);
> 	NONFAILING(*(uint64_t*)0x200000a0 = 0x23456);
> 	....
> 	NONFAILING(syz_io_uring_submit(r[1], r[2], 0x20000080, 0));
> into something more readable?  Bloody annoyance every time...  Sure, I can
> manually translate it into
> 	struct io_uring_sqe *sqe = (void *)0x20000080;
> 	char *s = (void *)0x20000180;
> 	memset(sqe, '\0', sizeof(*sqe));
> 	sqe->opcode = 0x12; // IORING_OP_OPENAT?
> 	sqe->fd = -100;	// AT_FDCWD?
> 	sqe->addr = s;
> 	strcpy(s, "./file0");
> 	sqe->open_flags = 0x80;	// O_EXCL???
> 	sqe->user_data = 0x23456;	// random tag?
> 	syz_io_uring_submit(r[1], r[2], (unsigned long)p, 0);
> but it's really annoying as hell, especially since syz_io_uring_submit()
> comes from syzcaller and the damn thing _knows_ that the third argument
> is sodding io_uring_sqe, and never passed to anything other than
> memcpy() in there, at that, so the exact address can't matter.

Yeah, it's terrible to get an idea of what's going on in such produces.
I did the same exercise as you. I should probably have replied with all
of that data sooner so you wouldn't have to do the same shite I had to.
Oh well.

> 
> Incidentally, solitary O_EXCL (without O_CREAT) is... curious.  Does that
> sucker still trigger without it?  I.e. with store to 0x2000009c replaced
> with storing 0?

Yeah, I did that as well yesterday morning and it still triggers.

Christian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04 11:34             ` Christian Brauner
@ 2021-04-04 15:56               ` Al Viro
  2021-04-04 16:40                 ` Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-04 15:56 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 01:34:45PM +0200, Christian Brauner wrote:

> Sorry for not replying to your earlier mail but I've been debugging this
> too. My current theory is that it's related to LOOKUP_ROOT_GRABBED when
> LOOKUP_CACHED is specified _possibly_ with an interaction how
> create_io_thread() is created with CLONE_FS. The reproducer requires you
> either have called pivot_root() or chroot() in order for the failure to
> happen. So I think the fact that we skip legitimize_root() when
> LOOKUP_CACHED is set might figure into this. I can keep digging.
> 

> Funny enough I already placed a printk statement into the place you
> wanted one too so I just amended mine. Here's what you get:
> 
> If pivot pivot_root() is used before the chroot() you get:
> 
> [  637.464555] AAAA: count(-1) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(579) | dev(tmpfs)
> 
> if you only call chroot, i.e. make the pivot_root() branch a simple
> if (true) you get:
> 
> [  955.206117] AAAA: count(-2) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(580) | dev(tmpfs)

Very interesting.  What happens if you call loop() twice?  And now I wonder
whether it's root or cwd, actually...  Hmm...

How about this:
	fd = open("/proc/self/mountinfo", 0);
	mkdir("./newroot/foo", 0777);
	mount("./newroot/foo", "./newroot/foo", 0, MS_BIND, NULL);
	chroot("./newroot");
	chdir("/foo");
	while (1) {
		static char buf[4096];
		int n = read(fd, buf, 4096);
		if (n <= 0)
			break;
		write(1, buf, n);
	}
	close(fd);
	drop_caps();
	loop();
as the end of namespace_sandbox_proc(), instead of
	chroot("./newroot");
	chdir("/");
	drop_caps();
	loop();
sequence we have there?

> The cat /proc/self/mountinfo is for the id(579) below:
... and it misses the damn thing, since we call it before the mount
in question had been created ;-/  So we'd probably be better off not
trying to be clever and just doing that as explicit (and completely
untested) read-write loop above.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04 15:56               ` Al Viro
@ 2021-04-04 16:40                 ` Christian Brauner
  2021-04-04 16:44                   ` Al Viro
  2021-04-04 16:52                   ` Christian Brauner
  0 siblings, 2 replies; 31+ messages in thread
From: Christian Brauner @ 2021-04-04 16:40 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 03:56:02PM +0000, Al Viro wrote:
> On Sun, Apr 04, 2021 at 01:34:45PM +0200, Christian Brauner wrote:
> 
> > Sorry for not replying to your earlier mail but I've been debugging this
> > too. My current theory is that it's related to LOOKUP_ROOT_GRABBED when
> > LOOKUP_CACHED is specified _possibly_ with an interaction how
> > create_io_thread() is created with CLONE_FS. The reproducer requires you
> > either have called pivot_root() or chroot() in order for the failure to
> > happen. So I think the fact that we skip legitimize_root() when
> > LOOKUP_CACHED is set might figure into this. I can keep digging.
> > 
> 
> > Funny enough I already placed a printk statement into the place you
> > wanted one too so I just amended mine. Here's what you get:
> > 
> > If pivot pivot_root() is used before the chroot() you get:
> > 
> > [  637.464555] AAAA: count(-1) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(579) | dev(tmpfs)
> > 
> > if you only call chroot, i.e. make the pivot_root() branch a simple
> > if (true) you get:
> > 
> > [  955.206117] AAAA: count(-2) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(580) | dev(tmpfs)
> 
> Very interesting.  What happens if you call loop() twice?  And now I wonder
> whether it's root or cwd, actually...  Hmm...
> 
> How about this:
> 	fd = open("/proc/self/mountinfo", 0);
> 	mkdir("./newroot/foo", 0777);
> 	mount("./newroot/foo", "./newroot/foo", 0, MS_BIND, NULL);
> 	chroot("./newroot");
> 	chdir("/foo");
> 	while (1) {
> 		static char buf[4096];
> 		int n = read(fd, buf, 4096);
> 		if (n <= 0)
> 			break;
> 		write(1, buf, n);
> 	}
> 	close(fd);
> 	drop_caps();
> 	loop();
> as the end of namespace_sandbox_proc(), instead of
> 	chroot("./newroot");
> 	chdir("/");
> 	drop_caps();
> 	loop();
> sequence we have there?

Uhum, well then we oops properly with a null-deref.

f1-vm login: [  395.046971][ T5856] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
[  395.049716][ T5856] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[  395.052847][ T5856] CPU: 1 PID: 5856 Comm: iou-wrk-5851 Tainted: G            E     5.12.0-rc5-020f68e042a19f59784f0962004d848181d13b9e #46
[  395.056594][ T5856] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS 0.0.0 02/06/2015
[  395.058962][ T5856] RIP: 0010:inode_permission+0x4e/0x530
[  395.060362][ T5856] Code: ff 89 de e8 34 42 a9 ff 85 db 0f 85 ef 00 00 00 e8 87 40 a9 ff 4c 8d 7d 02 48 b8 00 00 00 00 00 fc ff df 4c 89 fa 48 c1 ea 03 <0f> b6 14 02 4c 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 c2
[  395.065442][ T5856] RSP: 0018:ffffc9000681f640 EFLAGS: 00010246
[  395.067274][ T5856] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff84ac11fc
[  395.069834][ T5856] RDX: 0000000000000000 RSI: ffff888013210000 RDI: 0000000000000002
[  395.072527][ T5856] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1000044f22
[  395.075287][ T5856] R10: ffff888031c105d3 R11: ffffed1000044f21 R12: ffffc9000681f8e0
[  395.077026][ T5856] R13: 0000000000000001 R14: ffffffff8e441e60 R15: 0000000000000002
[  395.079001][ T5856] FS:  00007f136886f800(0000) GS:ffff888015a80000(0000) knlGS:0000000000000000
[  395.081181][ T5856] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  395.082659][ T5856] CR2: 000055ecfefe8e70 CR3: 00000000236b7000 CR4: 0000000000350ee0
[  395.084727][ T5856] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  395.087117][ T5856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  395.089613][ T5856] Call Trace:
[  395.090570][ T5856]  link_path_walk.part.0+0x790/0xc10
[  395.091735][ T5856]  ? write_comp_data+0x2a/0x90
[  395.092931][ T5856]  ? __sanitizer_cov_trace_pc+0x1d/0x50
[  395.094093][ T5856]  ? walk_component+0x6a0/0x6a0
[  395.095021][ T5856]  ? percpu_counter_add_batch+0xbc/0x180
[  395.096255][ T5856]  path_openat+0x269/0x2790
[  395.097305][ T5856]  ? path_lookupat.isra.0+0x530/0x530
[  395.098391][ T5856]  ? ret_from_fork+0x1f/0x30
[  395.099431][ T5856]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  395.101326][ T5856]  do_filp_open+0x208/0x270
[  395.102894][ T5856]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  395.104525][ T5856]  ? may_open_dev+0xf0/0xf0
[  395.105968][ T5856]  ? do_raw_spin_lock+0x125/0x2e0
[  395.107426][ T5856]  ? write_comp_data+0x2a/0x90
[  395.108651][ T5856]  ? __sanitizer_cov_trace_pc+0x1d/0x50
[  395.110033][ T5856]  ? _raw_spin_unlock+0x29/0x40
[  395.111397][ T5856]  ? alloc_fd+0x499/0x640
[  395.112783][ T5856]  io_openat2+0x1d1/0x8f0
[  395.114237][ T5856]  ? __lock_acquire+0x1847/0x5850
[  395.115982][ T5856]  ? io_req_complete_post+0xa90/0xa90
[  395.117347][ T5856]  ? write_comp_data+0x2a/0x90
[  395.118428][ T5856]  ? io_issue_sqe+0x5ac0/0x5ac0
[  395.119497][ T5856]  io_issue_sqe+0x2a2/0x5ac0
[  395.120707][ T5856]  ? io_poll_complete.constprop.0+0x100/0x100
[  395.130417][ T5856]  ? find_held_lock+0x2d/0x110
[  395.139772][ T5856]  ? io_worker_handle_work+0x55d/0x13b0
[  395.149051][ T5856]  ? lock_downgrade+0x690/0x690
[  395.158262][ T5856]  ? rwlock_bug.part.0+0x90/0x90
[  395.171997][ T5856]  ? io_issue_sqe+0x5ac0/0x5ac0
[  395.180513][ T5856]  io_wq_submit_work+0x1da/0x640
[  395.193646][ T5856]  ? io_issue_sqe+0x5ac0/0x5ac0
[  395.202240][ T5856]  io_worker_handle_work+0x70a/0x13b0
[  395.210276][ T5856]  ? rwlock_bug.part.0+0x90/0x90
[  395.218389][ T5856]  io_wqe_worker+0x2ec/0xd00
[  395.226152][ T5856]  ? io_worker_handle_work+0x13b0/0x13b0
[  395.238492][ T5856]  ? ret_from_fork+0x8/0x30
[  395.245595][ T5856]  ? lock_downgrade+0x690/0x690
[  395.252630][ T5856]  ? do_raw_spin_lock+0x125/0x2e0
[  395.259801][ T5856]  ? rwlock_bug.part.0+0x90/0x90
[  395.271613][ T5856]  ? write_comp_data+0x2a/0x90
[  395.279000][ T5856]  ? _raw_spin_unlock_irq+0x24/0x50
[  395.285999][ T5856]  ? io_worker_handle_work+0x13b0/0x13b0
[  395.297727][ T5856]  ret_from_fork+0x1f/0x30
[  395.304546][ T5856] Modules linked in: efi_pstore(E) efivarfs(E)
[  395.311931][ T5851] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#2] PREEMPT SMP KASAN
[  395.313449][ T5856] ---[ end trace e9d61a560dd12328 ]---

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04 16:40                 ` Christian Brauner
@ 2021-04-04 16:44                   ` Al Viro
  2021-04-04 17:05                     ` Christian Brauner
  2021-04-04 16:52                   ` Christian Brauner
  1 sibling, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-04 16:44 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 06:40:40PM +0200, Christian Brauner wrote:

> > Very interesting.  What happens if you call loop() twice?  And now I wonder
> > whether it's root or cwd, actually...  Hmm...
> > 
> > How about this:
> > 	fd = open("/proc/self/mountinfo", 0);
> > 	mkdir("./newroot/foo", 0777);
> > 	mount("./newroot/foo", "./newroot/foo", 0, MS_BIND, NULL);
> > 	chroot("./newroot");
> > 	chdir("/foo");
> > 	while (1) {
> > 		static char buf[4096];
> > 		int n = read(fd, buf, 4096);
> > 		if (n <= 0)
> > 			break;
> > 		write(1, buf, n);
> > 	}
> > 	close(fd);
> > 	drop_caps();
> > 	loop();
> > as the end of namespace_sandbox_proc(), instead of
> > 	chroot("./newroot");
> > 	chdir("/");
> > 	drop_caps();
> > 	loop();
> > sequence we have there?
> 
> Uhum, well then we oops properly with a null-deref.

Cute...  Could you dump namei.o (ideally - with namei.s) from your build
someplace public?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04 16:40                 ` Christian Brauner
  2021-04-04 16:44                   ` Al Viro
@ 2021-04-04 16:52                   ` Christian Brauner
  2021-04-04 16:55                     ` Christian Brauner
  1 sibling, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-04 16:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 06:40:40PM +0200, Christian Brauner wrote:
> On Sun, Apr 04, 2021 at 03:56:02PM +0000, Al Viro wrote:
> > On Sun, Apr 04, 2021 at 01:34:45PM +0200, Christian Brauner wrote:
> > 
> > > Sorry for not replying to your earlier mail but I've been debugging this
> > > too. My current theory is that it's related to LOOKUP_ROOT_GRABBED when
> > > LOOKUP_CACHED is specified _possibly_ with an interaction how
> > > create_io_thread() is created with CLONE_FS. The reproducer requires you
> > > either have called pivot_root() or chroot() in order for the failure to
> > > happen. So I think the fact that we skip legitimize_root() when
> > > LOOKUP_CACHED is set might figure into this. I can keep digging.
> > > 
> > 
> > > Funny enough I already placed a printk statement into the place you
> > > wanted one too so I just amended mine. Here's what you get:
> > > 
> > > If pivot pivot_root() is used before the chroot() you get:
> > > 
> > > [  637.464555] AAAA: count(-1) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(579) | dev(tmpfs)
> > > 
> > > if you only call chroot, i.e. make the pivot_root() branch a simple
> > > if (true) you get:
> > > 
> > > [  955.206117] AAAA: count(-2) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(580) | dev(tmpfs)
> > 
> > Very interesting.  What happens if you call loop() twice?  And now I wonder
> > whether it's root or cwd, actually...  Hmm...
> > 
> > How about this:
> > 	fd = open("/proc/self/mountinfo", 0);
> > 	mkdir("./newroot/foo", 0777);
> > 	mount("./newroot/foo", "./newroot/foo", 0, MS_BIND, NULL);
> > 	chroot("./newroot");
> > 	chdir("/foo");
> > 	while (1) {
> > 		static char buf[4096];
> > 		int n = read(fd, buf, 4096);
> > 		if (n <= 0)
> > 			break;
> > 		write(1, buf, n);
> > 	}
> > 	close(fd);
> > 	drop_caps();
> > 	loop();
> > as the end of namespace_sandbox_proc(), instead of
> > 	chroot("./newroot");
> > 	chdir("/");
> > 	drop_caps();
> > 	loop();
> > sequence we have there?
> 
> Uhum, well then we oops properly with a null-deref.

And note that the reproducer also requires CLONE_NEWNS which causes the
fs_struct to be unshared as well. I'm not completely in the clear what
would happen if a new io worker thread were to be created after the
caller has called unshare(CLONE_NEWNS).

> 
> f1-vm login: [  395.046971][ T5856] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
> [  395.049716][ T5856] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> [  395.052847][ T5856] CPU: 1 PID: 5856 Comm: iou-wrk-5851 Tainted: G            E     5.12.0-rc5-020f68e042a19f59784f0962004d848181d13b9e #46
> [  395.056594][ T5856] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS 0.0.0 02/06/2015
> [  395.058962][ T5856] RIP: 0010:inode_permission+0x4e/0x530
> [  395.060362][ T5856] Code: ff 89 de e8 34 42 a9 ff 85 db 0f 85 ef 00 00 00 e8 87 40 a9 ff 4c 8d 7d 02 48 b8 00 00 00 00 00 fc ff df 4c 89 fa 48 c1 ea 03 <0f> b6 14 02 4c 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 c2
> [  395.065442][ T5856] RSP: 0018:ffffc9000681f640 EFLAGS: 00010246
> [  395.067274][ T5856] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff84ac11fc
> [  395.069834][ T5856] RDX: 0000000000000000 RSI: ffff888013210000 RDI: 0000000000000002
> [  395.072527][ T5856] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1000044f22
> [  395.075287][ T5856] R10: ffff888031c105d3 R11: ffffed1000044f21 R12: ffffc9000681f8e0
> [  395.077026][ T5856] R13: 0000000000000001 R14: ffffffff8e441e60 R15: 0000000000000002
> [  395.079001][ T5856] FS:  00007f136886f800(0000) GS:ffff888015a80000(0000) knlGS:0000000000000000
> [  395.081181][ T5856] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  395.082659][ T5856] CR2: 000055ecfefe8e70 CR3: 00000000236b7000 CR4: 0000000000350ee0
> [  395.084727][ T5856] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  395.087117][ T5856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  395.089613][ T5856] Call Trace:
> [  395.090570][ T5856]  link_path_walk.part.0+0x790/0xc10
> [  395.091735][ T5856]  ? write_comp_data+0x2a/0x90
> [  395.092931][ T5856]  ? __sanitizer_cov_trace_pc+0x1d/0x50
> [  395.094093][ T5856]  ? walk_component+0x6a0/0x6a0
> [  395.095021][ T5856]  ? percpu_counter_add_batch+0xbc/0x180
> [  395.096255][ T5856]  path_openat+0x269/0x2790
> [  395.097305][ T5856]  ? path_lookupat.isra.0+0x530/0x530
> [  395.098391][ T5856]  ? ret_from_fork+0x1f/0x30
> [  395.099431][ T5856]  ? lockdep_hardirqs_on_prepare+0x400/0x400
> [  395.101326][ T5856]  do_filp_open+0x208/0x270
> [  395.102894][ T5856]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [  395.104525][ T5856]  ? may_open_dev+0xf0/0xf0
> [  395.105968][ T5856]  ? do_raw_spin_lock+0x125/0x2e0
> [  395.107426][ T5856]  ? write_comp_data+0x2a/0x90
> [  395.108651][ T5856]  ? __sanitizer_cov_trace_pc+0x1d/0x50
> [  395.110033][ T5856]  ? _raw_spin_unlock+0x29/0x40
> [  395.111397][ T5856]  ? alloc_fd+0x499/0x640
> [  395.112783][ T5856]  io_openat2+0x1d1/0x8f0
> [  395.114237][ T5856]  ? __lock_acquire+0x1847/0x5850
> [  395.115982][ T5856]  ? io_req_complete_post+0xa90/0xa90
> [  395.117347][ T5856]  ? write_comp_data+0x2a/0x90
> [  395.118428][ T5856]  ? io_issue_sqe+0x5ac0/0x5ac0
> [  395.119497][ T5856]  io_issue_sqe+0x2a2/0x5ac0
> [  395.120707][ T5856]  ? io_poll_complete.constprop.0+0x100/0x100
> [  395.130417][ T5856]  ? find_held_lock+0x2d/0x110
> [  395.139772][ T5856]  ? io_worker_handle_work+0x55d/0x13b0
> [  395.149051][ T5856]  ? lock_downgrade+0x690/0x690
> [  395.158262][ T5856]  ? rwlock_bug.part.0+0x90/0x90
> [  395.171997][ T5856]  ? io_issue_sqe+0x5ac0/0x5ac0
> [  395.180513][ T5856]  io_wq_submit_work+0x1da/0x640
> [  395.193646][ T5856]  ? io_issue_sqe+0x5ac0/0x5ac0
> [  395.202240][ T5856]  io_worker_handle_work+0x70a/0x13b0
> [  395.210276][ T5856]  ? rwlock_bug.part.0+0x90/0x90
> [  395.218389][ T5856]  io_wqe_worker+0x2ec/0xd00
> [  395.226152][ T5856]  ? io_worker_handle_work+0x13b0/0x13b0
> [  395.238492][ T5856]  ? ret_from_fork+0x8/0x30
> [  395.245595][ T5856]  ? lock_downgrade+0x690/0x690
> [  395.252630][ T5856]  ? do_raw_spin_lock+0x125/0x2e0
> [  395.259801][ T5856]  ? rwlock_bug.part.0+0x90/0x90
> [  395.271613][ T5856]  ? write_comp_data+0x2a/0x90
> [  395.279000][ T5856]  ? _raw_spin_unlock_irq+0x24/0x50
> [  395.285999][ T5856]  ? io_worker_handle_work+0x13b0/0x13b0
> [  395.297727][ T5856]  ret_from_fork+0x1f/0x30
> [  395.304546][ T5856] Modules linked in: efi_pstore(E) efivarfs(E)
> [  395.311931][ T5851] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#2] PREEMPT SMP KASAN
> [  395.313449][ T5856] ---[ end trace e9d61a560dd12328 ]---

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04 16:52                   ` Christian Brauner
@ 2021-04-04 16:55                     ` Christian Brauner
  0 siblings, 0 replies; 31+ messages in thread
From: Christian Brauner @ 2021-04-04 16:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 06:52:08PM +0200, Christian Brauner wrote:
> On Sun, Apr 04, 2021 at 06:40:40PM +0200, Christian Brauner wrote:
> > On Sun, Apr 04, 2021 at 03:56:02PM +0000, Al Viro wrote:
> > > On Sun, Apr 04, 2021 at 01:34:45PM +0200, Christian Brauner wrote:
> > > 
> > > > Sorry for not replying to your earlier mail but I've been debugging this
> > > > too. My current theory is that it's related to LOOKUP_ROOT_GRABBED when
> > > > LOOKUP_CACHED is specified _possibly_ with an interaction how
> > > > create_io_thread() is created with CLONE_FS. The reproducer requires you
> > > > either have called pivot_root() or chroot() in order for the failure to
> > > > happen. So I think the fact that we skip legitimize_root() when
> > > > LOOKUP_CACHED is set might figure into this. I can keep digging.
> > > > 
> > > 
> > > > Funny enough I already placed a printk statement into the place you
> > > > wanted one too so I just amended mine. Here's what you get:
> > > > 
> > > > If pivot pivot_root() is used before the chroot() you get:
> > > > 
> > > > [  637.464555] AAAA: count(-1) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(579) | dev(tmpfs)
> > > > 
> > > > if you only call chroot, i.e. make the pivot_root() branch a simple
> > > > if (true) you get:
> > > > 
> > > > [  955.206117] AAAA: count(-2) | mnt_mntpoint(/) | mnt->mnt.mnt_root(/) | id(580) | dev(tmpfs)
> > > 
> > > Very interesting.  What happens if you call loop() twice?  And now I wonder
> > > whether it's root or cwd, actually...  Hmm...
> > > 
> > > How about this:
> > > 	fd = open("/proc/self/mountinfo", 0);
> > > 	mkdir("./newroot/foo", 0777);
> > > 	mount("./newroot/foo", "./newroot/foo", 0, MS_BIND, NULL);
> > > 	chroot("./newroot");
> > > 	chdir("/foo");
> > > 	while (1) {
> > > 		static char buf[4096];
> > > 		int n = read(fd, buf, 4096);
> > > 		if (n <= 0)
> > > 			break;
> > > 		write(1, buf, n);
> > > 	}
> > > 	close(fd);
> > > 	drop_caps();
> > > 	loop();
> > > as the end of namespace_sandbox_proc(), instead of
> > > 	chroot("./newroot");
> > > 	chdir("/");
> > > 	drop_caps();
> > > 	loop();
> > > sequence we have there?
> > 
> > Uhum, well then we oops properly with a null-deref.
> 
> And note that the reproducer also requires CLONE_NEWNS which causes the
> fs_struct to be unshared as well. I'm not completely in the clear what
> would happen if a new io worker thread were to be created after the
> caller has called unshare(CLONE_NEWNS).

And here's a non-null-deref version:

[  647.257107] AAAA: count(-1) | mnt_mntpoint(foo) | mnt->mnt.mnt_root(foo) | id(1358) | dev(tmpfs)

which is

1358 1326 0:66 /newroot/foo /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/foo rw,relatime - tmpfs  rw

Just for kicks, here's the full mount table:

1224 513 8:2 / / rw,relatime - ext4 /dev/sda2 rw
1225 1224 0:5 / /dev rw,nosuid,noexec,relatime - devtmpfs udev rw,size=302716k,nr_inodes=75679,mode=755
1226 1225 0:26 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
1227 1225 0:28 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw
1228 1225 0:48 / /dev/hugepages rw,relatime - hugetlbfs hugetlbfs rw,pagesize=2M
1229 1225 0:21 / /dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
1230 1224 0:27 / /run rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,size=62152k,mode=755
1231 1230 0:29 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,size=5120k
1232 1230 0:49 / /run/lxd_agent rw,relatime - tmpfs tmpfs rw,size=51200k,mode=700
1233 1230 0:59 / /run/user/1000 rw,nosuid,nodev,relatime - tmpfs tmpfs rw,size=62148k,nr_inodes=15537,mode=700,uid=1000,gid=1000
1234 1224 0:24 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
1235 1234 0:6 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime - securityfs securityfs rw
1236 1234 0:30 / /sys/fs/cgroup ro,nosuid,nodev,noexec - tmpfs tmpfs ro,size=4096k,nr_inodes=1024,mode=755
1237 1236 0:31 /../../.. /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw
1238 1236 0:32 /../../.. /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,name=systemd
1239 1236 0:36 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
1240 1236 0:37 /.. /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
1241 1236 0:38 / /sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma
1242 1236 0:39 /.. /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct
1243 1236 0:40 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio
1244 1236 0:41 /.. /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
1245 1236 0:42 /../../.. /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
1246 1236 0:43 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb
1247 1236 0:44 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset,clone_children
1248 1236 0:45 /../../.. /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
1249 1236 0:46 /../../.. /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
1250 1234 0:33 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime - pstore pstore rw
1251 1234 0:34 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime - efivarfs efivarfs rw
1252 1234 0:35 / /sys/fs/bpf rw,nosuid,nodev,noexec,relatime - bpf none rw,mode=700
1253 1234 0:7 / /sys/kernel/debug rw,nosuid,nodev,noexec,relatime - debugfs debugfs rw
1254 1234 0:12 / /sys/kernel/tracing rw,nosuid,nodev,noexec,relatime - tracefs tracefs rw
1255 1234 0:51 / /sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime - fusectl fusectl rw
1256 1234 0:20 / /sys/kernel/config rw,nosuid,nodev,noexec,relatime - configfs configfs rw
1257 1224 0:25 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
1258 1257 0:47 / /proc/sys/fs/binfmt_misc rw,relatime - autofs systemd-1 rw,fd=29,pgrp=0,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=34137
1259 1258 0:52 / /proc/sys/fs/binfmt_misc rw,nosuid,nodev,noexec,relatime - binfmt_misc binfmt_misc rw
1260 1224 0:50 / /home/ubuntu/src/compiled rw,relatime - virtiofs lxd_lxc rw
1261 1224 8:1 / /boot/efi rw,relatime - vfat /dev/sda1 rw,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro
1262 1224 0:57 / /var/lib/lxcfs rw,nosuid,nodev,relatime - fuse.lxcfs lxcfs rw,user_id=0,group_id=0,allow_other
1263 1224 0:60 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp rw,relatime - tmpfs  rw
1264 1263 0:5 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/dev rw,nosuid,noexec,relatime - devtmpfs udev rw,size=302716k,nr_inodes=75679,mode=755
1265 1264 0:26 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
1266 1264 0:28 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/dev/shm rw,nosuid,nodev - tmpfs tmpfs rw
1267 1264 0:48 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/dev/hugepages rw,relatime - hugetlbfs hugetlbfs rw,pagesize=2M
1268 1264 0:21 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
1269 1263 0:61 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/proc rw,relatime - proc none rw
1270 1263 0:24 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
1271 1270 0:6 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/kernel/security rw,nosuid,nodev,noexec,relatime - securityfs securityfs rw
1272 1270 0:30 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup ro,nosuid,nodev,noexec - tmpfs tmpfs ro,size=4096k,nr_inodes=1024,mode=755
1273 1272 0:31 /../../.. /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw
1274 1272 0:32 /../../.. /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,name=systemd
1275 1272 0:36 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
1276 1272 0:37 /.. /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
1277 1272 0:38 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma
1278 1272 0:39 /.. /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct
1279 1272 0:40 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio
1280 1272 0:41 /.. /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
1281 1272 0:42 /../../.. /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
1282 1272 0:43 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb
1283 1272 0:44 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset,clone_children
1284 1272 0:45 /../../.. /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
1285 1272 0:46 /../../.. /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
1286 1270 0:33 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/pstore rw,nosuid,nodev,noexec,relatime - pstore pstore rw
1287 1270 0:34 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime - efivarfs efivarfs rw
1288 1270 0:35 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/bpf rw,nosuid,nodev,noexec,relatime - bpf none rw,mode=700
1289 1270 0:7 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/kernel/debug rw,nosuid,nodev,noexec,relatime - debugfs debugfs rw
1290 1270 0:12 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/kernel/tracing rw,nosuid,nodev,noexec,relatime - tracefs tracefs rw
1291 1270 0:51 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime - fusectl fusectl rw
1292 1270 0:20 / /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/sys/kernel/config rw,nosuid,nodev,noexec,relatime - configfs configfs rw
1293 1263 0:60 /newroot/foo /home/ubuntu/syzkaller.qO1bUT/syz-tmp/newroot/foo rw,relatime - tmpfs  rw
1294 1224 0:62 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp rw,relatime - tmpfs  rw
1295 1294 0:5 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/dev rw,nosuid,noexec,relatime - devtmpfs udev rw,size=302716k,nr_inodes=75679,mode=755
1296 1295 0:26 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
1297 1295 0:28 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/dev/shm rw,nosuid,nodev - tmpfs tmpfs rw
1298 1295 0:48 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/dev/hugepages rw,relatime - hugetlbfs hugetlbfs rw,pagesize=2M
1299 1295 0:21 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
1300 1294 0:63 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/proc rw,relatime - proc none rw
1301 1294 0:24 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
1302 1301 0:6 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/kernel/security rw,nosuid,nodev,noexec,relatime - securityfs securityfs rw
1303 1301 0:30 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup ro,nosuid,nodev,noexec - tmpfs tmpfs ro,size=4096k,nr_inodes=1024,mode=755
1304 1303 0:31 /../../.. /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw
1305 1303 0:32 /../../.. /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,name=systemd
1306 1303 0:36 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
1307 1303 0:37 /.. /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
1308 1303 0:38 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma
1309 1303 0:39 /.. /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct
1310 1303 0:40 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio
1311 1303 0:41 /.. /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
1312 1303 0:42 /../../.. /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
1313 1303 0:43 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb
1314 1303 0:44 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset,clone_children
1315 1303 0:45 /../../.. /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
1316 1303 0:46 /../../.. /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
1317 1301 0:33 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/pstore rw,nosuid,nodev,noexec,relatime - pstore pstore rw
1318 1301 0:34 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime - efivarfs efivarfs rw
1319 1301 0:35 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/bpf rw,nosuid,nodev,noexec,relatime - bpf none rw,mode=700
1320 1301 0:7 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/kernel/debug rw,nosuid,nodev,noexec,relatime - debugfs debugfs rw
1321 1301 0:12 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/kernel/tracing rw,nosuid,nodev,noexec,relatime - tracefs tracefs rw
1322 1301 0:51 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime - fusectl fusectl rw
1323 1301 0:20 / /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/sys/kernel/config rw,nosuid,nodev,noexec,relatime - configfs configfs rw
1324 1294 0:62 /newroot/foo /home/ubuntu/syzkaller.AVk8bL/syz-tmp/newroot/foo rw,relatime - tmpfs  rw
1326 1224 0:66 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp rw,relatime - tmpfs  rw
1327 1326 0:5 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/dev rw,nosuid,noexec,relatime - devtmpfs udev rw,size=302716k,nr_inodes=75679,mode=755
1328 1327 0:26 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,gid=5,mode=620,ptmxmode=000
1329 1327 0:28 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/dev/shm rw,nosuid,nodev - tmpfs tmpfs rw
1330 1327 0:48 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/dev/hugepages rw,relatime - hugetlbfs hugetlbfs rw,pagesize=2M
1331 1327 0:21 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/dev/mqueue rw,nosuid,nodev,noexec,relatime - mqueue mqueue rw
1332 1326 0:67 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/proc rw,relatime - proc none rw
1333 1326 0:24 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
1334 1333 0:6 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/kernel/security rw,nosuid,nodev,noexec,relatime - securityfs securityfs rw
1335 1333 0:30 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup ro,nosuid,nodev,noexec - tmpfs tmpfs ro,size=4096k,nr_inodes=1024,mode=755
1336 1335 0:31 /../../.. /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime - cgroup2 cgroup2 rw
1337 1335 0:32 /../../.. /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,xattr,name=systemd
1338 1335 0:36 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,perf_event
1339 1335 0:37 /.. /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,blkio
1340 1335 0:38 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/rdma rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,rdma
1341 1335 0:39 /.. /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpu,cpuacct
1342 1335 0:40 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,net_cls,net_prio
1343 1335 0:41 /.. /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,devices
1344 1335 0:42 /../../.. /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,memory
1345 1335 0:43 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,hugetlb
1346 1335 0:44 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,cpuset,clone_children
1347 1335 0:45 /../../.. /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,pids
1348 1335 0:46 /../../.. /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
1349 1333 0:33 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/pstore rw,nosuid,nodev,noexec,relatime - pstore pstore rw
1350 1333 0:34 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime - efivarfs efivarfs rw
1351 1333 0:35 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/bpf rw,nosuid,nodev,noexec,relatime - bpf none rw,mode=700
1352 1333 0:7 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/kernel/debug rw,nosuid,nodev,noexec,relatime - debugfs debugfs rw
1353 1333 0:12 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/kernel/tracing rw,nosuid,nodev,noexec,relatime - tracefs tracefs rw
1354 1333 0:51 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/fs/fuse/connections rw,nosuid,nodev,noexec,relatime - fusectl fusectl rw
1355 1333 0:20 / /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/sys/kernel/config rw,nosuid,nodev,noexec,relatime - configfs configfs rw
1358 1326 0:66 /newroot/foo /home/ubuntu/syzkaller.Wgqj6W/syz-tmp/newroot/foo rw,relatime - tmpfs  rw
OPCODE(18) | fd(-100) | path(./file0)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04 16:44                   ` Al Viro
@ 2021-04-04 17:05                     ` Christian Brauner
  2021-04-04 18:50                       ` Al Viro
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-04 17:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 04:44:06PM +0000, Al Viro wrote:
> On Sun, Apr 04, 2021 at 06:40:40PM +0200, Christian Brauner wrote:
> 
> > > Very interesting.  What happens if you call loop() twice?  And now I wonder
> > > whether it's root or cwd, actually...  Hmm...
> > > 
> > > How about this:
> > > 	fd = open("/proc/self/mountinfo", 0);
> > > 	mkdir("./newroot/foo", 0777);
> > > 	mount("./newroot/foo", "./newroot/foo", 0, MS_BIND, NULL);
> > > 	chroot("./newroot");
> > > 	chdir("/foo");
> > > 	while (1) {
> > > 		static char buf[4096];
> > > 		int n = read(fd, buf, 4096);
> > > 		if (n <= 0)
> > > 			break;
> > > 		write(1, buf, n);
> > > 	}
> > > 	close(fd);
> > > 	drop_caps();
> > > 	loop();
> > > as the end of namespace_sandbox_proc(), instead of
> > > 	chroot("./newroot");
> > > 	chdir("/");
> > > 	drop_caps();
> > > 	loop();
> > > sequence we have there?
> > 
> > Uhum, well then we oops properly with a null-deref.
> 
> Cute...  Could you dump namei.o (ideally - with namei.s) from your build
> someplace public?

Yeah, I have at least namei.o

https://drive.google.com/file/d/1AvO1St0YltIrA86DXjp1Xg3ojtS9owGh/view?usp=sharing

Christian

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04 17:05                     ` Christian Brauner
@ 2021-04-04 18:50                       ` Al Viro
  2021-04-04 20:17                         ` Al Viro
  0 siblings, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-04 18:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 07:05:13PM +0200, Christian Brauner wrote:
> On Sun, Apr 04, 2021 at 04:44:06PM +0000, Al Viro wrote:
> > On Sun, Apr 04, 2021 at 06:40:40PM +0200, Christian Brauner wrote:
> > 
> > > > Very interesting.  What happens if you call loop() twice?  And now I wonder
> > > > whether it's root or cwd, actually...  Hmm...
> > > > 
> > > > How about this:
> > > > 	fd = open("/proc/self/mountinfo", 0);
> > > > 	mkdir("./newroot/foo", 0777);
> > > > 	mount("./newroot/foo", "./newroot/foo", 0, MS_BIND, NULL);
> > > > 	chroot("./newroot");
> > > > 	chdir("/foo");
> > > > 	while (1) {
> > > > 		static char buf[4096];
> > > > 		int n = read(fd, buf, 4096);
> > > > 		if (n <= 0)
> > > > 			break;
> > > > 		write(1, buf, n);
> > > > 	}
> > > > 	close(fd);
> > > > 	drop_caps();
> > > > 	loop();
> > > > as the end of namespace_sandbox_proc(), instead of
> > > > 	chroot("./newroot");
> > > > 	chdir("/");
> > > > 	drop_caps();
> > > > 	loop();
> > > > sequence we have there?
> > > 
> > > Uhum, well then we oops properly with a null-deref.
> > 
> > Cute...  Could you dump namei.o (ideally - with namei.s) from your build
> > someplace public?
> 
> Yeah, I have at least namei.o
> 
> https://drive.google.com/file/d/1AvO1St0YltIrA86DXjp1Xg3ojtS9owGh/view?usp=sharing

*grumble*

Is it reproducible without KASAN?  Would be much easier to follow the produced
asm...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04 18:50                       ` Al Viro
@ 2021-04-04 20:17                         ` Al Viro
  2021-04-05 11:44                           ` Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-04 20:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 06:50:10PM +0000, Al Viro wrote:

> > Yeah, I have at least namei.o
> > 
> > https://drive.google.com/file/d/1AvO1St0YltIrA86DXjp1Xg3ojtS9owGh/view?usp=sharing
> 
> *grumble*
> 
> Is it reproducible without KASAN?  Would be much easier to follow the produced
> asm...

	Looks like inode_permission(_, NULL, _) from may_lookup(nd).  I.e.
nd->inode == NULL.

	Mind slapping BUG_ON(!nd->inode) right before may_lookup() call in
link_path_walk() and trying to reproduce that oops?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-04 20:17                         ` Al Viro
@ 2021-04-05 11:44                           ` Christian Brauner
  2021-04-05 16:18                             ` Al Viro
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-05 11:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Sun, Apr 04, 2021 at 08:17:21PM +0000, Al Viro wrote:
> On Sun, Apr 04, 2021 at 06:50:10PM +0000, Al Viro wrote:
> 
> > > Yeah, I have at least namei.o
> > > 
> > > https://drive.google.com/file/d/1AvO1St0YltIrA86DXjp1Xg3ojtS9owGh/view?usp=sharing
> > 
> > *grumble*
> > 
> > Is it reproducible without KASAN?  Would be much easier to follow the produced
> > asm...
> 
> 	Looks like inode_permission(_, NULL, _) from may_lookup(nd).  I.e.
> nd->inode == NULL.

Yeah, I already saw that.

> 
> 	Mind slapping BUG_ON(!nd->inode) right before may_lookup() call in
> link_path_walk() and trying to reproduce that oops?

Yep, no problem. If you run the reproducer in a loop for a little while
you eventually trigger the BUG_ON() and then you get the following splat
(and then an endless loop) in [1] with nd->inode NULL.

_But_ I managed to debug this further and was able to trigger the BUG_ON()
directly in path_init() in the AT_FDCWD branch (after all its AT_FDCWD(./file0)
with the patch in [3] (it's in LOOKUP_RCU) the corresponding splat is in [2].
So the crash happens for a PF_IO_WORKER thread with a NULL nd->inode for the
PF_IO_WORKER's pwd (The PF_IO_WORKER seems to be in async context.).

[3]:
diff --git a/fs/namei.c b/fs/namei.c
index 28a70diff --git a/fs/namei.c b/fs/namei.c
index 28a7006bdb04..8a31ccb61c45 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2336,8 +2336,8 @@ static const char *path_init(struct nameidata *nd, unsigned flags)

        /* Relative pathname -- get the starting-point it is relative to. */
        if (nd->dfd == AT_FDCWD) {
+               struct fs_struct *fs = current->fs;
                if (flags & LOOKUP_RCU) {
-                       struct fs_struct *fs = current->fs;
                        unsigned seq;

                        do {
@@ -2347,9 +2347,14 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
                                nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
                        } while (read_seqcount_retry(&fs->seq, seq));
                } else {
-                       get_fs_pwd(current->fs, &nd->path);
+                       get_fs_pwd(fs, &nd->path);
                        nd->inode = nd->path.dentry->d_inode;
                }
+
+               BUG_ON(!fs->users);
+               BUG_ON(fs->users < 0);
+               BUG_ON(!nd->inode && (current->flags & PF_IO_WORKER));
+               BUG_ON(!nd->inode);
        } else {
                /* Caller must check execute permissions on the starting path component */
                struct fd f = fdget_raw(nd->dfd);

[1]:
[  257.564526] ------------[ cut here ]------------
[  257.564549] kernel BUG at fs/namei.c:2208!
[  257.564998] ------------[ cut here ]------------
[  257.565773] kernel BUG at fs/namei.c:2208!
[  257.567632] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[  257.569113] CPU: 3 PID: 6086 Comm: uring_viro Tainted: G        W   E     5.12.0-rc5-b47394a22e3dce3d03165e48ef0462f41c198ac7 #47
[  257.572028] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS 0.0.0 02/06/2015
[  257.573687] RIP: 0010:link_path_walk.part.0+0x6c9/0xd30
[  257.575008] Code: 00 48 8b 43 18 48 89 44 24 18 48 8b 44 24 38 80 38 00 0f 85 b1 04 00 00 49 8b 6e 30 48 85 ed 0f 85 82 fb ff ff e8 67 35 a8 ff <0f> 0b e8 60 35 a8 ff 49 83 c7 01 48 bf 00 00 00 00 00 fc ff df 4c
[  257.580853] RSP: 0018:ffffc900052af5c0 EFLAGS: 00010246
[  257.582513] RAX: 0000000000000000 RBX: ffff88802f7b2da0 RCX: ffffffff8ecd175c
[  257.584211] RDX: 0000000000000000 RSI: ffff8880133a0000 RDI: 0000000000000002
[  257.585985] RBP: 0000000000000000 R08: 0000000000000001 R09: fffffbfff388f13e
[  257.588061] R10: ffff8880322e0353 R11: fffffbfff388f13d R12: ffffc900052af820
[  257.590059] R13: ffff8880133a0000 R14: ffffc900052af820 R15: ffff888031149120
[  257.592605] FS:  00007f445b331800(0000) GS:ffff888015b80000(0000) knlGS:0000000000000000
[  257.595266] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  257.597262] CR2: 00005578c2c17008 CR3: 00000000307a4000 CR4: 0000000000350ee0
[  257.599546] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  257.601331] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  257.602938] Call Trace:
[  257.603764]  ? write_comp_data+0x2a/0x90
[  257.605637]  ? __sanitizer_cov_trace_pc+0x1d/0x50
[  257.607107]  ? path_init+0x662/0x1800
[  257.608713]  ? walk_component+0x6a0/0x6a0
[  257.610125]  path_openat+0x269/0x2790
[  257.611601]  ? path_lookupat.isra.0+0x530/0x530
[  257.613180]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  257.614586]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  257.615898]  do_filp_open+0x197/0x270
[  257.616881]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  257.618293]  ? may_open_dev+0xf0/0xf0
[  257.619700]  ? do_raw_spin_lock+0x125/0x2e0
[  257.621159]  ? write_comp_data+0x2a/0x90
[  257.622574]  ? __sanitizer_cov_trace_pc+0x1d/0x50
[  257.624271]  ? _raw_spin_unlock+0x29/0x40
[  257.625737]  ? alloc_fd+0x499/0x640
[  257.627075]  io_openat2+0x1d1/0x8f0
[  257.628458]  ? io_req_complete_post+0xa90/0xa90
[  257.629921]  ? __lock_acquire+0x1847/0x5850
[  257.631108]  ? write_comp_data+0x2a/0x90
[  257.632069]  io_issue_sqe+0x2a2/0x5ac0
[  257.633112]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  257.634534]  ? io_poll_complete.constprop.0+0x100/0x100
[  257.636129]  ? rcu_read_lock_sched_held+0xa1/0xd0
[  257.647374]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  257.656776]  ? find_held_lock+0x2d/0x110
[  257.665481]  ? __might_fault+0xd8/0x180
[  257.674198]  __io_queue_sqe+0x19f/0xcf0
[  257.688028]  ? __check_object_size+0x1b4/0x4e0
[  257.696724]  ? __ia32_sys_io_uring_setup+0x70/0x70
[  257.705177]  ? write_comp_data+0x2a/0x90
[  257.713125]  io_queue_sqe+0x612/0xb70
[  257.720757]  io_submit_sqes+0x517d/0x6650
[  257.733042]  ? __x64_sys_io_uring_enter+0xb15/0xdd0
[  257.741308]  __x64_sys_io_uring_enter+0xb15/0xdd0
[  257.748782]  ? __ia32_sys_io_uring_enter+0xdd0/0xdd0
[  257.756094]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  257.764428]  ? syscall_enter_from_user_mode+0x27/0x70
[  257.773331]  do_syscall_64+0x2d/0x70
[  257.780447]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  257.787936] RIP: 0033:0x7f445b44767d
[  257.795186] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bb f7 0c 00 f7 d8 64 89 01 48
[  257.816184] RSP: 002b:00005578c0f91e98 EFLAGS: 00000212 ORIG_RAX: 00000000000001aa
[  257.824286] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f445b44767d
[  257.832553] RDX: 0000000000000000 RSI: 00000000000045f5 RDI: 0000000000000003
[  257.844189] RBP: 00005578c0f91f70 R08: 0000000000000000 R09: 0000000000000000
[  257.853290] R10: 0000000000000000 R11: 0000000000000212 R12: 00005578c0e64640
[  257.861454] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  257.874753] Modules linked in: efi_pstore(E) efivarfs(E)
[  257.883235] invalid opcode: 0000 [#2] PREEMPT SMP KASAN
[  257.891763] ---[ end trace a40a2f46a1222bc8 ]---

[2]:
[  228.106433] ------------[ cut here ]------------
[  228.106449] kernel BUG at fs/namei.c:2356!
[  228.106596] ------------[ cut here ]------------
[  228.109523] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[  228.110025] kernel BUG at fs/namei.c:2357!
[  228.111226] CPU: 1 PID: 6052 Comm: iou-wrk-6049 Tainted: G        W   E     5.12.0-rc5-cc402ba42485f28e932348e7627433eb1dccef5c #51
[  228.115791] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS 0.0.0 02/06/2015
[  228.118150] RIP: 0010:path_init+0x141c/0x1730
[  228.119523] Code: fb ff ff 48 c7 c7 00 47 c1 91 e8 4f c9 f0 ff e9 b5 ed ff ff 48 c7 c7 40 46 c1 91 e8 3e c9 f0 ff e9 5b ee ff ff e8 b4 78 a8 ff <0f> 0b 48 8b bd 60 ff ff ff e8 26 c9 f0 ff e9 52 f5 ff ff e8 1c c9
[  228.124889] RSP: 0018:ffffc90004d8f6c0 EFLAGS: 00010246
[  228.126702] RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffffff884cd98c
[  228.129228] RDX: 0000000000000000 RSI: ffff8880007a9d00 RDI: 0000000000000002
[  228.131765] RBP: ffffc90004d8f760 R08: 0000000000000001 R09: ffffed10026e1ba2
[  228.133962] R10: ffff88801370dd0b R11: ffffed10026e1ba1 R12: ffff88801370dd08
[  228.135843] R13: ffff88801370dd00 R14: ffffc90004d8f8e0 R15: 0000000000000001
[  228.137812] FS:  00007fd5db164800(0000) GS:ffff888015a80000(0000) knlGS:0000000000000000
[  228.140069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  228.141894] CR2: 00007fe892801000 CR3: 0000000036041000 CR4: 0000000000350ee0
[  228.144485] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  228.146914] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  228.149326] Call Trace:
[  228.150372]  ? write_comp_data+0x2a/0x90
[  228.151870]  ? percpu_counter_add_batch+0xbc/0x180
[  228.153667]  path_openat+0x192/0x2790
[  228.155038]  ? path_lookupat.isra.0+0x530/0x530
[  228.156573]  ? ret_from_fork+0x1f/0x30
[  228.157757]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  228.159136]  do_filp_open+0x208/0x270
[  228.160252]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  228.161598]  ? may_open_dev+0xf0/0xf0
[  228.162653]  ? do_raw_spin_lock+0x125/0x2e0
[  228.163856]  ? write_comp_data+0x2a/0x90
[  228.165213]  ? __sanitizer_cov_trace_pc+0x1d/0x50
[  228.166782]  ? _raw_spin_unlock+0x29/0x40
[  228.168176]  ? alloc_fd+0x499/0x640
[  228.169402]  io_openat2+0x1d1/0x8f0
[  228.170607]  ? __lock_acquire+0x1847/0x5850
[  228.172381]  ? io_req_complete_post+0xa90/0xa90
[  228.174218]  ? write_comp_data+0x2a/0x90
[  228.175843]  ? io_issue_sqe+0x5ac0/0x5ac0
[  228.177600]  io_issue_sqe+0x2a2/0x5ac0
[  228.179183]  ? io_poll_complete.constprop.0+0x100/0x100
[  228.180821]  ? find_held_lock+0x2d/0x110
[  228.181963]  ? io_worker_handle_work+0x55d/0x13b0
[  228.183154]  ? lock_downgrade+0x690/0x690
[  228.192463]  ? rwlock_bug.part.0+0x90/0x90
[  228.202181]  ? io_issue_sqe+0x5ac0/0x5ac0
[  228.211962]  io_wq_submit_work+0x1da/0x640
[  228.227152]  ? io_issue_sqe+0x5ac0/0x5ac0
[  228.236852]  io_worker_handle_work+0x70a/0x13b0
[  228.246583]  ? rwlock_bug.part.0+0x90/0x90
[  228.256009]  io_wqe_worker+0x2ec/0xd00
[  228.264954]  ? io_worker_handle_work+0x13b0/0x13b0
[  228.274045]  ? ret_from_fork+0x8/0x30
[  228.288365]  ? lock_downgrade+0x690/0x690
[  228.297032]  ? do_raw_spin_lock+0x125/0x2e0
[  228.305457]  ? rwlock_bug.part.0+0x90/0x90
[  228.313717]  ? write_comp_data+0x2a/0x90
[  228.321685]  ? _raw_spin_unlock_irq+0x24/0x50
[  228.329596]  ? io_worker_handle_work+0x13b0/0x13b0
[  228.343015]  ret_from_fork+0x1f/0x30
[  228.351033] Modules linked in: efi_pstore(E) efivarfs(E)
[  228.359699] invalid opcode: 0000 [#2] PREEMPT SMP KASAN
[  228.370439] CPU: 3 PID: 6049 Comm: uring_viro Tainted: G      D W   E     5.12.0-rc5-cc402ba42485f28e932348e7627433eb1dccef5c #51
[  228.394578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS 0.0.0 02/06/2015
[  228.396859] ---[ end trace 5ac6a4c645048000 ]---

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-05 11:44                           ` Christian Brauner
@ 2021-04-05 16:18                             ` Al Viro
  2021-04-05 17:08                               ` Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-05 16:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Mon, Apr 05, 2021 at 01:44:37PM +0200, Christian Brauner wrote:
> On Sun, Apr 04, 2021 at 08:17:21PM +0000, Al Viro wrote:
> > On Sun, Apr 04, 2021 at 06:50:10PM +0000, Al Viro wrote:
> > 
> > > > Yeah, I have at least namei.o
> > > > 
> > > > https://drive.google.com/file/d/1AvO1St0YltIrA86DXjp1Xg3ojtS9owGh/view?usp=sharing
> > > 
> > > *grumble*
> > > 
> > > Is it reproducible without KASAN?  Would be much easier to follow the produced
> > > asm...
> > 
> > 	Looks like inode_permission(_, NULL, _) from may_lookup(nd).  I.e.
> > nd->inode == NULL.
> 
> Yeah, I already saw that.
> 
> > 
> > 	Mind slapping BUG_ON(!nd->inode) right before may_lookup() call in
> > link_path_walk() and trying to reproduce that oops?
> 
> Yep, no problem. If you run the reproducer in a loop for a little while
> you eventually trigger the BUG_ON() and then you get the following splat
> (and then an endless loop) in [1] with nd->inode NULL.
> 
> _But_ I managed to debug this further and was able to trigger the BUG_ON()
> directly in path_init() in the AT_FDCWD branch (after all its AT_FDCWD(./file0)
> with the patch in [3] (it's in LOOKUP_RCU) the corresponding splat is in [2].
> So the crash happens for a PF_IO_WORKER thread with a NULL nd->inode for the
> PF_IO_WORKER's pwd (The PF_IO_WORKER seems to be in async context.).

So we find current->fs->pwd.dentry negative, with current->fs->seq sampled
equal before and after that?  Lovely...  The only places where we assign
anything to ->pwd.dentry are
void set_fs_pwd(struct fs_struct *fs, const struct path *path)
{
        struct path old_pwd; 

        path_get(path);
        spin_lock(&fs->lock);
        write_seqcount_begin(&fs->seq);
        old_pwd = fs->pwd;
        fs->pwd = *path;
        write_seqcount_end(&fs->seq);
        spin_unlock(&fs->lock);

        if (old_pwd.dentry)
                path_put(&old_pwd);
}
where we have ->seq bumped between dget new/assignment/ dput old,
copy_fs_struct() where we have
                spin_lock(&old->lock);
                fs->root = old->root;
                path_get(&fs->root);
                fs->pwd = old->pwd;
                path_get(&fs->pwd);
                spin_unlock(&old->lock);
fs being freshly allocated instance that couldn't have been observed
by anyone and chroot_fs_refs(), where we have
                        spin_lock(&fs->lock);
                        write_seqcount_begin(&fs->seq);
                        hits += replace_path(&fs->root, old_root, new_root);
                        hits += replace_path(&fs->pwd, old_root, new_root);
                        write_seqcount_end(&fs->seq);
                        while (hits--) {
                                count++;
                                path_get(new_root);
                        }
                        spin_unlock(&fs->lock);
...
static inline int replace_path(struct path *p, const struct path *old, const struct path *new)
{
        if (likely(p->dentry != old->dentry || p->mnt != old->mnt))
                return 0;
        *p = *new;
        return 1;
}
Here we have new_root->dentry pinned from the very beginning,
and assignments are wrapped into bumps of ->seq.  Moreover,
we are holding ->lock through that sequence (as all writers
do), so these references can't be dropped before path_get()
bumps new_root->dentry refcount.

chroot_fs_refs() is called only by pivot_root(2):
        chroot_fs_refs(&root, &new);
and there new is set by
        error = user_path_at(AT_FDCWD, new_root,
                             LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new);
        if (error)
                goto out0;
which pins new.dentry *and* verifies that it's positive and a directory,
at that.  Since pinned positive dentry can't be made negative by anybody
else, we know it will remain in that state until
	path_put(&new);
well downstream of chroot_fs_refs().  In copy_fs_struct() we are
copying someone's ->pwd, so it's also pinned positive.  And it
won't be dropped outside of old->lock, so by the time somebody
manages to drop the reference in old, path_get() effects will be
visible (old->lock serving as a barrier).

That leaves set_fs_pwd() calls:
fs/init.c:54:           set_fs_pwd(current->fs, &path);
	init_chdir(), path set by LOOKUP_DIRECTORY patwalk.  Pinned positive.
fs/namespace.c:4207:    set_fs_pwd(current->fs, &root);
	init_mount_tree(), root.dentry being ->mnt_root of rootfs.  Pinned
positive (and it would've oopsed much earlier had that been it)
fs/namespace.c:4485:    set_fs_pwd(fs, &root);
	mntns_install(), root filled by successful LOOKUP_DOWN for "/"
from mnt_ns->root.  Should be pinned positive.
fs/open.c:501:  set_fs_pwd(current->fs, &path);
	chdir(2), path set by LOOKUP_DIRECTORY pathwalk.  Pinned positive.
fs/open.c:528:          set_fs_pwd(current->fs, &f.file->f_path);
	fchdir(2), file->f_path of any opened file.  Pinned positive.
kernel/usermode_driver.c:130:   set_fs_pwd(current->fs, &umd_info->wd);
	umd_setup(), ->wd.dentry equal to ->wd.mnt->mnt_root, should be pinned positive.
kernel/nsproxy.c:509:           set_fs_pwd(me->fs, &nsset->fs->pwd);
	commit_nsset().  Let's see what's going on there...

        if ((flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS)) {
                set_fs_root(me->fs, &nsset->fs->root);
                set_fs_pwd(me->fs, &nsset->fs->pwd);
        }
In those conditions nsset.fs has come from copy_fs_struct() done in
prepare_nsset().  And the only thing that might've been done to it
would be those set_fs_pwd() in mntns_install() (I'm not fond of the
entire nsset->fs thing - looks like papering over bad calling
conventions, but anyway)

Now, I might've missed some insanity (direct assignments to ->pwd.dentry,
etc. - wouldn't be the first time io_uring folks went "layering? wassat?
we'll just poke in whatever we can reach"), but I don't see anything
obvious of that sort in the area...

OK, how about this: in path_init(), right after
                        do {
                                seq = read_seqcount_begin(&fs->seq);
                                nd->path = fs->pwd;
                                nd->inode = nd->path.dentry->d_inode;
                                nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
                        } while (read_seqcount_retry(&fs->seq, seq));
slap
			if (!nd->inode) {
				// should never, ever happen
				struct dentry *fucked = nd->path.dentry;
				printk(KERN_ERR "%pd4 %d %x %p %d %d", fucked, d_count(fucked),
							fucked->d_flags, fs, fs->users, seq);
				BUG_ON(1);
				return ERR_PTR(-EINVAL);
			}
and see what it catches?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-05 16:18                             ` Al Viro
@ 2021-04-05 17:08                               ` Christian Brauner
  2021-04-05 18:23                                 ` Al Viro
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-05 17:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Mon, Apr 05, 2021 at 04:18:58PM +0000, Al Viro wrote:
> On Mon, Apr 05, 2021 at 01:44:37PM +0200, Christian Brauner wrote:
> > On Sun, Apr 04, 2021 at 08:17:21PM +0000, Al Viro wrote:
> > > On Sun, Apr 04, 2021 at 06:50:10PM +0000, Al Viro wrote:
> > > 
> > > > > Yeah, I have at least namei.o
> > > > > 
> > > > > https://drive.google.com/file/d/1AvO1St0YltIrA86DXjp1Xg3ojtS9owGh/view?usp=sharing
> > > > 
> > > > *grumble*
> > > > 
> > > > Is it reproducible without KASAN?  Would be much easier to follow the produced
> > > > asm...
> > > 
> > > 	Looks like inode_permission(_, NULL, _) from may_lookup(nd).  I.e.
> > > nd->inode == NULL.
> > 
> > Yeah, I already saw that.
> > 
> > > 
> > > 	Mind slapping BUG_ON(!nd->inode) right before may_lookup() call in
> > > link_path_walk() and trying to reproduce that oops?
> > 
> > Yep, no problem. If you run the reproducer in a loop for a little while
> > you eventually trigger the BUG_ON() and then you get the following splat
> > (and then an endless loop) in [1] with nd->inode NULL.
> > 
> > _But_ I managed to debug this further and was able to trigger the BUG_ON()
> > directly in path_init() in the AT_FDCWD branch (after all its AT_FDCWD(./file0)
> > with the patch in [3] (it's in LOOKUP_RCU) the corresponding splat is in [2].
> > So the crash happens for a PF_IO_WORKER thread with a NULL nd->inode for the
> > PF_IO_WORKER's pwd (The PF_IO_WORKER seems to be in async context.).
> 
> So we find current->fs->pwd.dentry negative, with current->fs->seq sampled
> equal before and after that?  Lovely...  The only places where we assign
> anything to ->pwd.dentry are
> void set_fs_pwd(struct fs_struct *fs, const struct path *path)
> {
>         struct path old_pwd; 
> 
>         path_get(path);
>         spin_lock(&fs->lock);
>         write_seqcount_begin(&fs->seq);
>         old_pwd = fs->pwd;
>         fs->pwd = *path;
>         write_seqcount_end(&fs->seq);
>         spin_unlock(&fs->lock);
> 
>         if (old_pwd.dentry)
>                 path_put(&old_pwd);
> }
> where we have ->seq bumped between dget new/assignment/ dput old,
> copy_fs_struct() where we have
>                 spin_lock(&old->lock);
>                 fs->root = old->root;
>                 path_get(&fs->root);
>                 fs->pwd = old->pwd;
>                 path_get(&fs->pwd);
>                 spin_unlock(&old->lock);
> fs being freshly allocated instance that couldn't have been observed
> by anyone and chroot_fs_refs(), where we have
>                         spin_lock(&fs->lock);
>                         write_seqcount_begin(&fs->seq);
>                         hits += replace_path(&fs->root, old_root, new_root);
>                         hits += replace_path(&fs->pwd, old_root, new_root);
>                         write_seqcount_end(&fs->seq);
>                         while (hits--) {
>                                 count++;
>                                 path_get(new_root);
>                         }
>                         spin_unlock(&fs->lock);
> ...
> static inline int replace_path(struct path *p, const struct path *old, const struct path *new)
> {
>         if (likely(p->dentry != old->dentry || p->mnt != old->mnt))
>                 return 0;
>         *p = *new;
>         return 1;
> }
> Here we have new_root->dentry pinned from the very beginning,
> and assignments are wrapped into bumps of ->seq.  Moreover,
> we are holding ->lock through that sequence (as all writers
> do), so these references can't be dropped before path_get()
> bumps new_root->dentry refcount.
> 
> chroot_fs_refs() is called only by pivot_root(2):
>         chroot_fs_refs(&root, &new);
> and there new is set by
>         error = user_path_at(AT_FDCWD, new_root,
>                              LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new);
>         if (error)
>                 goto out0;
> which pins new.dentry *and* verifies that it's positive and a directory,
> at that.  Since pinned positive dentry can't be made negative by anybody
> else, we know it will remain in that state until
> 	path_put(&new);
> well downstream of chroot_fs_refs().  In copy_fs_struct() we are
> copying someone's ->pwd, so it's also pinned positive.  And it
> won't be dropped outside of old->lock, so by the time somebody
> manages to drop the reference in old, path_get() effects will be
> visible (old->lock serving as a barrier).
> 
> That leaves set_fs_pwd() calls:
> fs/init.c:54:           set_fs_pwd(current->fs, &path);
> 	init_chdir(), path set by LOOKUP_DIRECTORY patwalk.  Pinned positive.
> fs/namespace.c:4207:    set_fs_pwd(current->fs, &root);
> 	init_mount_tree(), root.dentry being ->mnt_root of rootfs.  Pinned
> positive (and it would've oopsed much earlier had that been it)
> fs/namespace.c:4485:    set_fs_pwd(fs, &root);
> 	mntns_install(), root filled by successful LOOKUP_DOWN for "/"
> from mnt_ns->root.  Should be pinned positive.
> fs/open.c:501:  set_fs_pwd(current->fs, &path);
> 	chdir(2), path set by LOOKUP_DIRECTORY pathwalk.  Pinned positive.
> fs/open.c:528:          set_fs_pwd(current->fs, &f.file->f_path);
> 	fchdir(2), file->f_path of any opened file.  Pinned positive.
> kernel/usermode_driver.c:130:   set_fs_pwd(current->fs, &umd_info->wd);
> 	umd_setup(), ->wd.dentry equal to ->wd.mnt->mnt_root, should be pinned positive.
> kernel/nsproxy.c:509:           set_fs_pwd(me->fs, &nsset->fs->pwd);
> 	commit_nsset().  Let's see what's going on there...
> 
>         if ((flags & CLONE_NEWNS) && (flags & ~CLONE_NEWNS)) {
>                 set_fs_root(me->fs, &nsset->fs->root);
>                 set_fs_pwd(me->fs, &nsset->fs->pwd);
>         }
> In those conditions nsset.fs has come from copy_fs_struct() done in
> prepare_nsset().  And the only thing that might've been done to it
> would be those set_fs_pwd() in mntns_install() (I'm not fond of the
> entire nsset->fs thing - looks like papering over bad calling
> conventions, but anyway)
> 
> Now, I might've missed some insanity (direct assignments to ->pwd.dentry,
> etc. - wouldn't be the first time io_uring folks went "layering? wassat?
> we'll just poke in whatever we can reach"), but I don't see anything
> obvious of that sort in the area...
> 
> OK, how about this: in path_init(), right after
>                         do {
>                                 seq = read_seqcount_begin(&fs->seq);
>                                 nd->path = fs->pwd;
>                                 nd->inode = nd->path.dentry->d_inode;
>                                 nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
>                         } while (read_seqcount_retry(&fs->seq, seq));
> slap
> 			if (!nd->inode) {
> 				// should never, ever happen
> 				struct dentry *fucked = nd->path.dentry;
> 				printk(KERN_ERR "%pd4 %d %x %p %d %d", fucked, d_count(fucked),
> 							fucked->d_flags, fs, fs->users, seq);
> 				BUG_ON(1);
> 				return ERR_PTR(-EINVAL);
> 			}
> and see what it catches?

Ah dentry count of -127 looks... odd.


[  246.102077] /newroot/foo -127 18008 ffff888012819000 6 0
[  246.102240] ------------[ cut here ]------------
[  246.102264] /newroot/foo -127 18008 ffff888012819000 6 6
[  246.104163] ------------[ cut here ]------------
[  246.104943] kernel BUG at fs/namei.c:2359!
[  246.106342] kernel BUG at fs/namei.c:2359!
[  246.106385] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[  246.110540] CPU: 0 PID: 6345 Comm: uring_viro Tainted: G        W   E     5.12.0-rc5-1ebc00aa82b08217d1fc4eef5435f8499783194c #53
[  246.113725] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)/LXD, BIOS 0.0.0 02/06/2015
[  246.116115] RIP: 0010:path_init.cold+0xbb/0xea
[  246.117711] Code: d0 7c 04 84 d2 75 4b 4c 8b 45 98 41 89 d9 44 89 f9 4c 89 e6 41 8b 94 24 d0 00 00 00 48 c7 c7 20 93 1a b1 41 55 e8 1c 70 fe ff <0f> 0b 48 8b 7d b8 e8 1f ee e6 f8 e9 55 ff ff ff 48 8b 7d 98 e8 01
[  246.124372] RSP: 0018:ffffc900073275f0 EFLAGS: 00010282
[  246.126466] RAX: 000000000000002c RBX: 0000000000000006 RCX: 0000000000000000
[  246.129685] RDX: 0000000000000000 RSI: ffff88801354d700 RDI: fffff52000e64eb0
[  246.131400] RBP: ffffc900073276a0 R08: 000000000000002c R09: ffffed1002b46045
[  246.133241] R10: ffff888015a30227 R11: ffffed1002b46044 R12: ffff8880303eb028
[  246.135124] R13: 0000000000000006 R14: ffffc90007327820 R15: 0000000000018008
[  246.136931] FS:  00007f8695724800(0000) GS:ffff888015a00000(0000) knlGS:0000000000000000
[  246.139247] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  246.141604] CR2: 000055fdf3c11008 CR3: 000000002e9dd000 CR4: 0000000000350ef0
[  246.143437] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  246.145337] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  246.147114] Call Trace:
[  246.147910]  ? write_comp_data+0x2a/0x90
[  246.149010]  path_openat+0x192/0x2790
[  246.150062]  ? path_lookupat.isra.0+0x530/0x530
[  246.151295]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  246.152561]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  246.154071]  do_filp_open+0x197/0x270
[  246.155157]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  246.156392]  ? may_open_dev+0xf0/0xf0
[  246.157969]  ? do_raw_spin_lock+0x125/0x2e0
[  246.159167]  ? write_comp_data+0x2a/0x90
[  246.160319]  ? __sanitizer_cov_trace_pc+0x1d/0x50
[  246.161649]  ? _raw_spin_unlock+0x29/0x40
[  246.162823]  ? alloc_fd+0x499/0x640
[  246.164092]  io_openat2+0x1d1/0x8f0
[  246.165403]  ? io_req_complete_post+0xa90/0xa90
[  246.166974]  ? __lock_acquire+0x1847/0x5850
[  246.168455]  ? write_comp_data+0x2a/0x90
[  246.169877]  io_issue_sqe+0x2a2/0x5ac0
[  246.171226]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  246.173079]  ? io_poll_complete.constprop.0+0x100/0x100
[  246.174960]  ? rcu_read_lock_sched_held+0xa1/0xd0
[  246.176468]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  246.187303]  ? find_held_lock+0x2d/0x110
[  246.197951]  ? __might_fault+0xd8/0x180
[  246.208458]  __io_queue_sqe+0x19f/0xcf0
[  246.218694]  ? __check_object_size+0x1b4/0x4e0
[  246.228802]  ? __ia32_sys_io_uring_setup+0x70/0x70
[  246.239099]  ? write_comp_data+0x2a/0x90
[  246.249152]  io_queue_sqe+0x612/0xb70
[  246.258967]  io_submit_sqes+0x517d/0x6650
[  246.268445]  ? __x64_sys_io_uring_enter+0xb15/0xdd0
[  246.282682]  __x64_sys_io_uring_enter+0xb15/0xdd0
[  246.292110]  ? __ia32_sys_io_uring_enter+0xdd0/0xdd0
[  246.301285]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  246.310116]  ? syscall_enter_from_user_mode+0x27/0x70
[  246.318845]  do_syscall_64+0x2d/0x70
[  246.327328]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  246.336188] RIP: 0033:0x7f869583a67d
[  246.344145] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bb f7 0c 00 f7 d8 64 89 01 48
[  246.367649] RSP: 002b:000055fdf2a89e98 EFLAGS: 00000212 ORIG_RAX: 00000000000001aa
[  246.377100] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f869583a67d
[  246.386119] RDX: 0000000000000000 RSI: 00000000000045f5 RDI: 0000000000000003
[  246.395095] RBP: 000055fdf2a89f70 R08: 0000000000000000 R09: 0000000000000000
[  246.403925] R10: 0000000000000000 R11: 0000000000000212 R12: 000055fdf295c640
[  246.412977] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  246.425046] Modules linked in: efi_pstore(E) efivarfs(E)
[  246.434681] invalid opcode: 0000 [#2] PREEMPT SMP KASAN
[  246.435099] ---[ end trace b331351bc5a092fa ]---

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-05 17:08                               ` Christian Brauner
@ 2021-04-05 18:23                                 ` Al Viro
  2021-04-05 18:28                                   ` Al Viro
  0 siblings, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-05 18:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Mon, Apr 05, 2021 at 07:08:01PM +0200, Christian Brauner wrote:

> Ah dentry count of -127 looks... odd.

dead + 1...

void lockref_mark_dead(struct lockref *lockref)
{
        assert_spin_locked(&lockref->lock);
	lockref->count = -128;
}

IOW, a leaked (uncounted) reference to dentry, that got dget() called on
it after dentry had been freed.

	IOW, current->fs->pwd.dentry happens to point to an already freed
struct dentry here.  Joy...

	Could you slap

spin_lock(&current->fs->lock);
WARN_ON(d_count(current->fs->pwd.dentry) < 0);
spin_unlock(&current->fs->lock);

before and after calls of io_issue_sqe() and see if it triggers?  We definitely
are seeing buggered dentry refcounting here.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-05 18:23                                 ` Al Viro
@ 2021-04-05 18:28                                   ` Al Viro
  2021-04-05 20:07                                     ` Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-05 18:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Mon, Apr 05, 2021 at 06:23:49PM +0000, Al Viro wrote:
> On Mon, Apr 05, 2021 at 07:08:01PM +0200, Christian Brauner wrote:
> 
> > Ah dentry count of -127 looks... odd.
> 
> dead + 1...
> 
> void lockref_mark_dead(struct lockref *lockref)
> {
>         assert_spin_locked(&lockref->lock);
> 	lockref->count = -128;
> }
> 
> IOW, a leaked (uncounted) reference to dentry, that got dget() called on
> it after dentry had been freed.
> 
> 	IOW, current->fs->pwd.dentry happens to point to an already freed
> struct dentry here.  Joy...
> 
> 	Could you slap
> 
> spin_lock(&current->fs->lock);
> WARN_ON(d_count(current->fs->pwd.dentry) < 0);
> spin_unlock(&current->fs->lock);
> 
> before and after calls of io_issue_sqe() and see if it triggers?  We definitely
> are seeing buggered dentry refcounting here.

Check if this helps, please.

diff --git a/fs/namei.c b/fs/namei.c
index 216f16e74351..82344f1139ff 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2289,6 +2289,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	int error;
 	const char *s = nd->name->name;
 
+	nd->path.mnt = NULL;
+	nd->path.dentry = NULL;
+
 	/* LOOKUP_CACHED requires RCU, ask caller to retry */
 	if ((flags & (LOOKUP_RCU | LOOKUP_CACHED)) == LOOKUP_CACHED)
 		return ERR_PTR(-EAGAIN);
@@ -2322,8 +2325,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	}
 
 	nd->root.mnt = NULL;
-	nd->path.mnt = NULL;
-	nd->path.dentry = NULL;
 
 	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
 	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-05 18:28                                   ` Al Viro
@ 2021-04-05 20:07                                     ` Christian Brauner
  2021-04-06  1:38                                       ` Al Viro
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-05 20:07 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Mon, Apr 05, 2021 at 06:28:54PM +0000, Al Viro wrote:
> On Mon, Apr 05, 2021 at 06:23:49PM +0000, Al Viro wrote:
> > On Mon, Apr 05, 2021 at 07:08:01PM +0200, Christian Brauner wrote:
> > 
> > > Ah dentry count of -127 looks... odd.
> > 
> > dead + 1...
> > 
> > void lockref_mark_dead(struct lockref *lockref)
> > {
> >         assert_spin_locked(&lockref->lock);
> > 	lockref->count = -128;
> > }
> > 
> > IOW, a leaked (uncounted) reference to dentry, that got dget() called on
> > it after dentry had been freed.
> > 
> > 	IOW, current->fs->pwd.dentry happens to point to an already freed
> > struct dentry here.  Joy...
> > 
> > 	Could you slap
> > 
> > spin_lock(&current->fs->lock);
> > WARN_ON(d_count(current->fs->pwd.dentry) < 0);
> > spin_unlock(&current->fs->lock);
> > 
> > before and after calls of io_issue_sqe() and see if it triggers?  We definitely
> > are seeing buggered dentry refcounting here.
> 
> Check if this helps, please.
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 216f16e74351..82344f1139ff 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2289,6 +2289,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  	int error;
>  	const char *s = nd->name->name;
>  
> +	nd->path.mnt = NULL;
> +	nd->path.dentry = NULL;
> +
>  	/* LOOKUP_CACHED requires RCU, ask caller to retry */
>  	if ((flags & (LOOKUP_RCU | LOOKUP_CACHED)) == LOOKUP_CACHED)
>  		return ERR_PTR(-EAGAIN);
> @@ -2322,8 +2325,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  	}
>  
>  	nd->root.mnt = NULL;
> -	nd->path.mnt = NULL;
> -	nd->path.dentry = NULL;
>  
>  	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
>  	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {

Bingo. That fixes it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-05 20:07                                     ` Christian Brauner
@ 2021-04-06  1:38                                       ` Al Viro
  2021-04-06  2:24                                         ` Al Viro
  2021-04-06 12:35                                         ` Christian Brauner
  0 siblings, 2 replies; 31+ messages in thread
From: Al Viro @ 2021-04-06  1:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Mon, Apr 05, 2021 at 10:07:37PM +0200, Christian Brauner wrote:

> > diff --git a/fs/namei.c b/fs/namei.c
> > index 216f16e74351..82344f1139ff 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2289,6 +2289,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> >  	int error;
> >  	const char *s = nd->name->name;
> >  
> > +	nd->path.mnt = NULL;
> > +	nd->path.dentry = NULL;
> > +
> >  	/* LOOKUP_CACHED requires RCU, ask caller to retry */
> >  	if ((flags & (LOOKUP_RCU | LOOKUP_CACHED)) == LOOKUP_CACHED)
> >  		return ERR_PTR(-EAGAIN);
> > @@ -2322,8 +2325,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> >  	}
> >  
> >  	nd->root.mnt = NULL;
> > -	nd->path.mnt = NULL;
> > -	nd->path.dentry = NULL;
> >  
> >  	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
> >  	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
> 
> Bingo. That fixes it.

*grumble*

OK, I suppose it'll do for backports, but longer term... I don't like how
convoluted the rules for nameidata fields' validity are.  In particular,
for nd->path I would rather have it
	* cleared in set_nameidata()
	* cleared when it become invalid.  That would be
		* places that drop rcu_read_lock() without having legitimized the sucker
		  (already done, except for terminate_walk())
		* terminate_walk() in non-RCU case after path_put(&nd->path)

OTOH... wait a sec - the whole thing is this cycle regression, so...

Could you verify that the variant below fixes that crap?

Make sure nd->path.mnt and nd->path.dentry are always valid pointers

Initialize them in set_nameidata() and make sure that terminate_walk() clears them
once the pointers become potentially invalid (i.e. we leave RCU mode or drop them
in non-RCU one).  Currently we have "path_init() always initializes them and nobody
accesses them outside of path_init()/terminate_walk() segments", which is asking
for trouble.

With that change we would have nd->path.{mnt,dentry}
	1) always valid - NULL or pointing to currently allocated objects.
	2) non-NULL while we are successfully walking
	3) NULL when we are not walking at all
	4) contributing to refcounts whenever non-NULL outside of RCU mode.

Hopefully-fixes: 6c6ec2b0a3e0 ("fs: add support for LOOKUP_CACHED")
Signed-off-by: Al Viro <[email protected]>
---
diff --git a/fs/namei.c b/fs/namei.c
index 216f16e74351..fc8760d4314e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -579,6 +579,8 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
 	p->stack = p->internal;
 	p->dfd = dfd;
 	p->name = name;
+	p->path.mnt = NULL;
+	p->path.dentry = NULL;
 	p->total_link_count = old ? old->total_link_count : 0;
 	p->saved = old;
 	current->nameidata = p;
@@ -652,6 +654,8 @@ static void terminate_walk(struct nameidata *nd)
 		rcu_read_unlock();
 	}
 	nd->depth = 0;
+	nd->path.mnt = NULL;
+	nd->path.dentry = NULL;
 }
 
 /* path_put is needed afterwards regardless of success or failure */
@@ -2322,8 +2326,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	}
 
 	nd->root.mnt = NULL;
-	nd->path.mnt = NULL;
-	nd->path.dentry = NULL;
 
 	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
 	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-06  1:38                                       ` Al Viro
@ 2021-04-06  2:24                                         ` Al Viro
  2021-04-06 12:35                                         ` Christian Brauner
  1 sibling, 0 replies; 31+ messages in thread
From: Al Viro @ 2021-04-06  2:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Tue, Apr 06, 2021 at 01:38:39AM +0000, Al Viro wrote:
> On Mon, Apr 05, 2021 at 10:07:37PM +0200, Christian Brauner wrote:
> 
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 216f16e74351..82344f1139ff 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -2289,6 +2289,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >  	int error;
> > >  	const char *s = nd->name->name;
> > >  
> > > +	nd->path.mnt = NULL;
> > > +	nd->path.dentry = NULL;
> > > +
> > >  	/* LOOKUP_CACHED requires RCU, ask caller to retry */
> > >  	if ((flags & (LOOKUP_RCU | LOOKUP_CACHED)) == LOOKUP_CACHED)
> > >  		return ERR_PTR(-EAGAIN);
> > > @@ -2322,8 +2325,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >  	}
> > >  
> > >  	nd->root.mnt = NULL;
> > > -	nd->path.mnt = NULL;
> > > -	nd->path.dentry = NULL;
> > >  
> > >  	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
> > >  	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
> > 
> > Bingo. That fixes it.
> 
> *grumble*
> 
> OK, I suppose it'll do for backports, but longer term... I don't like how
> convoluted the rules for nameidata fields' validity are.  In particular,
> for nd->path I would rather have it
> 	* cleared in set_nameidata()
> 	* cleared when it become invalid.  That would be
> 		* places that drop rcu_read_lock() without having legitimized the sucker
> 		  (already done, except for terminate_walk())
> 		* terminate_walk() in non-RCU case after path_put(&nd->path)
> 
> OTOH... wait a sec - the whole thing is this cycle regression, so...

BTW, there's another piece of brittleness that almost accidentally doesn't
end up biting us - nd->depth should be cleared in set_nameidata(), not
in path_init().  In this case we are saved by the fact that the only
really early failure in path_init() can't happen on the first call,
so if we do leave the sucker before zeroing nd->depth, we are saved by
the fact that terminate_walk() has just been called and it *does*
clear nd->depth.  It's obviously cleaner to have it initialized from
the very beginning and not bother with it in path_init() at all.
Separate patch, though - this is not, strictly speaking, a bug.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-06  1:38                                       ` Al Viro
  2021-04-06  2:24                                         ` Al Viro
@ 2021-04-06 12:35                                         ` Christian Brauner
  2021-04-06 13:13                                           ` Al Viro
  1 sibling, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-06 12:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Tue, Apr 06, 2021 at 01:38:39AM +0000, Al Viro wrote:
> On Mon, Apr 05, 2021 at 10:07:37PM +0200, Christian Brauner wrote:
> 
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 216f16e74351..82344f1139ff 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -2289,6 +2289,9 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >  	int error;
> > >  	const char *s = nd->name->name;
> > >  
> > > +	nd->path.mnt = NULL;
> > > +	nd->path.dentry = NULL;
> > > +
> > >  	/* LOOKUP_CACHED requires RCU, ask caller to retry */
> > >  	if ((flags & (LOOKUP_RCU | LOOKUP_CACHED)) == LOOKUP_CACHED)
> > >  		return ERR_PTR(-EAGAIN);
> > > @@ -2322,8 +2325,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
> > >  	}
> > >  
> > >  	nd->root.mnt = NULL;
> > > -	nd->path.mnt = NULL;
> > > -	nd->path.dentry = NULL;
> > >  
> > >  	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
> > >  	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
> > 
> > Bingo. That fixes it.
> 
> *grumble*
> 
> OK, I suppose it'll do for backports, but longer term... I don't like how
> convoluted the rules for nameidata fields' validity are.  In particular,
> for nd->path I would rather have it
> 	* cleared in set_nameidata()
> 	* cleared when it become invalid.  That would be
> 		* places that drop rcu_read_lock() without having legitimized the sucker
> 		  (already done, except for terminate_walk())
> 		* terminate_walk() in non-RCU case after path_put(&nd->path)
> 
> OTOH... wait a sec - the whole thing is this cycle regression, so...
> 
> Could you verify that the variant below fixes that crap?
> 
> Make sure nd->path.mnt and nd->path.dentry are always valid pointers
> 
> Initialize them in set_nameidata() and make sure that terminate_walk() clears them
> once the pointers become potentially invalid (i.e. we leave RCU mode or drop them
> in non-RCU one).  Currently we have "path_init() always initializes them and nobody
> accesses them outside of path_init()/terminate_walk() segments", which is asking
> for trouble.
> 
> With that change we would have nd->path.{mnt,dentry}
> 	1) always valid - NULL or pointing to currently allocated objects.
> 	2) non-NULL while we are successfully walking
> 	3) NULL when we are not walking at all
> 	4) contributing to refcounts whenever non-NULL outside of RCU mode.
> 
> Hopefully-fixes: 6c6ec2b0a3e0 ("fs: add support for LOOKUP_CACHED")
> Signed-off-by: Al Viro <[email protected]>
> ---
> diff --git a/fs/namei.c b/fs/namei.c
> index 216f16e74351..fc8760d4314e 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -579,6 +579,8 @@ static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
>  	p->stack = p->internal;
>  	p->dfd = dfd;
>  	p->name = name;
> +	p->path.mnt = NULL;
> +	p->path.dentry = NULL;
>  	p->total_link_count = old ? old->total_link_count : 0;
>  	p->saved = old;
>  	current->nameidata = p;
> @@ -652,6 +654,8 @@ static void terminate_walk(struct nameidata *nd)
>  		rcu_read_unlock();
>  	}
>  	nd->depth = 0;
> +	nd->path.mnt = NULL;
> +	nd->path.dentry = NULL;
>  }
>  
>  /* path_put is needed afterwards regardless of success or failure */
> @@ -2322,8 +2326,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
>  	}
>  
>  	nd->root.mnt = NULL;
> -	nd->path.mnt = NULL;
> -	nd->path.dentry = NULL;
>  
>  	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
>  	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {

Yeah, that works too.
Though I'm not a fan of this open-coding as it looks pretty brittle
especially if we ever introduce another LOOKUP_* flag that requires us
to clear some other field carefully. I think a tiny static inline void
helper might make it easier to grok what's going on.

And while we're at it might I bring up the possibility of an additional
cleanup of how we currently call path_init().
Right now we pass the return value from path_init() directly into e.g.
link_path_walk() which as a first thing checks for error. Which feels
rather wrong and has always confused me when looking at these codepaths.
I get that it might make sense for reasons unrelated to path_init() that
link_path_walk() checks its first argument for error but path_init()
should be checked for error right away especially now that we return
early when LOOKUP_CACHED is set without LOOKUP_RCU. I have two/three
additional patches on top of the one below that massage path_lookupat(),
path_openat() to get rid of such - imho - hard to parse while
constructs:

		while (!(error = link_path_walk(s, nd)) &&
		       (s = <foo>(nd, file, op)) != NULL)
			;

in favor of a for (;;) like we do in link_path_walk() itself already
with early exits to a cleanup label. I'm not too fond of the

if (!error)
	error = foo();
if (!error)
	error = bar();

terminate_walk();
return error;

thing especially in longer functions such as path_lookupat() where it
gets convoluted pretty quickly. I think it would be cleaner to have
something like [1]. The early exists make the code easier to reason
about imho. But I get that that's a style discussion.  If you want to
see the whole thing and not just the patch below I'm happy to send it
though. Here's the tweaked fix for the immediate syzbot issue only:

From d6d669aaaba64debb9c433e281109b5494ad9bc5 Mon Sep 17 00:00:00 2001
From: Al Viro <[email protected]>
Date: Tue, 6 Apr 2021 13:20:26 +0200
Subject: [PATCH 1/3] namei: Ensure nd->path.{dentry,mnt} are always valid
 pointers

Initialize them in set_nameidata() and make sure that terminate_walk()
clears them once the pointers become potentially invalid (i.e.
we leave RCU mode or drop them in non-RCU one).  Currently we have
"path_init() always initializes them and nobody accesses them outside of
path_init()/terminate_walk() segments", which is asking for trouble.

With that change we would have nd->path.{mnt,dentry}
1) always valid - NULL or pointing to currently allocated objects.
2) non-NULL while we are successfully walking
3) NULL when we are not walking at all
4) contributing to refcounts whenever non-NULL outside of RCU mode.

Fixes: 6c6ec2b0a3e0 ("fs: add support for LOOKUP_CACHED")
Reported-by: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
 fs/namei.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 216f16e74351..618055419da3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -573,9 +573,18 @@ struct nameidata {
 	umode_t		dir_mode;
 } __randomize_layout;
 
+static inline void clear_nameidata(struct nameidata *nd)
+{
+	nd->depth = 0;
+	nd->path.mnt = NULL;
+	nd->path.dentry = NULL;
+}
+
 static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
 {
 	struct nameidata *old = current->nameidata;
+
+	clear_nameidata(p);
 	p->stack = p->internal;
 	p->dfd = dfd;
 	p->name = name;
@@ -651,7 +660,7 @@ static void terminate_walk(struct nameidata *nd)
 		nd->flags &= ~LOOKUP_RCU;
 		rcu_read_unlock();
 	}
-	nd->depth = 0;
+	clear_nameidata(nd);
 }
 
 /* path_put is needed afterwards regardless of success or failure */
@@ -2322,8 +2331,6 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	}
 
 	nd->root.mnt = NULL;
-	nd->path.mnt = NULL;
-	nd->path.dentry = NULL;
 
 	/* Absolute pathname -- fetch the root (LOOKUP_IN_ROOT uses nd->dfd). */
 	if (*s == '/' && !(flags & LOOKUP_IN_ROOT)) {
-- 
2.27.0



[1]:
diff --git a/fs/namei.c b/fs/namei.c
index ae1b5cd7b1a9..572a0d6e22cc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2424,33 +2424,49 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path
        int err;

        s = path_init(nd, flags);
-       if (IS_ERR(s))
-               return PTR_ERR(s);
+       if (IS_ERR(s)) {
+               err = PTR_ERR(s);
+               goto out_terminate;
+       }

        if (unlikely(flags & LOOKUP_DOWN)) {
                err = handle_lookup_down(nd);
                if (unlikely(err < 0))
-                       s = ERR_PTR(err);
+                       goto out_terminate;
        }

-       while (!(err = link_path_walk(s, nd)) &&
-              (s = lookup_last(nd)) != NULL)
-               ;
-       if (!err)
-               err = complete_walk(nd);
+       for (;;) {
+               err = link_path_walk(s, nd);
+               if (err)
+                       goto out_terminate;

-       if (!err && nd->flags & LOOKUP_DIRECTORY)
-               if (!d_can_lookup(nd->path.dentry))
-                       err = -ENOTDIR;
-       if (!err && unlikely(nd->flags & LOOKUP_MOUNTPOINT)) {
+               s = lookup_last(nd);
+               if (!s)
+                       break;
+       }
+
+       err = complete_walk(nd);
+       if (err)
+               goto out_terminate;
+
+       if ((nd->flags & LOOKUP_DIRECTORY) &&
+           (!d_can_lookup(nd->path.dentry))) {
+               err = -ENOTDIR;
+               goto out_terminate;
+       }
+
+       if (unlikely(nd->flags & LOOKUP_MOUNTPOINT)) {
                err = handle_lookup_down(nd);
                nd->flags &= ~LOOKUP_JUMPED; // no d_weak_revalidate(), please...
+               if (err)
+                       goto out_terminate;
        }
-       if (!err) {
-               *path = nd->path;
-               nd->path.mnt = NULL;
-               nd->path.dentry = NULL;
-       }
+
+       *path = nd->path;
+       nd->path.mnt = NULL;
+       nd->path.dentry = NULL;
+
+out_terminate:
        terminate_walk(nd);
        return err;
 }

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-06 12:35                                         ` Christian Brauner
@ 2021-04-06 13:13                                           ` Al Viro
  2021-04-06 13:22                                             ` Christian Brauner
  0 siblings, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-06 13:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Tue, Apr 06, 2021 at 02:35:05PM +0200, Christian Brauner wrote:

> And while we're at it might I bring up the possibility of an additional
> cleanup of how we currently call path_init().
> Right now we pass the return value from path_init() directly into e.g.
> link_path_walk() which as a first thing checks for error. Which feels
> rather wrong and has always confused me when looking at these codepaths.

Why?

> I get that it might make sense for reasons unrelated to path_init() that
> link_path_walk() checks its first argument for error but path_init()
> should be checked for error right away especially now that we return
> early when LOOKUP_CACHED is set without LOOKUP_RCU.

But you are making the _callers_ of path_init() do that, for no good
reason.

> thing especially in longer functions such as path_lookupat() where it
> gets convoluted pretty quickly. I think it would be cleaner to have
> something like [1]. The early exists make the code easier to reason
> about imho. But I get that that's a style discussion.

Your variant is a lot more brittle, actually.

> @@ -2424,33 +2424,49 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path
>         int err;
> 
>         s = path_init(nd, flags);
> -       if (IS_ERR(s))
> -               return PTR_ERR(s);

Where has that come from, BTW?  Currently path_lookupat() does no such thing.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-06 13:13                                           ` Al Viro
@ 2021-04-06 13:22                                             ` Christian Brauner
  2021-04-06 14:15                                               ` Al Viro
  0 siblings, 1 reply; 31+ messages in thread
From: Christian Brauner @ 2021-04-06 13:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Tue, Apr 06, 2021 at 01:13:13PM +0000, Al Viro wrote:
> On Tue, Apr 06, 2021 at 02:35:05PM +0200, Christian Brauner wrote:
> 
> > And while we're at it might I bring up the possibility of an additional
> > cleanup of how we currently call path_init().
> > Right now we pass the return value from path_init() directly into e.g.
> > link_path_walk() which as a first thing checks for error. Which feels
> > rather wrong and has always confused me when looking at these codepaths.
> 
> Why?

Why is a another function in charge of checking the return value of an
initialization function. If something like path_init() fails why is the
next caller responsible for rejecting it's return value and then we're
passing that failure value through the whole function with if (!err)
ladders but as I said it's mostly style preferences.

> 
> > I get that it might make sense for reasons unrelated to path_init() that
> > link_path_walk() checks its first argument for error but path_init()
> > should be checked for error right away especially now that we return
> > early when LOOKUP_CACHED is set without LOOKUP_RCU.
> 
> But you are making the _callers_ of path_init() do that, for no good
> reason.

I'm confused why having callers of functions responsible for checking
error values is such an out-of-band concept suddenly. I don't think it's
worth arguing over this though.

> 
> > thing especially in longer functions such as path_lookupat() where it
> > gets convoluted pretty quickly. I think it would be cleaner to have
> > something like [1]. The early exists make the code easier to reason
> > about imho. But I get that that's a style discussion.
> 
> Your variant is a lot more brittle, actually.
> 
> > @@ -2424,33 +2424,49 @@ static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path
> >         int err;
> > 
> >         s = path_init(nd, flags);
> > -       if (IS_ERR(s))
> > -               return PTR_ERR(s);
> 
> Where has that come from, BTW?  Currently path_lookupat() does no such thing.

Hm? Are you maybe overlooking path_init() which assigns straight into
the variable declaration? Or are you referring to sm else?

static int path_lookupat(struct nameidata *nd, unsigned flags, struct path *path)
{
	const char *s = path_init(nd, flags);
	int err;

	if (unlikely(flags & LOOKUP_DOWN) && !IS_ERR(s)) {
		err = handle_lookup_down(nd);
		if (unlikely(err < 0))
			s = ERR_PTR(err);
	}

	while (!(err = link_path_walk(s, nd)) &&
	       (s = lookup_last(nd)) != NULL)
		;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-06 13:22                                             ` Christian Brauner
@ 2021-04-06 14:15                                               ` Al Viro
  2021-04-06 14:23                                                 ` Al Viro
  2021-04-06 14:46                                                 ` Christian Brauner
  0 siblings, 2 replies; 31+ messages in thread
From: Al Viro @ 2021-04-06 14:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Tue, Apr 06, 2021 at 03:22:05PM +0200, Christian Brauner wrote:

> Why is a another function in charge of checking the return value of an
> initialization function. If something like path_init() fails why is the
> next caller responsible for rejecting it's return value and then we're
> passing that failure value through the whole function with if (!err)
> ladders but as I said it's mostly style preferences.

Because otherwise you either need *all* paths leading to link_path_walk()
duplicate the logics (and get hurt whenever you miss one) or have "well,
in some cases link_path_walk() handles ERR_PTR() given to it, in some
cases its caller do" mess.

> > >         s = path_init(nd, flags);
> > > -       if (IS_ERR(s))
> > > -               return PTR_ERR(s);
> > 
> > Where has that come from, BTW?  Currently path_lookupat() does no such thing.
> 
> Hm? Are you maybe overlooking path_init() which assigns straight into
> the variable declaration? Or are you referring to sm else?

I'm referring to the fact that your diff is with an already modified path_lookupat()
_and_ those modifications have managed to introduce a bug your patch reverts.
No terminate_walk() paired with that path_init() failure, i.e. path_init() is
responsible for cleanups on its (many) failure exits...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-06 14:15                                               ` Al Viro
@ 2021-04-06 14:23                                                 ` Al Viro
  2021-04-06 15:37                                                   ` Jens Axboe
  2021-04-06 14:46                                                 ` Christian Brauner
  1 sibling, 1 reply; 31+ messages in thread
From: Al Viro @ 2021-04-06 14:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Tue, Apr 06, 2021 at 02:15:01PM +0000, Al Viro wrote:

> I'm referring to the fact that your diff is with an already modified path_lookupat()
> _and_ those modifications have managed to introduce a bug your patch reverts.
> No terminate_walk() paired with that path_init() failure, i.e. path_init() is
> responsible for cleanups on its (many) failure exits...

I can't tell without seeing the variant your diff is against, but at a guess
it had a non-trivial amount of trouble with missed rcu_read_unlock() in
cases when path_init() fails after having done rcu_read_lock().  For trivial
testcase, consider passing -1 for dfd, so that it would fail with -EBADF.
Or passing 0 for dfd and "blah" for name (assuming your stdin is not a directory).
Sure, you could handle those in path_init() (or delay grabbing rcu_read_lock()
in there, spreading it in a bunch of branches), but duplicated cleanup logics
for a bunch of failure exits is asking for trouble.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-06 14:15                                               ` Al Viro
  2021-04-06 14:23                                                 ` Al Viro
@ 2021-04-06 14:46                                                 ` Christian Brauner
  1 sibling, 0 replies; 31+ messages in thread
From: Christian Brauner @ 2021-04-06 14:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs,
	io-uring

On Tue, Apr 06, 2021 at 02:15:01PM +0000, Al Viro wrote:
> On Tue, Apr 06, 2021 at 03:22:05PM +0200, Christian Brauner wrote:
> 
> > Why is a another function in charge of checking the return value of an
> > initialization function. If something like path_init() fails why is the
> > next caller responsible for rejecting it's return value and then we're
> > passing that failure value through the whole function with if (!err)
> > ladders but as I said it's mostly style preferences.
> 
> Because otherwise you either need *all* paths leading to link_path_walk()
> duplicate the logics (and get hurt whenever you miss one) or have "well,
> in some cases link_path_walk() handles ERR_PTR() given to it, in some
> cases its caller do" mess.
> 
> > > >         s = path_init(nd, flags);
> > > > -       if (IS_ERR(s))
> > > > -               return PTR_ERR(s);
> > > 
> > > Where has that come from, BTW?  Currently path_lookupat() does no such thing.
> > 
> > Hm? Are you maybe overlooking path_init() which assigns straight into
> > the variable declaration? Or are you referring to sm else?
> 
> I'm referring to the fact that your diff is with an already modified path_lookupat()
> _and_ those modifications have managed to introduce a bug your patch reverts.
> No terminate_walk() paired with that path_init() failure, i.e. path_init() is
> responsible for cleanups on its (many) failure exits...

Note that the paste post the patch was just a doodle to illustrate the
point not sm to review in earnest (I should probably comment prefix
things like this with "untested".).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [syzbot] WARNING in mntput_no_expire (2)
  2021-04-06 14:23                                                 ` Al Viro
@ 2021-04-06 15:37                                                   ` Jens Axboe
  0 siblings, 0 replies; 31+ messages in thread
From: Jens Axboe @ 2021-04-06 15:37 UTC (permalink / raw)
  To: Al Viro, Christian Brauner
  Cc: syzbot, linux-fsdevel, linux-kernel, syzkaller-bugs, io-uring

On 4/6/21 8:23 AM, Al Viro wrote:
> On Tue, Apr 06, 2021 at 02:15:01PM +0000, Al Viro wrote:
> 
>> I'm referring to the fact that your diff is with an already modified path_lookupat()
>> _and_ those modifications have managed to introduce a bug your patch reverts.
>> No terminate_walk() paired with that path_init() failure, i.e. path_init() is
>> responsible for cleanups on its (many) failure exits...
> 
> I can't tell without seeing the variant your diff is against, but at a guess
> it had a non-trivial amount of trouble with missed rcu_read_unlock() in
> cases when path_init() fails after having done rcu_read_lock().  For trivial
> testcase, consider passing -1 for dfd, so that it would fail with -EBADF.
> Or passing 0 for dfd and "blah" for name (assuming your stdin is not a directory).
> Sure, you could handle those in path_init() (or delay grabbing rcu_read_lock()
> in there, spreading it in a bunch of branches), but duplicated cleanup logics
> for a bunch of failure exits is asking for trouble.

Thanks for taking care of this Al, fwiw I'm (mostly) out on vacation.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-04-06 15:37 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <[email protected]>
2021-04-01 15:45 ` [syzbot] WARNING in mntput_no_expire (2) Christian Brauner
2021-04-01 16:09   ` Jens Axboe
2021-04-01 17:46     ` Christian Brauner
2021-04-01 17:59       ` Christian Brauner
2021-04-01 19:11         ` Al Viro
2021-04-04  2:34           ` Al Viro
2021-04-04  2:38             ` Al Viro
2021-04-04 11:34             ` Christian Brauner
2021-04-04 15:56               ` Al Viro
2021-04-04 16:40                 ` Christian Brauner
2021-04-04 16:44                   ` Al Viro
2021-04-04 17:05                     ` Christian Brauner
2021-04-04 18:50                       ` Al Viro
2021-04-04 20:17                         ` Al Viro
2021-04-05 11:44                           ` Christian Brauner
2021-04-05 16:18                             ` Al Viro
2021-04-05 17:08                               ` Christian Brauner
2021-04-05 18:23                                 ` Al Viro
2021-04-05 18:28                                   ` Al Viro
2021-04-05 20:07                                     ` Christian Brauner
2021-04-06  1:38                                       ` Al Viro
2021-04-06  2:24                                         ` Al Viro
2021-04-06 12:35                                         ` Christian Brauner
2021-04-06 13:13                                           ` Al Viro
2021-04-06 13:22                                             ` Christian Brauner
2021-04-06 14:15                                               ` Al Viro
2021-04-06 14:23                                                 ` Al Viro
2021-04-06 15:37                                                   ` Jens Axboe
2021-04-06 14:46                                                 ` Christian Brauner
2021-04-04 16:52                   ` Christian Brauner
2021-04-04 16:55                     ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox