public inbox for [email protected]
 help / color / mirror / Atom feed
* occasional metadata I/O errors (-EOPNOTSUPP) on XFS + io_uring
@ 2020-09-15 11:33 Brian Foster
  2020-09-16 13:19 ` Brian Foster
  0 siblings, 1 reply; 5+ messages in thread
From: Brian Foster @ 2020-09-15 11:33 UTC (permalink / raw)
  To: linux-xfs; +Cc: io-uring, Jens Axboe

Hi Jens,

I'm seeing an occasional metadata (read) I/O error (EOPNOTSUPP) when
running Zorro's recent io_uring enabled fsstress on XFS (fsstress -d
<mnt> -n 99999999 -p 8). The storage is a 50GB dm-linear device on a
virtio disk (within a KVM guest). The full callstack of the I/O
submission path is appended below [2], acquired via inserting a
WARN_ON() in my local tree.

From tracing around a bit, it looks like what happens is that fsstress
calls into io_uring, the latter starts a plug and sets plug.nowait =
true (via io_submit_sqes() -> io_submit_state_start()) and eventually
XFS needs to read an inode cluster buffer in the context of this task.
That buffer read ultimately fails due to submit_bio_checks() setting
REQ_NOWAIT on the bio and the following logic in the same function
causing a BLK_STS_NOTSUPP status:

	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_mq(q))
		goto not_supported;

In turn, this leads to the following behavior in XFS:

[ 3839.273519] XFS (dm-2): metadata I/O error in "xfs_imap_to_bp+0x116/0x2c0 [xfs]" at daddr 0x323a5a0 len 32 error 95
[ 3839.303283] XFS (dm-2): log I/O error -95
[ 3839.321437] XFS (dm-2): xfs_do_force_shutdown(0x2) called from line 1196 of file fs/xfs/xfs_log.c. Return address = ffffffffc12dea8a
[ 3839.323554] XFS (dm-2): Log I/O Error Detected. Shutting down filesystem
[ 3839.324773] XFS (dm-2): Please unmount the filesystem and rectify the problem(s)

I suppose it's possible fsstress is making an invalid request based on
my setup, but I find it a little strange that this state appears to leak
into filesystem I/O requests. What's more concerning is that this also
seems to impact an immediately subsequent log write submission, which is
a fatal error and causes the filesystem to shutdown.

Finally, note that I've seen your patch associated with Zorro's recent
bug report [1] and that does seem to prevent the problem. I'm still
sending this report because the connection between the plug and that
change is not obvious to me, so I wanted to 1.) confirm this is intended
to fix this problem and 2.) try to understand whether this plugging
behavior introduces any constraints on the fs when invoked in io_uring
context. Thoughts? Thanks.

Brian

[1] https://bugzilla.kernel.org/show_bug.cgi?id=209243#c8

[2] fsstress callchain that exhibits I/O error:

[ 3839.327329] ------------[ cut here ]------------
[ 3839.328309] WARNING: CPU: 1 PID: 4160 at fs/xfs/libxfs/xfs_inode_buf.c:149 xfs_imap_to_bp+0x1ec/0x2c0 [xfs]
[ 3839.330605] Modules linked in: xfs(O) libcrc32c tun bridge stp llc ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security snd_hda_codec_generic ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device snd_pcm snd_timer virtio_net snd net_failover virtio_balloon failover soundcore i2c_piix4 virtio_console virtio_blk qxl drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm virtio_pci virtio_ring ata_generic serio_raw virtio pata_acpi [last unloaded: xfs]
[ 3839.344011] CPU: 1 PID: 4160 Comm: fsstress Tainted: G           O      5.9.0-rc4+ #164
[ 3839.345550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[ 3839.348430] RIP: 0010:xfs_imap_to_bp+0x1ec/0x2c0 [xfs]
[ 3839.349547] Code: 84 24 98 00 00 00 65 48 33 04 25 28 00 00 00 0f 85 c0 00 00 00 48 81 c4 a0 00 00 00 44 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 41 83 fc f5 75 b1 41 81 e6 00 00 00 40 75 a8 b9 96 00 00 00
[ 3839.353756] RSP: 0018:ffff8881f15d72c0 EFLAGS: 00010282
[ 3839.354782] RAX: ffff8881f15d7300 RBX: 1ffff1103e2bae5c RCX: ffffffffad2fa134
[ 3839.356230] RDX: ffffffffc13d26e0 RSI: 1ffff1103e2bae1e RDI: ffff8881e76161a0
[ 3839.357601] RBP: ffff88815c2fda28 R08: 0000000000000001 R09: ffffed103cec2c35
[ 3839.359041] R10: ffff8881e76161a3 R11: ffffed103cec2c34 R12: 00000000ffffffa1
[ 3839.360332] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881f15d73e0
[ 3839.361627] FS:  00007fe5b98f2040(0000) GS:ffff8881f7280000(0000) knlGS:0000000000000000
[ 3839.363288] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3839.364844] CR2: 00007fe5b98ef000 CR3: 00000001defd8000 CR4: 00000000000006e0
[ 3839.366409] Call Trace:
[ 3839.367065]  ? xfs_inode_buf_readahead_verify+0x10/0x10 [xfs]
[ 3839.368208]  ? do_raw_spin_lock+0x126/0x290
[ 3839.369092]  ? rwlock_bug.part.0+0x90/0x90
[ 3839.369992]  ? xfs_trans_ijoin+0xf9/0x200 [xfs]
[ 3839.370961]  xfs_trans_log_inode+0x3ef/0x930 [xfs]
[ 3839.371919]  ? rcu_read_lock_sched_held+0xa0/0xd0
[ 3839.372890]  ? xfs_trans_ichgtime+0x190/0x190 [xfs]
[ 3839.373812]  ? do_raw_spin_unlock+0x54/0x250
[ 3839.374911]  xfs_vn_update_time+0x301/0x590 [xfs]
[ 3839.375898]  ? xfs_setattr_mode.isra.0+0x80/0x80 [xfs]
[ 3839.376895]  touch_atime+0x18f/0x1e0
[ 3839.377588]  ? atime_needs_update+0x570/0x570
[ 3839.378377]  ? copy_page_to_iter+0x1e1/0xc60
[ 3839.379216]  generic_file_buffered_read+0x14ed/0x2010
[ 3839.380185]  ? do_syscall_64+0x2d/0x40
[ 3839.380912]  ? native_usergs_sysret64+0x1/0x10
[ 3839.381767]  ? fs_reclaim_release+0xf/0x30
[ 3839.382557]  ? pagecache_get_page+0x6d0/0x6d0
[ 3839.383403]  ? down_read_trylock+0x19a/0x360
[ 3839.384285]  ? xfs_ilock_nowait+0xd5/0x4e0 [xfs]
[ 3839.385175]  ? rwsem_spin_on_owner+0x2b0/0x2b0
[ 3839.386026]  ? rcu_read_lock_bh_held+0xb0/0xb0
[ 3839.386959]  ? xfs_file_buffered_aio_read+0xcf/0x340 [xfs]
[ 3839.388134]  xfs_file_buffered_aio_read+0xe2/0x340 [xfs]
[ 3839.389221]  xfs_file_read_iter+0x257/0x550 [xfs]
[ 3839.390139]  io_iter_do_read+0x82/0x190
[ 3839.390951]  io_read+0x80e/0xbd0
[ 3839.391594]  ? is_bpf_text_address+0x80/0xe0
[ 3839.392382]  ? kiocb_done+0x2e0/0x2e0
[ 3839.393089]  ? profile_setup.cold+0xa1/0xa1
[ 3839.393894]  ? arch_stack_walk+0x9c/0xf0
[ 3839.394684]  ? io_read_prep+0x71/0x2f0
[ 3839.395377]  ? io_prep_rw+0xd90/0xd90
[ 3839.396084]  ? __lock_acquire+0xbd7/0x66b0
[ 3839.397036]  io_issue_sqe+0x4e9/0x5000
[ 3839.397768]  ? lockdep_hardirqs_on_prepare+0x4f0/0x4f0
[ 3839.398753]  ? __fget_files+0x19f/0x2d0
[ 3839.399555]  ? __ia32_sys_io_uring_setup+0x70/0x70
[ 3839.400415]  ? rcu_read_lock_bh_held+0xb0/0xb0
[ 3839.401252]  ? lock_acquire+0x191/0x7d0
[ 3839.401998]  ? __fget_files+0x5/0x2d0
[ 3839.402694]  ? find_held_lock+0x2c/0x110
[ 3839.403417]  ? __fget_files+0x19f/0x2d0
[ 3839.404243]  ? __io_queue_sqe+0x266/0xc30
[ 3839.405004]  __io_queue_sqe+0x266/0xc30
[ 3839.405765]  ? io_wq_submit_work+0x270/0x270
[ 3839.406606]  ? rcu_read_lock_sched_held+0xa1/0xd0
[ 3839.407559]  ? rcu_read_lock_bh_held+0xb0/0xb0
[ 3839.408368]  ? io_file_get+0x415/0x870
[ 3839.409118]  io_submit_sqes+0x1130/0x1d60
[ 3839.409926]  ? __mutex_lock+0x44f/0x11c0
[ 3839.410683]  ? io_queue_sqe+0xf70/0xf70
[ 3839.411387]  ? __x64_sys_io_uring_enter+0x50c/0x7e0
[ 3839.412327]  ? mutex_trylock+0x2b0/0x2b0
[ 3839.413086]  ? __x64_sys_io_uring_enter+0x202/0x7e0
[ 3839.414008]  ? find_held_lock+0x2c/0x110
[ 3839.414763]  ? __x64_sys_io_uring_enter+0x28e/0x7e0
[ 3839.415773]  ? syscall_enter_from_user_mode+0x22/0x80
[ 3839.416737]  __x64_sys_io_uring_enter+0x51d/0x7e0
[ 3839.417655]  do_syscall_64+0x2d/0x40
[ 3839.418315]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3839.419262] RIP: 0033:0x7fe5b99ef37d
[ 3839.419966] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d eb 6a 0c 00 f7 d8 64 89 01 48
[ 3839.423393] RSP: 002b:00007ffcc0d57838 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
[ 3839.424975] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe5b99ef37d
[ 3839.426374] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000003
[ 3839.427672] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000008
[ 3839.428975] R10: 0000000000000001 R11: 0000000000000216 R12: 000000000000f8b8
[ 3839.430379] R13: 00007fe5b9aff000 R14: 000000000001b991 R15: 0000000000a4f440
[ 3839.431782] irq event stamp: 54797391
[ 3839.432456] hardirqs last  enabled at (54797399): [<ffffffffad3130db>] console_unlock+0x6eb/0x9b0
[ 3839.434248] hardirqs last disabled at (54797408): [<ffffffffad312fad>] console_unlock+0x5bd/0x9b0
[ 3839.435915] softirqs last  enabled at (54797422): [<ffffffffaf2005b9>] __do_softirq+0x5b9/0x8a0
[ 3839.437463] softirqs last disabled at (54797417): [<ffffffffaf000faf>] asm_call_on_stack+0xf/0x20
[ 3839.439061] ---[ end trace cbd4c49b2879030e ]---


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: occasional metadata I/O errors (-EOPNOTSUPP) on XFS + io_uring
  2020-09-15 11:33 occasional metadata I/O errors (-EOPNOTSUPP) on XFS + io_uring Brian Foster
@ 2020-09-16 13:19 ` Brian Foster
  2020-09-16 16:55   ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Brian Foster @ 2020-09-16 13:19 UTC (permalink / raw)
  To: linux-xfs; +Cc: io-uring, Jens Axboe

On Tue, Sep 15, 2020 at 07:33:27AM -0400, Brian Foster wrote:
> Hi Jens,
> 
> I'm seeing an occasional metadata (read) I/O error (EOPNOTSUPP) when
> running Zorro's recent io_uring enabled fsstress on XFS (fsstress -d
> <mnt> -n 99999999 -p 8). The storage is a 50GB dm-linear device on a
> virtio disk (within a KVM guest). The full callstack of the I/O
> submission path is appended below [2], acquired via inserting a
> WARN_ON() in my local tree.
> 
> From tracing around a bit, it looks like what happens is that fsstress
> calls into io_uring, the latter starts a plug and sets plug.nowait =
> true (via io_submit_sqes() -> io_submit_state_start()) and eventually
> XFS needs to read an inode cluster buffer in the context of this task.
> That buffer read ultimately fails due to submit_bio_checks() setting
> REQ_NOWAIT on the bio and the following logic in the same function
> causing a BLK_STS_NOTSUPP status:
> 
> 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_mq(q))
> 		goto not_supported;
> 
> In turn, this leads to the following behavior in XFS:
> 
> [ 3839.273519] XFS (dm-2): metadata I/O error in "xfs_imap_to_bp+0x116/0x2c0 [xfs]" at daddr 0x323a5a0 len 32 error 95
> [ 3839.303283] XFS (dm-2): log I/O error -95
> [ 3839.321437] XFS (dm-2): xfs_do_force_shutdown(0x2) called from line 1196 of file fs/xfs/xfs_log.c. Return address = ffffffffc12dea8a
> [ 3839.323554] XFS (dm-2): Log I/O Error Detected. Shutting down filesystem
> [ 3839.324773] XFS (dm-2): Please unmount the filesystem and rectify the problem(s)
> 
> I suppose it's possible fsstress is making an invalid request based on
> my setup, but I find it a little strange that this state appears to leak
> into filesystem I/O requests. What's more concerning is that this also
> seems to impact an immediately subsequent log write submission, which is
> a fatal error and causes the filesystem to shutdown.
> 
> Finally, note that I've seen your patch associated with Zorro's recent
> bug report [1] and that does seem to prevent the problem. I'm still
> sending this report because the connection between the plug and that
> change is not obvious to me, so I wanted to 1.) confirm this is intended
> to fix this problem and 2.) try to understand whether this plugging
> behavior introduces any constraints on the fs when invoked in io_uring
> context. Thoughts? Thanks.
> 

To expand on this a bit, I was playing more with the aforementioned fix
yesterday while waiting for this email's several hour trip to the
mailing list to complete and eventually realized that I don't think the
plug.nowait thing properly accommodates XFS' use of multiple devices. A
simple example is XFS on a data device with mq support and an external
log device without mq support. Presumably io_uring requests could thus
enter XFS with plug.nowait set to true, and then any log bio submission
that happens to occur in that context is doomed to fail and shutdown the
fs.

I'm not sure how likely a use case and/or how easy to reproduce that
would be, but I think it's technically valid and possible, so probably
something else to consider with this whole io_uring context topic..
Perhaps this state is something that should be associated with bios that
target a particular device rather than something that arbitrates all bio
submissions from a particular task context (i.e. current->plug)..?

Brian

> Brian
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=209243#c8
> 
> [2] fsstress callchain that exhibits I/O error:
> 
> [ 3839.327329] ------------[ cut here ]------------
> [ 3839.328309] WARNING: CPU: 1 PID: 4160 at fs/xfs/libxfs/xfs_inode_buf.c:149 xfs_imap_to_bp+0x1ec/0x2c0 [xfs]
> [ 3839.330605] Modules linked in: xfs(O) libcrc32c tun bridge stp llc ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security snd_hda_codec_generic ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables sunrpc snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device snd_pcm snd_timer virtio_net snd net_failover virtio_balloon failover soundcore i2c_piix4 virtio_console virtio_blk qxl drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm virtio_pci virtio_ring ata_generic serio_raw virtio pata_acpi [last unloaded: xfs]
> [ 3839.344011] CPU: 1 PID: 4160 Comm: fsstress Tainted: G           O      5.9.0-rc4+ #164
> [ 3839.345550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> [ 3839.348430] RIP: 0010:xfs_imap_to_bp+0x1ec/0x2c0 [xfs]
> [ 3839.349547] Code: 84 24 98 00 00 00 65 48 33 04 25 28 00 00 00 0f 85 c0 00 00 00 48 81 c4 a0 00 00 00 44 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b 41 83 fc f5 75 b1 41 81 e6 00 00 00 40 75 a8 b9 96 00 00 00
> [ 3839.353756] RSP: 0018:ffff8881f15d72c0 EFLAGS: 00010282
> [ 3839.354782] RAX: ffff8881f15d7300 RBX: 1ffff1103e2bae5c RCX: ffffffffad2fa134
> [ 3839.356230] RDX: ffffffffc13d26e0 RSI: 1ffff1103e2bae1e RDI: ffff8881e76161a0
> [ 3839.357601] RBP: ffff88815c2fda28 R08: 0000000000000001 R09: ffffed103cec2c35
> [ 3839.359041] R10: ffff8881e76161a3 R11: ffffed103cec2c34 R12: 00000000ffffffa1
> [ 3839.360332] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881f15d73e0
> [ 3839.361627] FS:  00007fe5b98f2040(0000) GS:ffff8881f7280000(0000) knlGS:0000000000000000
> [ 3839.363288] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3839.364844] CR2: 00007fe5b98ef000 CR3: 00000001defd8000 CR4: 00000000000006e0
> [ 3839.366409] Call Trace:
> [ 3839.367065]  ? xfs_inode_buf_readahead_verify+0x10/0x10 [xfs]
> [ 3839.368208]  ? do_raw_spin_lock+0x126/0x290
> [ 3839.369092]  ? rwlock_bug.part.0+0x90/0x90
> [ 3839.369992]  ? xfs_trans_ijoin+0xf9/0x200 [xfs]
> [ 3839.370961]  xfs_trans_log_inode+0x3ef/0x930 [xfs]
> [ 3839.371919]  ? rcu_read_lock_sched_held+0xa0/0xd0
> [ 3839.372890]  ? xfs_trans_ichgtime+0x190/0x190 [xfs]
> [ 3839.373812]  ? do_raw_spin_unlock+0x54/0x250
> [ 3839.374911]  xfs_vn_update_time+0x301/0x590 [xfs]
> [ 3839.375898]  ? xfs_setattr_mode.isra.0+0x80/0x80 [xfs]
> [ 3839.376895]  touch_atime+0x18f/0x1e0
> [ 3839.377588]  ? atime_needs_update+0x570/0x570
> [ 3839.378377]  ? copy_page_to_iter+0x1e1/0xc60
> [ 3839.379216]  generic_file_buffered_read+0x14ed/0x2010
> [ 3839.380185]  ? do_syscall_64+0x2d/0x40
> [ 3839.380912]  ? native_usergs_sysret64+0x1/0x10
> [ 3839.381767]  ? fs_reclaim_release+0xf/0x30
> [ 3839.382557]  ? pagecache_get_page+0x6d0/0x6d0
> [ 3839.383403]  ? down_read_trylock+0x19a/0x360
> [ 3839.384285]  ? xfs_ilock_nowait+0xd5/0x4e0 [xfs]
> [ 3839.385175]  ? rwsem_spin_on_owner+0x2b0/0x2b0
> [ 3839.386026]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [ 3839.386959]  ? xfs_file_buffered_aio_read+0xcf/0x340 [xfs]
> [ 3839.388134]  xfs_file_buffered_aio_read+0xe2/0x340 [xfs]
> [ 3839.389221]  xfs_file_read_iter+0x257/0x550 [xfs]
> [ 3839.390139]  io_iter_do_read+0x82/0x190
> [ 3839.390951]  io_read+0x80e/0xbd0
> [ 3839.391594]  ? is_bpf_text_address+0x80/0xe0
> [ 3839.392382]  ? kiocb_done+0x2e0/0x2e0
> [ 3839.393089]  ? profile_setup.cold+0xa1/0xa1
> [ 3839.393894]  ? arch_stack_walk+0x9c/0xf0
> [ 3839.394684]  ? io_read_prep+0x71/0x2f0
> [ 3839.395377]  ? io_prep_rw+0xd90/0xd90
> [ 3839.396084]  ? __lock_acquire+0xbd7/0x66b0
> [ 3839.397036]  io_issue_sqe+0x4e9/0x5000
> [ 3839.397768]  ? lockdep_hardirqs_on_prepare+0x4f0/0x4f0
> [ 3839.398753]  ? __fget_files+0x19f/0x2d0
> [ 3839.399555]  ? __ia32_sys_io_uring_setup+0x70/0x70
> [ 3839.400415]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [ 3839.401252]  ? lock_acquire+0x191/0x7d0
> [ 3839.401998]  ? __fget_files+0x5/0x2d0
> [ 3839.402694]  ? find_held_lock+0x2c/0x110
> [ 3839.403417]  ? __fget_files+0x19f/0x2d0
> [ 3839.404243]  ? __io_queue_sqe+0x266/0xc30
> [ 3839.405004]  __io_queue_sqe+0x266/0xc30
> [ 3839.405765]  ? io_wq_submit_work+0x270/0x270
> [ 3839.406606]  ? rcu_read_lock_sched_held+0xa1/0xd0
> [ 3839.407559]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [ 3839.408368]  ? io_file_get+0x415/0x870
> [ 3839.409118]  io_submit_sqes+0x1130/0x1d60
> [ 3839.409926]  ? __mutex_lock+0x44f/0x11c0
> [ 3839.410683]  ? io_queue_sqe+0xf70/0xf70
> [ 3839.411387]  ? __x64_sys_io_uring_enter+0x50c/0x7e0
> [ 3839.412327]  ? mutex_trylock+0x2b0/0x2b0
> [ 3839.413086]  ? __x64_sys_io_uring_enter+0x202/0x7e0
> [ 3839.414008]  ? find_held_lock+0x2c/0x110
> [ 3839.414763]  ? __x64_sys_io_uring_enter+0x28e/0x7e0
> [ 3839.415773]  ? syscall_enter_from_user_mode+0x22/0x80
> [ 3839.416737]  __x64_sys_io_uring_enter+0x51d/0x7e0
> [ 3839.417655]  do_syscall_64+0x2d/0x40
> [ 3839.418315]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 3839.419262] RIP: 0033:0x7fe5b99ef37d
> [ 3839.419966] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d eb 6a 0c 00 f7 d8 64 89 01 48
> [ 3839.423393] RSP: 002b:00007ffcc0d57838 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
> [ 3839.424975] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe5b99ef37d
> [ 3839.426374] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000003
> [ 3839.427672] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000008
> [ 3839.428975] R10: 0000000000000001 R11: 0000000000000216 R12: 000000000000f8b8
> [ 3839.430379] R13: 00007fe5b9aff000 R14: 000000000001b991 R15: 0000000000a4f440
> [ 3839.431782] irq event stamp: 54797391
> [ 3839.432456] hardirqs last  enabled at (54797399): [<ffffffffad3130db>] console_unlock+0x6eb/0x9b0
> [ 3839.434248] hardirqs last disabled at (54797408): [<ffffffffad312fad>] console_unlock+0x5bd/0x9b0
> [ 3839.435915] softirqs last  enabled at (54797422): [<ffffffffaf2005b9>] __do_softirq+0x5b9/0x8a0
> [ 3839.437463] softirqs last disabled at (54797417): [<ffffffffaf000faf>] asm_call_on_stack+0xf/0x20
> [ 3839.439061] ---[ end trace cbd4c49b2879030e ]---
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: occasional metadata I/O errors (-EOPNOTSUPP) on XFS + io_uring
  2020-09-16 13:19 ` Brian Foster
@ 2020-09-16 16:55   ` Jens Axboe
  2020-09-16 18:05     ` Brian Foster
  0 siblings, 1 reply; 5+ messages in thread
From: Jens Axboe @ 2020-09-16 16:55 UTC (permalink / raw)
  To: Brian Foster, linux-xfs; +Cc: io-uring

On 9/16/20 7:19 AM, Brian Foster wrote:
> On Tue, Sep 15, 2020 at 07:33:27AM -0400, Brian Foster wrote:
>> Hi Jens,
>>
>> I'm seeing an occasional metadata (read) I/O error (EOPNOTSUPP) when
>> running Zorro's recent io_uring enabled fsstress on XFS (fsstress -d
>> <mnt> -n 99999999 -p 8). The storage is a 50GB dm-linear device on a
>> virtio disk (within a KVM guest). The full callstack of the I/O
>> submission path is appended below [2], acquired via inserting a
>> WARN_ON() in my local tree.
>>
>> From tracing around a bit, it looks like what happens is that fsstress
>> calls into io_uring, the latter starts a plug and sets plug.nowait =
>> true (via io_submit_sqes() -> io_submit_state_start()) and eventually
>> XFS needs to read an inode cluster buffer in the context of this task.
>> That buffer read ultimately fails due to submit_bio_checks() setting
>> REQ_NOWAIT on the bio and the following logic in the same function
>> causing a BLK_STS_NOTSUPP status:
>>
>> 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_mq(q))
>> 		goto not_supported;
>>
>> In turn, this leads to the following behavior in XFS:
>>
>> [ 3839.273519] XFS (dm-2): metadata I/O error in "xfs_imap_to_bp+0x116/0x2c0 [xfs]" at daddr 0x323a5a0 len 32 error 95
>> [ 3839.303283] XFS (dm-2): log I/O error -95
>> [ 3839.321437] XFS (dm-2): xfs_do_force_shutdown(0x2) called from line 1196 of file fs/xfs/xfs_log.c. Return address = ffffffffc12dea8a
>> [ 3839.323554] XFS (dm-2): Log I/O Error Detected. Shutting down filesystem
>> [ 3839.324773] XFS (dm-2): Please unmount the filesystem and rectify the problem(s)
>>
>> I suppose it's possible fsstress is making an invalid request based on
>> my setup, but I find it a little strange that this state appears to leak
>> into filesystem I/O requests. What's more concerning is that this also
>> seems to impact an immediately subsequent log write submission, which is
>> a fatal error and causes the filesystem to shutdown.
>>
>> Finally, note that I've seen your patch associated with Zorro's recent
>> bug report [1] and that does seem to prevent the problem. I'm still
>> sending this report because the connection between the plug and that
>> change is not obvious to me, so I wanted to 1.) confirm this is intended
>> to fix this problem and 2.) try to understand whether this plugging
>> behavior introduces any constraints on the fs when invoked in io_uring
>> context. Thoughts? Thanks.
>>
> 
> To expand on this a bit, I was playing more with the aforementioned fix
> yesterday while waiting for this email's several hour trip to the
> mailing list to complete and eventually realized that I don't think the
> plug.nowait thing properly accommodates XFS' use of multiple devices. A
> simple example is XFS on a data device with mq support and an external
> log device without mq support. Presumably io_uring requests could thus
> enter XFS with plug.nowait set to true, and then any log bio submission
> that happens to occur in that context is doomed to fail and shutdown the
> fs.

Do we ever read from the logdev? It'll only be a concern on the read
side. And even from there, you'd need nested reads from the log device.

In general, the 'can async' check should be advisory, the -EAGAIN
or -EOPNOTSUPP should be caught and reissued. The failure path was
just related to this happening off the retry path on arming for the
async buffered callback.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: occasional metadata I/O errors (-EOPNOTSUPP) on XFS + io_uring
  2020-09-16 16:55   ` Jens Axboe
@ 2020-09-16 18:05     ` Brian Foster
  2020-09-16 18:12       ` Jens Axboe
  0 siblings, 1 reply; 5+ messages in thread
From: Brian Foster @ 2020-09-16 18:05 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-xfs, io-uring

On Wed, Sep 16, 2020 at 10:55:08AM -0600, Jens Axboe wrote:
> On 9/16/20 7:19 AM, Brian Foster wrote:
> > On Tue, Sep 15, 2020 at 07:33:27AM -0400, Brian Foster wrote:
> >> Hi Jens,
> >>
> >> I'm seeing an occasional metadata (read) I/O error (EOPNOTSUPP) when
> >> running Zorro's recent io_uring enabled fsstress on XFS (fsstress -d
> >> <mnt> -n 99999999 -p 8). The storage is a 50GB dm-linear device on a
> >> virtio disk (within a KVM guest). The full callstack of the I/O
> >> submission path is appended below [2], acquired via inserting a
> >> WARN_ON() in my local tree.
> >>
> >> From tracing around a bit, it looks like what happens is that fsstress
> >> calls into io_uring, the latter starts a plug and sets plug.nowait =
> >> true (via io_submit_sqes() -> io_submit_state_start()) and eventually
> >> XFS needs to read an inode cluster buffer in the context of this task.
> >> That buffer read ultimately fails due to submit_bio_checks() setting
> >> REQ_NOWAIT on the bio and the following logic in the same function
> >> causing a BLK_STS_NOTSUPP status:
> >>
> >> 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_mq(q))
> >> 		goto not_supported;
> >>
> >> In turn, this leads to the following behavior in XFS:
> >>
> >> [ 3839.273519] XFS (dm-2): metadata I/O error in "xfs_imap_to_bp+0x116/0x2c0 [xfs]" at daddr 0x323a5a0 len 32 error 95
> >> [ 3839.303283] XFS (dm-2): log I/O error -95
> >> [ 3839.321437] XFS (dm-2): xfs_do_force_shutdown(0x2) called from line 1196 of file fs/xfs/xfs_log.c. Return address = ffffffffc12dea8a
> >> [ 3839.323554] XFS (dm-2): Log I/O Error Detected. Shutting down filesystem
> >> [ 3839.324773] XFS (dm-2): Please unmount the filesystem and rectify the problem(s)
> >>
> >> I suppose it's possible fsstress is making an invalid request based on
> >> my setup, but I find it a little strange that this state appears to leak
> >> into filesystem I/O requests. What's more concerning is that this also
> >> seems to impact an immediately subsequent log write submission, which is
> >> a fatal error and causes the filesystem to shutdown.
> >>
> >> Finally, note that I've seen your patch associated with Zorro's recent
> >> bug report [1] and that does seem to prevent the problem. I'm still
> >> sending this report because the connection between the plug and that
> >> change is not obvious to me, so I wanted to 1.) confirm this is intended
> >> to fix this problem and 2.) try to understand whether this plugging
> >> behavior introduces any constraints on the fs when invoked in io_uring
> >> context. Thoughts? Thanks.
> >>
> > 
> > To expand on this a bit, I was playing more with the aforementioned fix
> > yesterday while waiting for this email's several hour trip to the
> > mailing list to complete and eventually realized that I don't think the
> > plug.nowait thing properly accommodates XFS' use of multiple devices. A
> > simple example is XFS on a data device with mq support and an external
> > log device without mq support. Presumably io_uring requests could thus
> > enter XFS with plug.nowait set to true, and then any log bio submission
> > that happens to occur in that context is doomed to fail and shutdown the
> > fs.
> 
> Do we ever read from the logdev? It'll only be a concern on the read
> side. And even from there, you'd need nested reads from the log device.
> 

We only read from the log device on log recovery (during filesystem
mount), but I don't follow why that matters since log writes originate
within XFS (not userspace). Do you mean to ask whether we access the log
in the context of userspace reads.. ?

We currently write to the log from various runtime contexts. I don't
_think_ that we currently ever do so during a file read, but log forces
can be async and buried under layers of indirection which makes it
difficult to reason about (and prevent in the future, if necessary). For
example, attempting to lock a stale buffer can issue an async log force.

FWIW and to confirm the above, a simple experiment to issue a log force
in XFS' read_iter() does reproduce the same shutdown condition described
above when XFS is mounted with a mq data device and !mq external log
device. That may or may not be a theoretical condition at the moment,
but it kind of looks like a landmine to me. Perhaps we'll need to come
up with a more explicit way of ensuring we never submit log bios from a
context where we know the block subsystem will swat them away...

> In general, the 'can async' check should be advisory, the -EAGAIN
> or -EOPNOTSUPP should be caught and reissued. The failure path was
> just related to this happening off the retry path on arming for the
> async buffered callback.
> 

I think the issue here is that io_uring is not in the path between XFS
and the log device. Therefore, XFS receives the log I/O error directly
and shuts down. I do think it's fair to argue that io_uring should not
be setting task level context that enforces strict device specific
requirements on I/O submission and then call into subsystems that can
submit I/O to disparate/unrelated devices. That said, I'm not intimately
familiar with the problem this is trying to solve...

Brian

> -- 
> Jens Axboe
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: occasional metadata I/O errors (-EOPNOTSUPP) on XFS + io_uring
  2020-09-16 18:05     ` Brian Foster
@ 2020-09-16 18:12       ` Jens Axboe
  0 siblings, 0 replies; 5+ messages in thread
From: Jens Axboe @ 2020-09-16 18:12 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, io-uring

On 9/16/20 12:05 PM, Brian Foster wrote:
> On Wed, Sep 16, 2020 at 10:55:08AM -0600, Jens Axboe wrote:
>> On 9/16/20 7:19 AM, Brian Foster wrote:
>>> On Tue, Sep 15, 2020 at 07:33:27AM -0400, Brian Foster wrote:
>>>> Hi Jens,
>>>>
>>>> I'm seeing an occasional metadata (read) I/O error (EOPNOTSUPP) when
>>>> running Zorro's recent io_uring enabled fsstress on XFS (fsstress -d
>>>> <mnt> -n 99999999 -p 8). The storage is a 50GB dm-linear device on a
>>>> virtio disk (within a KVM guest). The full callstack of the I/O
>>>> submission path is appended below [2], acquired via inserting a
>>>> WARN_ON() in my local tree.
>>>>
>>>> From tracing around a bit, it looks like what happens is that fsstress
>>>> calls into io_uring, the latter starts a plug and sets plug.nowait =
>>>> true (via io_submit_sqes() -> io_submit_state_start()) and eventually
>>>> XFS needs to read an inode cluster buffer in the context of this task.
>>>> That buffer read ultimately fails due to submit_bio_checks() setting
>>>> REQ_NOWAIT on the bio and the following logic in the same function
>>>> causing a BLK_STS_NOTSUPP status:
>>>>
>>>> 	if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_mq(q))
>>>> 		goto not_supported;
>>>>
>>>> In turn, this leads to the following behavior in XFS:
>>>>
>>>> [ 3839.273519] XFS (dm-2): metadata I/O error in "xfs_imap_to_bp+0x116/0x2c0 [xfs]" at daddr 0x323a5a0 len 32 error 95
>>>> [ 3839.303283] XFS (dm-2): log I/O error -95
>>>> [ 3839.321437] XFS (dm-2): xfs_do_force_shutdown(0x2) called from line 1196 of file fs/xfs/xfs_log.c. Return address = ffffffffc12dea8a
>>>> [ 3839.323554] XFS (dm-2): Log I/O Error Detected. Shutting down filesystem
>>>> [ 3839.324773] XFS (dm-2): Please unmount the filesystem and rectify the problem(s)
>>>>
>>>> I suppose it's possible fsstress is making an invalid request based on
>>>> my setup, but I find it a little strange that this state appears to leak
>>>> into filesystem I/O requests. What's more concerning is that this also
>>>> seems to impact an immediately subsequent log write submission, which is
>>>> a fatal error and causes the filesystem to shutdown.
>>>>
>>>> Finally, note that I've seen your patch associated with Zorro's recent
>>>> bug report [1] and that does seem to prevent the problem. I'm still
>>>> sending this report because the connection between the plug and that
>>>> change is not obvious to me, so I wanted to 1.) confirm this is intended
>>>> to fix this problem and 2.) try to understand whether this plugging
>>>> behavior introduces any constraints on the fs when invoked in io_uring
>>>> context. Thoughts? Thanks.
>>>>
>>>
>>> To expand on this a bit, I was playing more with the aforementioned fix
>>> yesterday while waiting for this email's several hour trip to the
>>> mailing list to complete and eventually realized that I don't think the
>>> plug.nowait thing properly accommodates XFS' use of multiple devices. A
>>> simple example is XFS on a data device with mq support and an external
>>> log device without mq support. Presumably io_uring requests could thus
>>> enter XFS with plug.nowait set to true, and then any log bio submission
>>> that happens to occur in that context is doomed to fail and shutdown the
>>> fs.
>>
>> Do we ever read from the logdev? It'll only be a concern on the read
>> side. And even from there, you'd need nested reads from the log device.
>>
> 
> We only read from the log device on log recovery (during filesystem
> mount), but I don't follow why that matters since log writes originate
> within XFS (not userspace). Do you mean to ask whether we access the log
> in the context of userspace reads.. ?
> 
> We currently write to the log from various runtime contexts. I don't
> _think_ that we currently ever do so during a file read, but log forces
> can be async and buried under layers of indirection which makes it
> difficult to reason about (and prevent in the future, if necessary). For
> example, attempting to lock a stale buffer can issue an async log force.
> 
> FWIW and to confirm the above, a simple experiment to issue a log force
> in XFS' read_iter() does reproduce the same shutdown condition described
> above when XFS is mounted with a mq data device and !mq external log
> device. That may or may not be a theoretical condition at the moment,
> but it kind of looks like a landmine to me. Perhaps we'll need to come
> up with a more explicit way of ensuring we never submit log bios from a
> context where we know the block subsystem will swat them away...
> 
>> In general, the 'can async' check should be advisory, the -EAGAIN
>> or -EOPNOTSUPP should be caught and reissued. The failure path was
>> just related to this happening off the retry path on arming for the
>> async buffered callback.
>>
> 
> I think the issue here is that io_uring is not in the path between XFS
> and the log device. Therefore, XFS receives the log I/O error directly
> and shuts down. I do think it's fair to argue that io_uring should not
> be setting task level context that enforces strict device specific
> requirements on I/O submission and then call into subsystems that can
> submit I/O to disparate/unrelated devices. That said, I'm not intimately
> familiar with the problem this is trying to solve...

I agree (with both this and the above), we should make this stronger.
I'll take a look.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-16 20:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-09-15 11:33 occasional metadata I/O errors (-EOPNOTSUPP) on XFS + io_uring Brian Foster
2020-09-16 13:19 ` Brian Foster
2020-09-16 16:55   ` Jens Axboe
2020-09-16 18:05     ` Brian Foster
2020-09-16 18:12       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox