From: John Garry <[email protected]>
To: Jens Axboe <[email protected]>, [email protected]
Cc: [email protected], [email protected]
Subject: Re: [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time
Date: Wed, 5 Mar 2025 16:57:50 +0000 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 04/03/2025 18:10, John Garry wrote:
+
> On 09/01/2025 18:15, Jens Axboe wrote:
>> Rather than try and have io_read/io_write turn REQ_F_REISSUE into
>> -EAGAIN, catch the REQ_F_REISSUE when the request is otherwise
>> considered as done. This is saner as we know this isn't happening
>> during an actual submission, and it removes the need to randomly
>> check REQ_F_REISSUE after read/write submission.
>>
>> If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over
>> this request in terms of posting a CQE, and the regular request
>> cleaning will ensure that it gets reissued via io-wq.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
>
Further info, I can easily recreate this on latest block/io_uring-6.14
on real NVMe HW:
[ 215.439892] cloud-init[2675]: Cloud-init v. 24.4-0ubuntu1~22.04.1
finished at Wed, 05 Mar 2025 16:48:44 +0000. Datasource
DataSourceOracle. Up 215.38 seconds
[ 371.784849]
==================================================================
[ 371.871354] BUG: KASAN: slab-use-after-free in bio_poll+0x300/0x4c0
[ 371.946393] Read of size 4 at addr ff11000170f75b74 by task fio/3372
[ 372.022458]
[ 372.040266] CPU: 7 UID: 0 PID: 3372 Comm: fio Not tainted 6.13.0+ #16
[ 372.117374] Hardware name: Oracle Corporation ORACLE SERVER
X9-2c/TLA,MB TRAY,X9-2c, BIOS 66040600 07/23/2021
[ 372.236091] Call Trace:
[ 372.265337] <TASK>
[ 372.290423] dump_stack_lvl+0x76/0xa0
[ 372.334246] print_report+0xd1/0x630
[ 372.377024] ? bio_poll+0x300/0x4c0
[ 372.418759] ? kasan_complete_mode_report_info+0x6a/0x200
[ 372.483382] ? bio_poll+0x300/0x4c0
[ 372.525110] kasan_report+0xb8/0x100
[ 372.567882] ? bio_poll+0x300/0x4c0
[ 372.609611] __asan_report_load4_noabort+0x14/0x30
[ 372.666946] bio_poll+0x300/0x4c0
[ 372.706597] iocb_bio_iopoll+0x3c/0x70
[ 372.751448] io_uring_classic_poll+0xb0/0x190
[ 372.803587] io_do_iopoll+0x471/0x1080
[ 372.848440] ? __io_read+0x206/0x1050
[ 372.892253] ? fget+0x83/0x260
[ 372.928786] ? __pfx_io_do_iopoll+0x10/0x10
[ 372.978840] ? io_read+0x17/0x50
[ 373.017454] ? __kasan_check_write+0x14/0x30
[ 373.068546] ? mutex_lock+0x86/0xe0
[ 373.110278] ? __pfx_mutex_lock+0x10/0x10
[ 373.158253] __do_sys_io_uring_enter+0x7ca/0x14a0
[ 373.214550] ? __pfx___do_sys_io_uring_enter+0x10/0x10
[ 373.276048] ? __pfx___do_sys_io_uring_enter+0x10/0x10
[ 373.337544] ? __kasan_check_write+0x14/0x30
[ 373.388637] ? fput+0x1b/0x310
[ 373.425168] ? __do_sys_io_uring_enter+0x996/0x14a0
[ 373.483546] __x64_sys_io_uring_enter+0xe0/0x1b0
[ 373.538800] x64_sys_call+0x16a1/0x2540
[ 373.584695] do_syscall_64+0x7e/0x170
[ 373.628511] ? fpregs_assert_state_consistent+0x21/0xb0
[ 373.691052] ? syscall_exit_to_user_mode+0x4e/0x240
[ 373.749435] ? do_syscall_64+0x8a/0x170
[ 373.795326] ? __kasan_check_read+0x11/0x20
[ 373.845378] ? fpregs_assert_state_consistent+0x21/0xb0
[ 373.907917] ? syscall_exit_to_user_mode+0x4e/0x240
[ 373.966292] ? do_syscall_64+0x8a/0x170
[ 374.012184] ? __kasan_check_read+0x11/0x20
[ 374.062842] ? __kasan_check_read+0x11/0x20
[ 374.113483] ? fpregs_assert_state_consistent+0x21/0xb0
[ 374.176597] ? syscall_exit_to_user_mode+0x4e/0x240
[ 374.235530] ? do_syscall_64+0x8a/0x170
[ 374.281951] ? syscall_exit_to_user_mode+0x4e/0x240
[ 374.340855] ? do_syscall_64+0x8a/0x170
[ 374.387264] ? __kasan_check_read+0x11/0x20
[ 374.437828] ? fpregs_assert_state_consistent+0x21/0xb0
[ 374.500868] ? syscall_exit_to_user_mode+0x4e/0x240
[ 374.559737] ? clear_bhb_loop+0x15/0x70
[ 374.606100] ? clear_bhb_loop+0x15/0x70
[ 374.652461] ? clear_bhb_loop+0x15/0x70
[ 374.698816] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 374.759739] RIP: 0033:0x71871b11e88d
[ 374.802982] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa
48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
[ 375.028802] RSP: 002b:00007ffd4a4cf298 EFLAGS: 00000246 ORIG_RAX:
00000000000001aa
[ 375.119949] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
000071871b11e88d
[ 375.205904] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
0000000000000006
[ 375.291858] RBP: 0000718710f762c0 R08: 0000000000000000 R09:
0000000000000000
[ 375.377805] R10: 0000000000000001 R11: 0000000000000246 R12:
0000000000000001
[ 375.463754] R13: 0000000000000001 R14: 0000584669038c60 R15:
0000000000000000
[ 375.549701] </TASK>
[ 375.576358]
[ 375.594687] Allocated by task 2986:
[ 375.637027]
[ 375.655357] Freed by task 3376:
[ 375.693501]
[ 375.711823] Last potentially related work creation:
[ 375.770791]
[ 375.789121] The buggy address belongs to the object at ff11000170f75b00
[ 375.789121] which belongs to the cache bio-264 of size 264
[ 375.935951] The buggy address is located 116 bytes inside of
[ 375.935951] freed 264-byte region [ff11000170f75b00, ff11000170f75c08)
[ 376.083858]
[ 376.102233] The buggy address belongs to the physical page:
[ 376.169531]
[ 376.187916] Memory state around the buggy address:
[ 376.245819] ff11000170f75a00: 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00
[ 376.332857] ff11000170f75a80: 00 fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc
[ 376.419894] >ff11000170f75b00: fa fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[ 376.506945]
^
[ 376.589835] ff11000170f75b80: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[ 376.677180] ff11000170f75c00: fb fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc
[ 376.764502]
==================================================================
Note that I use hipri for fio, as below - without seems ok.
fio --filename=/dev/nvme0n1p3 --direct=1 --rw=read --bs=4k --iodepth=100
--name=iops --numjobs=20 --loops=1000 --ioengine=io_uring
--group_reporting --exitall_on_error --hipri
> JFYI, this patch causes or exposes an issue in scsi_debug where we get a
> use-after-free:
>
> Starting 10 processes
> [ 9.445254]
> ==================================================================
> [ 9.446156] BUG: KASAN: slab-use-after-free in bio_poll+0x26b/0x420
> [ 9.447188] Read of size 4 at addr ff1100014c9b46b4 by task fio/442
> [ 9.447933]
> [ 9.448121] CPU: 8 UID: 0 PID: 442 Comm: fio Not tainted 6.13.0-
> rc4-00052-gfdf8fc8dce75 #3390
> [ 9.449161] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 9.450573] Call Trace:
> [ 9.450876] <TASK>
> [ 9.451186] dump_stack_lvl+0x53/0x70
> [ 9.451644] print_report+0xce/0x660
> [ 9.452077] ? sdebug_blk_mq_poll+0x92/0x100
> [ 9.452639] ? bio_poll+0x26b/0x420
> [ 9.453077] kasan_report+0xc6/0x100
> [ 9.453537] ? bio_poll+0x26b/0x420
> [ 9.453955] bio_poll+0x26b/0x420
> [ 9.454374] ? task_mm_cid_work+0x33e/0x750
> [ 9.454879] iocb_bio_iopoll+0x47/0x60
> [ 9.455355] io_do_iopoll+0x450/0x10a0
> [ 9.455814] ? _raw_spin_lock_irq+0x81/0xe0
> [ 9.456359] ? __pfx_io_do_iopoll+0x10/0x10
> [ 9.456866] ? mutex_lock+0x8c/0xe0
> [ 9.457317] ? __pfx_mutex_lock+0x10/0x10
> [ 9.457799] ? __pfx_mutex_unlock+0x10/0x10
> [ 9.458316] __do_sys_io_uring_enter+0x7b7/0x12e0
> [ 9.458866] ? __pfx___do_sys_io_uring_enter+0x10/0x10
> [ 9.459515] ? __pfx___rseq_handle_notify_resume+0x10/0x10
> [ 9.460202] ? handle_mm_fault+0x16f/0x400
> [ 9.460696] do_syscall_64+0xa6/0x1a0
> [ 9.461149] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 9.461787] RIP: 0033:0x560572d148f8
> [ 9.462234] Code: 1c 01 00 00 48 8b 04 24 83 78 38 00 0f 85 0e 01 00
> 00 41 8b 3f 41 ba 01 00 00 00 45 31 c0 45 31 c9 b8 aa 01 00 00 89 ea 0f
> 05 <89> c6 85 c0 0f 89 ec 00 00 00 89 44 24 0c e8 55 87 fa ff 8b 74 24
> [ 9.464489] RSP: 002b:00007ffc5330a600 EFLAGS: 00000246 ORIG_RAX:
> 00000000000001aa
> [ 9.465400] RAX: ffffffffffffffda RBX: 00007f39cd9d9ac0 RCX:
> 0000560572d148f8
> [ 9.466254] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
> 0000000000000006
> [ 9.467114] RBP: 0000000000000001 R08: 0000000000000000 R09:
> 0000000000000000
> [ 9.467962] R10: 0000000000000001 R11: 0000000000000246 R12:
> 0000000000000000
> [ 9.468803] R13: 00007ffc5330a798 R14: 0000000000000001 R15:
> 0000560577589630
> [ 9.469672] </TASK>
> [ 9.469950]
> [ 9.470168] Allocated by task 441:
> [ 9.470577] kasan_save_stack+0x33/0x60
> [ 9.471033] kasan_save_track+0x14/0x30
> [ 9.471554] __kasan_slab_alloc+0x6e/0x70
> [ 9.472036] kmem_cache_alloc_noprof+0xe9/0x300
> [ 9.472599] mempool_alloc_noprof+0x11a/0x2e0
> [ 9.473161] bio_alloc_bioset+0x1ab/0x780
> [ 9.473634] blkdev_direct_IO+0x456/0x2130
> [ 9.474130] blkdev_write_iter+0x54f/0xb90
> [ 9.474647] io_write+0x3b3/0xfe0
> [ 9.475053] io_issue_sqe+0x131/0x13e0
> [ 9.475516] io_submit_sqes+0x6f6/0x21e0
> [ 9.475995] __do_sys_io_uring_enter+0xa1e/0x12e0
> [ 9.476602] do_syscall_64+0xa6/0x1a0
> [ 9.477043] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 9.477659]
> [ 9.477848] Freed by task 441:
> [ 9.478261] kasan_save_stack+0x33/0x60
> [ 9.478715] kasan_save_track+0x14/0x30
> [ 9.479197] kasan_save_free_info+0x3b/0x60
> [ 9.479692] __kasan_slab_free+0x37/0x50
> [ 9.480191] slab_free_after_rcu_debug+0xb1/0x280
> [ 9.480755] rcu_core+0x610/0x1a80
> [ 9.481215] handle_softirqs+0x1b5/0x5c0
> [ 9.481696] irq_exit_rcu+0xaf/0xe0
> [ 9.482119] sysvec_apic_timer_interrupt+0x6c/0x80
> [ 9.482729] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 9.483389]
> [ 9.483581] Last potentially related work creation:
> [ 9.484174] kasan_save_stack+0x33/0x60
> [ 9.484661] __kasan_record_aux_stack+0x8e/0xa0
> [ 9.485228] kmem_cache_free+0x21c/0x370
> [ 9.485713] blk_update_request+0x22c/0x1070
> [ 9.486280] scsi_end_request+0x6b/0x5d0
> [ 9.486762] scsi_io_completion+0xa4/0xda0
> [ 9.487285] sdebug_blk_mq_poll_iter+0x189/0x2c0
> [ 9.487851] bt_tags_iter+0x15f/0x290
> [ 9.488310] __blk_mq_all_tag_iter+0x31d/0x960
> [ 9.488869] blk_mq_tagset_busy_iter+0xeb/0x140
> [ 9.489448] sdebug_blk_mq_poll+0x92/0x100
> [ 9.489949] blk_hctx_poll+0x160/0x330
> [ 9.490446] bio_poll+0x182/0x420
> [ 9.490853] iocb_bio_iopoll+0x47/0x60
> [ 9.491343] io_do_iopoll+0x450/0x10a0
> [ 9.491798] __do_sys_io_uring_enter+0x7b7/0x12e0
> [ 9.492398] do_syscall_64+0xa6/0x1a0
> [ 9.492852] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 9.493484]
> [ 9.493676] The buggy address belongs to the object at ff1100014c9b4640
> [ 9.493676] which belongs to the cache bio-248 of size 248
> [ 9.495118] The buggy address is located 116 bytes inside of
> [ 9.495118] freed 248-byte region [ff1100014c9b4640, ff1100014c9b4738)
> [ 9.496597]
> [ 9.496784] The buggy address belongs to the physical page:
> [ 9.497465] page: refcount:1 mapcount:0 mapping:0000000000000000
> index:0x0 pfn:0x14c9b4
> [ 9.498464] head: order:2 mapcount:0 entire_mapcount:0
> nr_pages_mapped:0 pincount:0
> [ 9.499421] flags: 0x200000000000040(head|node=0|zone=2)
> [ 9.500053] page_type: f5(slab)
> [ 9.500451] raw: 0200000000000040 ff110001052f8dc0 dead000000000122
> 0000000000000000
> [ 9.501386] raw: 0000000000000000 0000000080330033 00000001f5000000
> 0000000000000000
> [ 9.502333] head: 0200000000000040 ff110001052f8dc0 dead000000000122
> 0000000000000000
> [ 9.503261] head: 0000000000000000 0000000080330033 00000001f5000000
> 0000000000000000
> [ 9.504213] head: 0200000000000002 ffd4000005326d01 ffffffffffffffff
> 0000000000000000
> [ 9.505142] head: 0000000000000004 0000000000000000 00000000ffffffff
> 0000000000000000
> [ 9.506082] page dumped because: kasan: bad access detected
> [ 9.506752]
> [ 9.506939] Memory state around the buggy address:
> [ 9.507560] ff1100014c9b4580: 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 fc
> [ 9.508454] ff1100014c9b4600: fc fc fc fc fc fc fc fc fa fb fb fb fb
> fb fb fb
> [ 9.509365] >ff1100014c9b4680: fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb fb
> [ 9.510260] ^
> [ 9.510842] ff1100014c9b4700: fb fb fb fb fb fb fb fc fc fc fc fc fc
> fc fc fc
> [ 9.511755] ff1100014c9b4780: 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00
> [ 9.512654]
> ==================================================================
> [ 9.513616] Disabling lock debugging due to kernel taint
> QEMU: Terminated
>
> Now scsi_debug does something pretty unorthodox in the mq_poll callback
> in that it calls blk_mq_tagset_busy_iter() ... -> scsi_done().
>
> However, for qemu with nvme I get this:
>
> fio-3.34
> Starting 10 processes
> [ 30.887296]
> ==================================================================
> [ 30.907820] BUG: KASAN: slab-use-after-free in bio_poll+0x26b/0x420
> [ 30.924793] Read of size 4 at addr ff1100015f775ab4 by task fio/458
> [ 30.949904]
> [ 30.952784] CPU: 11 UID: 0 PID: 458 Comm: fio Not tainted 6.13.0-
> rc4-00053-gc9c268957b58 #3391
> [ 31.036344] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 31.090860] Call Trace:
> [ 31.153928] <TASK>
> [ 31.180060] dump_stack_lvl+0x53/0x70
> [ 31.209414] print_report+0xce/0x660
> [ 31.220341] ? bio_poll+0x26b/0x420
> [ 31.236876] kasan_report+0xc6/0x100
> [ 31.253395] ? bio_poll+0x26b/0x420
> [ 31.283105] bio_poll+0x26b/0x420
> [ 31.304388] iocb_bio_iopoll+0x47/0x60
> [ 31.327575] io_do_iopoll+0x450/0x10a0
> [ 31.357706] ? __pfx_io_do_iopoll+0x10/0x10
> [ 31.381389] ? io_submit_sqes+0x6f6/0x21e0
> [ 31.397833] ? mutex_lock+0x8c/0xe0
> [ 31.436789] ? __pfx_mutex_lock+0x10/0x10
> [ 31.469967] __do_sys_io_uring_enter+0x7b7/0x12e0
> [ 31.506017] ? __pfx___do_sys_io_uring_enter+0x10/0x10
> [ 31.556819] ? __pfx___rseq_handle_notify_resume+0x10/0x10
> [ 31.599749] ? handle_mm_fault+0x16f/0x400
> [ 31.637617] ? up_read+0x1a/0xb0
> [ 31.658649] do_syscall_64+0xa6/0x1a0
> [ 31.715961] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 31.738610] RIP: 0033:0x558b29f538f8
> [ 31.758298] Code: 1c 01 00 00 48 8b 04 24 83 78 38 00 0f 85 0e 01 00
> 00 41 8b 3f 41 ba 01 00 00 00 45 31 c0 45 31 c9 b8 aa 01 00 00 89 ea 0f
> 05 <89> c6 85 c0 0f 89 ec 00 00 00 89 44 24 0c e8 55 87 fa ff 8b 74 24
> [ 31.868980] RSP: 002b:00007ffd37d51490 EFLAGS: 00000246 ORIG_RAX:
> 00000000000001aa
> [ 31.946356] RAX: ffffffffffffffda RBX: 00007f120ebfeb40 RCX:
> 0000558b29f538f8
> [ 32.044833] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
> 0000000000000006
> [ 32.086849] RBP: 0000000000000001 R08: 0000000000000000 R09:
> 0000000000000000
> [ 32.117522] R10: 0000000000000001 R11: 0000000000000246 R12:
> 0000000000000000
> [ 32.155554] R13: 00007ffd37d51628 R14: 0000000000000001 R15:
> 0000558b3c3216b0
> [ 32.174488] </TASK>
> [ 32.183180]
> [ 32.193202] Allocated by task 458:
> [ 32.205642] kasan_save_stack+0x33/0x60
> [ 32.215908] kasan_save_track+0x14/0x30
> [ 32.231828] __kasan_slab_alloc+0x6e/0x70
> [ 32.244998] kmem_cache_alloc_noprof+0xe9/0x300
> [ 32.263654] mempool_alloc_noprof+0x11a/0x2e0
> [ 32.274050] bio_alloc_bioset+0x1ab/0x780
> [ 32.286829] blkdev_direct_IO+0x456/0x2130
> [ 32.293655] blkdev_write_iter+0x54f/0xb90
> [ 32.299844] io_write+0x3b3/0xfe0
> [ 32.309428] io_issue_sqe+0x131/0x13e0
> [ 32.315319] io_submit_sqes+0x6f6/0x21e0
> [ 32.320913] __do_sys_io_uring_enter+0xa1e/0x12e0
> [ 32.328091] do_syscall_64+0xa6/0x1a0
> [ 32.336915] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 32.350460]
> [ 32.355097] Freed by task 455:
> [ 32.360331] kasan_save_stack+0x33/0x60
> [ 32.369595] kasan_save_track+0x14/0x30
> [ 32.377397] kasan_save_free_info+0x3b/0x60
> [ 32.386598] __kasan_slab_free+0x37/0x50
> [ 32.398562] slab_free_after_rcu_debug+0xb1/0x280
> [ 32.417108] rcu_core+0x610/0x1a80
> [ 32.424947] handle_softirqs+0x1b5/0x5c0
> [ 32.434754] irq_exit_rcu+0xaf/0xe0
> [ 32.438144] sysvec_apic_timer_interrupt+0x6c/0x80
> [ 32.443842] asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 32.448109]
> [ 32.449772] Last potentially related work creation:
> [ 32.454800] kasan_save_stack+0x33/0x60
> [ 32.458743] __kasan_record_aux_stack+0x8e/0xa0
> [ 32.463802] kmem_cache_free+0x21c/0x370
> [ 32.468130] blk_mq_end_request_batch+0x26b/0x13f0
> [ 32.473935] io_do_iopoll+0xa78/0x10a0
> [ 32.477800] __do_sys_io_uring_enter+0x7b7/0x12e0
> [ 32.482678] do_syscall_64+0xa6/0x1a0
> [ 32.487671] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 32.492551]
> [ 32.494058] The buggy address belongs to the object at ff1100015f775a40
> [ 32.494058] which belongs to the cache bio-248 of size 248
> [ 32.504485] The buggy address is located 116 bytes inside of
> [ 32.504485] freed 248-byte region [ff1100015f775a40, ff1100015f775b38)
> [ 32.518309]
> [ 32.520370] The buggy address belongs to the physical page:
> [ 32.526444] page: refcount:1 mapcount:0 mapping:0000000000000000
> index:0x0 pfn:0x15f774
> [ 32.535554] head: order:2 mapcount:0 entire_mapcount:0
> nr_pages_mapped:0 pincount:0
> [ 32.542517] flags: 0x200000000000040(head|node=0|zone=2)
> [ 32.547971] page_type: f5(slab)
> [ 32.551287] raw: 0200000000000040 ff1100010376af80 dead000000000122
> 0000000000000000
> [ 32.559290] raw: 0000000000000000 0000000000330033 00000001f5000000
> 0000000000000000
> [ 32.566773] head: 0200000000000040 ff1100010376af80 dead000000000122
> 0000000000000000
> [ 32.574046] head: 0000000000000000 0000000000330033 00000001f5000000
> 0000000000000000
> [ 32.581715] head: 0200000000000002 ffd40000057ddd01 ffffffffffffffff
> 0000000000000000
> [ 32.589588] head: 0000000000000004 0000000000000000 00000000ffffffff
> 0000000000000000
> [ 32.596963] page dumped because: kasan: bad access detected
> [ 32.603473]
> [ 32.604871] Memory state around the buggy address:
> [ 32.609617] ff1100015f775980: 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 fc
> [ 32.617652] ff1100015f775a00: fc fc fc fc fc fc fc fc fa fb fb fb fb
> fb fb fb
> [ 32.625385] >ff1100015f775a80: fb fb fb fb fb fb fb fb fb fb fb fb fb
> fb fb fb
> [ 32.634014] ^
> [ 32.637444] ff1100015f775b00: fb fb fb fb fb fb fb fc fc fc fc fc fc
> fc fc fc
> [ 32.644158] ff1100015f775b80: 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00 00 00
> [ 32.651115]
> ==================================================================
> [ 32.659002] Disabling lock debugging due to kernel taint
> QEMU: Terminated [W(10)][0.1%][w=150MiB/s][w=38.4k IOPS][eta 01h:24m:16s]
> root@jgarry-ubuntu-bm5-instance-20230215-1843:/home/ubuntu/linux#
>
> Here's my git bisect log:
>
> git bisect start
> # good: [1cbfb828e05171ca2dd77b5988d068e6872480fe] Merge tag
> 'for-6.14/block-20250118' of git://git.kernel.dk/linux
> git bisect good 1cbfb828e05171ca2dd77b5988d068e6872480fe
> # bad: [a312e1706ce6c124f04ec85ddece240f3bb2a696] Merge tag
> 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux
> git bisect bad a312e1706ce6c124f04ec85ddece240f3bb2a696
> # good: [3d8b5a22d40435b4a7e58f06ae2cd3506b222898] block: add support
> to pass user meta buffer
> git bisect good 3d8b5a22d40435b4a7e58f06ae2cd3506b222898
> # good: [ce9464081d5168ee0f279d6932ba82260a5b97c4] io_uring/msg_ring:
> Drop custom destructor
> git bisect good ce9464081d5168ee0f279d6932ba82260a5b97c4
> # bad: [d803d123948feffbd992213e144df224097f82b0] io_uring/rw: handle
> -EAGAIN retry at IO completion time
> git bisect bad d803d123948feffbd992213e144df224097f82b0
> # good: [c5f71916146033f9aba108075ff7087022075fd6] io_uring/rw: always
> clear ->bytes_done on io_async_rw setup
> git bisect good c5f71916146033f9aba108075ff7087022075fd6
> # good: [2a51c327d4a4a2eb62d67f4ea13a17efd0f25c5c] io_uring/rsrc:
> simplify the bvec iter count calculation
> git bisect good 2a51c327d4a4a2eb62d67f4ea13a17efd0f25c5c
> # good: [9ac273ae3dc296905b4d61e4c8e7a25592f6d183] io_uring/rw: use
> io_rw_recycle() from cleanup path
> git bisect good 9ac273ae3dc296905b4d61e4c8e7a25592f6d183
> # first bad commit: [d803d123948feffbd992213e144df224097f82b0]
> io_uring/rw: handle -EAGAIN retry at IO completion time
> john@localhost:~/mnt_sda4/john/kernel-dev2>
>
> Thanks,
> John
>
>> ---
>> io_uring/io_uring.c | 15 +++++++--
>> io_uring/rw.c | 80 ++++++++++++++-------------------------------
>> 2 files changed, 38 insertions(+), 57 deletions(-)
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index db198bd435b5..92ba2fdcd087 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -115,7 +115,7 @@
>> REQ_F_ASYNC_DATA)
>> #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK |
>> REQ_F_HARDLINK |\
>> - IO_REQ_CLEAN_FLAGS)
>> + REQ_F_REISSUE | IO_REQ_CLEAN_FLAGS)
>> #define IO_TCTX_REFS_CACHE_NR (1U << 10)
>> @@ -1403,6 +1403,12 @@ static void io_free_batch_list(struct
>> io_ring_ctx *ctx,
>> comp_list);
>> if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
>> + if (req->flags & REQ_F_REISSUE) {
>> + node = req->comp_list.next;
>> + req->flags &= ~REQ_F_REISSUE;
>> + io_queue_iowq(req);
>> + continue;
>> + }
>> if (req->flags & REQ_F_REFCOUNT) {
>> node = req->comp_list.next;
>> if (!req_ref_put_and_test(req))
>> @@ -1442,7 +1448,12 @@ void __io_submit_flush_completions(struct
>> io_ring_ctx *ctx)
>> struct io_kiocb *req = container_of(node, struct io_kiocb,
>> comp_list);
>> - if (!(req->flags & REQ_F_CQE_SKIP) &&
>> + /*
>> + * Requests marked with REQUEUE should not post a CQE, they
>> + * will go through the io-wq retry machinery and post one
>> + * later.
>> + */
>> + if (!(req->flags & (REQ_F_CQE_SKIP | REQ_F_REISSUE)) &&
>> unlikely(!io_fill_cqe_req(ctx, req))) {
>> if (ctx->lockless_cq) {
>> spin_lock(&ctx->completion_lock);
>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>> index afc669048c5d..c52c0515f0a2 100644
>> --- a/io_uring/rw.c
>> +++ b/io_uring/rw.c
>> @@ -202,7 +202,7 @@ static void io_req_rw_cleanup(struct io_kiocb
>> *req, unsigned int issue_flags)
>> * mean that the underlying data can be gone at any time. But that
>> * should be fixed seperately, and then this check could be killed.
>> */
>> - if (!(req->flags & REQ_F_REFCOUNT)) {
>> + if (!(req->flags & (REQ_F_REISSUE | REQ_F_REFCOUNT))) {
>> req->flags &= ~REQ_F_NEED_CLEANUP;
>> io_rw_recycle(req, issue_flags);
>> }
>> @@ -455,19 +455,12 @@ static inline loff_t *io_kiocb_update_pos(struct
>> io_kiocb *req)
>> return NULL;
>> }
>> -#ifdef CONFIG_BLOCK
>> -static void io_resubmit_prep(struct io_kiocb *req)
>> -{
>> - struct io_async_rw *io = req->async_data;
>> - struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>> -
>> - io_meta_restore(io, &rw->kiocb);
>> - iov_iter_restore(&io->iter, &io->iter_state);
>> -}
>> -
>> static bool io_rw_should_reissue(struct io_kiocb *req)
>> {
>> +#ifdef CONFIG_BLOCK
>> + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>> umode_t mode = file_inode(req->file)->i_mode;
>> + struct io_async_rw *io = req->async_data;
>> struct io_ring_ctx *ctx = req->ctx;
>> if (!S_ISBLK(mode) && !S_ISREG(mode))
>> @@ -488,17 +481,14 @@ static bool io_rw_should_reissue(struct io_kiocb
>> *req)
>> */
>> if (!same_thread_group(req->tctx->task, current) || !in_task())
>> return false;
>> +
>> + io_meta_restore(io, &rw->kiocb);
>> + iov_iter_restore(&io->iter, &io->iter_state);
>> return true;
>> -}
>> #else
>> -static void io_resubmit_prep(struct io_kiocb *req)
>> -{
>> -}
>> -static bool io_rw_should_reissue(struct io_kiocb *req)
>> -{
>> return false;
>> -}
>> #endif
>> +}
>> static void io_req_end_write(struct io_kiocb *req)
>> {
>> @@ -525,22 +515,16 @@ static void io_req_io_end(struct io_kiocb *req)
>> }
>> }
>> -static bool __io_complete_rw_common(struct io_kiocb *req, long res)
>> +static void __io_complete_rw_common(struct io_kiocb *req, long res)
>> {
>> - if (unlikely(res != req->cqe.res)) {
>> - if (res == -EAGAIN && io_rw_should_reissue(req)) {
>> - /*
>> - * Reissue will start accounting again, finish the
>> - * current cycle.
>> - */
>> - io_req_io_end(req);
>> - req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE;
>> - return true;
>> - }
>> + if (res == req->cqe.res)
>> + return;
>> + if (res == -EAGAIN && io_rw_should_reissue(req)) {
>> + req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE;
>> + } else {
>> req_set_fail(req);
>> req->cqe.res = res;
>> }
>> - return false;
>> }
>> static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
>> @@ -583,8 +567,7 @@ static void io_complete_rw(struct kiocb *kiocb,
>> long res)
>> struct io_kiocb *req = cmd_to_io_kiocb(rw);
>> if (!kiocb->dio_complete || !(kiocb->ki_flags &
>> IOCB_DIO_CALLER_COMP)) {
>> - if (__io_complete_rw_common(req, res))
>> - return;
>> + __io_complete_rw_common(req, res);
>> io_req_set_res(req, io_fixup_rw_res(req, res), 0);
>> }
>> req->io_task_work.func = io_req_rw_complete;
>> @@ -646,26 +629,19 @@ static int kiocb_done(struct io_kiocb *req,
>> ssize_t ret,
>> if (ret >= 0 && req->flags & REQ_F_CUR_POS)
>> req->file->f_pos = rw->kiocb.ki_pos;
>> if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) {
>> - if (!__io_complete_rw_common(req, ret)) {
>> - /*
>> - * Safe to call io_end from here as we're inline
>> - * from the submission path.
>> - */
>> - io_req_io_end(req);
>> - io_req_set_res(req, final_ret,
>> - io_put_kbuf(req, ret, issue_flags));
>> - io_req_rw_cleanup(req, issue_flags);
>> - return IOU_OK;
>> - }
>> + __io_complete_rw_common(req, ret);
>> + /*
>> + * Safe to call io_end from here as we're inline
>> + * from the submission path.
>> + */
>> + io_req_io_end(req);
>> + io_req_set_res(req, final_ret, io_put_kbuf(req, ret,
>> issue_flags));
>> + io_req_rw_cleanup(req, issue_flags);
>> + return IOU_OK;
>> } else {
>> io_rw_done(&rw->kiocb, ret);
>> }
>> - if (req->flags & REQ_F_REISSUE) {
>> - req->flags &= ~REQ_F_REISSUE;
>> - io_resubmit_prep(req);
>> - return -EAGAIN;
>> - }
>> return IOU_ISSUE_SKIP_COMPLETE;
>> }
>> @@ -944,8 +920,7 @@ static int __io_read(struct io_kiocb *req,
>> unsigned int issue_flags)
>> if (ret == -EOPNOTSUPP && force_nonblock)
>> ret = -EAGAIN;
>> - if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
>> - req->flags &= ~REQ_F_REISSUE;
>> + if (ret == -EAGAIN) {
>> /* If we can poll, just do that. */
>> if (io_file_can_poll(req))
>> return -EAGAIN;
>> @@ -1154,11 +1129,6 @@ int io_write(struct io_kiocb *req, unsigned int
>> issue_flags)
>> else
>> ret2 = -EINVAL;
>> - if (req->flags & REQ_F_REISSUE) {
>> - req->flags &= ~REQ_F_REISSUE;
>> - ret2 = -EAGAIN;
>> - }
>> -
>> /*
>> * Raw bdev writes will return -EOPNOTSUPP for IOCB_NOWAIT. Just
>> * retry them without IOCB_NOWAIT.
>
>
next prev parent reply other threads:[~2025-03-05 16:58 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-09 18:15 [PATCHSET for-next 0/3] Fix read/write -EAGAIN failure cases Jens Axboe
2025-01-09 18:15 ` [PATCH 1/3] io_uring/rw: use io_rw_recycle() from cleanup path Jens Axboe
2025-01-09 18:15 ` [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time Jens Axboe
2025-03-04 18:10 ` John Garry
2025-03-05 16:57 ` John Garry [this message]
2025-03-05 17:03 ` Jens Axboe
2025-03-05 21:03 ` Jens Axboe
2025-01-09 18:15 ` [PATCH 3/3] io_uring/rw: don't gate retry on completion context Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox