Re: [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time

public inbox for [email protected]
 help / color / mirror / Atom feed

From: John Garry <[email protected]>
To: Jens Axboe <[email protected]>, [email protected]
Cc: [email protected], [email protected]
Subject: Re: [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time
Date: Wed, 5 Mar 2025 16:57:50 +0000	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 04/03/2025 18:10, John Garry wrote:

+

> On 09/01/2025 18:15, Jens Axboe wrote:
>> Rather than try and have io_read/io_write turn REQ_F_REISSUE into
>> -EAGAIN, catch the REQ_F_REISSUE when the request is otherwise
>> considered as done. This is saner as we know this isn't happening
>> during an actual submission, and it removes the need to randomly
>> check REQ_F_REISSUE after read/write submission.
>>
>> If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over
>> this request in terms of posting a CQE, and the regular request
>> cleaning will ensure that it gets reissued via io-wq.
>>
>> Signed-off-by: Jens Axboe <[email protected]>
> 

Further info, I can easily recreate this on latest block/io_uring-6.14 
on real NVMe HW:

[  215.439892] cloud-init[2675]: Cloud-init v. 24.4-0ubuntu1~22.04.1 
finished at Wed, 05 Mar 2025 16:48:44 +0000. Datasource 
DataSourceOracle.  Up 215.38 seconds
[  371.784849] 
==================================================================
[  371.871354] BUG: KASAN: slab-use-after-free in bio_poll+0x300/0x4c0
[  371.946393] Read of size 4 at addr ff11000170f75b74 by task fio/3372
[  372.022458]
[  372.040266] CPU: 7 UID: 0 PID: 3372 Comm: fio Not tainted 6.13.0+ #16
[  372.117374] Hardware name: Oracle Corporation ORACLE SERVER 
X9-2c/TLA,MB TRAY,X9-2c, BIOS 66040600 07/23/2021
[  372.236091] Call Trace:
[  372.265337]  <TASK>
[  372.290423]  dump_stack_lvl+0x76/0xa0
[  372.334246]  print_report+0xd1/0x630
[  372.377024]  ? bio_poll+0x300/0x4c0
[  372.418759]  ? kasan_complete_mode_report_info+0x6a/0x200
[  372.483382]  ? bio_poll+0x300/0x4c0
[  372.525110]  kasan_report+0xb8/0x100
[  372.567882]  ? bio_poll+0x300/0x4c0
[  372.609611]  __asan_report_load4_noabort+0x14/0x30
[  372.666946]  bio_poll+0x300/0x4c0
[  372.706597]  iocb_bio_iopoll+0x3c/0x70
[  372.751448]  io_uring_classic_poll+0xb0/0x190
[  372.803587]  io_do_iopoll+0x471/0x1080
[  372.848440]  ? __io_read+0x206/0x1050
[  372.892253]  ? fget+0x83/0x260
[  372.928786]  ? __pfx_io_do_iopoll+0x10/0x10
[  372.978840]  ? io_read+0x17/0x50
[  373.017454]  ? __kasan_check_write+0x14/0x30
[  373.068546]  ? mutex_lock+0x86/0xe0
[  373.110278]  ? __pfx_mutex_lock+0x10/0x10
[  373.158253]  __do_sys_io_uring_enter+0x7ca/0x14a0
[  373.214550]  ? __pfx___do_sys_io_uring_enter+0x10/0x10
[  373.276048]  ? __pfx___do_sys_io_uring_enter+0x10/0x10
[  373.337544]  ? __kasan_check_write+0x14/0x30
[  373.388637]  ? fput+0x1b/0x310
[  373.425168]  ? __do_sys_io_uring_enter+0x996/0x14a0
[  373.483546]  __x64_sys_io_uring_enter+0xe0/0x1b0
[  373.538800]  x64_sys_call+0x16a1/0x2540
[  373.584695]  do_syscall_64+0x7e/0x170
[  373.628511]  ? fpregs_assert_state_consistent+0x21/0xb0
[  373.691052]  ? syscall_exit_to_user_mode+0x4e/0x240
[  373.749435]  ? do_syscall_64+0x8a/0x170
[  373.795326]  ? __kasan_check_read+0x11/0x20
[  373.845378]  ? fpregs_assert_state_consistent+0x21/0xb0
[  373.907917]  ? syscall_exit_to_user_mode+0x4e/0x240
[  373.966292]  ? do_syscall_64+0x8a/0x170
[  374.012184]  ? __kasan_check_read+0x11/0x20
[  374.062842]  ? __kasan_check_read+0x11/0x20
[  374.113483]  ? fpregs_assert_state_consistent+0x21/0xb0
[  374.176597]  ? syscall_exit_to_user_mode+0x4e/0x240
[  374.235530]  ? do_syscall_64+0x8a/0x170
[  374.281951]  ? syscall_exit_to_user_mode+0x4e/0x240
[  374.340855]  ? do_syscall_64+0x8a/0x170
[  374.387264]  ? __kasan_check_read+0x11/0x20
[  374.437828]  ? fpregs_assert_state_consistent+0x21/0xb0
[  374.500868]  ? syscall_exit_to_user_mode+0x4e/0x240
[  374.559737]  ? clear_bhb_loop+0x15/0x70
[  374.606100]  ? clear_bhb_loop+0x15/0x70
[  374.652461]  ? clear_bhb_loop+0x15/0x70
[  374.698816]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  374.759739] RIP: 0033:0x71871b11e88d
[  374.802982] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 
48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
[  375.028802] RSP: 002b:00007ffd4a4cf298 EFLAGS: 00000246 ORIG_RAX: 
00000000000001aa
[  375.119949] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 
000071871b11e88d
[  375.205904] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 
0000000000000006
[  375.291858] RBP: 0000718710f762c0 R08: 0000000000000000 R09: 
0000000000000000
[  375.377805] R10: 0000000000000001 R11: 0000000000000246 R12: 
0000000000000001
[  375.463754] R13: 0000000000000001 R14: 0000584669038c60 R15: 
0000000000000000
[  375.549701]  </TASK>
[  375.576358]
[  375.594687] Allocated by task 2986:
[  375.637027]
[  375.655357] Freed by task 3376:
[  375.693501]
[  375.711823] Last potentially related work creation:
[  375.770791]
[  375.789121] The buggy address belongs to the object at ff11000170f75b00
[  375.789121]  which belongs to the cache bio-264 of size 264
[  375.935951] The buggy address is located 116 bytes inside of
[  375.935951]  freed 264-byte region [ff11000170f75b00, ff11000170f75c08)
[  376.083858]
[  376.102233] The buggy address belongs to the physical page:
[  376.169531]
[  376.187916] Memory state around the buggy address:
[  376.245819]  ff11000170f75a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  376.332857]  ff11000170f75a80: 00 fc fc fc fc fc fc fc fc fc fc fc fc 
fc fc fc
[  376.419894] >ff11000170f75b00: fa fb fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb
[  376.506945] 
    ^
[  376.589835]  ff11000170f75b80: fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb fb fb
[  376.677180]  ff11000170f75c00: fb fc fc fc fc fc fc fc fc fc fc fc fc 
fc fc fc
[  376.764502] 
==================================================================

Note that I use hipri for fio, as below - without seems ok.

fio --filename=/dev/nvme0n1p3 --direct=1 --rw=read --bs=4k --iodepth=100 
--name=iops --numjobs=20 --loops=1000 --ioengine=io_uring 
--group_reporting --exitall_on_error --hipri

> JFYI, this patch causes or exposes an issue in scsi_debug where we get a 
> use-after-free:
> 
> Starting 10 processes
> [    9.445254] 
> ==================================================================
> [    9.446156] BUG: KASAN: slab-use-after-free in bio_poll+0x26b/0x420
> [    9.447188] Read of size 4 at addr ff1100014c9b46b4 by task fio/442
> [    9.447933]
> [    9.448121] CPU: 8 UID: 0 PID: 442 Comm: fio Not tainted 6.13.0- 
> rc4-00052-gfdf8fc8dce75 #3390
> [    9.449161] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [    9.450573] Call Trace:
> [    9.450876]  <TASK>
> [    9.451186]  dump_stack_lvl+0x53/0x70
> [    9.451644]  print_report+0xce/0x660
> [    9.452077]  ? sdebug_blk_mq_poll+0x92/0x100
> [    9.452639]  ? bio_poll+0x26b/0x420
> [    9.453077]  kasan_report+0xc6/0x100
> [    9.453537]  ? bio_poll+0x26b/0x420
> [    9.453955]  bio_poll+0x26b/0x420
> [    9.454374]  ? task_mm_cid_work+0x33e/0x750
> [    9.454879]  iocb_bio_iopoll+0x47/0x60
> [    9.455355]  io_do_iopoll+0x450/0x10a0
> [    9.455814]  ? _raw_spin_lock_irq+0x81/0xe0
> [    9.456359]  ? __pfx_io_do_iopoll+0x10/0x10
> [    9.456866]  ? mutex_lock+0x8c/0xe0
> [    9.457317]  ? __pfx_mutex_lock+0x10/0x10
> [    9.457799]  ? __pfx_mutex_unlock+0x10/0x10
> [    9.458316]  __do_sys_io_uring_enter+0x7b7/0x12e0
> [    9.458866]  ? __pfx___do_sys_io_uring_enter+0x10/0x10
> [    9.459515]  ? __pfx___rseq_handle_notify_resume+0x10/0x10
> [    9.460202]  ? handle_mm_fault+0x16f/0x400
> [    9.460696]  do_syscall_64+0xa6/0x1a0
> [    9.461149]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [    9.461787] RIP: 0033:0x560572d148f8
> [    9.462234] Code: 1c 01 00 00 48 8b 04 24 83 78 38 00 0f 85 0e 01 00 
> 00 41 8b 3f 41 ba 01 00 00 00 45 31 c0 45 31 c9 b8 aa 01 00 00 89 ea 0f 
> 05 <89> c6 85 c0 0f 89 ec 00 00 00 89 44 24 0c e8 55 87 fa ff 8b 74 24
> [    9.464489] RSP: 002b:00007ffc5330a600 EFLAGS: 00000246 ORIG_RAX: 
> 00000000000001aa
> [    9.465400] RAX: ffffffffffffffda RBX: 00007f39cd9d9ac0 RCX: 
> 0000560572d148f8
> [    9.466254] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 
> 0000000000000006
> [    9.467114] RBP: 0000000000000001 R08: 0000000000000000 R09: 
> 0000000000000000
> [    9.467962] R10: 0000000000000001 R11: 0000000000000246 R12: 
> 0000000000000000
> [    9.468803] R13: 00007ffc5330a798 R14: 0000000000000001 R15: 
> 0000560577589630
> [    9.469672]  </TASK>
> [    9.469950]
> [    9.470168] Allocated by task 441:
> [    9.470577]  kasan_save_stack+0x33/0x60
> [    9.471033]  kasan_save_track+0x14/0x30
> [    9.471554]  __kasan_slab_alloc+0x6e/0x70
> [    9.472036]  kmem_cache_alloc_noprof+0xe9/0x300
> [    9.472599]  mempool_alloc_noprof+0x11a/0x2e0
> [    9.473161]  bio_alloc_bioset+0x1ab/0x780
> [    9.473634]  blkdev_direct_IO+0x456/0x2130
> [    9.474130]  blkdev_write_iter+0x54f/0xb90
> [    9.474647]  io_write+0x3b3/0xfe0
> [    9.475053]  io_issue_sqe+0x131/0x13e0
> [    9.475516]  io_submit_sqes+0x6f6/0x21e0
> [    9.475995]  __do_sys_io_uring_enter+0xa1e/0x12e0
> [    9.476602]  do_syscall_64+0xa6/0x1a0
> [    9.477043]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [    9.477659]
> [    9.477848] Freed by task 441:
> [    9.478261]  kasan_save_stack+0x33/0x60
> [    9.478715]  kasan_save_track+0x14/0x30
> [    9.479197]  kasan_save_free_info+0x3b/0x60
> [    9.479692]  __kasan_slab_free+0x37/0x50
> [    9.480191]  slab_free_after_rcu_debug+0xb1/0x280
> [    9.480755]  rcu_core+0x610/0x1a80
> [    9.481215]  handle_softirqs+0x1b5/0x5c0
> [    9.481696]  irq_exit_rcu+0xaf/0xe0
> [    9.482119]  sysvec_apic_timer_interrupt+0x6c/0x80
> [    9.482729]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [    9.483389]
> [    9.483581] Last potentially related work creation:
> [    9.484174]  kasan_save_stack+0x33/0x60
> [    9.484661]  __kasan_record_aux_stack+0x8e/0xa0
> [    9.485228]  kmem_cache_free+0x21c/0x370
> [    9.485713]  blk_update_request+0x22c/0x1070
> [    9.486280]  scsi_end_request+0x6b/0x5d0
> [    9.486762]  scsi_io_completion+0xa4/0xda0
> [    9.487285]  sdebug_blk_mq_poll_iter+0x189/0x2c0
> [    9.487851]  bt_tags_iter+0x15f/0x290
> [    9.488310]  __blk_mq_all_tag_iter+0x31d/0x960
> [    9.488869]  blk_mq_tagset_busy_iter+0xeb/0x140
> [    9.489448]  sdebug_blk_mq_poll+0x92/0x100
> [    9.489949]  blk_hctx_poll+0x160/0x330
> [    9.490446]  bio_poll+0x182/0x420
> [    9.490853]  iocb_bio_iopoll+0x47/0x60
> [    9.491343]  io_do_iopoll+0x450/0x10a0
> [    9.491798]  __do_sys_io_uring_enter+0x7b7/0x12e0
> [    9.492398]  do_syscall_64+0xa6/0x1a0
> [    9.492852]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [    9.493484]
> [    9.493676] The buggy address belongs to the object at ff1100014c9b4640
> [    9.493676]  which belongs to the cache bio-248 of size 248
> [    9.495118] The buggy address is located 116 bytes inside of
> [    9.495118]  freed 248-byte region [ff1100014c9b4640, ff1100014c9b4738)
> [    9.496597]
> [    9.496784] The buggy address belongs to the physical page:
> [    9.497465] page: refcount:1 mapcount:0 mapping:0000000000000000 
> index:0x0 pfn:0x14c9b4
> [    9.498464] head: order:2 mapcount:0 entire_mapcount:0 
> nr_pages_mapped:0 pincount:0
> [    9.499421] flags: 0x200000000000040(head|node=0|zone=2)
> [    9.500053] page_type: f5(slab)
> [    9.500451] raw: 0200000000000040 ff110001052f8dc0 dead000000000122 
> 0000000000000000
> [    9.501386] raw: 0000000000000000 0000000080330033 00000001f5000000 
> 0000000000000000
> [    9.502333] head: 0200000000000040 ff110001052f8dc0 dead000000000122 
> 0000000000000000
> [    9.503261] head: 0000000000000000 0000000080330033 00000001f5000000 
> 0000000000000000
> [    9.504213] head: 0200000000000002 ffd4000005326d01 ffffffffffffffff 
> 0000000000000000
> [    9.505142] head: 0000000000000004 0000000000000000 00000000ffffffff 
> 0000000000000000
> [    9.506082] page dumped because: kasan: bad access detected
> [    9.506752]
> [    9.506939] Memory state around the buggy address:
> [    9.507560]  ff1100014c9b4580: 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 fc
> [    9.508454]  ff1100014c9b4600: fc fc fc fc fc fc fc fc fa fb fb fb fb 
> fb fb fb
> [    9.509365] >ff1100014c9b4680: fb fb fb fb fb fb fb fb fb fb fb fb fb 
> fb fb fb
> [    9.510260]                                      ^
> [    9.510842]  ff1100014c9b4700: fb fb fb fb fb fb fb fc fc fc fc fc fc 
> fc fc fc
> [    9.511755]  ff1100014c9b4780: 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00
> [    9.512654] 
> ==================================================================
> [    9.513616] Disabling lock debugging due to kernel taint
> QEMU: Terminated
> 
> Now scsi_debug does something pretty unorthodox in the mq_poll callback 
> in that it calls blk_mq_tagset_busy_iter() ... -> scsi_done().
> 
> However, for qemu with nvme I get this:
> 
> fio-3.34
> Starting 10 processes
> [   30.887296] 
> ==================================================================
> [   30.907820] BUG: KASAN: slab-use-after-free in bio_poll+0x26b/0x420
> [   30.924793] Read of size 4 at addr ff1100015f775ab4 by task fio/458
> [   30.949904]
> [   30.952784] CPU: 11 UID: 0 PID: 458 Comm: fio Not tainted 6.13.0- 
> rc4-00053-gc9c268957b58 #3391
> [   31.036344] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [   31.090860] Call Trace:
> [   31.153928]  <TASK>
> [   31.180060]  dump_stack_lvl+0x53/0x70
> [   31.209414]  print_report+0xce/0x660
> [   31.220341]  ? bio_poll+0x26b/0x420
> [   31.236876]  kasan_report+0xc6/0x100
> [   31.253395]  ? bio_poll+0x26b/0x420
> [   31.283105]  bio_poll+0x26b/0x420
> [   31.304388]  iocb_bio_iopoll+0x47/0x60
> [   31.327575]  io_do_iopoll+0x450/0x10a0
> [   31.357706]  ? __pfx_io_do_iopoll+0x10/0x10
> [   31.381389]  ? io_submit_sqes+0x6f6/0x21e0
> [   31.397833]  ? mutex_lock+0x8c/0xe0
> [   31.436789]  ? __pfx_mutex_lock+0x10/0x10
> [   31.469967]  __do_sys_io_uring_enter+0x7b7/0x12e0
> [   31.506017]  ? __pfx___do_sys_io_uring_enter+0x10/0x10
> [   31.556819]  ? __pfx___rseq_handle_notify_resume+0x10/0x10
> [   31.599749]  ? handle_mm_fault+0x16f/0x400
> [   31.637617]  ? up_read+0x1a/0xb0
> [   31.658649]  do_syscall_64+0xa6/0x1a0
> [   31.715961]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   31.738610] RIP: 0033:0x558b29f538f8
> [   31.758298] Code: 1c 01 00 00 48 8b 04 24 83 78 38 00 0f 85 0e 01 00 
> 00 41 8b 3f 41 ba 01 00 00 00 45 31 c0 45 31 c9 b8 aa 01 00 00 89 ea 0f 
> 05 <89> c6 85 c0 0f 89 ec 00 00 00 89 44 24 0c e8 55 87 fa ff 8b 74 24
> [   31.868980] RSP: 002b:00007ffd37d51490 EFLAGS: 00000246 ORIG_RAX: 
> 00000000000001aa
> [   31.946356] RAX: ffffffffffffffda RBX: 00007f120ebfeb40 RCX: 
> 0000558b29f538f8
> [   32.044833] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 
> 0000000000000006
> [   32.086849] RBP: 0000000000000001 R08: 0000000000000000 R09: 
> 0000000000000000
> [   32.117522] R10: 0000000000000001 R11: 0000000000000246 R12: 
> 0000000000000000
> [   32.155554] R13: 00007ffd37d51628 R14: 0000000000000001 R15: 
> 0000558b3c3216b0
> [   32.174488]  </TASK>
> [   32.183180]
> [   32.193202] Allocated by task 458:
> [   32.205642]  kasan_save_stack+0x33/0x60
> [   32.215908]  kasan_save_track+0x14/0x30
> [   32.231828]  __kasan_slab_alloc+0x6e/0x70
> [   32.244998]  kmem_cache_alloc_noprof+0xe9/0x300
> [   32.263654]  mempool_alloc_noprof+0x11a/0x2e0
> [   32.274050]  bio_alloc_bioset+0x1ab/0x780
> [   32.286829]  blkdev_direct_IO+0x456/0x2130
> [   32.293655]  blkdev_write_iter+0x54f/0xb90
> [   32.299844]  io_write+0x3b3/0xfe0
> [   32.309428]  io_issue_sqe+0x131/0x13e0
> [   32.315319]  io_submit_sqes+0x6f6/0x21e0
> [   32.320913]  __do_sys_io_uring_enter+0xa1e/0x12e0
> [   32.328091]  do_syscall_64+0xa6/0x1a0
> [   32.336915]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   32.350460]
> [   32.355097] Freed by task 455:
> [   32.360331]  kasan_save_stack+0x33/0x60
> [   32.369595]  kasan_save_track+0x14/0x30
> [   32.377397]  kasan_save_free_info+0x3b/0x60
> [   32.386598]  __kasan_slab_free+0x37/0x50
> [   32.398562]  slab_free_after_rcu_debug+0xb1/0x280
> [   32.417108]  rcu_core+0x610/0x1a80
> [   32.424947]  handle_softirqs+0x1b5/0x5c0
> [   32.434754]  irq_exit_rcu+0xaf/0xe0
> [   32.438144]  sysvec_apic_timer_interrupt+0x6c/0x80
> [   32.443842]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [   32.448109]
> [   32.449772] Last potentially related work creation:
> [   32.454800]  kasan_save_stack+0x33/0x60
> [   32.458743]  __kasan_record_aux_stack+0x8e/0xa0
> [   32.463802]  kmem_cache_free+0x21c/0x370
> [   32.468130]  blk_mq_end_request_batch+0x26b/0x13f0
> [   32.473935]  io_do_iopoll+0xa78/0x10a0
> [   32.477800]  __do_sys_io_uring_enter+0x7b7/0x12e0
> [   32.482678]  do_syscall_64+0xa6/0x1a0
> [   32.487671]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   32.492551]
> [   32.494058] The buggy address belongs to the object at ff1100015f775a40
> [   32.494058]  which belongs to the cache bio-248 of size 248
> [   32.504485] The buggy address is located 116 bytes inside of
> [   32.504485]  freed 248-byte region [ff1100015f775a40, ff1100015f775b38)
> [   32.518309]
> [   32.520370] The buggy address belongs to the physical page:
> [   32.526444] page: refcount:1 mapcount:0 mapping:0000000000000000 
> index:0x0 pfn:0x15f774
> [   32.535554] head: order:2 mapcount:0 entire_mapcount:0 
> nr_pages_mapped:0 pincount:0
> [   32.542517] flags: 0x200000000000040(head|node=0|zone=2)
> [   32.547971] page_type: f5(slab)
> [   32.551287] raw: 0200000000000040 ff1100010376af80 dead000000000122 
> 0000000000000000
> [   32.559290] raw: 0000000000000000 0000000000330033 00000001f5000000 
> 0000000000000000
> [   32.566773] head: 0200000000000040 ff1100010376af80 dead000000000122 
> 0000000000000000
> [   32.574046] head: 0000000000000000 0000000000330033 00000001f5000000 
> 0000000000000000
> [   32.581715] head: 0200000000000002 ffd40000057ddd01 ffffffffffffffff 
> 0000000000000000
> [   32.589588] head: 0000000000000004 0000000000000000 00000000ffffffff 
> 0000000000000000
> [   32.596963] page dumped because: kasan: bad access detected
> [   32.603473]
> [   32.604871] Memory state around the buggy address:
> [   32.609617]  ff1100015f775980: 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 fc
> [   32.617652]  ff1100015f775a00: fc fc fc fc fc fc fc fc fa fb fb fb fb 
> fb fb fb
> [   32.625385] >ff1100015f775a80: fb fb fb fb fb fb fb fb fb fb fb fb fb 
> fb fb fb
> [   32.634014]                                      ^
> [   32.637444]  ff1100015f775b00: fb fb fb fb fb fb fb fc fc fc fc fc fc 
> fc fc fc
> [   32.644158]  ff1100015f775b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00
> [   32.651115] 
> ==================================================================
> [   32.659002] Disabling lock debugging due to kernel taint
> QEMU: Terminated [W(10)][0.1%][w=150MiB/s][w=38.4k IOPS][eta 01h:24m:16s]
> root@jgarry-ubuntu-bm5-instance-20230215-1843:/home/ubuntu/linux#
> 
> Here's my git bisect log:
> 
> git bisect start
> # good: [1cbfb828e05171ca2dd77b5988d068e6872480fe] Merge tag
> 'for-6.14/block-20250118' of git://git.kernel.dk/linux
> git bisect good 1cbfb828e05171ca2dd77b5988d068e6872480fe
> # bad: [a312e1706ce6c124f04ec85ddece240f3bb2a696] Merge tag
> 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux
> git bisect bad a312e1706ce6c124f04ec85ddece240f3bb2a696
> # good: [3d8b5a22d40435b4a7e58f06ae2cd3506b222898] block: add support
> to pass user meta buffer
> git bisect good 3d8b5a22d40435b4a7e58f06ae2cd3506b222898
> # good: [ce9464081d5168ee0f279d6932ba82260a5b97c4] io_uring/msg_ring:
> Drop custom destructor
> git bisect good ce9464081d5168ee0f279d6932ba82260a5b97c4
> # bad: [d803d123948feffbd992213e144df224097f82b0] io_uring/rw: handle
> -EAGAIN retry at IO completion time
> git bisect bad d803d123948feffbd992213e144df224097f82b0
> # good: [c5f71916146033f9aba108075ff7087022075fd6] io_uring/rw: always
> clear ->bytes_done on io_async_rw setup
> git bisect good c5f71916146033f9aba108075ff7087022075fd6
> # good: [2a51c327d4a4a2eb62d67f4ea13a17efd0f25c5c] io_uring/rsrc:
> simplify the bvec iter count calculation
> git bisect good 2a51c327d4a4a2eb62d67f4ea13a17efd0f25c5c
> # good: [9ac273ae3dc296905b4d61e4c8e7a25592f6d183] io_uring/rw: use
> io_rw_recycle() from cleanup path
> git bisect good 9ac273ae3dc296905b4d61e4c8e7a25592f6d183
> # first bad commit: [d803d123948feffbd992213e144df224097f82b0]
> io_uring/rw: handle -EAGAIN retry at IO completion time
> john@localhost:~/mnt_sda4/john/kernel-dev2>
> 
> Thanks,
> John
> 
>> ---
>>   io_uring/io_uring.c | 15 +++++++--
>>   io_uring/rw.c       | 80 ++++++++++++++-------------------------------
>>   2 files changed, 38 insertions(+), 57 deletions(-)
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index db198bd435b5..92ba2fdcd087 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -115,7 +115,7 @@
>>                   REQ_F_ASYNC_DATA)
>>   #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | 
>> REQ_F_HARDLINK |\
>> -                 IO_REQ_CLEAN_FLAGS)
>> +                 REQ_F_REISSUE | IO_REQ_CLEAN_FLAGS)
>>   #define IO_TCTX_REFS_CACHE_NR    (1U << 10)
>> @@ -1403,6 +1403,12 @@ static void io_free_batch_list(struct 
>> io_ring_ctx *ctx,
>>                               comp_list);
>>           if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
>> +            if (req->flags & REQ_F_REISSUE) {
>> +                node = req->comp_list.next;
>> +                req->flags &= ~REQ_F_REISSUE;
>> +                io_queue_iowq(req);
>> +                continue;
>> +            }
>>               if (req->flags & REQ_F_REFCOUNT) {
>>                   node = req->comp_list.next;
>>                   if (!req_ref_put_and_test(req))
>> @@ -1442,7 +1448,12 @@ void __io_submit_flush_completions(struct 
>> io_ring_ctx *ctx)
>>           struct io_kiocb *req = container_of(node, struct io_kiocb,
>>                           comp_list);
>> -        if (!(req->flags & REQ_F_CQE_SKIP) &&
>> +        /*
>> +         * Requests marked with REQUEUE should not post a CQE, they
>> +         * will go through the io-wq retry machinery and post one
>> +         * later.
>> +         */
>> +        if (!(req->flags & (REQ_F_CQE_SKIP | REQ_F_REISSUE)) &&
>>               unlikely(!io_fill_cqe_req(ctx, req))) {
>>               if (ctx->lockless_cq) {
>>                   spin_lock(&ctx->completion_lock);
>> diff --git a/io_uring/rw.c b/io_uring/rw.c
>> index afc669048c5d..c52c0515f0a2 100644
>> --- a/io_uring/rw.c
>> +++ b/io_uring/rw.c
>> @@ -202,7 +202,7 @@ static void io_req_rw_cleanup(struct io_kiocb 
>> *req, unsigned int issue_flags)
>>        * mean that the underlying data can be gone at any time. But that
>>        * should be fixed seperately, and then this check could be killed.
>>        */
>> -    if (!(req->flags & REQ_F_REFCOUNT)) {
>> +    if (!(req->flags & (REQ_F_REISSUE | REQ_F_REFCOUNT))) {
>>           req->flags &= ~REQ_F_NEED_CLEANUP;
>>           io_rw_recycle(req, issue_flags);
>>       }
>> @@ -455,19 +455,12 @@ static inline loff_t *io_kiocb_update_pos(struct 
>> io_kiocb *req)
>>       return NULL;
>>   }
>> -#ifdef CONFIG_BLOCK
>> -static void io_resubmit_prep(struct io_kiocb *req)
>> -{
>> -    struct io_async_rw *io = req->async_data;
>> -    struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>> -
>> -    io_meta_restore(io, &rw->kiocb);
>> -    iov_iter_restore(&io->iter, &io->iter_state);
>> -}
>> -
>>   static bool io_rw_should_reissue(struct io_kiocb *req)
>>   {
>> +#ifdef CONFIG_BLOCK
>> +    struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
>>       umode_t mode = file_inode(req->file)->i_mode;
>> +    struct io_async_rw *io = req->async_data;
>>       struct io_ring_ctx *ctx = req->ctx;
>>       if (!S_ISBLK(mode) && !S_ISREG(mode))
>> @@ -488,17 +481,14 @@ static bool io_rw_should_reissue(struct io_kiocb 
>> *req)
>>        */
>>       if (!same_thread_group(req->tctx->task, current) || !in_task())
>>           return false;
>> +
>> +    io_meta_restore(io, &rw->kiocb);
>> +    iov_iter_restore(&io->iter, &io->iter_state);
>>       return true;
>> -}
>>   #else
>> -static void io_resubmit_prep(struct io_kiocb *req)
>> -{
>> -}
>> -static bool io_rw_should_reissue(struct io_kiocb *req)
>> -{
>>       return false;
>> -}
>>   #endif
>> +}
>>   static void io_req_end_write(struct io_kiocb *req)
>>   {
>> @@ -525,22 +515,16 @@ static void io_req_io_end(struct io_kiocb *req)
>>       }
>>   }
>> -static bool __io_complete_rw_common(struct io_kiocb *req, long res)
>> +static void __io_complete_rw_common(struct io_kiocb *req, long res)
>>   {
>> -    if (unlikely(res != req->cqe.res)) {
>> -        if (res == -EAGAIN && io_rw_should_reissue(req)) {
>> -            /*
>> -             * Reissue will start accounting again, finish the
>> -             * current cycle.
>> -             */
>> -            io_req_io_end(req);
>> -            req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE;
>> -            return true;
>> -        }
>> +    if (res == req->cqe.res)
>> +        return;
>> +    if (res == -EAGAIN && io_rw_should_reissue(req)) {
>> +        req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE;
>> +    } else {
>>           req_set_fail(req);
>>           req->cqe.res = res;
>>       }
>> -    return false;
>>   }
>>   static inline int io_fixup_rw_res(struct io_kiocb *req, long res)
>> @@ -583,8 +567,7 @@ static void io_complete_rw(struct kiocb *kiocb, 
>> long res)
>>       struct io_kiocb *req = cmd_to_io_kiocb(rw);
>>       if (!kiocb->dio_complete || !(kiocb->ki_flags & 
>> IOCB_DIO_CALLER_COMP)) {
>> -        if (__io_complete_rw_common(req, res))
>> -            return;
>> +        __io_complete_rw_common(req, res);
>>           io_req_set_res(req, io_fixup_rw_res(req, res), 0);
>>       }
>>       req->io_task_work.func = io_req_rw_complete;
>> @@ -646,26 +629,19 @@ static int kiocb_done(struct io_kiocb *req, 
>> ssize_t ret,
>>       if (ret >= 0 && req->flags & REQ_F_CUR_POS)
>>           req->file->f_pos = rw->kiocb.ki_pos;
>>       if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) {
>> -        if (!__io_complete_rw_common(req, ret)) {
>> -            /*
>> -             * Safe to call io_end from here as we're inline
>> -             * from the submission path.
>> -             */
>> -            io_req_io_end(req);
>> -            io_req_set_res(req, final_ret,
>> -                       io_put_kbuf(req, ret, issue_flags));
>> -            io_req_rw_cleanup(req, issue_flags);
>> -            return IOU_OK;
>> -        }
>> +        __io_complete_rw_common(req, ret);
>> +        /*
>> +         * Safe to call io_end from here as we're inline
>> +         * from the submission path.
>> +         */
>> +        io_req_io_end(req);
>> +        io_req_set_res(req, final_ret, io_put_kbuf(req, ret, 
>> issue_flags));
>> +        io_req_rw_cleanup(req, issue_flags);
>> +        return IOU_OK;
>>       } else {
>>           io_rw_done(&rw->kiocb, ret);
>>       }
>> -    if (req->flags & REQ_F_REISSUE) {
>> -        req->flags &= ~REQ_F_REISSUE;
>> -        io_resubmit_prep(req);
>> -        return -EAGAIN;
>> -    }
>>       return IOU_ISSUE_SKIP_COMPLETE;
>>   }
>> @@ -944,8 +920,7 @@ static int __io_read(struct io_kiocb *req, 
>> unsigned int issue_flags)
>>       if (ret == -EOPNOTSUPP && force_nonblock)
>>           ret = -EAGAIN;
>> -    if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
>> -        req->flags &= ~REQ_F_REISSUE;
>> +    if (ret == -EAGAIN) {
>>           /* If we can poll, just do that. */
>>           if (io_file_can_poll(req))
>>               return -EAGAIN;
>> @@ -1154,11 +1129,6 @@ int io_write(struct io_kiocb *req, unsigned int 
>> issue_flags)
>>       else
>>           ret2 = -EINVAL;
>> -    if (req->flags & REQ_F_REISSUE) {
>> -        req->flags &= ~REQ_F_REISSUE;
>> -        ret2 = -EAGAIN;
>> -    }
>> -
>>       /*
>>        * Raw bdev writes will return -EOPNOTSUPP for IOCB_NOWAIT. Just
>>        * retry them without IOCB_NOWAIT.
> 
>

next prev parent reply	other threads:[~2025-03-05 16:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-09 18:15 [PATCHSET for-next 0/3] Fix read/write -EAGAIN failure cases Jens Axboe
2025-01-09 18:15 ` [PATCH 1/3] io_uring/rw: use io_rw_recycle() from cleanup path Jens Axboe
2025-01-09 18:15 ` [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time Jens Axboe
2025-03-04 18:10   ` John Garry
2025-03-05 16:57     ` John Garry [this message]
2025-03-05 17:03       ` Jens Axboe
2025-03-05 21:03         ` Jens Axboe
2025-01-09 18:15 ` [PATCH 3/3] io_uring/rw: don't gate retry on completion context Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox