* [PATCHSET for-next 0/3] Fix read/write -EAGAIN failure cases @ 2025-01-09 18:15 Jens Axboe 2025-01-09 18:15 ` [PATCH 1/3] io_uring/rw: use io_rw_recycle() from cleanup path Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 8+ messages in thread From: Jens Axboe @ 2025-01-09 18:15 UTC (permalink / raw) To: io-uring Hi, After I had originally ensured that all read/write requests were fully retryable (with the 6.10 "always allocate async data" changes), there was an attempt or two at getting rid of these -EAGAIN's bubbling back to userspace. See: commit 039a2e800bcd5beb89909d1a488abf3d647642cf Author: Jens Axboe <[email protected]> Date: Thu Apr 25 09:04:32 2024 -0600 io_uring/rw: reinstate thread check for retries and the other two commits it references. Here's another attempt, which simply moves this kind of checking to be at completion time. This solves the issue of REQ_F_REISSUE being missed in case it gets sets outside of submission context, and generally just cleans up the random REQ_F_REISSUE checking that otherwise needs to happen for the read/write issue path. Patch 1 is just a cleanup, patch 2 does the actual change, and patch 3 finally removes the thread group checking in the "Can I resubmit" checks. io_uring/io_uring.c | 15 +++++++- io_uring/rw.c | 89 ++++++++++++++------------------------------- 2 files changed, 40 insertions(+), 64 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/3] io_uring/rw: use io_rw_recycle() from cleanup path 2025-01-09 18:15 [PATCHSET for-next 0/3] Fix read/write -EAGAIN failure cases Jens Axboe @ 2025-01-09 18:15 ` Jens Axboe 2025-01-09 18:15 ` [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time Jens Axboe 2025-01-09 18:15 ` [PATCH 3/3] io_uring/rw: don't gate retry on completion context Jens Axboe 2 siblings, 0 replies; 8+ messages in thread From: Jens Axboe @ 2025-01-09 18:15 UTC (permalink / raw) To: io-uring; +Cc: Jens Axboe Cleanup should always have the uring lock held, it's safe to recycle from here. Signed-off-by: Jens Axboe <[email protected]> --- io_uring/rw.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/io_uring/rw.c b/io_uring/rw.c index ca1b19d3d142..afc669048c5d 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -434,7 +434,8 @@ int io_read_mshot_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) void io_readv_writev_cleanup(struct io_kiocb *req) { - io_rw_iovec_free(req->async_data); + lockdep_assert_held(&req->ctx->uring_lock); + io_rw_recycle(req, 0); } static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req) -- 2.47.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time 2025-01-09 18:15 [PATCHSET for-next 0/3] Fix read/write -EAGAIN failure cases Jens Axboe 2025-01-09 18:15 ` [PATCH 1/3] io_uring/rw: use io_rw_recycle() from cleanup path Jens Axboe @ 2025-01-09 18:15 ` Jens Axboe 2025-03-04 18:10 ` John Garry 2025-01-09 18:15 ` [PATCH 3/3] io_uring/rw: don't gate retry on completion context Jens Axboe 2 siblings, 1 reply; 8+ messages in thread From: Jens Axboe @ 2025-01-09 18:15 UTC (permalink / raw) To: io-uring; +Cc: Jens Axboe Rather than try and have io_read/io_write turn REQ_F_REISSUE into -EAGAIN, catch the REQ_F_REISSUE when the request is otherwise considered as done. This is saner as we know this isn't happening during an actual submission, and it removes the need to randomly check REQ_F_REISSUE after read/write submission. If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over this request in terms of posting a CQE, and the regular request cleaning will ensure that it gets reissued via io-wq. Signed-off-by: Jens Axboe <[email protected]> --- io_uring/io_uring.c | 15 +++++++-- io_uring/rw.c | 80 ++++++++++++++------------------------------- 2 files changed, 38 insertions(+), 57 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index db198bd435b5..92ba2fdcd087 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -115,7 +115,7 @@ REQ_F_ASYNC_DATA) #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | REQ_F_HARDLINK |\ - IO_REQ_CLEAN_FLAGS) + REQ_F_REISSUE | IO_REQ_CLEAN_FLAGS) #define IO_TCTX_REFS_CACHE_NR (1U << 10) @@ -1403,6 +1403,12 @@ static void io_free_batch_list(struct io_ring_ctx *ctx, comp_list); if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) { + if (req->flags & REQ_F_REISSUE) { + node = req->comp_list.next; + req->flags &= ~REQ_F_REISSUE; + io_queue_iowq(req); + continue; + } if (req->flags & REQ_F_REFCOUNT) { node = req->comp_list.next; if (!req_ref_put_and_test(req)) @@ -1442,7 +1448,12 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) struct io_kiocb *req = container_of(node, struct io_kiocb, comp_list); - if (!(req->flags & REQ_F_CQE_SKIP) && + /* + * Requests marked with REQUEUE should not post a CQE, they + * will go through the io-wq retry machinery and post one + * later. + */ + if (!(req->flags & (REQ_F_CQE_SKIP | REQ_F_REISSUE)) && unlikely(!io_fill_cqe_req(ctx, req))) { if (ctx->lockless_cq) { spin_lock(&ctx->completion_lock); diff --git a/io_uring/rw.c b/io_uring/rw.c index afc669048c5d..c52c0515f0a2 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -202,7 +202,7 @@ static void io_req_rw_cleanup(struct io_kiocb *req, unsigned int issue_flags) * mean that the underlying data can be gone at any time. But that * should be fixed seperately, and then this check could be killed. */ - if (!(req->flags & REQ_F_REFCOUNT)) { + if (!(req->flags & (REQ_F_REISSUE | REQ_F_REFCOUNT))) { req->flags &= ~REQ_F_NEED_CLEANUP; io_rw_recycle(req, issue_flags); } @@ -455,19 +455,12 @@ static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req) return NULL; } -#ifdef CONFIG_BLOCK -static void io_resubmit_prep(struct io_kiocb *req) -{ - struct io_async_rw *io = req->async_data; - struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); - - io_meta_restore(io, &rw->kiocb); - iov_iter_restore(&io->iter, &io->iter_state); -} - static bool io_rw_should_reissue(struct io_kiocb *req) { +#ifdef CONFIG_BLOCK + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); umode_t mode = file_inode(req->file)->i_mode; + struct io_async_rw *io = req->async_data; struct io_ring_ctx *ctx = req->ctx; if (!S_ISBLK(mode) && !S_ISREG(mode)) @@ -488,17 +481,14 @@ static bool io_rw_should_reissue(struct io_kiocb *req) */ if (!same_thread_group(req->tctx->task, current) || !in_task()) return false; + + io_meta_restore(io, &rw->kiocb); + iov_iter_restore(&io->iter, &io->iter_state); return true; -} #else -static void io_resubmit_prep(struct io_kiocb *req) -{ -} -static bool io_rw_should_reissue(struct io_kiocb *req) -{ return false; -} #endif +} static void io_req_end_write(struct io_kiocb *req) { @@ -525,22 +515,16 @@ static void io_req_io_end(struct io_kiocb *req) } } -static bool __io_complete_rw_common(struct io_kiocb *req, long res) +static void __io_complete_rw_common(struct io_kiocb *req, long res) { - if (unlikely(res != req->cqe.res)) { - if (res == -EAGAIN && io_rw_should_reissue(req)) { - /* - * Reissue will start accounting again, finish the - * current cycle. - */ - io_req_io_end(req); - req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE; - return true; - } + if (res == req->cqe.res) + return; + if (res == -EAGAIN && io_rw_should_reissue(req)) { + req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE; + } else { req_set_fail(req); req->cqe.res = res; } - return false; } static inline int io_fixup_rw_res(struct io_kiocb *req, long res) @@ -583,8 +567,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res) struct io_kiocb *req = cmd_to_io_kiocb(rw); if (!kiocb->dio_complete || !(kiocb->ki_flags & IOCB_DIO_CALLER_COMP)) { - if (__io_complete_rw_common(req, res)) - return; + __io_complete_rw_common(req, res); io_req_set_res(req, io_fixup_rw_res(req, res), 0); } req->io_task_work.func = io_req_rw_complete; @@ -646,26 +629,19 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret, if (ret >= 0 && req->flags & REQ_F_CUR_POS) req->file->f_pos = rw->kiocb.ki_pos; if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) { - if (!__io_complete_rw_common(req, ret)) { - /* - * Safe to call io_end from here as we're inline - * from the submission path. - */ - io_req_io_end(req); - io_req_set_res(req, final_ret, - io_put_kbuf(req, ret, issue_flags)); - io_req_rw_cleanup(req, issue_flags); - return IOU_OK; - } + __io_complete_rw_common(req, ret); + /* + * Safe to call io_end from here as we're inline + * from the submission path. + */ + io_req_io_end(req); + io_req_set_res(req, final_ret, io_put_kbuf(req, ret, issue_flags)); + io_req_rw_cleanup(req, issue_flags); + return IOU_OK; } else { io_rw_done(&rw->kiocb, ret); } - if (req->flags & REQ_F_REISSUE) { - req->flags &= ~REQ_F_REISSUE; - io_resubmit_prep(req); - return -EAGAIN; - } return IOU_ISSUE_SKIP_COMPLETE; } @@ -944,8 +920,7 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) if (ret == -EOPNOTSUPP && force_nonblock) ret = -EAGAIN; - if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) { - req->flags &= ~REQ_F_REISSUE; + if (ret == -EAGAIN) { /* If we can poll, just do that. */ if (io_file_can_poll(req)) return -EAGAIN; @@ -1154,11 +1129,6 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) else ret2 = -EINVAL; - if (req->flags & REQ_F_REISSUE) { - req->flags &= ~REQ_F_REISSUE; - ret2 = -EAGAIN; - } - /* * Raw bdev writes will return -EOPNOTSUPP for IOCB_NOWAIT. Just * retry them without IOCB_NOWAIT. -- 2.47.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time 2025-01-09 18:15 ` [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time Jens Axboe @ 2025-03-04 18:10 ` John Garry 2025-03-05 16:57 ` John Garry 0 siblings, 1 reply; 8+ messages in thread From: John Garry @ 2025-03-04 18:10 UTC (permalink / raw) To: Jens Axboe, io-uring; +Cc: linux-scsi On 09/01/2025 18:15, Jens Axboe wrote: > Rather than try and have io_read/io_write turn REQ_F_REISSUE into > -EAGAIN, catch the REQ_F_REISSUE when the request is otherwise > considered as done. This is saner as we know this isn't happening > during an actual submission, and it removes the need to randomly > check REQ_F_REISSUE after read/write submission. > > If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over > this request in terms of posting a CQE, and the regular request > cleaning will ensure that it gets reissued via io-wq. > > Signed-off-by: Jens Axboe <[email protected]> JFYI, this patch causes or exposes an issue in scsi_debug where we get a use-after-free: Starting 10 processes [ 9.445254] ================================================================== [ 9.446156] BUG: KASAN: slab-use-after-free in bio_poll+0x26b/0x420 [ 9.447188] Read of size 4 at addr ff1100014c9b46b4 by task fio/442 [ 9.447933] [ 9.448121] CPU: 8 UID: 0 PID: 442 Comm: fio Not tainted 6.13.0-rc4-00052-gfdf8fc8dce75 #3390 [ 9.449161] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 9.450573] Call Trace: [ 9.450876] <TASK> [ 9.451186] dump_stack_lvl+0x53/0x70 [ 9.451644] print_report+0xce/0x660 [ 9.452077] ? sdebug_blk_mq_poll+0x92/0x100 [ 9.452639] ? bio_poll+0x26b/0x420 [ 9.453077] kasan_report+0xc6/0x100 [ 9.453537] ? bio_poll+0x26b/0x420 [ 9.453955] bio_poll+0x26b/0x420 [ 9.454374] ? task_mm_cid_work+0x33e/0x750 [ 9.454879] iocb_bio_iopoll+0x47/0x60 [ 9.455355] io_do_iopoll+0x450/0x10a0 [ 9.455814] ? _raw_spin_lock_irq+0x81/0xe0 [ 9.456359] ? __pfx_io_do_iopoll+0x10/0x10 [ 9.456866] ? mutex_lock+0x8c/0xe0 [ 9.457317] ? __pfx_mutex_lock+0x10/0x10 [ 9.457799] ? __pfx_mutex_unlock+0x10/0x10 [ 9.458316] __do_sys_io_uring_enter+0x7b7/0x12e0 [ 9.458866] ? __pfx___do_sys_io_uring_enter+0x10/0x10 [ 9.459515] ? __pfx___rseq_handle_notify_resume+0x10/0x10 [ 9.460202] ? handle_mm_fault+0x16f/0x400 [ 9.460696] do_syscall_64+0xa6/0x1a0 [ 9.461149] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 9.461787] RIP: 0033:0x560572d148f8 [ 9.462234] Code: 1c 01 00 00 48 8b 04 24 83 78 38 00 0f 85 0e 01 00 00 41 8b 3f 41 ba 01 00 00 00 45 31 c0 45 31 c9 b8 aa 01 00 00 89 ea 0f 05 <89> c6 85 c0 0f 89 ec 00 00 00 89 44 24 0c e8 55 87 fa ff 8b 74 24 [ 9.464489] RSP: 002b:00007ffc5330a600 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa [ 9.465400] RAX: ffffffffffffffda RBX: 00007f39cd9d9ac0 RCX: 0000560572d148f8 [ 9.466254] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000006 [ 9.467114] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 [ 9.467962] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 [ 9.468803] R13: 00007ffc5330a798 R14: 0000000000000001 R15: 0000560577589630 [ 9.469672] </TASK> [ 9.469950] [ 9.470168] Allocated by task 441: [ 9.470577] kasan_save_stack+0x33/0x60 [ 9.471033] kasan_save_track+0x14/0x30 [ 9.471554] __kasan_slab_alloc+0x6e/0x70 [ 9.472036] kmem_cache_alloc_noprof+0xe9/0x300 [ 9.472599] mempool_alloc_noprof+0x11a/0x2e0 [ 9.473161] bio_alloc_bioset+0x1ab/0x780 [ 9.473634] blkdev_direct_IO+0x456/0x2130 [ 9.474130] blkdev_write_iter+0x54f/0xb90 [ 9.474647] io_write+0x3b3/0xfe0 [ 9.475053] io_issue_sqe+0x131/0x13e0 [ 9.475516] io_submit_sqes+0x6f6/0x21e0 [ 9.475995] __do_sys_io_uring_enter+0xa1e/0x12e0 [ 9.476602] do_syscall_64+0xa6/0x1a0 [ 9.477043] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 9.477659] [ 9.477848] Freed by task 441: [ 9.478261] kasan_save_stack+0x33/0x60 [ 9.478715] kasan_save_track+0x14/0x30 [ 9.479197] kasan_save_free_info+0x3b/0x60 [ 9.479692] __kasan_slab_free+0x37/0x50 [ 9.480191] slab_free_after_rcu_debug+0xb1/0x280 [ 9.480755] rcu_core+0x610/0x1a80 [ 9.481215] handle_softirqs+0x1b5/0x5c0 [ 9.481696] irq_exit_rcu+0xaf/0xe0 [ 9.482119] sysvec_apic_timer_interrupt+0x6c/0x80 [ 9.482729] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 9.483389] [ 9.483581] Last potentially related work creation: [ 9.484174] kasan_save_stack+0x33/0x60 [ 9.484661] __kasan_record_aux_stack+0x8e/0xa0 [ 9.485228] kmem_cache_free+0x21c/0x370 [ 9.485713] blk_update_request+0x22c/0x1070 [ 9.486280] scsi_end_request+0x6b/0x5d0 [ 9.486762] scsi_io_completion+0xa4/0xda0 [ 9.487285] sdebug_blk_mq_poll_iter+0x189/0x2c0 [ 9.487851] bt_tags_iter+0x15f/0x290 [ 9.488310] __blk_mq_all_tag_iter+0x31d/0x960 [ 9.488869] blk_mq_tagset_busy_iter+0xeb/0x140 [ 9.489448] sdebug_blk_mq_poll+0x92/0x100 [ 9.489949] blk_hctx_poll+0x160/0x330 [ 9.490446] bio_poll+0x182/0x420 [ 9.490853] iocb_bio_iopoll+0x47/0x60 [ 9.491343] io_do_iopoll+0x450/0x10a0 [ 9.491798] __do_sys_io_uring_enter+0x7b7/0x12e0 [ 9.492398] do_syscall_64+0xa6/0x1a0 [ 9.492852] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 9.493484] [ 9.493676] The buggy address belongs to the object at ff1100014c9b4640 [ 9.493676] which belongs to the cache bio-248 of size 248 [ 9.495118] The buggy address is located 116 bytes inside of [ 9.495118] freed 248-byte region [ff1100014c9b4640, ff1100014c9b4738) [ 9.496597] [ 9.496784] The buggy address belongs to the physical page: [ 9.497465] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14c9b4 [ 9.498464] head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 9.499421] flags: 0x200000000000040(head|node=0|zone=2) [ 9.500053] page_type: f5(slab) [ 9.500451] raw: 0200000000000040 ff110001052f8dc0 dead000000000122 0000000000000000 [ 9.501386] raw: 0000000000000000 0000000080330033 00000001f5000000 0000000000000000 [ 9.502333] head: 0200000000000040 ff110001052f8dc0 dead000000000122 0000000000000000 [ 9.503261] head: 0000000000000000 0000000080330033 00000001f5000000 0000000000000000 [ 9.504213] head: 0200000000000002 ffd4000005326d01 ffffffffffffffff 0000000000000000 [ 9.505142] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 [ 9.506082] page dumped because: kasan: bad access detected [ 9.506752] [ 9.506939] Memory state around the buggy address: [ 9.507560] ff1100014c9b4580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc [ 9.508454] ff1100014c9b4600: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb [ 9.509365] >ff1100014c9b4680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 9.510260] ^ [ 9.510842] ff1100014c9b4700: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc [ 9.511755] ff1100014c9b4780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 9.512654] ================================================================== [ 9.513616] Disabling lock debugging due to kernel taint QEMU: Terminated Now scsi_debug does something pretty unorthodox in the mq_poll callback in that it calls blk_mq_tagset_busy_iter() ... -> scsi_done(). However, for qemu with nvme I get this: fio-3.34 Starting 10 processes [ 30.887296] ================================================================== [ 30.907820] BUG: KASAN: slab-use-after-free in bio_poll+0x26b/0x420 [ 30.924793] Read of size 4 at addr ff1100015f775ab4 by task fio/458 [ 30.949904] [ 30.952784] CPU: 11 UID: 0 PID: 458 Comm: fio Not tainted 6.13.0-rc4-00053-gc9c268957b58 #3391 [ 31.036344] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 31.090860] Call Trace: [ 31.153928] <TASK> [ 31.180060] dump_stack_lvl+0x53/0x70 [ 31.209414] print_report+0xce/0x660 [ 31.220341] ? bio_poll+0x26b/0x420 [ 31.236876] kasan_report+0xc6/0x100 [ 31.253395] ? bio_poll+0x26b/0x420 [ 31.283105] bio_poll+0x26b/0x420 [ 31.304388] iocb_bio_iopoll+0x47/0x60 [ 31.327575] io_do_iopoll+0x450/0x10a0 [ 31.357706] ? __pfx_io_do_iopoll+0x10/0x10 [ 31.381389] ? io_submit_sqes+0x6f6/0x21e0 [ 31.397833] ? mutex_lock+0x8c/0xe0 [ 31.436789] ? __pfx_mutex_lock+0x10/0x10 [ 31.469967] __do_sys_io_uring_enter+0x7b7/0x12e0 [ 31.506017] ? __pfx___do_sys_io_uring_enter+0x10/0x10 [ 31.556819] ? __pfx___rseq_handle_notify_resume+0x10/0x10 [ 31.599749] ? handle_mm_fault+0x16f/0x400 [ 31.637617] ? up_read+0x1a/0xb0 [ 31.658649] do_syscall_64+0xa6/0x1a0 [ 31.715961] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 31.738610] RIP: 0033:0x558b29f538f8 [ 31.758298] Code: 1c 01 00 00 48 8b 04 24 83 78 38 00 0f 85 0e 01 00 00 41 8b 3f 41 ba 01 00 00 00 45 31 c0 45 31 c9 b8 aa 01 00 00 89 ea 0f 05 <89> c6 85 c0 0f 89 ec 00 00 00 89 44 24 0c e8 55 87 fa ff 8b 74 24 [ 31.868980] RSP: 002b:00007ffd37d51490 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa [ 31.946356] RAX: ffffffffffffffda RBX: 00007f120ebfeb40 RCX: 0000558b29f538f8 [ 32.044833] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000006 [ 32.086849] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 [ 32.117522] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 [ 32.155554] R13: 00007ffd37d51628 R14: 0000000000000001 R15: 0000558b3c3216b0 [ 32.174488] </TASK> [ 32.183180] [ 32.193202] Allocated by task 458: [ 32.205642] kasan_save_stack+0x33/0x60 [ 32.215908] kasan_save_track+0x14/0x30 [ 32.231828] __kasan_slab_alloc+0x6e/0x70 [ 32.244998] kmem_cache_alloc_noprof+0xe9/0x300 [ 32.263654] mempool_alloc_noprof+0x11a/0x2e0 [ 32.274050] bio_alloc_bioset+0x1ab/0x780 [ 32.286829] blkdev_direct_IO+0x456/0x2130 [ 32.293655] blkdev_write_iter+0x54f/0xb90 [ 32.299844] io_write+0x3b3/0xfe0 [ 32.309428] io_issue_sqe+0x131/0x13e0 [ 32.315319] io_submit_sqes+0x6f6/0x21e0 [ 32.320913] __do_sys_io_uring_enter+0xa1e/0x12e0 [ 32.328091] do_syscall_64+0xa6/0x1a0 [ 32.336915] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 32.350460] [ 32.355097] Freed by task 455: [ 32.360331] kasan_save_stack+0x33/0x60 [ 32.369595] kasan_save_track+0x14/0x30 [ 32.377397] kasan_save_free_info+0x3b/0x60 [ 32.386598] __kasan_slab_free+0x37/0x50 [ 32.398562] slab_free_after_rcu_debug+0xb1/0x280 [ 32.417108] rcu_core+0x610/0x1a80 [ 32.424947] handle_softirqs+0x1b5/0x5c0 [ 32.434754] irq_exit_rcu+0xaf/0xe0 [ 32.438144] sysvec_apic_timer_interrupt+0x6c/0x80 [ 32.443842] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [ 32.448109] [ 32.449772] Last potentially related work creation: [ 32.454800] kasan_save_stack+0x33/0x60 [ 32.458743] __kasan_record_aux_stack+0x8e/0xa0 [ 32.463802] kmem_cache_free+0x21c/0x370 [ 32.468130] blk_mq_end_request_batch+0x26b/0x13f0 [ 32.473935] io_do_iopoll+0xa78/0x10a0 [ 32.477800] __do_sys_io_uring_enter+0x7b7/0x12e0 [ 32.482678] do_syscall_64+0xa6/0x1a0 [ 32.487671] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 32.492551] [ 32.494058] The buggy address belongs to the object at ff1100015f775a40 [ 32.494058] which belongs to the cache bio-248 of size 248 [ 32.504485] The buggy address is located 116 bytes inside of [ 32.504485] freed 248-byte region [ff1100015f775a40, ff1100015f775b38) [ 32.518309] [ 32.520370] The buggy address belongs to the physical page: [ 32.526444] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x15f774 [ 32.535554] head: order:2 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 32.542517] flags: 0x200000000000040(head|node=0|zone=2) [ 32.547971] page_type: f5(slab) [ 32.551287] raw: 0200000000000040 ff1100010376af80 dead000000000122 0000000000000000 [ 32.559290] raw: 0000000000000000 0000000000330033 00000001f5000000 0000000000000000 [ 32.566773] head: 0200000000000040 ff1100010376af80 dead000000000122 0000000000000000 [ 32.574046] head: 0000000000000000 0000000000330033 00000001f5000000 0000000000000000 [ 32.581715] head: 0200000000000002 ffd40000057ddd01 ffffffffffffffff 0000000000000000 [ 32.589588] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 [ 32.596963] page dumped because: kasan: bad access detected [ 32.603473] [ 32.604871] Memory state around the buggy address: [ 32.609617] ff1100015f775980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc [ 32.617652] ff1100015f775a00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb [ 32.625385] >ff1100015f775a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 32.634014] ^ [ 32.637444] ff1100015f775b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc [ 32.644158] ff1100015f775b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 32.651115] ================================================================== [ 32.659002] Disabling lock debugging due to kernel taint QEMU: Terminated [W(10)][0.1%][w=150MiB/s][w=38.4k IOPS][eta 01h:24m:16s] root@jgarry-ubuntu-bm5-instance-20230215-1843:/home/ubuntu/linux# Here's my git bisect log: git bisect start # good: [1cbfb828e05171ca2dd77b5988d068e6872480fe] Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux git bisect good 1cbfb828e05171ca2dd77b5988d068e6872480fe # bad: [a312e1706ce6c124f04ec85ddece240f3bb2a696] Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux git bisect bad a312e1706ce6c124f04ec85ddece240f3bb2a696 # good: [3d8b5a22d40435b4a7e58f06ae2cd3506b222898] block: add support to pass user meta buffer git bisect good 3d8b5a22d40435b4a7e58f06ae2cd3506b222898 # good: [ce9464081d5168ee0f279d6932ba82260a5b97c4] io_uring/msg_ring: Drop custom destructor git bisect good ce9464081d5168ee0f279d6932ba82260a5b97c4 # bad: [d803d123948feffbd992213e144df224097f82b0] io_uring/rw: handle -EAGAIN retry at IO completion time git bisect bad d803d123948feffbd992213e144df224097f82b0 # good: [c5f71916146033f9aba108075ff7087022075fd6] io_uring/rw: always clear ->bytes_done on io_async_rw setup git bisect good c5f71916146033f9aba108075ff7087022075fd6 # good: [2a51c327d4a4a2eb62d67f4ea13a17efd0f25c5c] io_uring/rsrc: simplify the bvec iter count calculation git bisect good 2a51c327d4a4a2eb62d67f4ea13a17efd0f25c5c # good: [9ac273ae3dc296905b4d61e4c8e7a25592f6d183] io_uring/rw: use io_rw_recycle() from cleanup path git bisect good 9ac273ae3dc296905b4d61e4c8e7a25592f6d183 # first bad commit: [d803d123948feffbd992213e144df224097f82b0] io_uring/rw: handle -EAGAIN retry at IO completion time john@localhost:~/mnt_sda4/john/kernel-dev2> Thanks, John > --- > io_uring/io_uring.c | 15 +++++++-- > io_uring/rw.c | 80 ++++++++++++++------------------------------- > 2 files changed, 38 insertions(+), 57 deletions(-) > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index db198bd435b5..92ba2fdcd087 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -115,7 +115,7 @@ > REQ_F_ASYNC_DATA) > > #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | REQ_F_HARDLINK |\ > - IO_REQ_CLEAN_FLAGS) > + REQ_F_REISSUE | IO_REQ_CLEAN_FLAGS) > > #define IO_TCTX_REFS_CACHE_NR (1U << 10) > > @@ -1403,6 +1403,12 @@ static void io_free_batch_list(struct io_ring_ctx *ctx, > comp_list); > > if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) { > + if (req->flags & REQ_F_REISSUE) { > + node = req->comp_list.next; > + req->flags &= ~REQ_F_REISSUE; > + io_queue_iowq(req); > + continue; > + } > if (req->flags & REQ_F_REFCOUNT) { > node = req->comp_list.next; > if (!req_ref_put_and_test(req)) > @@ -1442,7 +1448,12 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) > struct io_kiocb *req = container_of(node, struct io_kiocb, > comp_list); > > - if (!(req->flags & REQ_F_CQE_SKIP) && > + /* > + * Requests marked with REQUEUE should not post a CQE, they > + * will go through the io-wq retry machinery and post one > + * later. > + */ > + if (!(req->flags & (REQ_F_CQE_SKIP | REQ_F_REISSUE)) && > unlikely(!io_fill_cqe_req(ctx, req))) { > if (ctx->lockless_cq) { > spin_lock(&ctx->completion_lock); > diff --git a/io_uring/rw.c b/io_uring/rw.c > index afc669048c5d..c52c0515f0a2 100644 > --- a/io_uring/rw.c > +++ b/io_uring/rw.c > @@ -202,7 +202,7 @@ static void io_req_rw_cleanup(struct io_kiocb *req, unsigned int issue_flags) > * mean that the underlying data can be gone at any time. But that > * should be fixed seperately, and then this check could be killed. > */ > - if (!(req->flags & REQ_F_REFCOUNT)) { > + if (!(req->flags & (REQ_F_REISSUE | REQ_F_REFCOUNT))) { > req->flags &= ~REQ_F_NEED_CLEANUP; > io_rw_recycle(req, issue_flags); > } > @@ -455,19 +455,12 @@ static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req) > return NULL; > } > > -#ifdef CONFIG_BLOCK > -static void io_resubmit_prep(struct io_kiocb *req) > -{ > - struct io_async_rw *io = req->async_data; > - struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); > - > - io_meta_restore(io, &rw->kiocb); > - iov_iter_restore(&io->iter, &io->iter_state); > -} > - > static bool io_rw_should_reissue(struct io_kiocb *req) > { > +#ifdef CONFIG_BLOCK > + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); > umode_t mode = file_inode(req->file)->i_mode; > + struct io_async_rw *io = req->async_data; > struct io_ring_ctx *ctx = req->ctx; > > if (!S_ISBLK(mode) && !S_ISREG(mode)) > @@ -488,17 +481,14 @@ static bool io_rw_should_reissue(struct io_kiocb *req) > */ > if (!same_thread_group(req->tctx->task, current) || !in_task()) > return false; > + > + io_meta_restore(io, &rw->kiocb); > + iov_iter_restore(&io->iter, &io->iter_state); > return true; > -} > #else > -static void io_resubmit_prep(struct io_kiocb *req) > -{ > -} > -static bool io_rw_should_reissue(struct io_kiocb *req) > -{ > return false; > -} > #endif > +} > > static void io_req_end_write(struct io_kiocb *req) > { > @@ -525,22 +515,16 @@ static void io_req_io_end(struct io_kiocb *req) > } > } > > -static bool __io_complete_rw_common(struct io_kiocb *req, long res) > +static void __io_complete_rw_common(struct io_kiocb *req, long res) > { > - if (unlikely(res != req->cqe.res)) { > - if (res == -EAGAIN && io_rw_should_reissue(req)) { > - /* > - * Reissue will start accounting again, finish the > - * current cycle. > - */ > - io_req_io_end(req); > - req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE; > - return true; > - } > + if (res == req->cqe.res) > + return; > + if (res == -EAGAIN && io_rw_should_reissue(req)) { > + req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE; > + } else { > req_set_fail(req); > req->cqe.res = res; > } > - return false; > } > > static inline int io_fixup_rw_res(struct io_kiocb *req, long res) > @@ -583,8 +567,7 @@ static void io_complete_rw(struct kiocb *kiocb, long res) > struct io_kiocb *req = cmd_to_io_kiocb(rw); > > if (!kiocb->dio_complete || !(kiocb->ki_flags & IOCB_DIO_CALLER_COMP)) { > - if (__io_complete_rw_common(req, res)) > - return; > + __io_complete_rw_common(req, res); > io_req_set_res(req, io_fixup_rw_res(req, res), 0); > } > req->io_task_work.func = io_req_rw_complete; > @@ -646,26 +629,19 @@ static int kiocb_done(struct io_kiocb *req, ssize_t ret, > if (ret >= 0 && req->flags & REQ_F_CUR_POS) > req->file->f_pos = rw->kiocb.ki_pos; > if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) { > - if (!__io_complete_rw_common(req, ret)) { > - /* > - * Safe to call io_end from here as we're inline > - * from the submission path. > - */ > - io_req_io_end(req); > - io_req_set_res(req, final_ret, > - io_put_kbuf(req, ret, issue_flags)); > - io_req_rw_cleanup(req, issue_flags); > - return IOU_OK; > - } > + __io_complete_rw_common(req, ret); > + /* > + * Safe to call io_end from here as we're inline > + * from the submission path. > + */ > + io_req_io_end(req); > + io_req_set_res(req, final_ret, io_put_kbuf(req, ret, issue_flags)); > + io_req_rw_cleanup(req, issue_flags); > + return IOU_OK; > } else { > io_rw_done(&rw->kiocb, ret); > } > > - if (req->flags & REQ_F_REISSUE) { > - req->flags &= ~REQ_F_REISSUE; > - io_resubmit_prep(req); > - return -EAGAIN; > - } > return IOU_ISSUE_SKIP_COMPLETE; > } > > @@ -944,8 +920,7 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) > if (ret == -EOPNOTSUPP && force_nonblock) > ret = -EAGAIN; > > - if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) { > - req->flags &= ~REQ_F_REISSUE; > + if (ret == -EAGAIN) { > /* If we can poll, just do that. */ > if (io_file_can_poll(req)) > return -EAGAIN; > @@ -1154,11 +1129,6 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) > else > ret2 = -EINVAL; > > - if (req->flags & REQ_F_REISSUE) { > - req->flags &= ~REQ_F_REISSUE; > - ret2 = -EAGAIN; > - } > - > /* > * Raw bdev writes will return -EOPNOTSUPP for IOCB_NOWAIT. Just > * retry them without IOCB_NOWAIT. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time 2025-03-04 18:10 ` John Garry @ 2025-03-05 16:57 ` John Garry 2025-03-05 17:03 ` Jens Axboe 0 siblings, 1 reply; 8+ messages in thread From: John Garry @ 2025-03-05 16:57 UTC (permalink / raw) To: Jens Axboe, io-uring; +Cc: linux-scsi, linux-nvme On 04/03/2025 18:10, John Garry wrote: + > On 09/01/2025 18:15, Jens Axboe wrote: >> Rather than try and have io_read/io_write turn REQ_F_REISSUE into >> -EAGAIN, catch the REQ_F_REISSUE when the request is otherwise >> considered as done. This is saner as we know this isn't happening >> during an actual submission, and it removes the need to randomly >> check REQ_F_REISSUE after read/write submission. >> >> If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over >> this request in terms of posting a CQE, and the regular request >> cleaning will ensure that it gets reissued via io-wq. >> >> Signed-off-by: Jens Axboe <[email protected]> > Further info, I can easily recreate this on latest block/io_uring-6.14 on real NVMe HW: [ 215.439892] cloud-init[2675]: Cloud-init v. 24.4-0ubuntu1~22.04.1 finished at Wed, 05 Mar 2025 16:48:44 +0000. Datasource DataSourceOracle. Up 215.38 seconds [ 371.784849] ================================================================== [ 371.871354] BUG: KASAN: slab-use-after-free in bio_poll+0x300/0x4c0 [ 371.946393] Read of size 4 at addr ff11000170f75b74 by task fio/3372 [ 372.022458] [ 372.040266] CPU: 7 UID: 0 PID: 3372 Comm: fio Not tainted 6.13.0+ #16 [ 372.117374] Hardware name: Oracle Corporation ORACLE SERVER X9-2c/TLA,MB TRAY,X9-2c, BIOS 66040600 07/23/2021 [ 372.236091] Call Trace: [ 372.265337] <TASK> [ 372.290423] dump_stack_lvl+0x76/0xa0 [ 372.334246] print_report+0xd1/0x630 [ 372.377024] ? bio_poll+0x300/0x4c0 [ 372.418759] ? kasan_complete_mode_report_info+0x6a/0x200 [ 372.483382] ? bio_poll+0x300/0x4c0 [ 372.525110] kasan_report+0xb8/0x100 [ 372.567882] ? bio_poll+0x300/0x4c0 [ 372.609611] __asan_report_load4_noabort+0x14/0x30 [ 372.666946] bio_poll+0x300/0x4c0 [ 372.706597] iocb_bio_iopoll+0x3c/0x70 [ 372.751448] io_uring_classic_poll+0xb0/0x190 [ 372.803587] io_do_iopoll+0x471/0x1080 [ 372.848440] ? __io_read+0x206/0x1050 [ 372.892253] ? fget+0x83/0x260 [ 372.928786] ? __pfx_io_do_iopoll+0x10/0x10 [ 372.978840] ? io_read+0x17/0x50 [ 373.017454] ? __kasan_check_write+0x14/0x30 [ 373.068546] ? mutex_lock+0x86/0xe0 [ 373.110278] ? __pfx_mutex_lock+0x10/0x10 [ 373.158253] __do_sys_io_uring_enter+0x7ca/0x14a0 [ 373.214550] ? __pfx___do_sys_io_uring_enter+0x10/0x10 [ 373.276048] ? __pfx___do_sys_io_uring_enter+0x10/0x10 [ 373.337544] ? __kasan_check_write+0x14/0x30 [ 373.388637] ? fput+0x1b/0x310 [ 373.425168] ? __do_sys_io_uring_enter+0x996/0x14a0 [ 373.483546] __x64_sys_io_uring_enter+0xe0/0x1b0 [ 373.538800] x64_sys_call+0x16a1/0x2540 [ 373.584695] do_syscall_64+0x7e/0x170 [ 373.628511] ? fpregs_assert_state_consistent+0x21/0xb0 [ 373.691052] ? syscall_exit_to_user_mode+0x4e/0x240 [ 373.749435] ? do_syscall_64+0x8a/0x170 [ 373.795326] ? __kasan_check_read+0x11/0x20 [ 373.845378] ? fpregs_assert_state_consistent+0x21/0xb0 [ 373.907917] ? syscall_exit_to_user_mode+0x4e/0x240 [ 373.966292] ? do_syscall_64+0x8a/0x170 [ 374.012184] ? __kasan_check_read+0x11/0x20 [ 374.062842] ? __kasan_check_read+0x11/0x20 [ 374.113483] ? fpregs_assert_state_consistent+0x21/0xb0 [ 374.176597] ? syscall_exit_to_user_mode+0x4e/0x240 [ 374.235530] ? do_syscall_64+0x8a/0x170 [ 374.281951] ? syscall_exit_to_user_mode+0x4e/0x240 [ 374.340855] ? do_syscall_64+0x8a/0x170 [ 374.387264] ? __kasan_check_read+0x11/0x20 [ 374.437828] ? fpregs_assert_state_consistent+0x21/0xb0 [ 374.500868] ? syscall_exit_to_user_mode+0x4e/0x240 [ 374.559737] ? clear_bhb_loop+0x15/0x70 [ 374.606100] ? clear_bhb_loop+0x15/0x70 [ 374.652461] ? clear_bhb_loop+0x15/0x70 [ 374.698816] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 374.759739] RIP: 0033:0x71871b11e88d [ 374.802982] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48 [ 375.028802] RSP: 002b:00007ffd4a4cf298 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa [ 375.119949] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 000071871b11e88d [ 375.205904] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000006 [ 375.291858] RBP: 0000718710f762c0 R08: 0000000000000000 R09: 0000000000000000 [ 375.377805] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001 [ 375.463754] R13: 0000000000000001 R14: 0000584669038c60 R15: 0000000000000000 [ 375.549701] </TASK> [ 375.576358] [ 375.594687] Allocated by task 2986: [ 375.637027] [ 375.655357] Freed by task 3376: [ 375.693501] [ 375.711823] Last potentially related work creation: [ 375.770791] [ 375.789121] The buggy address belongs to the object at ff11000170f75b00 [ 375.789121] which belongs to the cache bio-264 of size 264 [ 375.935951] The buggy address is located 116 bytes inside of [ 375.935951] freed 264-byte region [ff11000170f75b00, ff11000170f75c08) [ 376.083858] [ 376.102233] The buggy address belongs to the physical page: [ 376.169531] [ 376.187916] Memory state around the buggy address: [ 376.245819] ff11000170f75a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 376.332857] ff11000170f75a80: 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 376.419894] >ff11000170f75b00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 376.506945] ^ [ 376.589835] ff11000170f75b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 376.677180] ff11000170f75c00: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 376.764502] ================================================================== Note that I use hipri for fio, as below - without seems ok. fio --filename=/dev/nvme0n1p3 --direct=1 --rw=read --bs=4k --iodepth=100 --name=iops --numjobs=20 --loops=1000 --ioengine=io_uring --group_reporting --exitall_on_error --hipri > JFYI, this patch causes or exposes an issue in scsi_debug where we get a > use-after-free: > > Starting 10 processes > [ 9.445254] > ================================================================== > [ 9.446156] BUG: KASAN: slab-use-after-free in bio_poll+0x26b/0x420 > [ 9.447188] Read of size 4 at addr ff1100014c9b46b4 by task fio/442 > [ 9.447933] > [ 9.448121] CPU: 8 UID: 0 PID: 442 Comm: fio Not tainted 6.13.0- > rc4-00052-gfdf8fc8dce75 #3390 > [ 9.449161] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > [ 9.450573] Call Trace: > [ 9.450876] <TASK> > [ 9.451186] dump_stack_lvl+0x53/0x70 > [ 9.451644] print_report+0xce/0x660 > [ 9.452077] ? sdebug_blk_mq_poll+0x92/0x100 > [ 9.452639] ? bio_poll+0x26b/0x420 > [ 9.453077] kasan_report+0xc6/0x100 > [ 9.453537] ? bio_poll+0x26b/0x420 > [ 9.453955] bio_poll+0x26b/0x420 > [ 9.454374] ? task_mm_cid_work+0x33e/0x750 > [ 9.454879] iocb_bio_iopoll+0x47/0x60 > [ 9.455355] io_do_iopoll+0x450/0x10a0 > [ 9.455814] ? _raw_spin_lock_irq+0x81/0xe0 > [ 9.456359] ? __pfx_io_do_iopoll+0x10/0x10 > [ 9.456866] ? mutex_lock+0x8c/0xe0 > [ 9.457317] ? __pfx_mutex_lock+0x10/0x10 > [ 9.457799] ? __pfx_mutex_unlock+0x10/0x10 > [ 9.458316] __do_sys_io_uring_enter+0x7b7/0x12e0 > [ 9.458866] ? __pfx___do_sys_io_uring_enter+0x10/0x10 > [ 9.459515] ? __pfx___rseq_handle_notify_resume+0x10/0x10 > [ 9.460202] ? handle_mm_fault+0x16f/0x400 > [ 9.460696] do_syscall_64+0xa6/0x1a0 > [ 9.461149] entry_SYSCALL_64_after_hwframe+0x77/0x7f > [ 9.461787] RIP: 0033:0x560572d148f8 > [ 9.462234] Code: 1c 01 00 00 48 8b 04 24 83 78 38 00 0f 85 0e 01 00 > 00 41 8b 3f 41 ba 01 00 00 00 45 31 c0 45 31 c9 b8 aa 01 00 00 89 ea 0f > 05 <89> c6 85 c0 0f 89 ec 00 00 00 89 44 24 0c e8 55 87 fa ff 8b 74 24 > [ 9.464489] RSP: 002b:00007ffc5330a600 EFLAGS: 00000246 ORIG_RAX: > 00000000000001aa > [ 9.465400] RAX: ffffffffffffffda RBX: 00007f39cd9d9ac0 RCX: > 0000560572d148f8 > [ 9.466254] RDX: 0000000000000001 RSI: 0000000000000000 RDI: > 0000000000000006 > [ 9.467114] RBP: 0000000000000001 R08: 0000000000000000 R09: > 0000000000000000 > [ 9.467962] R10: 0000000000000001 R11: 0000000000000246 R12: > 0000000000000000 > [ 9.468803] R13: 00007ffc5330a798 R14: 0000000000000001 R15: > 0000560577589630 > [ 9.469672] </TASK> > [ 9.469950] > [ 9.470168] Allocated by task 441: > [ 9.470577] kasan_save_stack+0x33/0x60 > [ 9.471033] kasan_save_track+0x14/0x30 > [ 9.471554] __kasan_slab_alloc+0x6e/0x70 > [ 9.472036] kmem_cache_alloc_noprof+0xe9/0x300 > [ 9.472599] mempool_alloc_noprof+0x11a/0x2e0 > [ 9.473161] bio_alloc_bioset+0x1ab/0x780 > [ 9.473634] blkdev_direct_IO+0x456/0x2130 > [ 9.474130] blkdev_write_iter+0x54f/0xb90 > [ 9.474647] io_write+0x3b3/0xfe0 > [ 9.475053] io_issue_sqe+0x131/0x13e0 > [ 9.475516] io_submit_sqes+0x6f6/0x21e0 > [ 9.475995] __do_sys_io_uring_enter+0xa1e/0x12e0 > [ 9.476602] do_syscall_64+0xa6/0x1a0 > [ 9.477043] entry_SYSCALL_64_after_hwframe+0x77/0x7f > [ 9.477659] > [ 9.477848] Freed by task 441: > [ 9.478261] kasan_save_stack+0x33/0x60 > [ 9.478715] kasan_save_track+0x14/0x30 > [ 9.479197] kasan_save_free_info+0x3b/0x60 > [ 9.479692] __kasan_slab_free+0x37/0x50 > [ 9.480191] slab_free_after_rcu_debug+0xb1/0x280 > [ 9.480755] rcu_core+0x610/0x1a80 > [ 9.481215] handle_softirqs+0x1b5/0x5c0 > [ 9.481696] irq_exit_rcu+0xaf/0xe0 > [ 9.482119] sysvec_apic_timer_interrupt+0x6c/0x80 > [ 9.482729] asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [ 9.483389] > [ 9.483581] Last potentially related work creation: > [ 9.484174] kasan_save_stack+0x33/0x60 > [ 9.484661] __kasan_record_aux_stack+0x8e/0xa0 > [ 9.485228] kmem_cache_free+0x21c/0x370 > [ 9.485713] blk_update_request+0x22c/0x1070 > [ 9.486280] scsi_end_request+0x6b/0x5d0 > [ 9.486762] scsi_io_completion+0xa4/0xda0 > [ 9.487285] sdebug_blk_mq_poll_iter+0x189/0x2c0 > [ 9.487851] bt_tags_iter+0x15f/0x290 > [ 9.488310] __blk_mq_all_tag_iter+0x31d/0x960 > [ 9.488869] blk_mq_tagset_busy_iter+0xeb/0x140 > [ 9.489448] sdebug_blk_mq_poll+0x92/0x100 > [ 9.489949] blk_hctx_poll+0x160/0x330 > [ 9.490446] bio_poll+0x182/0x420 > [ 9.490853] iocb_bio_iopoll+0x47/0x60 > [ 9.491343] io_do_iopoll+0x450/0x10a0 > [ 9.491798] __do_sys_io_uring_enter+0x7b7/0x12e0 > [ 9.492398] do_syscall_64+0xa6/0x1a0 > [ 9.492852] entry_SYSCALL_64_after_hwframe+0x77/0x7f > [ 9.493484] > [ 9.493676] The buggy address belongs to the object at ff1100014c9b4640 > [ 9.493676] which belongs to the cache bio-248 of size 248 > [ 9.495118] The buggy address is located 116 bytes inside of > [ 9.495118] freed 248-byte region [ff1100014c9b4640, ff1100014c9b4738) > [ 9.496597] > [ 9.496784] The buggy address belongs to the physical page: > [ 9.497465] page: refcount:1 mapcount:0 mapping:0000000000000000 > index:0x0 pfn:0x14c9b4 > [ 9.498464] head: order:2 mapcount:0 entire_mapcount:0 > nr_pages_mapped:0 pincount:0 > [ 9.499421] flags: 0x200000000000040(head|node=0|zone=2) > [ 9.500053] page_type: f5(slab) > [ 9.500451] raw: 0200000000000040 ff110001052f8dc0 dead000000000122 > 0000000000000000 > [ 9.501386] raw: 0000000000000000 0000000080330033 00000001f5000000 > 0000000000000000 > [ 9.502333] head: 0200000000000040 ff110001052f8dc0 dead000000000122 > 0000000000000000 > [ 9.503261] head: 0000000000000000 0000000080330033 00000001f5000000 > 0000000000000000 > [ 9.504213] head: 0200000000000002 ffd4000005326d01 ffffffffffffffff > 0000000000000000 > [ 9.505142] head: 0000000000000004 0000000000000000 00000000ffffffff > 0000000000000000 > [ 9.506082] page dumped because: kasan: bad access detected > [ 9.506752] > [ 9.506939] Memory state around the buggy address: > [ 9.507560] ff1100014c9b4580: 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 fc > [ 9.508454] ff1100014c9b4600: fc fc fc fc fc fc fc fc fa fb fb fb fb > fb fb fb > [ 9.509365] >ff1100014c9b4680: fb fb fb fb fb fb fb fb fb fb fb fb fb > fb fb fb > [ 9.510260] ^ > [ 9.510842] ff1100014c9b4700: fb fb fb fb fb fb fb fc fc fc fc fc fc > fc fc fc > [ 9.511755] ff1100014c9b4780: 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > [ 9.512654] > ================================================================== > [ 9.513616] Disabling lock debugging due to kernel taint > QEMU: Terminated > > Now scsi_debug does something pretty unorthodox in the mq_poll callback > in that it calls blk_mq_tagset_busy_iter() ... -> scsi_done(). > > However, for qemu with nvme I get this: > > fio-3.34 > Starting 10 processes > [ 30.887296] > ================================================================== > [ 30.907820] BUG: KASAN: slab-use-after-free in bio_poll+0x26b/0x420 > [ 30.924793] Read of size 4 at addr ff1100015f775ab4 by task fio/458 > [ 30.949904] > [ 30.952784] CPU: 11 UID: 0 PID: 458 Comm: fio Not tainted 6.13.0- > rc4-00053-gc9c268957b58 #3391 > [ 31.036344] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 > [ 31.090860] Call Trace: > [ 31.153928] <TASK> > [ 31.180060] dump_stack_lvl+0x53/0x70 > [ 31.209414] print_report+0xce/0x660 > [ 31.220341] ? bio_poll+0x26b/0x420 > [ 31.236876] kasan_report+0xc6/0x100 > [ 31.253395] ? bio_poll+0x26b/0x420 > [ 31.283105] bio_poll+0x26b/0x420 > [ 31.304388] iocb_bio_iopoll+0x47/0x60 > [ 31.327575] io_do_iopoll+0x450/0x10a0 > [ 31.357706] ? __pfx_io_do_iopoll+0x10/0x10 > [ 31.381389] ? io_submit_sqes+0x6f6/0x21e0 > [ 31.397833] ? mutex_lock+0x8c/0xe0 > [ 31.436789] ? __pfx_mutex_lock+0x10/0x10 > [ 31.469967] __do_sys_io_uring_enter+0x7b7/0x12e0 > [ 31.506017] ? __pfx___do_sys_io_uring_enter+0x10/0x10 > [ 31.556819] ? __pfx___rseq_handle_notify_resume+0x10/0x10 > [ 31.599749] ? handle_mm_fault+0x16f/0x400 > [ 31.637617] ? up_read+0x1a/0xb0 > [ 31.658649] do_syscall_64+0xa6/0x1a0 > [ 31.715961] entry_SYSCALL_64_after_hwframe+0x77/0x7f > [ 31.738610] RIP: 0033:0x558b29f538f8 > [ 31.758298] Code: 1c 01 00 00 48 8b 04 24 83 78 38 00 0f 85 0e 01 00 > 00 41 8b 3f 41 ba 01 00 00 00 45 31 c0 45 31 c9 b8 aa 01 00 00 89 ea 0f > 05 <89> c6 85 c0 0f 89 ec 00 00 00 89 44 24 0c e8 55 87 fa ff 8b 74 24 > [ 31.868980] RSP: 002b:00007ffd37d51490 EFLAGS: 00000246 ORIG_RAX: > 00000000000001aa > [ 31.946356] RAX: ffffffffffffffda RBX: 00007f120ebfeb40 RCX: > 0000558b29f538f8 > [ 32.044833] RDX: 0000000000000001 RSI: 0000000000000000 RDI: > 0000000000000006 > [ 32.086849] RBP: 0000000000000001 R08: 0000000000000000 R09: > 0000000000000000 > [ 32.117522] R10: 0000000000000001 R11: 0000000000000246 R12: > 0000000000000000 > [ 32.155554] R13: 00007ffd37d51628 R14: 0000000000000001 R15: > 0000558b3c3216b0 > [ 32.174488] </TASK> > [ 32.183180] > [ 32.193202] Allocated by task 458: > [ 32.205642] kasan_save_stack+0x33/0x60 > [ 32.215908] kasan_save_track+0x14/0x30 > [ 32.231828] __kasan_slab_alloc+0x6e/0x70 > [ 32.244998] kmem_cache_alloc_noprof+0xe9/0x300 > [ 32.263654] mempool_alloc_noprof+0x11a/0x2e0 > [ 32.274050] bio_alloc_bioset+0x1ab/0x780 > [ 32.286829] blkdev_direct_IO+0x456/0x2130 > [ 32.293655] blkdev_write_iter+0x54f/0xb90 > [ 32.299844] io_write+0x3b3/0xfe0 > [ 32.309428] io_issue_sqe+0x131/0x13e0 > [ 32.315319] io_submit_sqes+0x6f6/0x21e0 > [ 32.320913] __do_sys_io_uring_enter+0xa1e/0x12e0 > [ 32.328091] do_syscall_64+0xa6/0x1a0 > [ 32.336915] entry_SYSCALL_64_after_hwframe+0x77/0x7f > [ 32.350460] > [ 32.355097] Freed by task 455: > [ 32.360331] kasan_save_stack+0x33/0x60 > [ 32.369595] kasan_save_track+0x14/0x30 > [ 32.377397] kasan_save_free_info+0x3b/0x60 > [ 32.386598] __kasan_slab_free+0x37/0x50 > [ 32.398562] slab_free_after_rcu_debug+0xb1/0x280 > [ 32.417108] rcu_core+0x610/0x1a80 > [ 32.424947] handle_softirqs+0x1b5/0x5c0 > [ 32.434754] irq_exit_rcu+0xaf/0xe0 > [ 32.438144] sysvec_apic_timer_interrupt+0x6c/0x80 > [ 32.443842] asm_sysvec_apic_timer_interrupt+0x1a/0x20 > [ 32.448109] > [ 32.449772] Last potentially related work creation: > [ 32.454800] kasan_save_stack+0x33/0x60 > [ 32.458743] __kasan_record_aux_stack+0x8e/0xa0 > [ 32.463802] kmem_cache_free+0x21c/0x370 > [ 32.468130] blk_mq_end_request_batch+0x26b/0x13f0 > [ 32.473935] io_do_iopoll+0xa78/0x10a0 > [ 32.477800] __do_sys_io_uring_enter+0x7b7/0x12e0 > [ 32.482678] do_syscall_64+0xa6/0x1a0 > [ 32.487671] entry_SYSCALL_64_after_hwframe+0x77/0x7f > [ 32.492551] > [ 32.494058] The buggy address belongs to the object at ff1100015f775a40 > [ 32.494058] which belongs to the cache bio-248 of size 248 > [ 32.504485] The buggy address is located 116 bytes inside of > [ 32.504485] freed 248-byte region [ff1100015f775a40, ff1100015f775b38) > [ 32.518309] > [ 32.520370] The buggy address belongs to the physical page: > [ 32.526444] page: refcount:1 mapcount:0 mapping:0000000000000000 > index:0x0 pfn:0x15f774 > [ 32.535554] head: order:2 mapcount:0 entire_mapcount:0 > nr_pages_mapped:0 pincount:0 > [ 32.542517] flags: 0x200000000000040(head|node=0|zone=2) > [ 32.547971] page_type: f5(slab) > [ 32.551287] raw: 0200000000000040 ff1100010376af80 dead000000000122 > 0000000000000000 > [ 32.559290] raw: 0000000000000000 0000000000330033 00000001f5000000 > 0000000000000000 > [ 32.566773] head: 0200000000000040 ff1100010376af80 dead000000000122 > 0000000000000000 > [ 32.574046] head: 0000000000000000 0000000000330033 00000001f5000000 > 0000000000000000 > [ 32.581715] head: 0200000000000002 ffd40000057ddd01 ffffffffffffffff > 0000000000000000 > [ 32.589588] head: 0000000000000004 0000000000000000 00000000ffffffff > 0000000000000000 > [ 32.596963] page dumped because: kasan: bad access detected > [ 32.603473] > [ 32.604871] Memory state around the buggy address: > [ 32.609617] ff1100015f775980: 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 fc > [ 32.617652] ff1100015f775a00: fc fc fc fc fc fc fc fc fa fb fb fb fb > fb fb fb > [ 32.625385] >ff1100015f775a80: fb fb fb fb fb fb fb fb fb fb fb fb fb > fb fb fb > [ 32.634014] ^ > [ 32.637444] ff1100015f775b00: fb fb fb fb fb fb fb fc fc fc fc fc fc > fc fc fc > [ 32.644158] ff1100015f775b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 > [ 32.651115] > ================================================================== > [ 32.659002] Disabling lock debugging due to kernel taint > QEMU: Terminated [W(10)][0.1%][w=150MiB/s][w=38.4k IOPS][eta 01h:24m:16s] > root@jgarry-ubuntu-bm5-instance-20230215-1843:/home/ubuntu/linux# > > Here's my git bisect log: > > git bisect start > # good: [1cbfb828e05171ca2dd77b5988d068e6872480fe] Merge tag > 'for-6.14/block-20250118' of git://git.kernel.dk/linux > git bisect good 1cbfb828e05171ca2dd77b5988d068e6872480fe > # bad: [a312e1706ce6c124f04ec85ddece240f3bb2a696] Merge tag > 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux > git bisect bad a312e1706ce6c124f04ec85ddece240f3bb2a696 > # good: [3d8b5a22d40435b4a7e58f06ae2cd3506b222898] block: add support > to pass user meta buffer > git bisect good 3d8b5a22d40435b4a7e58f06ae2cd3506b222898 > # good: [ce9464081d5168ee0f279d6932ba82260a5b97c4] io_uring/msg_ring: > Drop custom destructor > git bisect good ce9464081d5168ee0f279d6932ba82260a5b97c4 > # bad: [d803d123948feffbd992213e144df224097f82b0] io_uring/rw: handle > -EAGAIN retry at IO completion time > git bisect bad d803d123948feffbd992213e144df224097f82b0 > # good: [c5f71916146033f9aba108075ff7087022075fd6] io_uring/rw: always > clear ->bytes_done on io_async_rw setup > git bisect good c5f71916146033f9aba108075ff7087022075fd6 > # good: [2a51c327d4a4a2eb62d67f4ea13a17efd0f25c5c] io_uring/rsrc: > simplify the bvec iter count calculation > git bisect good 2a51c327d4a4a2eb62d67f4ea13a17efd0f25c5c > # good: [9ac273ae3dc296905b4d61e4c8e7a25592f6d183] io_uring/rw: use > io_rw_recycle() from cleanup path > git bisect good 9ac273ae3dc296905b4d61e4c8e7a25592f6d183 > # first bad commit: [d803d123948feffbd992213e144df224097f82b0] > io_uring/rw: handle -EAGAIN retry at IO completion time > john@localhost:~/mnt_sda4/john/kernel-dev2> > > Thanks, > John > >> --- >> io_uring/io_uring.c | 15 +++++++-- >> io_uring/rw.c | 80 ++++++++++++++------------------------------- >> 2 files changed, 38 insertions(+), 57 deletions(-) >> >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >> index db198bd435b5..92ba2fdcd087 100644 >> --- a/io_uring/io_uring.c >> +++ b/io_uring/io_uring.c >> @@ -115,7 +115,7 @@ >> REQ_F_ASYNC_DATA) >> #define IO_REQ_CLEAN_SLOW_FLAGS (REQ_F_REFCOUNT | REQ_F_LINK | >> REQ_F_HARDLINK |\ >> - IO_REQ_CLEAN_FLAGS) >> + REQ_F_REISSUE | IO_REQ_CLEAN_FLAGS) >> #define IO_TCTX_REFS_CACHE_NR (1U << 10) >> @@ -1403,6 +1403,12 @@ static void io_free_batch_list(struct >> io_ring_ctx *ctx, >> comp_list); >> if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) { >> + if (req->flags & REQ_F_REISSUE) { >> + node = req->comp_list.next; >> + req->flags &= ~REQ_F_REISSUE; >> + io_queue_iowq(req); >> + continue; >> + } >> if (req->flags & REQ_F_REFCOUNT) { >> node = req->comp_list.next; >> if (!req_ref_put_and_test(req)) >> @@ -1442,7 +1448,12 @@ void __io_submit_flush_completions(struct >> io_ring_ctx *ctx) >> struct io_kiocb *req = container_of(node, struct io_kiocb, >> comp_list); >> - if (!(req->flags & REQ_F_CQE_SKIP) && >> + /* >> + * Requests marked with REQUEUE should not post a CQE, they >> + * will go through the io-wq retry machinery and post one >> + * later. >> + */ >> + if (!(req->flags & (REQ_F_CQE_SKIP | REQ_F_REISSUE)) && >> unlikely(!io_fill_cqe_req(ctx, req))) { >> if (ctx->lockless_cq) { >> spin_lock(&ctx->completion_lock); >> diff --git a/io_uring/rw.c b/io_uring/rw.c >> index afc669048c5d..c52c0515f0a2 100644 >> --- a/io_uring/rw.c >> +++ b/io_uring/rw.c >> @@ -202,7 +202,7 @@ static void io_req_rw_cleanup(struct io_kiocb >> *req, unsigned int issue_flags) >> * mean that the underlying data can be gone at any time. But that >> * should be fixed seperately, and then this check could be killed. >> */ >> - if (!(req->flags & REQ_F_REFCOUNT)) { >> + if (!(req->flags & (REQ_F_REISSUE | REQ_F_REFCOUNT))) { >> req->flags &= ~REQ_F_NEED_CLEANUP; >> io_rw_recycle(req, issue_flags); >> } >> @@ -455,19 +455,12 @@ static inline loff_t *io_kiocb_update_pos(struct >> io_kiocb *req) >> return NULL; >> } >> -#ifdef CONFIG_BLOCK >> -static void io_resubmit_prep(struct io_kiocb *req) >> -{ >> - struct io_async_rw *io = req->async_data; >> - struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); >> - >> - io_meta_restore(io, &rw->kiocb); >> - iov_iter_restore(&io->iter, &io->iter_state); >> -} >> - >> static bool io_rw_should_reissue(struct io_kiocb *req) >> { >> +#ifdef CONFIG_BLOCK >> + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); >> umode_t mode = file_inode(req->file)->i_mode; >> + struct io_async_rw *io = req->async_data; >> struct io_ring_ctx *ctx = req->ctx; >> if (!S_ISBLK(mode) && !S_ISREG(mode)) >> @@ -488,17 +481,14 @@ static bool io_rw_should_reissue(struct io_kiocb >> *req) >> */ >> if (!same_thread_group(req->tctx->task, current) || !in_task()) >> return false; >> + >> + io_meta_restore(io, &rw->kiocb); >> + iov_iter_restore(&io->iter, &io->iter_state); >> return true; >> -} >> #else >> -static void io_resubmit_prep(struct io_kiocb *req) >> -{ >> -} >> -static bool io_rw_should_reissue(struct io_kiocb *req) >> -{ >> return false; >> -} >> #endif >> +} >> static void io_req_end_write(struct io_kiocb *req) >> { >> @@ -525,22 +515,16 @@ static void io_req_io_end(struct io_kiocb *req) >> } >> } >> -static bool __io_complete_rw_common(struct io_kiocb *req, long res) >> +static void __io_complete_rw_common(struct io_kiocb *req, long res) >> { >> - if (unlikely(res != req->cqe.res)) { >> - if (res == -EAGAIN && io_rw_should_reissue(req)) { >> - /* >> - * Reissue will start accounting again, finish the >> - * current cycle. >> - */ >> - io_req_io_end(req); >> - req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE; >> - return true; >> - } >> + if (res == req->cqe.res) >> + return; >> + if (res == -EAGAIN && io_rw_should_reissue(req)) { >> + req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE; >> + } else { >> req_set_fail(req); >> req->cqe.res = res; >> } >> - return false; >> } >> static inline int io_fixup_rw_res(struct io_kiocb *req, long res) >> @@ -583,8 +567,7 @@ static void io_complete_rw(struct kiocb *kiocb, >> long res) >> struct io_kiocb *req = cmd_to_io_kiocb(rw); >> if (!kiocb->dio_complete || !(kiocb->ki_flags & >> IOCB_DIO_CALLER_COMP)) { >> - if (__io_complete_rw_common(req, res)) >> - return; >> + __io_complete_rw_common(req, res); >> io_req_set_res(req, io_fixup_rw_res(req, res), 0); >> } >> req->io_task_work.func = io_req_rw_complete; >> @@ -646,26 +629,19 @@ static int kiocb_done(struct io_kiocb *req, >> ssize_t ret, >> if (ret >= 0 && req->flags & REQ_F_CUR_POS) >> req->file->f_pos = rw->kiocb.ki_pos; >> if (ret >= 0 && (rw->kiocb.ki_complete == io_complete_rw)) { >> - if (!__io_complete_rw_common(req, ret)) { >> - /* >> - * Safe to call io_end from here as we're inline >> - * from the submission path. >> - */ >> - io_req_io_end(req); >> - io_req_set_res(req, final_ret, >> - io_put_kbuf(req, ret, issue_flags)); >> - io_req_rw_cleanup(req, issue_flags); >> - return IOU_OK; >> - } >> + __io_complete_rw_common(req, ret); >> + /* >> + * Safe to call io_end from here as we're inline >> + * from the submission path. >> + */ >> + io_req_io_end(req); >> + io_req_set_res(req, final_ret, io_put_kbuf(req, ret, >> issue_flags)); >> + io_req_rw_cleanup(req, issue_flags); >> + return IOU_OK; >> } else { >> io_rw_done(&rw->kiocb, ret); >> } >> - if (req->flags & REQ_F_REISSUE) { >> - req->flags &= ~REQ_F_REISSUE; >> - io_resubmit_prep(req); >> - return -EAGAIN; >> - } >> return IOU_ISSUE_SKIP_COMPLETE; >> } >> @@ -944,8 +920,7 @@ static int __io_read(struct io_kiocb *req, >> unsigned int issue_flags) >> if (ret == -EOPNOTSUPP && force_nonblock) >> ret = -EAGAIN; >> - if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) { >> - req->flags &= ~REQ_F_REISSUE; >> + if (ret == -EAGAIN) { >> /* If we can poll, just do that. */ >> if (io_file_can_poll(req)) >> return -EAGAIN; >> @@ -1154,11 +1129,6 @@ int io_write(struct io_kiocb *req, unsigned int >> issue_flags) >> else >> ret2 = -EINVAL; >> - if (req->flags & REQ_F_REISSUE) { >> - req->flags &= ~REQ_F_REISSUE; >> - ret2 = -EAGAIN; >> - } >> - >> /* >> * Raw bdev writes will return -EOPNOTSUPP for IOCB_NOWAIT. Just >> * retry them without IOCB_NOWAIT. > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time 2025-03-05 16:57 ` John Garry @ 2025-03-05 17:03 ` Jens Axboe 2025-03-05 21:03 ` Jens Axboe 0 siblings, 1 reply; 8+ messages in thread From: Jens Axboe @ 2025-03-05 17:03 UTC (permalink / raw) To: John Garry, io-uring; +Cc: linux-scsi, linux-nvme On 3/5/25 9:57 AM, John Garry wrote: > On 04/03/2025 18:10, John Garry wrote: > > + > >> On 09/01/2025 18:15, Jens Axboe wrote: >>> Rather than try and have io_read/io_write turn REQ_F_REISSUE into >>> -EAGAIN, catch the REQ_F_REISSUE when the request is otherwise >>> considered as done. This is saner as we know this isn't happening >>> during an actual submission, and it removes the need to randomly >>> check REQ_F_REISSUE after read/write submission. >>> >>> If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over >>> this request in terms of posting a CQE, and the regular request >>> cleaning will ensure that it gets reissued via io-wq. >>> >>> Signed-off-by: Jens Axboe <[email protected]> >> > > Further info, I can easily recreate this on latest block/io_uring-6.14 on real NVMe HW: Thanks, I'll take a look! -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time 2025-03-05 17:03 ` Jens Axboe @ 2025-03-05 21:03 ` Jens Axboe 0 siblings, 0 replies; 8+ messages in thread From: Jens Axboe @ 2025-03-05 21:03 UTC (permalink / raw) To: John Garry, io-uring; +Cc: linux-scsi, linux-nvme On 3/5/25 10:03 AM, Jens Axboe wrote: > On 3/5/25 9:57 AM, John Garry wrote: >> On 04/03/2025 18:10, John Garry wrote: >> >> + >> >>> On 09/01/2025 18:15, Jens Axboe wrote: >>>> Rather than try and have io_read/io_write turn REQ_F_REISSUE into >>>> -EAGAIN, catch the REQ_F_REISSUE when the request is otherwise >>>> considered as done. This is saner as we know this isn't happening >>>> during an actual submission, and it removes the need to randomly >>>> check REQ_F_REISSUE after read/write submission. >>>> >>>> If REQ_F_REISSUE is set, __io_submit_flush_completions() will skip over >>>> this request in terms of posting a CQE, and the regular request >>>> cleaning will ensure that it gets reissued via io-wq. >>>> >>>> Signed-off-by: Jens Axboe <[email protected]> >>> >> >> Further info, I can easily recreate this on latest block/io_uring-6.14 on real NVMe HW: > > Thanks, I'll take a look! Can you give this a spin? diff --git a/io_uring/rw.c b/io_uring/rw.c index 9edc6baebd01..e5528cebcd06 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -560,11 +560,10 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res) if (kiocb->ki_flags & IOCB_WRITE) io_req_end_write(req); if (unlikely(res != req->cqe.res)) { - if (res == -EAGAIN && io_rw_should_reissue(req)) { + if (res == -EAGAIN && io_rw_should_reissue(req)) req->flags |= REQ_F_REISSUE | REQ_F_BL_NO_RECYCLE; - return; - } - req->cqe.res = res; + else + req->cqe.res = res; } /* order with io_iopoll_complete() checking ->iopoll_completed */ -- Jens Axboe ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/3] io_uring/rw: don't gate retry on completion context 2025-01-09 18:15 [PATCHSET for-next 0/3] Fix read/write -EAGAIN failure cases Jens Axboe 2025-01-09 18:15 ` [PATCH 1/3] io_uring/rw: use io_rw_recycle() from cleanup path Jens Axboe 2025-01-09 18:15 ` [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time Jens Axboe @ 2025-01-09 18:15 ` Jens Axboe 2 siblings, 0 replies; 8+ messages in thread From: Jens Axboe @ 2025-01-09 18:15 UTC (permalink / raw) To: io-uring; +Cc: Jens Axboe, Haeuptle, Michael nvme multipath reports that they see spurious -EAGAIN bubbling back to userspace, which is caused by how they handle retries internally through a kworker. However, any data that needs preserving or importing for a read/write request has always been done so at prep time, and we can sanely skip this check. Reported-by: "Haeuptle, Michael" <[email protected]> Link: https://lore.kernel.org/io-uring/DS7PR84MB31105C2C63CFA47BE8CBD6EE95102@DS7PR84MB3110.NAMPRD84.PROD.OUTLOOK.COM/ Signed-off-by: Jens Axboe <[email protected]> --- io_uring/rw.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/io_uring/rw.c b/io_uring/rw.c index c52c0515f0a2..ee5d38db9b48 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -475,12 +475,6 @@ static bool io_rw_should_reissue(struct io_kiocb *req) */ if (percpu_ref_is_dying(&ctx->refs)) return false; - /* - * Play it safe and assume not safe to re-import and reissue if we're - * not in the original thread group (or in task context). - */ - if (!same_thread_group(req->tctx->task, current) || !in_task()) - return false; io_meta_restore(io, &rw->kiocb); iov_iter_restore(&io->iter, &io->iter_state); -- 2.47.1 ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-03-05 21:03 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-09 18:15 [PATCHSET for-next 0/3] Fix read/write -EAGAIN failure cases Jens Axboe 2025-01-09 18:15 ` [PATCH 1/3] io_uring/rw: use io_rw_recycle() from cleanup path Jens Axboe 2025-01-09 18:15 ` [PATCH 2/3] io_uring/rw: handle -EAGAIN retry at IO completion time Jens Axboe 2025-03-04 18:10 ` John Garry 2025-03-05 16:57 ` John Garry 2025-03-05 17:03 ` Jens Axboe 2025-03-05 21:03 ` Jens Axboe 2025-01-09 18:15 ` [PATCH 3/3] io_uring/rw: don't gate retry on completion context Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox