* [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode
@ 2025-12-10 8:54 Fengnan Chang
2025-12-10 8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Fengnan Chang @ 2025-12-10 8:54 UTC (permalink / raw)
To: axboe, asml.silence, io-uring; +Cc: Fengnan Chang
In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
is considered that the current req is the actual completed request.
This may be reasonable for multi-queue ctx, but is problematic for
single-queue ctx because the current request may not be done when the
poll gets to the result. In this case, the completed io needs to wait
for the first io on the chain to complete before notifying the user,
which may cause io accumulation in the list.
Our modification plan is as follows: change io_wq_work_list to normal
list so that the iopoll_list list in it can be removed and put into the
comp_reqs list when the request is completed. This way each io is
handled independently and all gets processed in time.
After modification, test with:
./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
/dev/nvme6n1
base IOPS is 725K, patch IOPS is 782K.
./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
/dev/nvme6n1
Base IOPS is 880k, patch IOPS is 895K.
Fengnan Chang (2):
blk-mq: delete task running check in blk_hctx_poll
io_uring: fix io may accumulation in poll mode
block/blk-mq.c | 2 --
include/linux/blk-mq.h | 1 +
include/linux/blkdev.h | 1 +
include/linux/io_uring_types.h | 8 +++---
io_uring/io_uring.c | 51 ++++++++++++++++------------------
io_uring/io_uring.h | 15 ++++------
io_uring/rw.c | 35 +++++++----------------
io_uring/slist.h | 9 ------
io_uring/sqpoll.c | 8 +++---
9 files changed, 49 insertions(+), 81 deletions(-)
base-commit: 4941a17751c99e17422be743c02c923ad706f888
prerequisite-patch-id: 14d5e19968d0acb074ebcd62f073c330b0e34f36
prerequisite-patch-id: 1b5c18a564a4b4788f4992dc02d70c01f69c0e0e
--
2.39.5 (Apple Git-154)
^ permalink raw reply [flat|nested] 22+ messages in thread* [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll 2025-12-10 8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang @ 2025-12-10 8:55 ` Fengnan Chang 2025-12-10 9:19 ` Jens Axboe 2025-12-10 9:53 ` Jens Axboe 2025-12-10 8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang 2025-12-10 9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe 2 siblings, 2 replies; 22+ messages in thread From: Fengnan Chang @ 2025-12-10 8:55 UTC (permalink / raw) To: axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li In blk_hctx_poll, it always check task is running or not, and return 1 if task is running, it's not reasonable for current caller, especially io_uring, which is always running and cause BLK_POLL_ONESHOT is set. It looks like there has been this judgment for historical reasons, and in very early versions of this function the user would set the process state to TASK_UNINTERRUPTIBLE. Signed-off-by: Diangang Li <lidiangang@bytedance.com> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> --- block/blk-mq.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 0b8b72194003..b0eb90c50afb 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -5172,8 +5172,6 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx, if (signal_pending_state(state, current)) __set_current_state(TASK_RUNNING); - if (task_is_running(current)) - return 1; if (ret < 0 || (flags & BLK_POLL_ONESHOT)) break; -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll 2025-12-10 8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang @ 2025-12-10 9:19 ` Jens Axboe 2025-12-10 9:53 ` Jens Axboe 1 sibling, 0 replies; 22+ messages in thread From: Jens Axboe @ 2025-12-10 9:19 UTC (permalink / raw) To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/10/25 1:55 AM, Fengnan Chang wrote: > In blk_hctx_poll, it always check task is running or not, and return 1 > if task is running, it's not reasonable for current caller, especially > io_uring, which is always running and cause BLK_POLL_ONESHOT is set. > > It looks like there has been this judgment for historical reasons, and > in very early versions of this function the user would set the process > state to TASK_UNINTERRUPTIBLE. > > Signed-off-by: Diangang Li <lidiangang@bytedance.com> > Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> > --- > block/blk-mq.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 0b8b72194003..b0eb90c50afb 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -5172,8 +5172,6 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx, > > if (signal_pending_state(state, current)) > __set_current_state(TASK_RUNNING); > - if (task_is_running(current)) > - return 1; > > if (ret < 0 || (flags & BLK_POLL_ONESHOT)) > break; Heh yes, this is completely a leftover part from when polled IO was purely synchronous and the completion would wake it. Similarly, the __set_current_state(TASK_RUNNING) a bit further down needs to die as well. I'll check your patch 2 tomorrow. Not loving the complete switch from the singly linked list to the double linked, that's going to slow down both addition and iteration of the req freelist. So I think something needs to change there, but haven't really gone deep with it yet. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll 2025-12-10 8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang 2025-12-10 9:19 ` Jens Axboe @ 2025-12-10 9:53 ` Jens Axboe 1 sibling, 0 replies; 22+ messages in thread From: Jens Axboe @ 2025-12-10 9:53 UTC (permalink / raw) To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/10/25 1:55 AM, Fengnan Chang wrote: > In blk_hctx_poll, it always check task is running or not, and return 1 > if task is running, it's not reasonable for current caller, especially > io_uring, which is always running and cause BLK_POLL_ONESHOT is set. > > It looks like there has been this judgment for historical reasons, and > in very early versions of this function the user would set the process > state to TASK_UNINTERRUPTIBLE. > > Signed-off-by: Diangang Li <lidiangang@bytedance.com> > Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> > --- > block/blk-mq.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 0b8b72194003..b0eb90c50afb 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -5172,8 +5172,6 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx, > > if (signal_pending_state(state, current)) > __set_current_state(TASK_RUNNING); > - if (task_is_running(current)) > - return 1; > > if (ret < 0 || (flags & BLK_POLL_ONESHOT)) > break; Applied with the other task-is-running remnants killed, too. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-10 8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang 2025-12-10 8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang @ 2025-12-10 8:55 ` Fengnan Chang 2025-12-11 2:15 ` Jens Axboe 2025-12-10 9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe 2 siblings, 1 reply; 22+ messages in thread From: Fengnan Chang @ 2025-12-10 8:55 UTC (permalink / raw) To: axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li In the io_do_iopoll function, when the poll loop of iopoll_list ends, it is considered that the current req is the actual completed request. This may be reasonable for multi-queue ctx, but is problematic for single-queue ctx because the current request may not be done when the poll gets to the result. In this case, the completed io needs to wait for the first io on the chain to complete before notifying the user, which may cause io accumulation in the list. Our modification plan is as follows: change io_wq_work_list to normal list so that the iopoll_list list in it can be removed and put into the comp_reqs list when the request is completed. This way each io is handled independently and all gets processed in time. After modification, test with: ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 /dev/nvme6n1 base IOPS is 725K, patch IOPS is 782K. ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 /dev/nvme6n1 Base IOPS is 880k, patch IOPS is 895K. Signed-off-by: Diangang Li <lidiangang@bytedance.com> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> --- include/linux/blk-mq.h | 1 + include/linux/blkdev.h | 1 + include/linux/io_uring_types.h | 8 +++--- io_uring/io_uring.c | 51 ++++++++++++++++------------------ io_uring/io_uring.h | 15 ++++------ io_uring/rw.c | 35 +++++++---------------- io_uring/slist.h | 9 ------ io_uring/sqpoll.c | 8 +++--- 8 files changed, 49 insertions(+), 79 deletions(-) diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index c16875b35521..39151998f8df 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -902,6 +902,7 @@ static inline bool blk_mq_add_to_batch(struct request *req, return false; iob->need_ts |= blk_mq_need_time_stamp(req); rq_list_add_tail(&iob->req_list, req); + iob->req_count++; return true; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e25d9802e08b..923dbb1de6cf 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1822,6 +1822,7 @@ void bdev_fput(struct file *bdev_file); struct io_comp_batch { struct rq_list req_list; + unsigned int req_count; bool need_ts; void (*complete)(struct io_comp_batch *); }; diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index c2ea6280901d..e9c70219c0f2 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -235,9 +235,9 @@ struct io_submit_link { struct io_submit_state { /* inline/task_work completion list, under ->uring_lock */ - struct io_wq_work_node free_list; + struct list_head free_list; /* batch completion logic */ - struct io_wq_work_list compl_reqs; + struct list_head compl_reqs; struct io_submit_link link; bool plug_started; @@ -317,7 +317,7 @@ struct io_ring_ctx { * manipulate the list, hence no extra locking is needed there. */ bool poll_multi_queue; - struct io_wq_work_list iopoll_list; + struct list_head iopoll_list; struct io_file_table file_table; struct io_rsrc_data buf_table; @@ -695,7 +695,7 @@ struct io_kiocb { union { /* used by request caches, completion batching and iopoll */ - struct io_wq_work_node comp_list; + struct list_head comp_list; /* cache ->apoll->events */ __poll_t apoll_events; }; diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 296667ba712c..a7166656093a 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -255,7 +255,7 @@ static inline void io_req_add_to_cache(struct io_kiocb *req, struct io_ring_ctx { if (IS_ENABLED(CONFIG_KASAN)) io_poison_cached_req(req); - wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list); + list_add(&req->comp_list, &ctx->submit_state.free_list); } static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref) @@ -367,20 +367,20 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->poll_wq); spin_lock_init(&ctx->completion_lock); raw_spin_lock_init(&ctx->timeout_lock); - INIT_WQ_LIST(&ctx->iopoll_list); + INIT_LIST_HEAD(&ctx->iopoll_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); INIT_LIST_HEAD(&ctx->ltimeout_list); init_llist_head(&ctx->work_llist); INIT_LIST_HEAD(&ctx->tctx_list); - ctx->submit_state.free_list.next = NULL; + INIT_LIST_HEAD(&ctx->submit_state.free_list); INIT_HLIST_HEAD(&ctx->waitid_list); xa_init_flags(&ctx->zcrx_ctxs, XA_FLAGS_ALLOC); #ifdef CONFIG_FUTEX INIT_HLIST_HEAD(&ctx->futex_list); #endif INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func); - INIT_WQ_LIST(&ctx->submit_state.compl_reqs); + INIT_LIST_HEAD(&ctx->submit_state.compl_reqs); INIT_HLIST_HEAD(&ctx->cancelable_uring_cmd); io_napi_init(ctx); mutex_init(&ctx->mmap_lock); @@ -945,7 +945,7 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags) * those are flushed first before posting this one. If not, CQEs * could get reordered. */ - if (!wq_list_empty(&ctx->submit_state.compl_reqs)) + if (!list_empty(&ctx->submit_state.compl_reqs)) __io_submit_flush_completions(ctx); lockdep_assert(!io_wq_current_is_worker()); @@ -1498,9 +1498,10 @@ static inline void io_req_put_rsrc_nodes(struct io_kiocb *req) } static void io_free_batch_list(struct io_ring_ctx *ctx, - struct io_wq_work_node *node) + struct list_head *head) __must_hold(&ctx->uring_lock) { + struct list_head *node = head->next; do { struct io_kiocb *req = container_of(node, struct io_kiocb, comp_list); @@ -1536,20 +1537,17 @@ static void io_free_batch_list(struct io_ring_ctx *ctx, node = req->comp_list.next; io_req_add_to_cache(req, ctx); - } while (node); + } while (!list_is_head(node, head)); } void __io_submit_flush_completions(struct io_ring_ctx *ctx) __must_hold(&ctx->uring_lock) { struct io_submit_state *state = &ctx->submit_state; - struct io_wq_work_node *node; + struct io_kiocb *req; __io_cq_lock(ctx); - __wq_list_for_each(node, &state->compl_reqs) { - struct io_kiocb *req = container_of(node, struct io_kiocb, - comp_list); - + list_for_each_entry(req, &state->compl_reqs, comp_list) { /* * Requests marked with REQUEUE should not post a CQE, they * will go through the io-wq retry machinery and post one @@ -1565,9 +1563,9 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx) } __io_cq_unlock_post(ctx); - if (!wq_list_empty(&state->compl_reqs)) { - io_free_batch_list(ctx, state->compl_reqs.first); - INIT_WQ_LIST(&state->compl_reqs); + if (!list_empty(&state->compl_reqs)) { + io_free_batch_list(ctx, &state->compl_reqs); + INIT_LIST_HEAD(&state->compl_reqs); } if (unlikely(ctx->drain_active)) @@ -1593,7 +1591,7 @@ static __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) return; mutex_lock(&ctx->uring_lock); - while (!wq_list_empty(&ctx->iopoll_list)) { + while (!list_empty(&ctx->iopoll_list)) { /* let it sleep and repeat later if can't complete a request */ if (io_do_iopoll(ctx, true) == 0) break; @@ -1658,21 +1656,21 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events) * forever, while the workqueue is stuck trying to acquire the * very same mutex. */ - if (wq_list_empty(&ctx->iopoll_list) || + if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) { u32 tail = ctx->cached_cq_tail; (void) io_run_local_work_locked(ctx, min_events); if (task_work_pending(current) || - wq_list_empty(&ctx->iopoll_list)) { + list_empty(&ctx->iopoll_list)) { mutex_unlock(&ctx->uring_lock); io_run_task_work(); mutex_lock(&ctx->uring_lock); } /* some requests don't go through iopoll_list */ if (tail != ctx->cached_cq_tail || - wq_list_empty(&ctx->iopoll_list)) + list_empty(&ctx->iopoll_list)) break; } ret = io_do_iopoll(ctx, !min_events); @@ -1715,13 +1713,12 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags) * how we do polling eventually, not spinning if we're on potentially * different devices. */ - if (wq_list_empty(&ctx->iopoll_list)) { + if (list_empty(&ctx->iopoll_list)) { ctx->poll_multi_queue = false; } else if (!ctx->poll_multi_queue) { struct io_kiocb *list_req; - list_req = container_of(ctx->iopoll_list.first, struct io_kiocb, - comp_list); + list_req = list_first_entry(&ctx->iopoll_list, struct io_kiocb, comp_list); if (list_req->file != req->file) ctx->poll_multi_queue = true; } @@ -1731,9 +1728,9 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags) * it to the front so we find it first. */ if (READ_ONCE(req->iopoll_completed)) - wq_list_add_head(&req->comp_list, &ctx->iopoll_list); + list_add(&req->comp_list, &ctx->iopoll_list); else - wq_list_add_tail(&req->comp_list, &ctx->iopoll_list); + list_add_tail(&req->comp_list, &ctx->iopoll_list); if (unlikely(needs_lock)) { /* @@ -2454,7 +2451,7 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr) if (unlikely(left)) { ret -= left; /* try again if it submitted nothing and can't allocate a req */ - if (!ret && io_req_cache_empty(ctx)) + if (!ret && list_empty(&ctx->submit_state.free_list)) ret = -EAGAIN; current->io_uring->cached_refs += left; } @@ -2818,7 +2815,7 @@ static __cold void __io_req_caches_free(struct io_ring_ctx *ctx) struct io_kiocb *req; int nr = 0; - while (!io_req_cache_empty(ctx)) { + while (!list_empty(&ctx->submit_state.free_list)) { req = io_extract_req(ctx); io_poison_req(req); kmem_cache_free(req_cachep, req); @@ -3215,7 +3212,7 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx, /* SQPOLL thread does its own polling */ if ((!(ctx->flags & IORING_SETUP_SQPOLL) && cancel_all) || is_sqpoll_thread) { - while (!wq_list_empty(&ctx->iopoll_list)) { + while (!list_empty(&ctx->iopoll_list)) { io_iopoll_try_reap_events(ctx); ret = true; cond_resched(); diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 46d9141d772a..acc82a1b7107 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -215,7 +215,7 @@ static inline void io_req_task_work_add(struct io_kiocb *req) static inline void io_submit_flush_completions(struct io_ring_ctx *ctx) { - if (!wq_list_empty(&ctx->submit_state.compl_reqs) || + if (!list_empty(&ctx->submit_state.compl_reqs) || ctx->submit_state.cq_flush) __io_submit_flush_completions(ctx); } @@ -502,7 +502,7 @@ static inline void io_req_complete_defer(struct io_kiocb *req) lockdep_assert_held(&req->ctx->uring_lock); - wq_list_add_tail(&req->comp_list, &state->compl_reqs); + list_add_tail(&req->comp_list, &state->compl_reqs); } static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx) @@ -521,25 +521,20 @@ static inline void io_get_task_refs(int nr) io_task_refs_refill(tctx); } -static inline bool io_req_cache_empty(struct io_ring_ctx *ctx) -{ - return !ctx->submit_state.free_list.next; -} - extern struct kmem_cache *req_cachep; static inline struct io_kiocb *io_extract_req(struct io_ring_ctx *ctx) { struct io_kiocb *req; - req = container_of(ctx->submit_state.free_list.next, struct io_kiocb, comp_list); - wq_stack_extract(&ctx->submit_state.free_list); + req = list_first_entry(&ctx->submit_state.free_list, struct io_kiocb, comp_list); + list_del_init(&req->comp_list); return req; } static inline bool io_alloc_req(struct io_ring_ctx *ctx, struct io_kiocb **req) { - if (unlikely(io_req_cache_empty(ctx))) { + if (unlikely(list_empty(&ctx->submit_state.free_list))) { if (!__io_alloc_req_refill(ctx)) return false; } diff --git a/io_uring/rw.c b/io_uring/rw.c index abe68ba9c9dc..4ecd41f48812 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -616,6 +616,11 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res) /* order with io_iopoll_complete() checking ->iopoll_completed */ smp_store_release(&req->iopoll_completed, 1); + + req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL); + if (req->opcode != IORING_OP_URING_CMD) + io_req_rw_cleanup(req, 0); + list_move_tail(&req->comp_list, &req->ctx->submit_state.compl_reqs); } static inline void io_rw_done(struct io_kiocb *req, ssize_t ret) @@ -1326,10 +1331,9 @@ static int io_uring_hybrid_poll(struct io_kiocb *req, int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) { - struct io_wq_work_node *pos, *start, *prev; + struct io_kiocb *req, *next; unsigned int poll_flags = 0; DEFINE_IO_COMP_BATCH(iob); - int nr_events = 0; /* * Only spin for completions if we don't have multiple devices hanging @@ -1338,8 +1342,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) if (ctx->poll_multi_queue || force_nonspin) poll_flags |= BLK_POLL_ONESHOT; - wq_list_for_each(pos, start, &ctx->iopoll_list) { - struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); + list_for_each_entry(req, &ctx->iopoll_list, comp_list) { int ret; /* @@ -1368,32 +1371,14 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) if (!rq_list_empty(&iob.req_list)) iob.complete(&iob); - else if (!pos) + else if (list_entry_is_head(req, &ctx->iopoll_list, comp_list)) return 0; - prev = start; - wq_list_for_each_resume(pos, prev) { - struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); - - /* order with io_complete_rw_iopoll(), e.g. ->result updates */ - if (!smp_load_acquire(&req->iopoll_completed)) - break; - nr_events++; - req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL); - if (req->opcode != IORING_OP_URING_CMD) - io_req_rw_cleanup(req, 0); - } - if (unlikely(!nr_events)) + if (unlikely(!iob.req_count)) return 0; - pos = start ? start->next : ctx->iopoll_list.first; - wq_list_cut(&ctx->iopoll_list, prev, start); - - if (WARN_ON_ONCE(!wq_list_empty(&ctx->submit_state.compl_reqs))) - return 0; - ctx->submit_state.compl_reqs.first = pos; __io_submit_flush_completions(ctx); - return nr_events; + return iob.req_count; } void io_rw_cache_free(const void *entry) diff --git a/io_uring/slist.h b/io_uring/slist.h index 0eb194817242..f62e12c67e4f 100644 --- a/io_uring/slist.h +++ b/io_uring/slist.h @@ -99,15 +99,6 @@ static inline void wq_list_del(struct io_wq_work_list *list, wq_list_cut(list, node, prev); } -static inline -struct io_wq_work_node *wq_stack_extract(struct io_wq_work_node *stack) -{ - struct io_wq_work_node *node = stack->next; - - stack->next = node->next; - return node; -} - static inline struct io_wq_work *wq_next_work(struct io_wq_work *work) { if (!work->list.next) diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index e22f072c7d5f..aac7a2c69892 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -211,7 +211,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd, if (cap_entries && to_submit > IORING_SQPOLL_CAP_ENTRIES_VALUE) to_submit = IORING_SQPOLL_CAP_ENTRIES_VALUE; - if (to_submit || !wq_list_empty(&ctx->iopoll_list)) { + if (to_submit || !list_empty(&ctx->iopoll_list)) { const struct cred *creds = NULL; io_sq_start_worktime(ist); @@ -220,7 +220,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd, creds = override_creds(ctx->sq_creds); mutex_lock(&ctx->uring_lock); - if (!wq_list_empty(&ctx->iopoll_list)) + if (!list_empty(&ctx->iopoll_list)) io_do_iopoll(ctx, true); /* @@ -343,7 +343,7 @@ static int io_sq_thread(void *data) list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) { int ret = __io_sq_thread(ctx, sqd, cap_entries, &ist); - if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list))) + if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list))) sqt_spin = true; } if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE)) @@ -378,7 +378,7 @@ static int io_sq_thread(void *data) atomic_or(IORING_SQ_NEED_WAKEUP, &ctx->rings->sq_flags); if ((ctx->flags & IORING_SETUP_IOPOLL) && - !wq_list_empty(&ctx->iopoll_list)) { + !list_empty(&ctx->iopoll_list)) { needs_sched = false; break; } -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-10 8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang @ 2025-12-11 2:15 ` Jens Axboe 2025-12-11 4:10 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2025-12-11 2:15 UTC (permalink / raw) To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/10/25 1:55 AM, Fengnan Chang wrote: > In the io_do_iopoll function, when the poll loop of iopoll_list ends, it > is considered that the current req is the actual completed request. > This may be reasonable for multi-queue ctx, but is problematic for > single-queue ctx because the current request may not be done when the > poll gets to the result. In this case, the completed io needs to wait > for the first io on the chain to complete before notifying the user, > which may cause io accumulation in the list. > Our modification plan is as follows: change io_wq_work_list to normal > list so that the iopoll_list list in it can be removed and put into the > comp_reqs list when the request is completed. This way each io is > handled independently and all gets processed in time. > > After modification, test with: > > ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 > /dev/nvme6n1 > > base IOPS is 725K, patch IOPS is 782K. > > ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 > /dev/nvme6n1 > > Base IOPS is 880k, patch IOPS is 895K. A few notes on this: 1) Manipulating the list in io_complete_rw_iopoll() I don't think is necessarily safe. Yes generally this is invoked from the owning/polling task, but that's not guaranteed. 2) The patch doesn't apply to the current tree, must be an older version? 3) When hand-applied, it still throws a compile warning about an unused variable. Please don't send untested stuff... 4) Don't just blatantly bloat the io_kiocb. When you change from a singly to a doubly linked list, you're growing the io_kiocb size. You should be able to use a union with struct io_task_work for example. That's already 16b in size - win/win as you don't need to slow down the cache management as that can keep using the linkage it currently is using, and you're not bloating the io_kiocb. 5) The already mentioned point about the cache free list now being doubly linked. This is generally a _bad_ idea as removing and adding entries now need to touch other entries too. That's not very cache friendly. #1 is kind of the big one, as it means you'll need to re-think how you do this. I do agree that the current approach isn't necessarily ideal as we don't process completions as quickly as we could, so I think there's merrit in continuing this work. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-11 2:15 ` Jens Axboe @ 2025-12-11 4:10 ` Jens Axboe 2025-12-11 7:38 ` Fengnan 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2025-12-11 4:10 UTC (permalink / raw) To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/10/25 7:15 PM, Jens Axboe wrote: > On 12/10/25 1:55 AM, Fengnan Chang wrote: >> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it >> is considered that the current req is the actual completed request. >> This may be reasonable for multi-queue ctx, but is problematic for >> single-queue ctx because the current request may not be done when the >> poll gets to the result. In this case, the completed io needs to wait >> for the first io on the chain to complete before notifying the user, >> which may cause io accumulation in the list. >> Our modification plan is as follows: change io_wq_work_list to normal >> list so that the iopoll_list list in it can be removed and put into the >> comp_reqs list when the request is completed. This way each io is >> handled independently and all gets processed in time. >> >> After modification, test with: >> >> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 >> /dev/nvme6n1 >> >> base IOPS is 725K, patch IOPS is 782K. >> >> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 >> /dev/nvme6n1 >> >> Base IOPS is 880k, patch IOPS is 895K. > > A few notes on this: > > 1) Manipulating the list in io_complete_rw_iopoll() I don't think is > necessarily safe. Yes generally this is invoked from the > owning/polling task, but that's not guaranteed. > > 2) The patch doesn't apply to the current tree, must be an older > version? > > 3) When hand-applied, it still throws a compile warning about an unused > variable. Please don't send untested stuff... > > 4) Don't just blatantly bloat the io_kiocb. When you change from a > singly to a doubly linked list, you're growing the io_kiocb size. You > should be able to use a union with struct io_task_work for example. > That's already 16b in size - win/win as you don't need to slow down > the cache management as that can keep using the linkage it currently > is using, and you're not bloating the io_kiocb. > > 5) The already mentioned point about the cache free list now being > doubly linked. This is generally a _bad_ idea as removing and adding > entries now need to touch other entries too. That's not very cache > friendly. > > #1 is kind of the big one, as it means you'll need to re-think how you > do this. I do agree that the current approach isn't necessarily ideal as > we don't process completions as quickly as we could, so I think there's > merrit in continuing this work. Proof of concept below, entirely untested, at a conference. Basically does what I describe above, and retains the list manipulation logic on the iopoll side, rather than on the completion side. Can you take a look at this? And if it runs, can you do some numbers on that too? diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index e1adb0d20a0a..54fd30abf2b8 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -316,7 +316,7 @@ struct io_ring_ctx { * manipulate the list, hence no extra locking is needed there. */ bool poll_multi_queue; - struct io_wq_work_list iopoll_list; + struct list_head iopoll_list; struct io_file_table file_table; struct io_rsrc_data buf_table; @@ -708,7 +708,16 @@ struct io_kiocb { atomic_t refs; bool cancel_seq_set; - struct io_task_work io_task_work; + + /* + * IOPOLL doesn't use task_work, so use the ->iopoll_node list + * entry to manage pending iopoll requests. + */ + union { + struct io_task_work io_task_work; + struct list_head iopoll_node; + }; + union { /* * for polled requests, i.e. IORING_OP_POLL_ADD and async armed diff --git a/io_uring/cancel.c b/io_uring/cancel.c index ca12ac10c0ae..4136bf04464a 100644 --- a/io_uring/cancel.c +++ b/io_uring/cancel.c @@ -534,7 +534,7 @@ __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx, /* SQPOLL thread does its own polling */ if ((!(ctx->flags & IORING_SETUP_SQPOLL) && cancel_all) || is_sqpoll_thread) { - while (!wq_list_empty(&ctx->iopoll_list)) { + while (!list_empty(&ctx->iopoll_list)) { io_iopoll_try_reap_events(ctx); ret = true; cond_resched(); diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 6cb24cdf8e68..fdb0b43f6fb5 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -334,7 +334,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->poll_wq); spin_lock_init(&ctx->completion_lock); raw_spin_lock_init(&ctx->timeout_lock); - INIT_WQ_LIST(&ctx->iopoll_list); + INIT_LIST_HEAD(&ctx->iopoll_list); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); INIT_LIST_HEAD(&ctx->ltimeout_list); @@ -1561,7 +1561,7 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) return; mutex_lock(&ctx->uring_lock); - while (!wq_list_empty(&ctx->iopoll_list)) { + while (!list_empty(&ctx->iopoll_list)) { /* let it sleep and repeat later if can't complete a request */ if (io_do_iopoll(ctx, true) == 0) break; @@ -1626,21 +1626,18 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events) * forever, while the workqueue is stuck trying to acquire the * very same mutex. */ - if (wq_list_empty(&ctx->iopoll_list) || - io_task_work_pending(ctx)) { + if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) { u32 tail = ctx->cached_cq_tail; (void) io_run_local_work_locked(ctx, min_events); - if (task_work_pending(current) || - wq_list_empty(&ctx->iopoll_list)) { + if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) { mutex_unlock(&ctx->uring_lock); io_run_task_work(); mutex_lock(&ctx->uring_lock); } /* some requests don't go through iopoll_list */ - if (tail != ctx->cached_cq_tail || - wq_list_empty(&ctx->iopoll_list)) + if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list)) break; } ret = io_do_iopoll(ctx, !min_events); @@ -1683,25 +1680,17 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags) * how we do polling eventually, not spinning if we're on potentially * different devices. */ - if (wq_list_empty(&ctx->iopoll_list)) { + if (list_empty(&ctx->iopoll_list)) { ctx->poll_multi_queue = false; } else if (!ctx->poll_multi_queue) { struct io_kiocb *list_req; - list_req = container_of(ctx->iopoll_list.first, struct io_kiocb, - comp_list); + list_req = list_entry(ctx->iopoll_list.next, struct io_kiocb, iopoll_node); if (list_req->file != req->file) ctx->poll_multi_queue = true; } - /* - * For fast devices, IO may have already completed. If it has, add - * it to the front so we find it first. - */ - if (READ_ONCE(req->iopoll_completed)) - wq_list_add_head(&req->comp_list, &ctx->iopoll_list); - else - wq_list_add_tail(&req->comp_list, &ctx->iopoll_list); + list_add_tail(&req->iopoll_node, &ctx->iopoll_list); if (unlikely(needs_lock)) { /* diff --git a/io_uring/rw.c b/io_uring/rw.c index 70ca88cc1f54..307f1f39d9f3 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -1315,9 +1315,9 @@ static int io_uring_hybrid_poll(struct io_kiocb *req, int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) { - struct io_wq_work_node *pos, *start, *prev; unsigned int poll_flags = 0; DEFINE_IO_COMP_BATCH(iob); + struct io_kiocb *req, *tmp; int nr_events = 0; /* @@ -1327,8 +1327,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) if (ctx->poll_multi_queue || force_nonspin) poll_flags |= BLK_POLL_ONESHOT; - wq_list_for_each(pos, start, &ctx->iopoll_list) { - struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); + list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) { int ret; /* @@ -1357,31 +1356,20 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) if (!rq_list_empty(&iob.req_list)) iob.complete(&iob); - else if (!pos) - return 0; - - prev = start; - wq_list_for_each_resume(pos, prev) { - struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); + list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) { /* order with io_complete_rw_iopoll(), e.g. ->result updates */ if (!smp_load_acquire(&req->iopoll_completed)) - break; + continue; + list_del(&req->iopoll_node); + wq_list_add_tail(&req->comp_list, &ctx->submit_state.compl_reqs); nr_events++; req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL); if (req->opcode != IORING_OP_URING_CMD) io_req_rw_cleanup(req, 0); } - if (unlikely(!nr_events)) - return 0; - - pos = start ? start->next : ctx->iopoll_list.first; - wq_list_cut(&ctx->iopoll_list, prev, start); - - if (WARN_ON_ONCE(!wq_list_empty(&ctx->submit_state.compl_reqs))) - return 0; - ctx->submit_state.compl_reqs.first = pos; - __io_submit_flush_completions(ctx); + if (nr_events) + __io_submit_flush_completions(ctx); return nr_events; } diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c index 74c1a130cd87..becdfdd323a9 100644 --- a/io_uring/sqpoll.c +++ b/io_uring/sqpoll.c @@ -212,7 +212,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd, if (cap_entries && to_submit > IORING_SQPOLL_CAP_ENTRIES_VALUE) to_submit = IORING_SQPOLL_CAP_ENTRIES_VALUE; - if (to_submit || !wq_list_empty(&ctx->iopoll_list)) { + if (to_submit || !list_empty(&ctx->iopoll_list)) { const struct cred *creds = NULL; io_sq_start_worktime(ist); @@ -221,7 +221,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd, creds = override_creds(ctx->sq_creds); mutex_lock(&ctx->uring_lock); - if (!wq_list_empty(&ctx->iopoll_list)) + if (!list_empty(&ctx->iopoll_list)) io_do_iopoll(ctx, true); /* @@ -344,7 +344,7 @@ static int io_sq_thread(void *data) list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) { int ret = __io_sq_thread(ctx, sqd, cap_entries, &ist); - if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list))) + if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list))) sqt_spin = true; } if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE)) @@ -379,7 +379,7 @@ static int io_sq_thread(void *data) atomic_or(IORING_SQ_NEED_WAKEUP, &ctx->rings->sq_flags); if ((ctx->flags & IORING_SETUP_IOPOLL) && - !wq_list_empty(&ctx->iopoll_list)) { + !list_empty(&ctx->iopoll_list)) { needs_sched = false; break; } -- Jens Axboe ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-11 4:10 ` Jens Axboe @ 2025-12-11 7:38 ` Fengnan 2025-12-11 10:22 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Fengnan @ 2025-12-11 7:38 UTC (permalink / raw) To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li 在 2025/12/11 12:10, Jens Axboe 写道: > On 12/10/25 7:15 PM, Jens Axboe wrote: >> On 12/10/25 1:55 AM, Fengnan Chang wrote: >>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it >>> is considered that the current req is the actual completed request. >>> This may be reasonable for multi-queue ctx, but is problematic for >>> single-queue ctx because the current request may not be done when the >>> poll gets to the result. In this case, the completed io needs to wait >>> for the first io on the chain to complete before notifying the user, >>> which may cause io accumulation in the list. >>> Our modification plan is as follows: change io_wq_work_list to normal >>> list so that the iopoll_list list in it can be removed and put into the >>> comp_reqs list when the request is completed. This way each io is >>> handled independently and all gets processed in time. >>> >>> After modification, test with: >>> >>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 >>> /dev/nvme6n1 >>> >>> base IOPS is 725K, patch IOPS is 782K. >>> >>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 >>> /dev/nvme6n1 >>> >>> Base IOPS is 880k, patch IOPS is 895K. >> A few notes on this: >> >> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is >> necessarily safe. Yes generally this is invoked from the >> owning/polling task, but that's not guaranteed. >> >> 2) The patch doesn't apply to the current tree, must be an older >> version? >> >> 3) When hand-applied, it still throws a compile warning about an unused >> variable. Please don't send untested stuff... >> >> 4) Don't just blatantly bloat the io_kiocb. When you change from a >> singly to a doubly linked list, you're growing the io_kiocb size. You >> should be able to use a union with struct io_task_work for example. >> That's already 16b in size - win/win as you don't need to slow down >> the cache management as that can keep using the linkage it currently >> is using, and you're not bloating the io_kiocb. >> >> 5) The already mentioned point about the cache free list now being >> doubly linked. This is generally a _bad_ idea as removing and adding >> entries now need to touch other entries too. That's not very cache >> friendly. >> >> #1 is kind of the big one, as it means you'll need to re-think how you >> do this. I do agree that the current approach isn't necessarily ideal as >> we don't process completions as quickly as we could, so I think there's >> merrit in continuing this work. > Proof of concept below, entirely untested, at a conference. Basically > does what I describe above, and retains the list manipulation logic > on the iopoll side, rather than on the completion side. Can you take > a look at this? And if it runs, can you do some numbers on that too? This patch works, and in my test case, the performance is identical to my patch. But there is a small problem, now looking for completed requests, always need to traverse the whole iopoll_list. this can be a bit inefficient in some cases, for example if the previous sent 128K io , the last io is 4K, the last io will be returned much earlier, this kind of scenario can not be verified in the current test. I'm not sure if it's a meaningful scenario. > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h > index e1adb0d20a0a..54fd30abf2b8 100644 > --- a/include/linux/io_uring_types.h > +++ b/include/linux/io_uring_types.h > @@ -316,7 +316,7 @@ struct io_ring_ctx { > * manipulate the list, hence no extra locking is needed there. > */ > bool poll_multi_queue; > - struct io_wq_work_list iopoll_list; > + struct list_head iopoll_list; > > struct io_file_table file_table; > struct io_rsrc_data buf_table; > @@ -708,7 +708,16 @@ struct io_kiocb { > > atomic_t refs; > bool cancel_seq_set; > - struct io_task_work io_task_work; > + > + /* > + * IOPOLL doesn't use task_work, so use the ->iopoll_node list > + * entry to manage pending iopoll requests. > + */ > + union { > + struct io_task_work io_task_work; > + struct list_head iopoll_node; > + }; > + > union { > /* > * for polled requests, i.e. IORING_OP_POLL_ADD and async armed > diff --git a/io_uring/cancel.c b/io_uring/cancel.c > index ca12ac10c0ae..4136bf04464a 100644 > --- a/io_uring/cancel.c > +++ b/io_uring/cancel.c > @@ -534,7 +534,7 @@ __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx, > /* SQPOLL thread does its own polling */ > if ((!(ctx->flags & IORING_SETUP_SQPOLL) && cancel_all) || > is_sqpoll_thread) { > - while (!wq_list_empty(&ctx->iopoll_list)) { > + while (!list_empty(&ctx->iopoll_list)) { > io_iopoll_try_reap_events(ctx); > ret = true; > cond_resched(); > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 6cb24cdf8e68..fdb0b43f6fb5 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -334,7 +334,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) > init_waitqueue_head(&ctx->poll_wq); > spin_lock_init(&ctx->completion_lock); > raw_spin_lock_init(&ctx->timeout_lock); > - INIT_WQ_LIST(&ctx->iopoll_list); > + INIT_LIST_HEAD(&ctx->iopoll_list); > INIT_LIST_HEAD(&ctx->defer_list); > INIT_LIST_HEAD(&ctx->timeout_list); > INIT_LIST_HEAD(&ctx->ltimeout_list); > @@ -1561,7 +1561,7 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx) > return; > > mutex_lock(&ctx->uring_lock); > - while (!wq_list_empty(&ctx->iopoll_list)) { > + while (!list_empty(&ctx->iopoll_list)) { > /* let it sleep and repeat later if can't complete a request */ > if (io_do_iopoll(ctx, true) == 0) > break; > @@ -1626,21 +1626,18 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events) > * forever, while the workqueue is stuck trying to acquire the > * very same mutex. > */ > - if (wq_list_empty(&ctx->iopoll_list) || > - io_task_work_pending(ctx)) { > + if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) { > u32 tail = ctx->cached_cq_tail; > > (void) io_run_local_work_locked(ctx, min_events); > > - if (task_work_pending(current) || > - wq_list_empty(&ctx->iopoll_list)) { > + if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) { > mutex_unlock(&ctx->uring_lock); > io_run_task_work(); > mutex_lock(&ctx->uring_lock); > } > /* some requests don't go through iopoll_list */ > - if (tail != ctx->cached_cq_tail || > - wq_list_empty(&ctx->iopoll_list)) > + if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list)) > break; > } > ret = io_do_iopoll(ctx, !min_events); > @@ -1683,25 +1680,17 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags) > * how we do polling eventually, not spinning if we're on potentially > * different devices. > */ > - if (wq_list_empty(&ctx->iopoll_list)) { > + if (list_empty(&ctx->iopoll_list)) { > ctx->poll_multi_queue = false; > } else if (!ctx->poll_multi_queue) { > struct io_kiocb *list_req; > > - list_req = container_of(ctx->iopoll_list.first, struct io_kiocb, > - comp_list); > + list_req = list_entry(ctx->iopoll_list.next, struct io_kiocb, iopoll_node); > if (list_req->file != req->file) > ctx->poll_multi_queue = true; > } > > - /* > - * For fast devices, IO may have already completed. If it has, add > - * it to the front so we find it first. > - */ > - if (READ_ONCE(req->iopoll_completed)) > - wq_list_add_head(&req->comp_list, &ctx->iopoll_list); > - else > - wq_list_add_tail(&req->comp_list, &ctx->iopoll_list); > + list_add_tail(&req->iopoll_node, &ctx->iopoll_list); > > if (unlikely(needs_lock)) { > /* > diff --git a/io_uring/rw.c b/io_uring/rw.c > index 70ca88cc1f54..307f1f39d9f3 100644 > --- a/io_uring/rw.c > +++ b/io_uring/rw.c > @@ -1315,9 +1315,9 @@ static int io_uring_hybrid_poll(struct io_kiocb *req, > > int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) > { > - struct io_wq_work_node *pos, *start, *prev; > unsigned int poll_flags = 0; > DEFINE_IO_COMP_BATCH(iob); > + struct io_kiocb *req, *tmp; > int nr_events = 0; > > /* > @@ -1327,8 +1327,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) > if (ctx->poll_multi_queue || force_nonspin) > poll_flags |= BLK_POLL_ONESHOT; > > - wq_list_for_each(pos, start, &ctx->iopoll_list) { > - struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); > + list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) { > int ret; > > /* > @@ -1357,31 +1356,20 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) > > if (!rq_list_empty(&iob.req_list)) > iob.complete(&iob); > - else if (!pos) > - return 0; > - > - prev = start; > - wq_list_for_each_resume(pos, prev) { > - struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list); > > + list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) { > /* order with io_complete_rw_iopoll(), e.g. ->result updates */ > if (!smp_load_acquire(&req->iopoll_completed)) > - break; > + continue; > + list_del(&req->iopoll_node); > + wq_list_add_tail(&req->comp_list, &ctx->submit_state.compl_reqs); > nr_events++; > req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL); > if (req->opcode != IORING_OP_URING_CMD) > io_req_rw_cleanup(req, 0); > } > - if (unlikely(!nr_events)) > - return 0; > - > - pos = start ? start->next : ctx->iopoll_list.first; > - wq_list_cut(&ctx->iopoll_list, prev, start); > - > - if (WARN_ON_ONCE(!wq_list_empty(&ctx->submit_state.compl_reqs))) > - return 0; > - ctx->submit_state.compl_reqs.first = pos; > - __io_submit_flush_completions(ctx); > + if (nr_events) > + __io_submit_flush_completions(ctx); > return nr_events; > } > > diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c > index 74c1a130cd87..becdfdd323a9 100644 > --- a/io_uring/sqpoll.c > +++ b/io_uring/sqpoll.c > @@ -212,7 +212,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd, > if (cap_entries && to_submit > IORING_SQPOLL_CAP_ENTRIES_VALUE) > to_submit = IORING_SQPOLL_CAP_ENTRIES_VALUE; > > - if (to_submit || !wq_list_empty(&ctx->iopoll_list)) { > + if (to_submit || !list_empty(&ctx->iopoll_list)) { > const struct cred *creds = NULL; > > io_sq_start_worktime(ist); > @@ -221,7 +221,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd, > creds = override_creds(ctx->sq_creds); > > mutex_lock(&ctx->uring_lock); > - if (!wq_list_empty(&ctx->iopoll_list)) > + if (!list_empty(&ctx->iopoll_list)) > io_do_iopoll(ctx, true); > > /* > @@ -344,7 +344,7 @@ static int io_sq_thread(void *data) > list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) { > int ret = __io_sq_thread(ctx, sqd, cap_entries, &ist); > > - if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list))) > + if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list))) > sqt_spin = true; > } > if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE)) > @@ -379,7 +379,7 @@ static int io_sq_thread(void *data) > atomic_or(IORING_SQ_NEED_WAKEUP, > &ctx->rings->sq_flags); > if ((ctx->flags & IORING_SETUP_IOPOLL) && > - !wq_list_empty(&ctx->iopoll_list)) { > + !list_empty(&ctx->iopoll_list)) { > needs_sched = false; > break; > } > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-11 7:38 ` Fengnan @ 2025-12-11 10:22 ` Jens Axboe 2025-12-11 10:33 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2025-12-11 10:22 UTC (permalink / raw) To: Fengnan, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/11/25 12:38 AM, Fengnan wrote: > > > ? 2025/12/11 12:10, Jens Axboe ??: >> On 12/10/25 7:15 PM, Jens Axboe wrote: >>> On 12/10/25 1:55 AM, Fengnan Chang wrote: >>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it >>>> is considered that the current req is the actual completed request. >>>> This may be reasonable for multi-queue ctx, but is problematic for >>>> single-queue ctx because the current request may not be done when the >>>> poll gets to the result. In this case, the completed io needs to wait >>>> for the first io on the chain to complete before notifying the user, >>>> which may cause io accumulation in the list. >>>> Our modification plan is as follows: change io_wq_work_list to normal >>>> list so that the iopoll_list list in it can be removed and put into the >>>> comp_reqs list when the request is completed. This way each io is >>>> handled independently and all gets processed in time. >>>> >>>> After modification, test with: >>>> >>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 >>>> /dev/nvme6n1 >>>> >>>> base IOPS is 725K, patch IOPS is 782K. >>>> >>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 >>>> /dev/nvme6n1 >>>> >>>> Base IOPS is 880k, patch IOPS is 895K. >>> A few notes on this: >>> >>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is >>> necessarily safe. Yes generally this is invoked from the >>> owning/polling task, but that's not guaranteed. >>> >>> 2) The patch doesn't apply to the current tree, must be an older >>> version? >>> >>> 3) When hand-applied, it still throws a compile warning about an unused >>> variable. Please don't send untested stuff... >>> >>> 4) Don't just blatantly bloat the io_kiocb. When you change from a >>> singly to a doubly linked list, you're growing the io_kiocb size. You >>> should be able to use a union with struct io_task_work for example. >>> That's already 16b in size - win/win as you don't need to slow down >>> the cache management as that can keep using the linkage it currently >>> is using, and you're not bloating the io_kiocb. >>> >>> 5) The already mentioned point about the cache free list now being >>> doubly linked. This is generally a _bad_ idea as removing and adding >>> entries now need to touch other entries too. That's not very cache >>> friendly. >>> >>> #1 is kind of the big one, as it means you'll need to re-think how you >>> do this. I do agree that the current approach isn't necessarily ideal as >>> we don't process completions as quickly as we could, so I think there's >>> merrit in continuing this work. >> Proof of concept below, entirely untested, at a conference. Basically >> does what I describe above, and retains the list manipulation logic >> on the iopoll side, rather than on the completion side. Can you take >> a look at this? And if it runs, can you do some numbers on that too? > > This patch works, and in my test case, the performance is identical to > my patch. Good! > But there is a small problem, now looking for completed requests, > always need to traverse the whole iopoll_list. this can be a bit > inefficient in some cases, for example if the previous sent 128K io , > the last io is 4K, the last io will be returned much earlier, this > kind of scenario can not be verified in the current test. I'm not sure > if it's a meaningful scenario. Not sure that's a big problem, you're just spinning to complete anyway. You could add your iob->nr_reqs or something, and break after finding those know have completed. That won't necessarily change anything, could still be the last one that completed. Would probably make more sense to pass in 'min_events' or similar and stop after that. But I think mostly tweaks that can be made after the fact. If you want to send out a new patch based on the one I sent, feel free to. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-11 10:22 ` Jens Axboe @ 2025-12-11 10:33 ` Jens Axboe 2025-12-11 11:13 ` Fengnan Chang 2025-12-12 1:41 ` Fengnan Chang 0 siblings, 2 replies; 22+ messages in thread From: Jens Axboe @ 2025-12-11 10:33 UTC (permalink / raw) To: Fengnan, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/11/25 3:22 AM, Jens Axboe wrote: > On 12/11/25 12:38 AM, Fengnan wrote: >> >> >> ? 2025/12/11 12:10, Jens Axboe ??: >>> On 12/10/25 7:15 PM, Jens Axboe wrote: >>>> On 12/10/25 1:55 AM, Fengnan Chang wrote: >>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it >>>>> is considered that the current req is the actual completed request. >>>>> This may be reasonable for multi-queue ctx, but is problematic for >>>>> single-queue ctx because the current request may not be done when the >>>>> poll gets to the result. In this case, the completed io needs to wait >>>>> for the first io on the chain to complete before notifying the user, >>>>> which may cause io accumulation in the list. >>>>> Our modification plan is as follows: change io_wq_work_list to normal >>>>> list so that the iopoll_list list in it can be removed and put into the >>>>> comp_reqs list when the request is completed. This way each io is >>>>> handled independently and all gets processed in time. >>>>> >>>>> After modification, test with: >>>>> >>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 >>>>> /dev/nvme6n1 >>>>> >>>>> base IOPS is 725K, patch IOPS is 782K. >>>>> >>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 >>>>> /dev/nvme6n1 >>>>> >>>>> Base IOPS is 880k, patch IOPS is 895K. >>>> A few notes on this: >>>> >>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is >>>> necessarily safe. Yes generally this is invoked from the >>>> owning/polling task, but that's not guaranteed. >>>> >>>> 2) The patch doesn't apply to the current tree, must be an older >>>> version? >>>> >>>> 3) When hand-applied, it still throws a compile warning about an unused >>>> variable. Please don't send untested stuff... >>>> >>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a >>>> singly to a doubly linked list, you're growing the io_kiocb size. You >>>> should be able to use a union with struct io_task_work for example. >>>> That's already 16b in size - win/win as you don't need to slow down >>>> the cache management as that can keep using the linkage it currently >>>> is using, and you're not bloating the io_kiocb. >>>> >>>> 5) The already mentioned point about the cache free list now being >>>> doubly linked. This is generally a _bad_ idea as removing and adding >>>> entries now need to touch other entries too. That's not very cache >>>> friendly. >>>> >>>> #1 is kind of the big one, as it means you'll need to re-think how you >>>> do this. I do agree that the current approach isn't necessarily ideal as >>>> we don't process completions as quickly as we could, so I think there's >>>> merrit in continuing this work. >>> Proof of concept below, entirely untested, at a conference. Basically >>> does what I describe above, and retains the list manipulation logic >>> on the iopoll side, rather than on the completion side. Can you take >>> a look at this? And if it runs, can you do some numbers on that too? >> >> This patch works, and in my test case, the performance is identical to >> my patch. > > Good! > >> But there is a small problem, now looking for completed requests, >> always need to traverse the whole iopoll_list. this can be a bit >> inefficient in some cases, for example if the previous sent 128K io , >> the last io is 4K, the last io will be returned much earlier, this >> kind of scenario can not be verified in the current test. I'm not sure >> if it's a meaningful scenario. > > Not sure that's a big problem, you're just spinning to complete anyway. > You could add your iob->nr_reqs or something, and break after finding > those know have completed. That won't necessarily change anything, could > still be the last one that completed. Would probably make more sense to > pass in 'min_events' or similar and stop after that. But I think mostly > tweaks that can be made after the fact. If you want to send out a new > patch based on the one I sent, feel free to. Eg, something like this on top would do that. Like I mentioned earlier, you cannot do the list manipulation where you did it, it's not safe. You have to defer it to reaping time. If we could do it from the callback where we mark it complete, then that would surely make things more trivial and avoid iteration when not needed. diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index cae9e857aea4..93709ee6c3ee 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -917,6 +917,7 @@ static inline bool blk_mq_add_to_batch(struct request *req, else if (iob->complete != complete) return false; iob->need_ts |= blk_mq_need_time_stamp(req); + iob->nr_reqs++; rq_list_add_tail(&iob->req_list, req); return true; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 72e34acd439c..9335b552e040 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1821,6 +1821,7 @@ void bdev_fput(struct file *bdev_file); struct io_comp_batch { struct rq_list req_list; bool need_ts; + int nr_reqs; void (*complete)(struct io_comp_batch *); }; diff --git a/io_uring/rw.c b/io_uring/rw.c index 307f1f39d9f3..37b5b2328f24 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -1358,6 +1358,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) iob.complete(&iob); list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) { + if (nr_events == iob.nr_reqs) + break; /* order with io_complete_rw_iopoll(), e.g. ->result updates */ if (!smp_load_acquire(&req->iopoll_completed)) continue; -- Jens Axboe ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-11 10:33 ` Jens Axboe @ 2025-12-11 11:13 ` Fengnan Chang 2025-12-11 11:19 ` Jens Axboe 2025-12-12 1:41 ` Fengnan Chang 1 sibling, 1 reply; 22+ messages in thread From: Fengnan Chang @ 2025-12-11 11:13 UTC (permalink / raw) To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li 在 2025/12/11 18:33, Jens Axboe 写道: > On 12/11/25 3:22 AM, Jens Axboe wrote: >> On 12/11/25 12:38 AM, Fengnan wrote: >>> >>> ? 2025/12/11 12:10, Jens Axboe ??: >>>> On 12/10/25 7:15 PM, Jens Axboe wrote: >>>>> On 12/10/25 1:55 AM, Fengnan Chang wrote: >>>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it >>>>>> is considered that the current req is the actual completed request. >>>>>> This may be reasonable for multi-queue ctx, but is problematic for >>>>>> single-queue ctx because the current request may not be done when the >>>>>> poll gets to the result. In this case, the completed io needs to wait >>>>>> for the first io on the chain to complete before notifying the user, >>>>>> which may cause io accumulation in the list. >>>>>> Our modification plan is as follows: change io_wq_work_list to normal >>>>>> list so that the iopoll_list list in it can be removed and put into the >>>>>> comp_reqs list when the request is completed. This way each io is >>>>>> handled independently and all gets processed in time. >>>>>> >>>>>> After modification, test with: >>>>>> >>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 >>>>>> /dev/nvme6n1 >>>>>> >>>>>> base IOPS is 725K, patch IOPS is 782K. >>>>>> >>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 >>>>>> /dev/nvme6n1 >>>>>> >>>>>> Base IOPS is 880k, patch IOPS is 895K. >>>>> A few notes on this: >>>>> >>>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is >>>>> necessarily safe. Yes generally this is invoked from the >>>>> owning/polling task, but that's not guaranteed. >>>>> >>>>> 2) The patch doesn't apply to the current tree, must be an older >>>>> version? >>>>> >>>>> 3) When hand-applied, it still throws a compile warning about an unused >>>>> variable. Please don't send untested stuff... >>>>> >>>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a >>>>> singly to a doubly linked list, you're growing the io_kiocb size. You >>>>> should be able to use a union with struct io_task_work for example. >>>>> That's already 16b in size - win/win as you don't need to slow down >>>>> the cache management as that can keep using the linkage it currently >>>>> is using, and you're not bloating the io_kiocb. >>>>> >>>>> 5) The already mentioned point about the cache free list now being >>>>> doubly linked. This is generally a _bad_ idea as removing and adding >>>>> entries now need to touch other entries too. That's not very cache >>>>> friendly. >>>>> >>>>> #1 is kind of the big one, as it means you'll need to re-think how you >>>>> do this. I do agree that the current approach isn't necessarily ideal as >>>>> we don't process completions as quickly as we could, so I think there's >>>>> merrit in continuing this work. >>>> Proof of concept below, entirely untested, at a conference. Basically >>>> does what I describe above, and retains the list manipulation logic >>>> on the iopoll side, rather than on the completion side. Can you take >>>> a look at this? And if it runs, can you do some numbers on that too? >>> This patch works, and in my test case, the performance is identical to >>> my patch. >> Good! >> >>> But there is a small problem, now looking for completed requests, >>> always need to traverse the whole iopoll_list. this can be a bit >>> inefficient in some cases, for example if the previous sent 128K io , >>> the last io is 4K, the last io will be returned much earlier, this >>> kind of scenario can not be verified in the current test. I'm not sure >>> if it's a meaningful scenario. >> Not sure that's a big problem, you're just spinning to complete anyway. >> You could add your iob->nr_reqs or something, and break after finding >> those know have completed. That won't necessarily change anything, could >> still be the last one that completed. Would probably make more sense to >> pass in 'min_events' or similar and stop after that. But I think mostly >> tweaks that can be made after the fact. If you want to send out a new >> patch based on the one I sent, feel free to. > Eg, something like this on top would do that. Like I mentioned earlier, > you cannot do the list manipulation where you did it, it's not safe. You > have to defer it to reaping time. If we could do it from the callback > where we mark it complete, then that would surely make things more > trivial and avoid iteration when not needed. Yes, it's not safe do the list manipulation in io_complete_rw_iopoll. It looks more reasonable with the following modifications. Your changes look good enough, but please give me more time, I'd like to do some more testing and rethink this. > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index cae9e857aea4..93709ee6c3ee 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -917,6 +917,7 @@ static inline bool blk_mq_add_to_batch(struct request *req, > else if (iob->complete != complete) > return false; > iob->need_ts |= blk_mq_need_time_stamp(req); > + iob->nr_reqs++; > rq_list_add_tail(&iob->req_list, req); > return true; > } > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 72e34acd439c..9335b552e040 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1821,6 +1821,7 @@ void bdev_fput(struct file *bdev_file); > struct io_comp_batch { > struct rq_list req_list; > bool need_ts; > + int nr_reqs; > void (*complete)(struct io_comp_batch *); > }; > > diff --git a/io_uring/rw.c b/io_uring/rw.c > index 307f1f39d9f3..37b5b2328f24 100644 > --- a/io_uring/rw.c > +++ b/io_uring/rw.c > @@ -1358,6 +1358,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) > iob.complete(&iob); > > list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) { > + if (nr_events == iob.nr_reqs) > + break; > /* order with io_complete_rw_iopoll(), e.g. ->result updates */ > if (!smp_load_acquire(&req->iopoll_completed)) > continue; > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-11 11:13 ` Fengnan Chang @ 2025-12-11 11:19 ` Jens Axboe 0 siblings, 0 replies; 22+ messages in thread From: Jens Axboe @ 2025-12-11 11:19 UTC (permalink / raw) To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/11/25 4:13 AM, Fengnan Chang wrote: > > > ? 2025/12/11 18:33, Jens Axboe ??: >> On 12/11/25 3:22 AM, Jens Axboe wrote: >>> On 12/11/25 12:38 AM, Fengnan wrote: >>>> >>>> ? 2025/12/11 12:10, Jens Axboe ??: >>>>> On 12/10/25 7:15 PM, Jens Axboe wrote: >>>>>> On 12/10/25 1:55 AM, Fengnan Chang wrote: >>>>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it >>>>>>> is considered that the current req is the actual completed request. >>>>>>> This may be reasonable for multi-queue ctx, but is problematic for >>>>>>> single-queue ctx because the current request may not be done when the >>>>>>> poll gets to the result. In this case, the completed io needs to wait >>>>>>> for the first io on the chain to complete before notifying the user, >>>>>>> which may cause io accumulation in the list. >>>>>>> Our modification plan is as follows: change io_wq_work_list to normal >>>>>>> list so that the iopoll_list list in it can be removed and put into the >>>>>>> comp_reqs list when the request is completed. This way each io is >>>>>>> handled independently and all gets processed in time. >>>>>>> >>>>>>> After modification, test with: >>>>>>> >>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 >>>>>>> /dev/nvme6n1 >>>>>>> >>>>>>> base IOPS is 725K, patch IOPS is 782K. >>>>>>> >>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 >>>>>>> /dev/nvme6n1 >>>>>>> >>>>>>> Base IOPS is 880k, patch IOPS is 895K. >>>>>> A few notes on this: >>>>>> >>>>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is >>>>>> necessarily safe. Yes generally this is invoked from the >>>>>> owning/polling task, but that's not guaranteed. >>>>>> >>>>>> 2) The patch doesn't apply to the current tree, must be an older >>>>>> version? >>>>>> >>>>>> 3) When hand-applied, it still throws a compile warning about an unused >>>>>> variable. Please don't send untested stuff... >>>>>> >>>>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a >>>>>> singly to a doubly linked list, you're growing the io_kiocb size. You >>>>>> should be able to use a union with struct io_task_work for example. >>>>>> That's already 16b in size - win/win as you don't need to slow down >>>>>> the cache management as that can keep using the linkage it currently >>>>>> is using, and you're not bloating the io_kiocb. >>>>>> >>>>>> 5) The already mentioned point about the cache free list now being >>>>>> doubly linked. This is generally a _bad_ idea as removing and adding >>>>>> entries now need to touch other entries too. That's not very cache >>>>>> friendly. >>>>>> >>>>>> #1 is kind of the big one, as it means you'll need to re-think how you >>>>>> do this. I do agree that the current approach isn't necessarily ideal as >>>>>> we don't process completions as quickly as we could, so I think there's >>>>>> merrit in continuing this work. >>>>> Proof of concept below, entirely untested, at a conference. Basically >>>>> does what I describe above, and retains the list manipulation logic >>>>> on the iopoll side, rather than on the completion side. Can you take >>>>> a look at this? And if it runs, can you do some numbers on that too? >>>> This patch works, and in my test case, the performance is identical to >>>> my patch. >>> Good! >>> >>>> But there is a small problem, now looking for completed requests, >>>> always need to traverse the whole iopoll_list. this can be a bit >>>> inefficient in some cases, for example if the previous sent 128K io , >>>> the last io is 4K, the last io will be returned much earlier, this >>>> kind of scenario can not be verified in the current test. I'm not sure >>>> if it's a meaningful scenario. >>> Not sure that's a big problem, you're just spinning to complete anyway. >>> You could add your iob->nr_reqs or something, and break after finding >>> those know have completed. That won't necessarily change anything, could >>> still be the last one that completed. Would probably make more sense to >>> pass in 'min_events' or similar and stop after that. But I think mostly >>> tweaks that can be made after the fact. If you want to send out a new >>> patch based on the one I sent, feel free to. >> Eg, something like this on top would do that. Like I mentioned earlier, >> you cannot do the list manipulation where you did it, it's not safe. You >> have to defer it to reaping time. If we could do it from the callback >> where we mark it complete, then that would surely make things more >> trivial and avoid iteration when not needed. > > Yes, it's not safe do the list manipulation in io_complete_rw_iopoll. > It looks more reasonable with the following modifications. > Your changes look good enough, but please give me more time, I'd > like to do some more testing and rethink this. Of course, there's no rush - we're in the merge window anyway, so this will be targeted for 6.20/7.0 either way. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-11 10:33 ` Jens Axboe 2025-12-11 11:13 ` Fengnan Chang @ 2025-12-12 1:41 ` Fengnan Chang 2025-12-12 1:53 ` Jens Axboe 1 sibling, 1 reply; 22+ messages in thread From: Fengnan Chang @ 2025-12-12 1:41 UTC (permalink / raw) To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li 在 2025/12/11 18:33, Jens Axboe 写道: > On 12/11/25 3:22 AM, Jens Axboe wrote: >> On 12/11/25 12:38 AM, Fengnan wrote: >>> >>> ? 2025/12/11 12:10, Jens Axboe ??: >>>> On 12/10/25 7:15 PM, Jens Axboe wrote: >>>>> On 12/10/25 1:55 AM, Fengnan Chang wrote: >>>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it >>>>>> is considered that the current req is the actual completed request. >>>>>> This may be reasonable for multi-queue ctx, but is problematic for >>>>>> single-queue ctx because the current request may not be done when the >>>>>> poll gets to the result. In this case, the completed io needs to wait >>>>>> for the first io on the chain to complete before notifying the user, >>>>>> which may cause io accumulation in the list. >>>>>> Our modification plan is as follows: change io_wq_work_list to normal >>>>>> list so that the iopoll_list list in it can be removed and put into the >>>>>> comp_reqs list when the request is completed. This way each io is >>>>>> handled independently and all gets processed in time. >>>>>> >>>>>> After modification, test with: >>>>>> >>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 >>>>>> /dev/nvme6n1 >>>>>> >>>>>> base IOPS is 725K, patch IOPS is 782K. >>>>>> >>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1 >>>>>> /dev/nvme6n1 >>>>>> >>>>>> Base IOPS is 880k, patch IOPS is 895K. >>>>> A few notes on this: >>>>> >>>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is >>>>> necessarily safe. Yes generally this is invoked from the >>>>> owning/polling task, but that's not guaranteed. >>>>> >>>>> 2) The patch doesn't apply to the current tree, must be an older >>>>> version? >>>>> >>>>> 3) When hand-applied, it still throws a compile warning about an unused >>>>> variable. Please don't send untested stuff... >>>>> >>>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a >>>>> singly to a doubly linked list, you're growing the io_kiocb size. You >>>>> should be able to use a union with struct io_task_work for example. >>>>> That's already 16b in size - win/win as you don't need to slow down >>>>> the cache management as that can keep using the linkage it currently >>>>> is using, and you're not bloating the io_kiocb. >>>>> >>>>> 5) The already mentioned point about the cache free list now being >>>>> doubly linked. This is generally a _bad_ idea as removing and adding >>>>> entries now need to touch other entries too. That's not very cache >>>>> friendly. >>>>> >>>>> #1 is kind of the big one, as it means you'll need to re-think how you >>>>> do this. I do agree that the current approach isn't necessarily ideal as >>>>> we don't process completions as quickly as we could, so I think there's >>>>> merrit in continuing this work. >>>> Proof of concept below, entirely untested, at a conference. Basically >>>> does what I describe above, and retains the list manipulation logic >>>> on the iopoll side, rather than on the completion side. Can you take >>>> a look at this? And if it runs, can you do some numbers on that too? >>> This patch works, and in my test case, the performance is identical to >>> my patch. >> Good! >> >>> But there is a small problem, now looking for completed requests, >>> always need to traverse the whole iopoll_list. this can be a bit >>> inefficient in some cases, for example if the previous sent 128K io , >>> the last io is 4K, the last io will be returned much earlier, this >>> kind of scenario can not be verified in the current test. I'm not sure >>> if it's a meaningful scenario. >> Not sure that's a big problem, you're just spinning to complete anyway. >> You could add your iob->nr_reqs or something, and break after finding >> those know have completed. That won't necessarily change anything, could >> still be the last one that completed. Would probably make more sense to >> pass in 'min_events' or similar and stop after that. But I think mostly >> tweaks that can be made after the fact. If you want to send out a new >> patch based on the one I sent, feel free to. > Eg, something like this on top would do that. Like I mentioned earlier, > you cannot do the list manipulation where you did it, it's not safe. You > have to defer it to reaping time. If we could do it from the callback > where we mark it complete, then that would surely make things more > trivial and avoid iteration when not needed. Oh, we can't add nr_events == iob.nr_reqs check, if blk_mq_add_to_batch add failed, completed IO will not add into iob, iob.nr_reqs will be 0, this may cause io hang. > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index cae9e857aea4..93709ee6c3ee 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -917,6 +917,7 @@ static inline bool blk_mq_add_to_batch(struct request *req, > else if (iob->complete != complete) > return false; > iob->need_ts |= blk_mq_need_time_stamp(req); > + iob->nr_reqs++; > rq_list_add_tail(&iob->req_list, req); > return true; > } > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 72e34acd439c..9335b552e040 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -1821,6 +1821,7 @@ void bdev_fput(struct file *bdev_file); > struct io_comp_batch { > struct rq_list req_list; > bool need_ts; > + int nr_reqs; > void (*complete)(struct io_comp_batch *); > }; > > diff --git a/io_uring/rw.c b/io_uring/rw.c > index 307f1f39d9f3..37b5b2328f24 100644 > --- a/io_uring/rw.c > +++ b/io_uring/rw.c > @@ -1358,6 +1358,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) > iob.complete(&iob); > > list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) { > + if (nr_events == iob.nr_reqs) > + break; > /* order with io_complete_rw_iopoll(), e.g. ->result updates */ > if (!smp_load_acquire(&req->iopoll_completed)) > continue; > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-12 1:41 ` Fengnan Chang @ 2025-12-12 1:53 ` Jens Axboe 2025-12-12 2:12 ` Fengnan Chang 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2025-12-12 1:53 UTC (permalink / raw) To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/11/25 6:41 PM, Fengnan Chang wrote: > Oh, we can't add nr_events == iob.nr_reqs check, if > blk_mq_add_to_batch add failed, completed IO will not add into iob, > iob.nr_reqs will be 0, this may cause io hang. Indeed, won't work as-is. I do think we're probably making a bigger deal out of the full loop than necessary. At least I'd be perfectly happy with just the current patch, performance should be better there than we currently have it. Ideally we'd have just one loop for polling and catching the completed items, but that's a bit tricky with the batch completions. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-12 1:53 ` Jens Axboe @ 2025-12-12 2:12 ` Fengnan Chang 2025-12-12 5:11 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Fengnan Chang @ 2025-12-12 2:12 UTC (permalink / raw) To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li 在 2025/12/12 09:53, Jens Axboe 写道: > On 12/11/25 6:41 PM, Fengnan Chang wrote: >> Oh, we can't add nr_events == iob.nr_reqs check, if >> blk_mq_add_to_batch add failed, completed IO will not add into iob, >> iob.nr_reqs will be 0, this may cause io hang. > Indeed, won't work as-is. > > I do think we're probably making a bigger deal out of the full loop than > necessary. At least I'd be perfectly happy with just the current patch, > performance should be better there than we currently have it. Ideally > we'd have just one loop for polling and catching the completed items, > but that's a bit tricky with the batch completions. Yes, ideally one loop would be enough, but given that there are also multi_queue ctx, that doesn't seem to be possible. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-12 2:12 ` Fengnan Chang @ 2025-12-12 5:11 ` Jens Axboe 2025-12-12 8:58 ` Jens Axboe 2025-12-12 13:32 ` Diangang Li 0 siblings, 2 replies; 22+ messages in thread From: Jens Axboe @ 2025-12-12 5:11 UTC (permalink / raw) To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/11/25 7:12 PM, Fengnan Chang wrote: > > > ? 2025/12/12 09:53, Jens Axboe ??: >> On 12/11/25 6:41 PM, Fengnan Chang wrote: >>> Oh, we can't add nr_events == iob.nr_reqs check, if >>> blk_mq_add_to_batch add failed, completed IO will not add into iob, >>> iob.nr_reqs will be 0, this may cause io hang. >> Indeed, won't work as-is. >> >> I do think we're probably making a bigger deal out of the full loop than >> necessary. At least I'd be perfectly happy with just the current patch, >> performance should be better there than we currently have it. Ideally >> we'd have just one loop for polling and catching the completed items, >> but that's a bit tricky with the batch completions. > > Yes, ideally one loop would be enough, but given that there are also > multi_queue ctx, that doesn't seem to be possible. It's not removing the double loop, but the below could help _only_ iterate completed requests at the end. Rather than move items between the current list at the completion callback, have a separate list just for completed requests. Then we can simply iterate that, knowing all of them have completed. Gets rid of the ->iopoll_completed as well, and then we can move the poll_refs. Not really related at all, obviously this patch should be split into multiple pieces. This uses a lockless list. But since the producer and consumer are generally the same task, that should not add any real overhead. On top of the previous one I sent. What do you think? diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 54fd30abf2b8..2d67d95a64ee 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -317,6 +317,7 @@ struct io_ring_ctx { */ bool poll_multi_queue; struct list_head iopoll_list; + struct llist_head iopoll_complete; struct io_file_table file_table; struct io_rsrc_data buf_table; @@ -672,8 +673,9 @@ struct io_kiocb { }; u8 opcode; - /* polled IO has completed */ - u8 iopoll_completed; + + bool cancel_seq_set; + /* * Can be either a fixed buffer index, or used with provided buffers. * For the latter, it points to the selected buffer ID. @@ -700,6 +702,7 @@ struct io_kiocb { union { /* used by request caches, completion batching and iopoll */ struct io_wq_work_node comp_list; + struct llist_node iopoll_done_list; /* cache ->apoll->events */ __poll_t apoll_events; }; @@ -707,7 +710,7 @@ struct io_kiocb { struct io_rsrc_node *file_node; atomic_t refs; - bool cancel_seq_set; + atomic_t poll_refs; /* * IOPOLL doesn't use task_work, so use the ->iopoll_node list @@ -734,7 +737,6 @@ struct io_kiocb { /* opcode allocated if it needs to store data for async defer */ void *async_data; /* linked requests, IFF REQ_F_HARDLINK or REQ_F_LINK are set */ - atomic_t poll_refs; struct io_kiocb *link; /* custom credentials, valid IFF REQ_F_CREDS is set */ const struct cred *creds; diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 05a660c97316..5e503a0bfcfc 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -335,6 +335,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) spin_lock_init(&ctx->completion_lock); raw_spin_lock_init(&ctx->timeout_lock); INIT_LIST_HEAD(&ctx->iopoll_list); + init_llist_head(&ctx->iopoll_complete); INIT_LIST_HEAD(&ctx->defer_list); INIT_LIST_HEAD(&ctx->timeout_list); INIT_LIST_HEAD(&ctx->ltimeout_list); diff --git a/io_uring/rw.c b/io_uring/rw.c index 307f1f39d9f3..ad481ca74a46 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -604,8 +604,7 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res) req->cqe.res = res; } - /* order with io_iopoll_complete() checking ->iopoll_completed */ - smp_store_release(&req->iopoll_completed, 1); + llist_add(&req->iopoll_done_list, &req->ctx->iopoll_complete); } static inline void io_rw_done(struct io_kiocb *req, ssize_t ret) @@ -870,7 +869,6 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) return -EOPNOTSUPP; kiocb->private = NULL; kiocb->ki_flags |= IOCB_HIPRI; - req->iopoll_completed = 0; if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) { /* make sure every req only blocks once*/ req->flags &= ~REQ_F_IOPOLL_STATE; @@ -1317,7 +1315,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) { unsigned int poll_flags = 0; DEFINE_IO_COMP_BATCH(iob); - struct io_kiocb *req, *tmp; + struct llist_node *node; + struct io_kiocb *req; int nr_events = 0; /* @@ -1327,17 +1326,12 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) if (ctx->poll_multi_queue || force_nonspin) poll_flags |= BLK_POLL_ONESHOT; + /* + * Loop over uncompleted polled IO requests, and poll for them. + */ list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) { int ret; - /* - * Move completed and retryable entries to our local lists. - * If we find a request that requires polling, break out - * and complete those lists first, if we have entries there. - */ - if (READ_ONCE(req->iopoll_completed)) - break; - if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) ret = io_uring_hybrid_poll(req, &iob, poll_flags); else @@ -1349,24 +1343,25 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) poll_flags |= BLK_POLL_ONESHOT; /* iopoll may have completed current req */ - if (!rq_list_empty(&iob.req_list) || - READ_ONCE(req->iopoll_completed)) + if (!rq_list_empty(&iob.req_list)) break; } if (!rq_list_empty(&iob.req_list)) iob.complete(&iob); - list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) { - /* order with io_complete_rw_iopoll(), e.g. ->result updates */ - if (!smp_load_acquire(&req->iopoll_completed)) - continue; + node = llist_del_all(&ctx->iopoll_complete); + while (node) { + struct llist_node *next = node->next; + + req = container_of(node, struct io_kiocb, iopoll_done_list); list_del(&req->iopoll_node); wq_list_add_tail(&req->comp_list, &ctx->submit_state.compl_reqs); nr_events++; req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL); if (req->opcode != IORING_OP_URING_CMD) io_req_rw_cleanup(req, 0); + node = next; } if (nr_events) __io_submit_flush_completions(ctx); diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index 197474911f04..0841fa541f5d 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -159,8 +159,7 @@ void __io_uring_cmd_done(struct io_uring_cmd *ioucmd, s32 ret, u64 res2, } io_req_uring_cleanup(req, issue_flags); if (req->ctx->flags & IORING_SETUP_IOPOLL) { - /* order with io_iopoll_req_issued() checking ->iopoll_complete */ - smp_store_release(&req->iopoll_completed, 1); + llist_add(&req->iopoll_done_list, &req->ctx->iopoll_complete); } else if (issue_flags & IO_URING_F_COMPLETE_DEFER) { if (WARN_ON_ONCE(issue_flags & IO_URING_F_UNLOCKED)) return; @@ -252,7 +251,6 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags) if (!file->f_op->uring_cmd_iopoll) return -EOPNOTSUPP; issue_flags |= IO_URING_F_IOPOLL; - req->iopoll_completed = 0; if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) { /* make sure every req only blocks once */ req->flags &= ~REQ_F_IOPOLL_STATE; -- Jens Axboe ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-12 5:11 ` Jens Axboe @ 2025-12-12 8:58 ` Jens Axboe 2025-12-12 9:49 ` Fengnan Chang 2025-12-12 13:32 ` Diangang Li 1 sibling, 1 reply; 22+ messages in thread From: Jens Axboe @ 2025-12-12 8:58 UTC (permalink / raw) To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/11/25 10:11 PM, Jens Axboe wrote: > On 12/11/25 7:12 PM, Fengnan Chang wrote: >> >> >> ? 2025/12/12 09:53, Jens Axboe ??: >>> On 12/11/25 6:41 PM, Fengnan Chang wrote: >>>> Oh, we can't add nr_events == iob.nr_reqs check, if >>>> blk_mq_add_to_batch add failed, completed IO will not add into iob, >>>> iob.nr_reqs will be 0, this may cause io hang. >>> Indeed, won't work as-is. >>> >>> I do think we're probably making a bigger deal out of the full loop than >>> necessary. At least I'd be perfectly happy with just the current patch, >>> performance should be better there than we currently have it. Ideally >>> we'd have just one loop for polling and catching the completed items, >>> but that's a bit tricky with the batch completions. >> >> Yes, ideally one loop would be enough, but given that there are also >> multi_queue ctx, that doesn't seem to be possible. > > It's not removing the double loop, but the below could help _only_ > iterate completed requests at the end. Rather than move items between > the current list at the completion callback, have a separate list just > for completed requests. Then we can simply iterate that, knowing all of > them have completed. Gets rid of the ->iopoll_completed as well, and > then we can move the poll_refs. Not really related at all, obviously > this patch should be split into multiple pieces. > > This uses a lockless list. But since the producer and consumer are > generally the same task, that should not add any real overhead. On top > of the previous one I sent. What do you think? Ran some quick testing, as one interesting case is mixing slower and faster devices. Let's take this basic example: t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 -t1 -n2 /dev/nvme32n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 where nvme32n1 is about 1.27M IOPS, and the other 3 do about 3.3M IOPS, and we poll 2 devices with each IO thread. With the current kernel, we get: IOPS=5.18M, BW=2.53GiB/s, IOS/call=31/32 IOPS=5.19M, BW=2.53GiB/s, IOS/call=31/31 IOPS=5.17M, BW=2.53GiB/s, IOS/call=31/31 and with the two patches we get: IOPS=6.54M, BW=3.19GiB/s, IOS/call=31/31 IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31 IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31 or about a 25% improvement. This is mostly due to the issue you highlighted, where you end up with later completions (that are done) being stuck behind waiting on a slower completion. Note: obviously 1 thread driving multiple devices for polling could still be improved, and in fact it does improve if we simply change -c32 to something lower. The more important case is the one you identified, where different completion times on the same device will hold completions up. Multi device polling is just an interesting way to kind of emulate that, to an extent. This effect is (naturally) also apparent in the completion latencies as well, particularly in the higher percentiles. Ran peak testing too, and it's better all around than before. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-12 8:58 ` Jens Axboe @ 2025-12-12 9:49 ` Fengnan Chang 2025-12-12 20:22 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Fengnan Chang @ 2025-12-12 9:49 UTC (permalink / raw) To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li 在 2025/12/12 16:58, Jens Axboe 写道: > On 12/11/25 10:11 PM, Jens Axboe wrote: >> On 12/11/25 7:12 PM, Fengnan Chang wrote: >>> >>> ? 2025/12/12 09:53, Jens Axboe ??: >>>> On 12/11/25 6:41 PM, Fengnan Chang wrote: >>>>> Oh, we can't add nr_events == iob.nr_reqs check, if >>>>> blk_mq_add_to_batch add failed, completed IO will not add into iob, >>>>> iob.nr_reqs will be 0, this may cause io hang. >>>> Indeed, won't work as-is. >>>> >>>> I do think we're probably making a bigger deal out of the full loop than >>>> necessary. At least I'd be perfectly happy with just the current patch, >>>> performance should be better there than we currently have it. Ideally >>>> we'd have just one loop for polling and catching the completed items, >>>> but that's a bit tricky with the batch completions. >>> Yes, ideally one loop would be enough, but given that there are also >>> multi_queue ctx, that doesn't seem to be possible. >> It's not removing the double loop, but the below could help _only_ >> iterate completed requests at the end. Rather than move items between >> the current list at the completion callback, have a separate list just >> for completed requests. Then we can simply iterate that, knowing all of >> them have completed. Gets rid of the ->iopoll_completed as well, and >> then we can move the poll_refs. Not really related at all, obviously >> this patch should be split into multiple pieces. >> >> This uses a lockless list. But since the producer and consumer are >> generally the same task, that should not add any real overhead. On top >> of the previous one I sent. What do you think? > Ran some quick testing, as one interesting case is mixing slower and > faster devices. Let's take this basic example: > > t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 -t1 -n2 /dev/nvme32n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > where nvme32n1 is about 1.27M IOPS, and the other 3 do about 3.3M IOPS, > and we poll 2 devices with each IO thread. With the current kernel, we > get: > > IOPS=5.18M, BW=2.53GiB/s, IOS/call=31/32 > IOPS=5.19M, BW=2.53GiB/s, IOS/call=31/31 > IOPS=5.17M, BW=2.53GiB/s, IOS/call=31/31 > > and with the two patches we get: > > IOPS=6.54M, BW=3.19GiB/s, IOS/call=31/31 > IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31 > IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31 > > or about a 25% improvement. This is mostly due to the issue you > highlighted, where you end up with later completions (that are done) > being stuck behind waiting on a slower completion. > > Note: obviously 1 thread driving multiple devices for polling could > still be improved, and in fact it does improve if we simply change -c32 > to something lower. The more important case is the one you identified, > where different completion times on the same device will hold > completions up. Multi device polling is just an interesting way to kind > of emulate that, to an extent. > > This effect is (naturally) also apparent in the completion latencies as > well, particularly in the higher percentiles. > > Ran peak testing too, and it's better all around than before. I love the patch, I had a similar thought, this addresses my concern, I simple tested it and the performance is a bit better than the previous performance. base IOPS is 725K, previous IOPS is 782K, now 790k. It looks like all the problems are solved,I 'll do more testing next week. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-12 9:49 ` Fengnan Chang @ 2025-12-12 20:22 ` Jens Axboe 0 siblings, 0 replies; 22+ messages in thread From: Jens Axboe @ 2025-12-12 20:22 UTC (permalink / raw) To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li On 12/12/25 2:49 AM, Fengnan Chang wrote: > > > ? 2025/12/12 16:58, Jens Axboe ??: >> On 12/11/25 10:11 PM, Jens Axboe wrote: >>> On 12/11/25 7:12 PM, Fengnan Chang wrote: >>>> >>>> ? 2025/12/12 09:53, Jens Axboe ??: >>>>> On 12/11/25 6:41 PM, Fengnan Chang wrote: >>>>>> Oh, we can't add nr_events == iob.nr_reqs check, if >>>>>> blk_mq_add_to_batch add failed, completed IO will not add into iob, >>>>>> iob.nr_reqs will be 0, this may cause io hang. >>>>> Indeed, won't work as-is. >>>>> >>>>> I do think we're probably making a bigger deal out of the full loop than >>>>> necessary. At least I'd be perfectly happy with just the current patch, >>>>> performance should be better there than we currently have it. Ideally >>>>> we'd have just one loop for polling and catching the completed items, >>>>> but that's a bit tricky with the batch completions. >>>> Yes, ideally one loop would be enough, but given that there are also >>>> multi_queue ctx, that doesn't seem to be possible. >>> It's not removing the double loop, but the below could help _only_ >>> iterate completed requests at the end. Rather than move items between >>> the current list at the completion callback, have a separate list just >>> for completed requests. Then we can simply iterate that, knowing all of >>> them have completed. Gets rid of the ->iopoll_completed as well, and >>> then we can move the poll_refs. Not really related at all, obviously >>> this patch should be split into multiple pieces. >>> >>> This uses a lockless list. But since the producer and consumer are >>> generally the same task, that should not add any real overhead. On top >>> of the previous one I sent. What do you think? >> Ran some quick testing, as one interesting case is mixing slower and >> faster devices. Let's take this basic example: >> >> t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 -t1 -n2 /dev/nvme32n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 >> >> where nvme32n1 is about 1.27M IOPS, and the other 3 do about 3.3M IOPS, >> and we poll 2 devices with each IO thread. With the current kernel, we >> get: >> >> IOPS=5.18M, BW=2.53GiB/s, IOS/call=31/32 >> IOPS=5.19M, BW=2.53GiB/s, IOS/call=31/31 >> IOPS=5.17M, BW=2.53GiB/s, IOS/call=31/31 >> >> and with the two patches we get: >> >> IOPS=6.54M, BW=3.19GiB/s, IOS/call=31/31 >> IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31 >> IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31 >> >> or about a 25% improvement. This is mostly due to the issue you >> highlighted, where you end up with later completions (that are done) >> being stuck behind waiting on a slower completion. >> >> Note: obviously 1 thread driving multiple devices for polling could >> still be improved, and in fact it does improve if we simply change -c32 >> to something lower. The more important case is the one you identified, >> where different completion times on the same device will hold >> completions up. Multi device polling is just an interesting way to kind >> of emulate that, to an extent. >> >> This effect is (naturally) also apparent in the completion latencies as >> well, particularly in the higher percentiles. >> >> Ran peak testing too, and it's better all around than before. > I love the patch, I had a similar thought, this addresses my concern, I simple tested it > and the performance is a bit better than the previous performance. > > base IOPS is 725K, previous IOPS is 782K, now 790k. > It looks like all the problems are solved,I 'll do more testing next week. Nice, thanks! FWIW, I put the patches in a branch here, just to have them somewhat organized and easier to iterate/test on: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-iopoll -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-12 5:11 ` Jens Axboe 2025-12-12 8:58 ` Jens Axboe @ 2025-12-12 13:32 ` Diangang Li 2025-12-12 20:09 ` Jens Axboe 1 sibling, 1 reply; 22+ messages in thread From: Diangang Li @ 2025-12-12 13:32 UTC (permalink / raw) To: Jens Axboe, Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang On 2025/12/12 13:11, Jens Axboe wrote: > On 12/11/25 7:12 PM, Fengnan Chang wrote: >> >> >> ? 2025/12/12 09:53, Jens Axboe ??: >>> On 12/11/25 6:41 PM, Fengnan Chang wrote: >>>> Oh, we can't add nr_events == iob.nr_reqs check, if >>>> blk_mq_add_to_batch add failed, completed IO will not add into iob, >>>> iob.nr_reqs will be 0, this may cause io hang. >>> Indeed, won't work as-is. >>> >>> I do think we're probably making a bigger deal out of the full loop than >>> necessary. At least I'd be perfectly happy with just the current patch, >>> performance should be better there than we currently have it. Ideally >>> we'd have just one loop for polling and catching the completed items, >>> but that's a bit tricky with the batch completions. >> >> Yes, ideally one loop would be enough, but given that there are also >> multi_queue ctx, that doesn't seem to be possible. > > It's not removing the double loop, but the below could help _only_ > iterate completed requests at the end. Rather than move items between > the current list at the completion callback, have a separate list just > for completed requests. Then we can simply iterate that, knowing all of > them have completed. Gets rid of the ->iopoll_completed as well, and > then we can move the poll_refs. Not really related at all, obviously > this patch should be split into multiple pieces. > > This uses a lockless list. But since the producer and consumer are > generally the same task, that should not add any real overhead. On top > of the previous one I sent. What do you think? > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h > index 54fd30abf2b8..2d67d95a64ee 100644 > --- a/include/linux/io_uring_types.h > +++ b/include/linux/io_uring_types.h > @@ -317,6 +317,7 @@ struct io_ring_ctx { > */ > bool poll_multi_queue; > struct list_head iopoll_list; > + struct llist_head iopoll_complete; > > struct io_file_table file_table; > struct io_rsrc_data buf_table; > @@ -672,8 +673,9 @@ struct io_kiocb { > }; > > u8 opcode; > - /* polled IO has completed */ > - u8 iopoll_completed; > + > + bool cancel_seq_set; > + > /* > * Can be either a fixed buffer index, or used with provided buffers. > * For the latter, it points to the selected buffer ID. > @@ -700,6 +702,7 @@ struct io_kiocb { > union { > /* used by request caches, completion batching and iopoll */ > struct io_wq_work_node comp_list; > + struct llist_node iopoll_done_list; > /* cache ->apoll->events */ > __poll_t apoll_events; > }; > @@ -707,7 +710,7 @@ struct io_kiocb { > struct io_rsrc_node *file_node; > > atomic_t refs; > - bool cancel_seq_set; > + atomic_t poll_refs; > > /* > * IOPOLL doesn't use task_work, so use the ->iopoll_node list > @@ -734,7 +737,6 @@ struct io_kiocb { > /* opcode allocated if it needs to store data for async defer */ > void *async_data; > /* linked requests, IFF REQ_F_HARDLINK or REQ_F_LINK are set */ > - atomic_t poll_refs; > struct io_kiocb *link; > /* custom credentials, valid IFF REQ_F_CREDS is set */ > const struct cred *creds; > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 05a660c97316..5e503a0bfcfc 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -335,6 +335,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) > spin_lock_init(&ctx->completion_lock); > raw_spin_lock_init(&ctx->timeout_lock); > INIT_LIST_HEAD(&ctx->iopoll_list); > + init_llist_head(&ctx->iopoll_complete); > INIT_LIST_HEAD(&ctx->defer_list); > INIT_LIST_HEAD(&ctx->timeout_list); > INIT_LIST_HEAD(&ctx->ltimeout_list); > diff --git a/io_uring/rw.c b/io_uring/rw.c > index 307f1f39d9f3..ad481ca74a46 100644 > --- a/io_uring/rw.c > +++ b/io_uring/rw.c > @@ -604,8 +604,7 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res) > req->cqe.res = res; > } > > - /* order with io_iopoll_complete() checking ->iopoll_completed */ > - smp_store_release(&req->iopoll_completed, 1); > + llist_add(&req->iopoll_done_list, &req->ctx->iopoll_complete); > } > > static inline void io_rw_done(struct io_kiocb *req, ssize_t ret) > @@ -870,7 +869,6 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) > return -EOPNOTSUPP; > kiocb->private = NULL; > kiocb->ki_flags |= IOCB_HIPRI; > - req->iopoll_completed = 0; > if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) { > /* make sure every req only blocks once*/ > req->flags &= ~REQ_F_IOPOLL_STATE; > @@ -1317,7 +1315,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) > { > unsigned int poll_flags = 0; > DEFINE_IO_COMP_BATCH(iob); > - struct io_kiocb *req, *tmp; > + struct llist_node *node; > + struct io_kiocb *req; > int nr_events = 0; > > /* > @@ -1327,17 +1326,12 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) > if (ctx->poll_multi_queue || force_nonspin) > poll_flags |= BLK_POLL_ONESHOT; > > + /* > + * Loop over uncompleted polled IO requests, and poll for them. > + */ > list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) { > int ret; > > - /* > - * Move completed and retryable entries to our local lists. > - * If we find a request that requires polling, break out > - * and complete those lists first, if we have entries there. > - */ > - if (READ_ONCE(req->iopoll_completed)) > - break; Suggest keeping iopoll_completed here to avoid unnecessary subsequent polling and to process IRQ-completed requests promptly. > - > if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) > ret = io_uring_hybrid_poll(req, &iob, poll_flags); > else > @@ -1349,24 +1343,25 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) > poll_flags |= BLK_POLL_ONESHOT; > > /* iopoll may have completed current req */ > - if (!rq_list_empty(&iob.req_list) || > - READ_ONCE(req->iopoll_completed)) > + if (!rq_list_empty(&iob.req_list)) > break; > } > > if (!rq_list_empty(&iob.req_list)) > iob.complete(&iob); > > - list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) { > - /* order with io_complete_rw_iopoll(), e.g. ->result updates */ > - if (!smp_load_acquire(&req->iopoll_completed)) > - continue; > + node = llist_del_all(&ctx->iopoll_complete); > + while (node) { > + struct llist_node *next = node->next; > + > + req = container_of(node, struct io_kiocb, iopoll_done_list); > list_del(&req->iopoll_node); > wq_list_add_tail(&req->comp_list, &ctx->submit_state.compl_reqs); > nr_events++; > req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL); > if (req->opcode != IORING_OP_URING_CMD) > io_req_rw_cleanup(req, 0); > + node = next; > } > if (nr_events) > __io_submit_flush_completions(ctx); > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > index 197474911f04..0841fa541f5d 100644 > --- a/io_uring/uring_cmd.c > +++ b/io_uring/uring_cmd.c > @@ -159,8 +159,7 @@ void __io_uring_cmd_done(struct io_uring_cmd *ioucmd, s32 ret, u64 res2, > } > io_req_uring_cleanup(req, issue_flags); > if (req->ctx->flags & IORING_SETUP_IOPOLL) { > - /* order with io_iopoll_req_issued() checking ->iopoll_complete */ > - smp_store_release(&req->iopoll_completed, 1); > + llist_add(&req->iopoll_done_list, &req->ctx->iopoll_complete); > } else if (issue_flags & IO_URING_F_COMPLETE_DEFER) { > if (WARN_ON_ONCE(issue_flags & IO_URING_F_UNLOCKED)) > return; > @@ -252,7 +251,6 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags) > if (!file->f_op->uring_cmd_iopoll) > return -EOPNOTSUPP; > issue_flags |= IO_URING_F_IOPOLL; > - req->iopoll_completed = 0; > if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) { > /* make sure every req only blocks once */ > req->flags &= ~REQ_F_IOPOLL_STATE; ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode 2025-12-12 13:32 ` Diangang Li @ 2025-12-12 20:09 ` Jens Axboe 0 siblings, 0 replies; 22+ messages in thread From: Jens Axboe @ 2025-12-12 20:09 UTC (permalink / raw) To: Diangang Li, Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang On 12/12/25 6:32 AM, Diangang Li wrote: >> @@ -1327,17 +1326,12 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin) >> if (ctx->poll_multi_queue || force_nonspin) >> poll_flags |= BLK_POLL_ONESHOT; >> >> + /* >> + * Loop over uncompleted polled IO requests, and poll for them. >> + */ >> list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) { >> int ret; >> >> - /* >> - * Move completed and retryable entries to our local lists. >> - * If we find a request that requires polling, break out >> - * and complete those lists first, if we have entries there. >> - */ >> - if (READ_ONCE(req->iopoll_completed)) >> - break; > > Suggest keeping iopoll_completed here to avoid unnecessary subsequent > polling and to process IRQ-completed requests promptly. There should not be any IRQ completed requests in here. The block layer used to allow that, but that should no longer be the case. If it's a polled request, then it will by definition end up in a polled queue and need iopoll completion. Or it'll sit for a while and be completed by a timeout. If someone is still allowing polled IO with IRQ completions then that should be fixed, and there's no reason why we should try and catch those cases here. Will not happen with NVMe, for example. For the other point, there's no way for ->iopoll_completed to be set before iob.complete(&iob) has been called to invoke the callback, which in turn sets it. Hence there's no way for us to end up in this loop and have ->iopoll_completed set. So no, I don't think we need it, and honestly it wasn't even really needed the posted changes either. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: (subset) [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode 2025-12-10 8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang 2025-12-10 8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang 2025-12-10 8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang @ 2025-12-10 9:53 ` Jens Axboe 2 siblings, 0 replies; 22+ messages in thread From: Jens Axboe @ 2025-12-10 9:53 UTC (permalink / raw) To: asml.silence, io-uring, Fengnan Chang; +Cc: Fengnan Chang On Wed, 10 Dec 2025 16:54:59 +0800, Fengnan Chang wrote: > In the io_do_iopoll function, when the poll loop of iopoll_list ends, it > is considered that the current req is the actual completed request. > This may be reasonable for multi-queue ctx, but is problematic for > single-queue ctx because the current request may not be done when the > poll gets to the result. In this case, the completed io needs to wait > for the first io on the chain to complete before notifying the user, > which may cause io accumulation in the list. > Our modification plan is as follows: change io_wq_work_list to normal > list so that the iopoll_list list in it can be removed and put into the > comp_reqs list when the request is completed. This way each io is > handled independently and all gets processed in time. > > [...] Applied, thanks! [1/2] blk-mq: delete task running check in blk_hctx_poll (no commit info) Best regards, -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-12-12 20:22 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-10 8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang 2025-12-10 8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang 2025-12-10 9:19 ` Jens Axboe 2025-12-10 9:53 ` Jens Axboe 2025-12-10 8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang 2025-12-11 2:15 ` Jens Axboe 2025-12-11 4:10 ` Jens Axboe 2025-12-11 7:38 ` Fengnan 2025-12-11 10:22 ` Jens Axboe 2025-12-11 10:33 ` Jens Axboe 2025-12-11 11:13 ` Fengnan Chang 2025-12-11 11:19 ` Jens Axboe 2025-12-12 1:41 ` Fengnan Chang 2025-12-12 1:53 ` Jens Axboe 2025-12-12 2:12 ` Fengnan Chang 2025-12-12 5:11 ` Jens Axboe 2025-12-12 8:58 ` Jens Axboe 2025-12-12 9:49 ` Fengnan Chang 2025-12-12 20:22 ` Jens Axboe 2025-12-12 13:32 ` Diangang Li 2025-12-12 20:09 ` Jens Axboe 2025-12-10 9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox