public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode
@ 2025-12-10  8:54 Fengnan Chang
  2025-12-10  8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Fengnan Chang @ 2025-12-10  8:54 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring; +Cc: Fengnan Chang

In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
is considered that the current req is the actual completed request.
This may be reasonable for multi-queue ctx, but is problematic for
single-queue ctx because the current request may not be done when the
poll gets to the result. In this case, the completed io needs to wait
for the first io on the chain to complete before notifying the user,
which may cause io accumulation in the list.
Our modification plan is as follows: change io_wq_work_list to normal
list so that the iopoll_list list in it can be removed and put into the
comp_reqs list when the request is completed. This way each io is
handled independently and all gets processed in time.

After modification,  test with:

./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
/dev/nvme6n1

base IOPS is 725K,  patch IOPS is 782K.

./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
/dev/nvme6n1

Base IOPS is 880k, patch IOPS is 895K.

Fengnan Chang (2):
  blk-mq: delete task running check in blk_hctx_poll
  io_uring: fix io may accumulation in poll mode

 block/blk-mq.c                 |  2 --
 include/linux/blk-mq.h         |  1 +
 include/linux/blkdev.h         |  1 +
 include/linux/io_uring_types.h |  8 +++---
 io_uring/io_uring.c            | 51 ++++++++++++++++------------------
 io_uring/io_uring.h            | 15 ++++------
 io_uring/rw.c                  | 35 +++++++----------------
 io_uring/slist.h               |  9 ------
 io_uring/sqpoll.c              |  8 +++---
 9 files changed, 49 insertions(+), 81 deletions(-)


base-commit: 4941a17751c99e17422be743c02c923ad706f888
prerequisite-patch-id: 14d5e19968d0acb074ebcd62f073c330b0e34f36
prerequisite-patch-id: 1b5c18a564a4b4788f4992dc02d70c01f69c0e0e
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll
  2025-12-10  8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
@ 2025-12-10  8:55 ` Fengnan Chang
  2025-12-10  9:19   ` Jens Axboe
  2025-12-10  9:53   ` Jens Axboe
  2025-12-10  8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
  2025-12-10  9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe
  2 siblings, 2 replies; 22+ messages in thread
From: Fengnan Chang @ 2025-12-10  8:55 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

In blk_hctx_poll, it always check task is running or not, and return 1
if task is running, it's not reasonable for current caller, especially
io_uring, which is always running and cause BLK_POLL_ONESHOT is set.

It looks like there has been this judgment for historical reasons, and
in very early versions of this function the user would set the process
state to TASK_UNINTERRUPTIBLE.

Signed-off-by: Diangang Li <lidiangang@bytedance.com>
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
---
 block/blk-mq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0b8b72194003..b0eb90c50afb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -5172,8 +5172,6 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
 
 		if (signal_pending_state(state, current))
 			__set_current_state(TASK_RUNNING);
-		if (task_is_running(current))
-			return 1;
 
 		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
 			break;
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-10  8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
  2025-12-10  8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
@ 2025-12-10  8:55 ` Fengnan Chang
  2025-12-11  2:15   ` Jens Axboe
  2025-12-10  9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe
  2 siblings, 1 reply; 22+ messages in thread
From: Fengnan Chang @ 2025-12-10  8:55 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
is considered that the current req is the actual completed request.
This may be reasonable for multi-queue ctx, but is problematic for
single-queue ctx because the current request may not be done when the
poll gets to the result. In this case, the completed io needs to wait
for the first io on the chain to complete before notifying the user,
which may cause io accumulation in the list.
Our modification plan is as follows: change io_wq_work_list to normal
list so that the iopoll_list list in it can be removed and put into the
comp_reqs list when the request is completed. This way each io is
handled independently and all gets processed in time.

After modification,  test with:

./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
/dev/nvme6n1

base IOPS is 725K,  patch IOPS is 782K.

./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
/dev/nvme6n1

Base IOPS is 880k, patch IOPS is 895K.

Signed-off-by: Diangang Li <lidiangang@bytedance.com>
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
---
 include/linux/blk-mq.h         |  1 +
 include/linux/blkdev.h         |  1 +
 include/linux/io_uring_types.h |  8 +++---
 io_uring/io_uring.c            | 51 ++++++++++++++++------------------
 io_uring/io_uring.h            | 15 ++++------
 io_uring/rw.c                  | 35 +++++++----------------
 io_uring/slist.h               |  9 ------
 io_uring/sqpoll.c              |  8 +++---
 8 files changed, 49 insertions(+), 79 deletions(-)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c16875b35521..39151998f8df 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -902,6 +902,7 @@ static inline bool blk_mq_add_to_batch(struct request *req,
 		return false;
 	iob->need_ts |= blk_mq_need_time_stamp(req);
 	rq_list_add_tail(&iob->req_list, req);
+	iob->req_count++;
 	return true;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e25d9802e08b..923dbb1de6cf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1822,6 +1822,7 @@ void bdev_fput(struct file *bdev_file);
 
 struct io_comp_batch {
 	struct rq_list req_list;
+	unsigned int req_count;
 	bool need_ts;
 	void (*complete)(struct io_comp_batch *);
 };
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c2ea6280901d..e9c70219c0f2 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -235,9 +235,9 @@ struct io_submit_link {
 
 struct io_submit_state {
 	/* inline/task_work completion list, under ->uring_lock */
-	struct io_wq_work_node	free_list;
+	struct list_head	free_list;
 	/* batch completion logic */
-	struct io_wq_work_list	compl_reqs;
+	struct list_head	compl_reqs;
 	struct io_submit_link	link;
 
 	bool			plug_started;
@@ -317,7 +317,7 @@ struct io_ring_ctx {
 		 * manipulate the list, hence no extra locking is needed there.
 		 */
 		bool			poll_multi_queue;
-		struct io_wq_work_list	iopoll_list;
+		struct list_head	iopoll_list;
 
 		struct io_file_table	file_table;
 		struct io_rsrc_data	buf_table;
@@ -695,7 +695,7 @@ struct io_kiocb {
 
 	union {
 		/* used by request caches, completion batching and iopoll */
-		struct io_wq_work_node	comp_list;
+		struct list_head	comp_list;
 		/* cache ->apoll->events */
 		__poll_t apoll_events;
 	};
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 296667ba712c..a7166656093a 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -255,7 +255,7 @@ static inline void io_req_add_to_cache(struct io_kiocb *req, struct io_ring_ctx
 {
 	if (IS_ENABLED(CONFIG_KASAN))
 		io_poison_cached_req(req);
-	wq_stack_add_head(&req->comp_list, &ctx->submit_state.free_list);
+	list_add(&req->comp_list, &ctx->submit_state.free_list);
 }
 
 static __cold void io_ring_ctx_ref_free(struct percpu_ref *ref)
@@ -367,20 +367,20 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	init_waitqueue_head(&ctx->poll_wq);
 	spin_lock_init(&ctx->completion_lock);
 	raw_spin_lock_init(&ctx->timeout_lock);
-	INIT_WQ_LIST(&ctx->iopoll_list);
+	INIT_LIST_HEAD(&ctx->iopoll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
 	INIT_LIST_HEAD(&ctx->ltimeout_list);
 	init_llist_head(&ctx->work_llist);
 	INIT_LIST_HEAD(&ctx->tctx_list);
-	ctx->submit_state.free_list.next = NULL;
+	INIT_LIST_HEAD(&ctx->submit_state.free_list);
 	INIT_HLIST_HEAD(&ctx->waitid_list);
 	xa_init_flags(&ctx->zcrx_ctxs, XA_FLAGS_ALLOC);
 #ifdef CONFIG_FUTEX
 	INIT_HLIST_HEAD(&ctx->futex_list);
 #endif
 	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
-	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
+	INIT_LIST_HEAD(&ctx->submit_state.compl_reqs);
 	INIT_HLIST_HEAD(&ctx->cancelable_uring_cmd);
 	io_napi_init(ctx);
 	mutex_init(&ctx->mmap_lock);
@@ -945,7 +945,7 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
 	 * those are flushed first before posting this one. If not, CQEs
 	 * could get reordered.
 	 */
-	if (!wq_list_empty(&ctx->submit_state.compl_reqs))
+	if (!list_empty(&ctx->submit_state.compl_reqs))
 		__io_submit_flush_completions(ctx);
 
 	lockdep_assert(!io_wq_current_is_worker());
@@ -1498,9 +1498,10 @@ static inline void io_req_put_rsrc_nodes(struct io_kiocb *req)
 }
 
 static void io_free_batch_list(struct io_ring_ctx *ctx,
-			       struct io_wq_work_node *node)
+			       struct list_head *head)
 	__must_hold(&ctx->uring_lock)
 {
+	struct list_head *node = head->next;
 	do {
 		struct io_kiocb *req = container_of(node, struct io_kiocb,
 						    comp_list);
@@ -1536,20 +1537,17 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
 
 		node = req->comp_list.next;
 		io_req_add_to_cache(req, ctx);
-	} while (node);
+	} while (!list_is_head(node, head));
 }
 
 void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_state *state = &ctx->submit_state;
-	struct io_wq_work_node *node;
+	struct io_kiocb *req;
 
 	__io_cq_lock(ctx);
-	__wq_list_for_each(node, &state->compl_reqs) {
-		struct io_kiocb *req = container_of(node, struct io_kiocb,
-					    comp_list);
-
+	list_for_each_entry(req, &state->compl_reqs, comp_list) {
 		/*
 		 * Requests marked with REQUEUE should not post a CQE, they
 		 * will go through the io-wq retry machinery and post one
@@ -1565,9 +1563,9 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
 	}
 	__io_cq_unlock_post(ctx);
 
-	if (!wq_list_empty(&state->compl_reqs)) {
-		io_free_batch_list(ctx, state->compl_reqs.first);
-		INIT_WQ_LIST(&state->compl_reqs);
+	if (!list_empty(&state->compl_reqs)) {
+		io_free_batch_list(ctx, &state->compl_reqs);
+		INIT_LIST_HEAD(&state->compl_reqs);
 	}
 
 	if (unlikely(ctx->drain_active))
@@ -1593,7 +1591,7 @@ static __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
 		return;
 
 	mutex_lock(&ctx->uring_lock);
-	while (!wq_list_empty(&ctx->iopoll_list)) {
+	while (!list_empty(&ctx->iopoll_list)) {
 		/* let it sleep and repeat later if can't complete a request */
 		if (io_do_iopoll(ctx, true) == 0)
 			break;
@@ -1658,21 +1656,21 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
 		 * forever, while the workqueue is stuck trying to acquire the
 		 * very same mutex.
 		 */
-		if (wq_list_empty(&ctx->iopoll_list) ||
+		if (list_empty(&ctx->iopoll_list) ||
 		    io_task_work_pending(ctx)) {
 			u32 tail = ctx->cached_cq_tail;
 
 			(void) io_run_local_work_locked(ctx, min_events);
 
 			if (task_work_pending(current) ||
-			    wq_list_empty(&ctx->iopoll_list)) {
+			    list_empty(&ctx->iopoll_list)) {
 				mutex_unlock(&ctx->uring_lock);
 				io_run_task_work();
 				mutex_lock(&ctx->uring_lock);
 			}
 			/* some requests don't go through iopoll_list */
 			if (tail != ctx->cached_cq_tail ||
-			    wq_list_empty(&ctx->iopoll_list))
+			    list_empty(&ctx->iopoll_list))
 				break;
 		}
 		ret = io_do_iopoll(ctx, !min_events);
@@ -1715,13 +1713,12 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags)
 	 * how we do polling eventually, not spinning if we're on potentially
 	 * different devices.
 	 */
-	if (wq_list_empty(&ctx->iopoll_list)) {
+	if (list_empty(&ctx->iopoll_list)) {
 		ctx->poll_multi_queue = false;
 	} else if (!ctx->poll_multi_queue) {
 		struct io_kiocb *list_req;
 
-		list_req = container_of(ctx->iopoll_list.first, struct io_kiocb,
-					comp_list);
+		list_req = list_first_entry(&ctx->iopoll_list, struct io_kiocb, comp_list);
 		if (list_req->file != req->file)
 			ctx->poll_multi_queue = true;
 	}
@@ -1731,9 +1728,9 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags)
 	 * it to the front so we find it first.
 	 */
 	if (READ_ONCE(req->iopoll_completed))
-		wq_list_add_head(&req->comp_list, &ctx->iopoll_list);
+		list_add(&req->comp_list, &ctx->iopoll_list);
 	else
-		wq_list_add_tail(&req->comp_list, &ctx->iopoll_list);
+		list_add_tail(&req->comp_list, &ctx->iopoll_list);
 
 	if (unlikely(needs_lock)) {
 		/*
@@ -2454,7 +2451,7 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 	if (unlikely(left)) {
 		ret -= left;
 		/* try again if it submitted nothing and can't allocate a req */
-		if (!ret && io_req_cache_empty(ctx))
+		if (!ret && list_empty(&ctx->submit_state.free_list))
 			ret = -EAGAIN;
 		current->io_uring->cached_refs += left;
 	}
@@ -2818,7 +2815,7 @@ static __cold void __io_req_caches_free(struct io_ring_ctx *ctx)
 	struct io_kiocb *req;
 	int nr = 0;
 
-	while (!io_req_cache_empty(ctx)) {
+	while (!list_empty(&ctx->submit_state.free_list)) {
 		req = io_extract_req(ctx);
 		io_poison_req(req);
 		kmem_cache_free(req_cachep, req);
@@ -3215,7 +3212,7 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	/* SQPOLL thread does its own polling */
 	if ((!(ctx->flags & IORING_SETUP_SQPOLL) && cancel_all) ||
 	    is_sqpoll_thread) {
-		while (!wq_list_empty(&ctx->iopoll_list)) {
+		while (!list_empty(&ctx->iopoll_list)) {
 			io_iopoll_try_reap_events(ctx);
 			ret = true;
 			cond_resched();
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 46d9141d772a..acc82a1b7107 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -215,7 +215,7 @@ static inline void io_req_task_work_add(struct io_kiocb *req)
 
 static inline void io_submit_flush_completions(struct io_ring_ctx *ctx)
 {
-	if (!wq_list_empty(&ctx->submit_state.compl_reqs) ||
+	if (!list_empty(&ctx->submit_state.compl_reqs) ||
 	    ctx->submit_state.cq_flush)
 		__io_submit_flush_completions(ctx);
 }
@@ -502,7 +502,7 @@ static inline void io_req_complete_defer(struct io_kiocb *req)
 
 	lockdep_assert_held(&req->ctx->uring_lock);
 
-	wq_list_add_tail(&req->comp_list, &state->compl_reqs);
+	list_add_tail(&req->comp_list, &state->compl_reqs);
 }
 
 static inline void io_commit_cqring_flush(struct io_ring_ctx *ctx)
@@ -521,25 +521,20 @@ static inline void io_get_task_refs(int nr)
 		io_task_refs_refill(tctx);
 }
 
-static inline bool io_req_cache_empty(struct io_ring_ctx *ctx)
-{
-	return !ctx->submit_state.free_list.next;
-}
-
 extern struct kmem_cache *req_cachep;
 
 static inline struct io_kiocb *io_extract_req(struct io_ring_ctx *ctx)
 {
 	struct io_kiocb *req;
 
-	req = container_of(ctx->submit_state.free_list.next, struct io_kiocb, comp_list);
-	wq_stack_extract(&ctx->submit_state.free_list);
+	req = list_first_entry(&ctx->submit_state.free_list, struct io_kiocb, comp_list);
+	list_del_init(&req->comp_list);
 	return req;
 }
 
 static inline bool io_alloc_req(struct io_ring_ctx *ctx, struct io_kiocb **req)
 {
-	if (unlikely(io_req_cache_empty(ctx))) {
+	if (unlikely(list_empty(&ctx->submit_state.free_list))) {
 		if (!__io_alloc_req_refill(ctx))
 			return false;
 	}
diff --git a/io_uring/rw.c b/io_uring/rw.c
index abe68ba9c9dc..4ecd41f48812 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -616,6 +616,11 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)
 
 	/* order with io_iopoll_complete() checking ->iopoll_completed */
 	smp_store_release(&req->iopoll_completed, 1);
+
+	req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL);
+	if (req->opcode != IORING_OP_URING_CMD)
+		io_req_rw_cleanup(req, 0);
+	list_move_tail(&req->comp_list, &req->ctx->submit_state.compl_reqs);
 }
 
 static inline void io_rw_done(struct io_kiocb *req, ssize_t ret)
@@ -1326,10 +1331,9 @@ static int io_uring_hybrid_poll(struct io_kiocb *req,
 
 int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 {
-	struct io_wq_work_node *pos, *start, *prev;
+	struct io_kiocb *req, *next;
 	unsigned int poll_flags = 0;
 	DEFINE_IO_COMP_BATCH(iob);
-	int nr_events = 0;
 
 	/*
 	 * Only spin for completions if we don't have multiple devices hanging
@@ -1338,8 +1342,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 	if (ctx->poll_multi_queue || force_nonspin)
 		poll_flags |= BLK_POLL_ONESHOT;
 
-	wq_list_for_each(pos, start, &ctx->iopoll_list) {
-		struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list);
+	list_for_each_entry(req, &ctx->iopoll_list, comp_list) {
 		int ret;
 
 		/*
@@ -1368,32 +1371,14 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 
 	if (!rq_list_empty(&iob.req_list))
 		iob.complete(&iob);
-	else if (!pos)
+	else if (list_entry_is_head(req, &ctx->iopoll_list, comp_list))
 		return 0;
 
-	prev = start;
-	wq_list_for_each_resume(pos, prev) {
-		struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list);
-
-		/* order with io_complete_rw_iopoll(), e.g. ->result updates */
-		if (!smp_load_acquire(&req->iopoll_completed))
-			break;
-		nr_events++;
-		req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL);
-		if (req->opcode != IORING_OP_URING_CMD)
-			io_req_rw_cleanup(req, 0);
-	}
-	if (unlikely(!nr_events))
+	if (unlikely(!iob.req_count))
 		return 0;
 
-	pos = start ? start->next : ctx->iopoll_list.first;
-	wq_list_cut(&ctx->iopoll_list, prev, start);
-
-	if (WARN_ON_ONCE(!wq_list_empty(&ctx->submit_state.compl_reqs)))
-		return 0;
-	ctx->submit_state.compl_reqs.first = pos;
 	__io_submit_flush_completions(ctx);
-	return nr_events;
+	return iob.req_count;
 }
 
 void io_rw_cache_free(const void *entry)
diff --git a/io_uring/slist.h b/io_uring/slist.h
index 0eb194817242..f62e12c67e4f 100644
--- a/io_uring/slist.h
+++ b/io_uring/slist.h
@@ -99,15 +99,6 @@ static inline void wq_list_del(struct io_wq_work_list *list,
 	wq_list_cut(list, node, prev);
 }
 
-static inline
-struct io_wq_work_node *wq_stack_extract(struct io_wq_work_node *stack)
-{
-	struct io_wq_work_node *node = stack->next;
-
-	stack->next = node->next;
-	return node;
-}
-
 static inline struct io_wq_work *wq_next_work(struct io_wq_work *work)
 {
 	if (!work->list.next)
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index e22f072c7d5f..aac7a2c69892 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -211,7 +211,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd,
 	if (cap_entries && to_submit > IORING_SQPOLL_CAP_ENTRIES_VALUE)
 		to_submit = IORING_SQPOLL_CAP_ENTRIES_VALUE;
 
-	if (to_submit || !wq_list_empty(&ctx->iopoll_list)) {
+	if (to_submit || !list_empty(&ctx->iopoll_list)) {
 		const struct cred *creds = NULL;
 
 		io_sq_start_worktime(ist);
@@ -220,7 +220,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd,
 			creds = override_creds(ctx->sq_creds);
 
 		mutex_lock(&ctx->uring_lock);
-		if (!wq_list_empty(&ctx->iopoll_list))
+		if (!list_empty(&ctx->iopoll_list))
 			io_do_iopoll(ctx, true);
 
 		/*
@@ -343,7 +343,7 @@ static int io_sq_thread(void *data)
 		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
 			int ret = __io_sq_thread(ctx, sqd, cap_entries, &ist);
 
-			if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
+			if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list)))
 				sqt_spin = true;
 		}
 		if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
@@ -378,7 +378,7 @@ static int io_sq_thread(void *data)
 				atomic_or(IORING_SQ_NEED_WAKEUP,
 						&ctx->rings->sq_flags);
 				if ((ctx->flags & IORING_SETUP_IOPOLL) &&
-				    !wq_list_empty(&ctx->iopoll_list)) {
+				    !list_empty(&ctx->iopoll_list)) {
 					needs_sched = false;
 					break;
 				}
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll
  2025-12-10  8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
@ 2025-12-10  9:19   ` Jens Axboe
  2025-12-10  9:53   ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-12-10  9:19 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/10/25 1:55 AM, Fengnan Chang wrote:
> In blk_hctx_poll, it always check task is running or not, and return 1
> if task is running, it's not reasonable for current caller, especially
> io_uring, which is always running and cause BLK_POLL_ONESHOT is set.
> 
> It looks like there has been this judgment for historical reasons, and
> in very early versions of this function the user would set the process
> state to TASK_UNINTERRUPTIBLE.
> 
> Signed-off-by: Diangang Li <lidiangang@bytedance.com>
> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
> ---
>  block/blk-mq.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0b8b72194003..b0eb90c50afb 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -5172,8 +5172,6 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>  
>  		if (signal_pending_state(state, current))
>  			__set_current_state(TASK_RUNNING);
> -		if (task_is_running(current))
> -			return 1;
>  
>  		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
>  			break;

Heh yes, this is completely a leftover part from when polled IO was
purely synchronous and the completion would wake it. Similarly, the
__set_current_state(TASK_RUNNING) a bit further down needs to die as
well.

I'll check your patch 2 tomorrow. Not loving the complete switch from
the singly linked list to the double linked, that's going to slow down
both addition and iteration of the req freelist. So I think something
needs to change there, but haven't really gone deep with it yet.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: (subset) [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode
  2025-12-10  8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
  2025-12-10  8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
  2025-12-10  8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
@ 2025-12-10  9:53 ` Jens Axboe
  2 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-12-10  9:53 UTC (permalink / raw)
  To: asml.silence, io-uring, Fengnan Chang; +Cc: Fengnan Chang


On Wed, 10 Dec 2025 16:54:59 +0800, Fengnan Chang wrote:
> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
> is considered that the current req is the actual completed request.
> This may be reasonable for multi-queue ctx, but is problematic for
> single-queue ctx because the current request may not be done when the
> poll gets to the result. In this case, the completed io needs to wait
> for the first io on the chain to complete before notifying the user,
> which may cause io accumulation in the list.
> Our modification plan is as follows: change io_wq_work_list to normal
> list so that the iopoll_list list in it can be removed and put into the
> comp_reqs list when the request is completed. This way each io is
> handled independently and all gets processed in time.
> 
> [...]

Applied, thanks!

[1/2] blk-mq: delete task running check in blk_hctx_poll
      (no commit info)

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll
  2025-12-10  8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
  2025-12-10  9:19   ` Jens Axboe
@ 2025-12-10  9:53   ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-12-10  9:53 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/10/25 1:55 AM, Fengnan Chang wrote:
> In blk_hctx_poll, it always check task is running or not, and return 1
> if task is running, it's not reasonable for current caller, especially
> io_uring, which is always running and cause BLK_POLL_ONESHOT is set.
> 
> It looks like there has been this judgment for historical reasons, and
> in very early versions of this function the user would set the process
> state to TASK_UNINTERRUPTIBLE.
> 
> Signed-off-by: Diangang Li <lidiangang@bytedance.com>
> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
> ---
>  block/blk-mq.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0b8b72194003..b0eb90c50afb 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -5172,8 +5172,6 @@ static int blk_hctx_poll(struct request_queue *q, struct blk_mq_hw_ctx *hctx,
>  
>  		if (signal_pending_state(state, current))
>  			__set_current_state(TASK_RUNNING);
> -		if (task_is_running(current))
> -			return 1;
>  
>  		if (ret < 0 || (flags & BLK_POLL_ONESHOT))
>  			break;

Applied with the other task-is-running remnants killed, too.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-10  8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
@ 2025-12-11  2:15   ` Jens Axboe
  2025-12-11  4:10     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2025-12-11  2:15 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/10/25 1:55 AM, Fengnan Chang wrote:
> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
> is considered that the current req is the actual completed request.
> This may be reasonable for multi-queue ctx, but is problematic for
> single-queue ctx because the current request may not be done when the
> poll gets to the result. In this case, the completed io needs to wait
> for the first io on the chain to complete before notifying the user,
> which may cause io accumulation in the list.
> Our modification plan is as follows: change io_wq_work_list to normal
> list so that the iopoll_list list in it can be removed and put into the
> comp_reqs list when the request is completed. This way each io is
> handled independently and all gets processed in time.
> 
> After modification,  test with:
> 
> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
> /dev/nvme6n1
> 
> base IOPS is 725K,  patch IOPS is 782K.
> 
> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
> /dev/nvme6n1
> 
> Base IOPS is 880k, patch IOPS is 895K.

A few notes on this:

1) Manipulating the list in io_complete_rw_iopoll() I don't think is
   necessarily safe. Yes generally this is invoked from the
   owning/polling task, but that's not guaranteed.

2) The patch doesn't apply to the current tree, must be an older
   version?

3) When hand-applied, it still throws a compile warning about an unused
   variable. Please don't send untested stuff...

4) Don't just blatantly bloat the io_kiocb. When you change from a
   singly to a doubly linked list, you're growing the io_kiocb size. You
   should be able to use a union with struct io_task_work for example.
   That's already 16b in size - win/win as you don't need to slow down
   the cache management as that can keep using the linkage it currently
   is using, and you're not bloating the io_kiocb.

5) The already mentioned point about the cache free list now being
   doubly linked. This is generally a _bad_ idea as removing and adding
   entries now need to touch other entries too. That's not very cache
   friendly.

#1 is kind of the big one, as it means you'll need to re-think how you
do this. I do agree that the current approach isn't necessarily ideal as
we don't process completions as quickly as we could, so I think there's
merrit in continuing this work.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-11  2:15   ` Jens Axboe
@ 2025-12-11  4:10     ` Jens Axboe
  2025-12-11  7:38       ` Fengnan
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2025-12-11  4:10 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/10/25 7:15 PM, Jens Axboe wrote:
> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>> is considered that the current req is the actual completed request.
>> This may be reasonable for multi-queue ctx, but is problematic for
>> single-queue ctx because the current request may not be done when the
>> poll gets to the result. In this case, the completed io needs to wait
>> for the first io on the chain to complete before notifying the user,
>> which may cause io accumulation in the list.
>> Our modification plan is as follows: change io_wq_work_list to normal
>> list so that the iopoll_list list in it can be removed and put into the
>> comp_reqs list when the request is completed. This way each io is
>> handled independently and all gets processed in time.
>>
>> After modification,  test with:
>>
>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>> /dev/nvme6n1
>>
>> base IOPS is 725K,  patch IOPS is 782K.
>>
>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>> /dev/nvme6n1
>>
>> Base IOPS is 880k, patch IOPS is 895K.
> 
> A few notes on this:
> 
> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>    necessarily safe. Yes generally this is invoked from the
>    owning/polling task, but that's not guaranteed.
> 
> 2) The patch doesn't apply to the current tree, must be an older
>    version?
> 
> 3) When hand-applied, it still throws a compile warning about an unused
>    variable. Please don't send untested stuff...
> 
> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>    singly to a doubly linked list, you're growing the io_kiocb size. You
>    should be able to use a union with struct io_task_work for example.
>    That's already 16b in size - win/win as you don't need to slow down
>    the cache management as that can keep using the linkage it currently
>    is using, and you're not bloating the io_kiocb.
> 
> 5) The already mentioned point about the cache free list now being
>    doubly linked. This is generally a _bad_ idea as removing and adding
>    entries now need to touch other entries too. That's not very cache
>    friendly.
> 
> #1 is kind of the big one, as it means you'll need to re-think how you
> do this. I do agree that the current approach isn't necessarily ideal as
> we don't process completions as quickly as we could, so I think there's
> merrit in continuing this work.

Proof of concept below, entirely untested, at a conference. Basically
does what I describe above, and retains the list manipulation logic
on the iopoll side, rather than on the completion side. Can you take
a look at this? And if it runs, can you do some numbers on that too?

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e1adb0d20a0a..54fd30abf2b8 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -316,7 +316,7 @@ struct io_ring_ctx {
 		 * manipulate the list, hence no extra locking is needed there.
 		 */
 		bool			poll_multi_queue;
-		struct io_wq_work_list	iopoll_list;
+		struct list_head	iopoll_list;
 
 		struct io_file_table	file_table;
 		struct io_rsrc_data	buf_table;
@@ -708,7 +708,16 @@ struct io_kiocb {
 
 	atomic_t			refs;
 	bool				cancel_seq_set;
-	struct io_task_work		io_task_work;
+
+	/*
+	 * IOPOLL doesn't use task_work, so use the ->iopoll_node list
+	 * entry to manage pending iopoll requests.
+	 */
+	union {
+		struct io_task_work	io_task_work;
+		struct list_head	iopoll_node;
+	};
+
 	union {
 		/*
 		 * for polled requests, i.e. IORING_OP_POLL_ADD and async armed
diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index ca12ac10c0ae..4136bf04464a 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -534,7 +534,7 @@ __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	/* SQPOLL thread does its own polling */
 	if ((!(ctx->flags & IORING_SETUP_SQPOLL) && cancel_all) ||
 	    is_sqpoll_thread) {
-		while (!wq_list_empty(&ctx->iopoll_list)) {
+		while (!list_empty(&ctx->iopoll_list)) {
 			io_iopoll_try_reap_events(ctx);
 			ret = true;
 			cond_resched();
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 6cb24cdf8e68..fdb0b43f6fb5 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -334,7 +334,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	init_waitqueue_head(&ctx->poll_wq);
 	spin_lock_init(&ctx->completion_lock);
 	raw_spin_lock_init(&ctx->timeout_lock);
-	INIT_WQ_LIST(&ctx->iopoll_list);
+	INIT_LIST_HEAD(&ctx->iopoll_list);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
 	INIT_LIST_HEAD(&ctx->ltimeout_list);
@@ -1561,7 +1561,7 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
 		return;
 
 	mutex_lock(&ctx->uring_lock);
-	while (!wq_list_empty(&ctx->iopoll_list)) {
+	while (!list_empty(&ctx->iopoll_list)) {
 		/* let it sleep and repeat later if can't complete a request */
 		if (io_do_iopoll(ctx, true) == 0)
 			break;
@@ -1626,21 +1626,18 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
 		 * forever, while the workqueue is stuck trying to acquire the
 		 * very same mutex.
 		 */
-		if (wq_list_empty(&ctx->iopoll_list) ||
-		    io_task_work_pending(ctx)) {
+		if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
 			u32 tail = ctx->cached_cq_tail;
 
 			(void) io_run_local_work_locked(ctx, min_events);
 
-			if (task_work_pending(current) ||
-			    wq_list_empty(&ctx->iopoll_list)) {
+			if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
 				mutex_unlock(&ctx->uring_lock);
 				io_run_task_work();
 				mutex_lock(&ctx->uring_lock);
 			}
 			/* some requests don't go through iopoll_list */
-			if (tail != ctx->cached_cq_tail ||
-			    wq_list_empty(&ctx->iopoll_list))
+			if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
 				break;
 		}
 		ret = io_do_iopoll(ctx, !min_events);
@@ -1683,25 +1680,17 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags)
 	 * how we do polling eventually, not spinning if we're on potentially
 	 * different devices.
 	 */
-	if (wq_list_empty(&ctx->iopoll_list)) {
+	if (list_empty(&ctx->iopoll_list)) {
 		ctx->poll_multi_queue = false;
 	} else if (!ctx->poll_multi_queue) {
 		struct io_kiocb *list_req;
 
-		list_req = container_of(ctx->iopoll_list.first, struct io_kiocb,
-					comp_list);
+		list_req = list_entry(ctx->iopoll_list.next, struct io_kiocb, iopoll_node);
 		if (list_req->file != req->file)
 			ctx->poll_multi_queue = true;
 	}
 
-	/*
-	 * For fast devices, IO may have already completed. If it has, add
-	 * it to the front so we find it first.
-	 */
-	if (READ_ONCE(req->iopoll_completed))
-		wq_list_add_head(&req->comp_list, &ctx->iopoll_list);
-	else
-		wq_list_add_tail(&req->comp_list, &ctx->iopoll_list);
+	list_add_tail(&req->iopoll_node, &ctx->iopoll_list);
 
 	if (unlikely(needs_lock)) {
 		/*
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 70ca88cc1f54..307f1f39d9f3 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1315,9 +1315,9 @@ static int io_uring_hybrid_poll(struct io_kiocb *req,
 
 int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 {
-	struct io_wq_work_node *pos, *start, *prev;
 	unsigned int poll_flags = 0;
 	DEFINE_IO_COMP_BATCH(iob);
+	struct io_kiocb *req, *tmp;
 	int nr_events = 0;
 
 	/*
@@ -1327,8 +1327,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 	if (ctx->poll_multi_queue || force_nonspin)
 		poll_flags |= BLK_POLL_ONESHOT;
 
-	wq_list_for_each(pos, start, &ctx->iopoll_list) {
-		struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list);
+	list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) {
 		int ret;
 
 		/*
@@ -1357,31 +1356,20 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 
 	if (!rq_list_empty(&iob.req_list))
 		iob.complete(&iob);
-	else if (!pos)
-		return 0;
-
-	prev = start;
-	wq_list_for_each_resume(pos, prev) {
-		struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list);
 
+	list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) {
 		/* order with io_complete_rw_iopoll(), e.g. ->result updates */
 		if (!smp_load_acquire(&req->iopoll_completed))
-			break;
+			continue;
+		list_del(&req->iopoll_node);
+		wq_list_add_tail(&req->comp_list, &ctx->submit_state.compl_reqs);
 		nr_events++;
 		req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL);
 		if (req->opcode != IORING_OP_URING_CMD)
 			io_req_rw_cleanup(req, 0);
 	}
-	if (unlikely(!nr_events))
-		return 0;
-
-	pos = start ? start->next : ctx->iopoll_list.first;
-	wq_list_cut(&ctx->iopoll_list, prev, start);
-
-	if (WARN_ON_ONCE(!wq_list_empty(&ctx->submit_state.compl_reqs)))
-		return 0;
-	ctx->submit_state.compl_reqs.first = pos;
-	__io_submit_flush_completions(ctx);
+	if (nr_events)
+		__io_submit_flush_completions(ctx);
 	return nr_events;
 }
 
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 74c1a130cd87..becdfdd323a9 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -212,7 +212,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd,
 	if (cap_entries && to_submit > IORING_SQPOLL_CAP_ENTRIES_VALUE)
 		to_submit = IORING_SQPOLL_CAP_ENTRIES_VALUE;
 
-	if (to_submit || !wq_list_empty(&ctx->iopoll_list)) {
+	if (to_submit || !list_empty(&ctx->iopoll_list)) {
 		const struct cred *creds = NULL;
 
 		io_sq_start_worktime(ist);
@@ -221,7 +221,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd,
 			creds = override_creds(ctx->sq_creds);
 
 		mutex_lock(&ctx->uring_lock);
-		if (!wq_list_empty(&ctx->iopoll_list))
+		if (!list_empty(&ctx->iopoll_list))
 			io_do_iopoll(ctx, true);
 
 		/*
@@ -344,7 +344,7 @@ static int io_sq_thread(void *data)
 		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
 			int ret = __io_sq_thread(ctx, sqd, cap_entries, &ist);
 
-			if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
+			if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list)))
 				sqt_spin = true;
 		}
 		if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
@@ -379,7 +379,7 @@ static int io_sq_thread(void *data)
 				atomic_or(IORING_SQ_NEED_WAKEUP,
 						&ctx->rings->sq_flags);
 				if ((ctx->flags & IORING_SETUP_IOPOLL) &&
-				    !wq_list_empty(&ctx->iopoll_list)) {
+				    !list_empty(&ctx->iopoll_list)) {
 					needs_sched = false;
 					break;
 				}

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-11  4:10     ` Jens Axboe
@ 2025-12-11  7:38       ` Fengnan
  2025-12-11 10:22         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Fengnan @ 2025-12-11  7:38 UTC (permalink / raw)
  To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li



在 2025/12/11 12:10, Jens Axboe 写道:
> On 12/10/25 7:15 PM, Jens Axboe wrote:
>> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>>> is considered that the current req is the actual completed request.
>>> This may be reasonable for multi-queue ctx, but is problematic for
>>> single-queue ctx because the current request may not be done when the
>>> poll gets to the result. In this case, the completed io needs to wait
>>> for the first io on the chain to complete before notifying the user,
>>> which may cause io accumulation in the list.
>>> Our modification plan is as follows: change io_wq_work_list to normal
>>> list so that the iopoll_list list in it can be removed and put into the
>>> comp_reqs list when the request is completed. This way each io is
>>> handled independently and all gets processed in time.
>>>
>>> After modification,  test with:
>>>
>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>>> /dev/nvme6n1
>>>
>>> base IOPS is 725K,  patch IOPS is 782K.
>>>
>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>>> /dev/nvme6n1
>>>
>>> Base IOPS is 880k, patch IOPS is 895K.
>> A few notes on this:
>>
>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>>     necessarily safe. Yes generally this is invoked from the
>>     owning/polling task, but that's not guaranteed.
>>
>> 2) The patch doesn't apply to the current tree, must be an older
>>     version?
>>
>> 3) When hand-applied, it still throws a compile warning about an unused
>>     variable. Please don't send untested stuff...
>>
>> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>>     singly to a doubly linked list, you're growing the io_kiocb size. You
>>     should be able to use a union with struct io_task_work for example.
>>     That's already 16b in size - win/win as you don't need to slow down
>>     the cache management as that can keep using the linkage it currently
>>     is using, and you're not bloating the io_kiocb.
>>
>> 5) The already mentioned point about the cache free list now being
>>     doubly linked. This is generally a _bad_ idea as removing and adding
>>     entries now need to touch other entries too. That's not very cache
>>     friendly.
>>
>> #1 is kind of the big one, as it means you'll need to re-think how you
>> do this. I do agree that the current approach isn't necessarily ideal as
>> we don't process completions as quickly as we could, so I think there's
>> merrit in continuing this work.
> Proof of concept below, entirely untested, at a conference. Basically
> does what I describe above, and retains the list manipulation logic
> on the iopoll side, rather than on the completion side. Can you take
> a look at this? And if it runs, can you do some numbers on that too?

This patch works, and in my test case, the performance is identical to 
my patch.
But there is a small problem, now looking for completed requests, always 
need to
traverse the whole iopoll_list. this can be a bit inefficient in some 
cases, for
example if the previous sent 128K io , the last io is 4K, the last io 
will be returned
much earlier, this kind of scenario can not be verified in the current test.
I'm not sure if it's a meaningful scenario.

> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index e1adb0d20a0a..54fd30abf2b8 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -316,7 +316,7 @@ struct io_ring_ctx {
>   		 * manipulate the list, hence no extra locking is needed there.
>   		 */
>   		bool			poll_multi_queue;
> -		struct io_wq_work_list	iopoll_list;
> +		struct list_head	iopoll_list;
>   
>   		struct io_file_table	file_table;
>   		struct io_rsrc_data	buf_table;
> @@ -708,7 +708,16 @@ struct io_kiocb {
>   
>   	atomic_t			refs;
>   	bool				cancel_seq_set;
> -	struct io_task_work		io_task_work;
> +
> +	/*
> +	 * IOPOLL doesn't use task_work, so use the ->iopoll_node list
> +	 * entry to manage pending iopoll requests.
> +	 */
> +	union {
> +		struct io_task_work	io_task_work;
> +		struct list_head	iopoll_node;
> +	};
> +
>   	union {
>   		/*
>   		 * for polled requests, i.e. IORING_OP_POLL_ADD and async armed
> diff --git a/io_uring/cancel.c b/io_uring/cancel.c
> index ca12ac10c0ae..4136bf04464a 100644
> --- a/io_uring/cancel.c
> +++ b/io_uring/cancel.c
> @@ -534,7 +534,7 @@ __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>   	/* SQPOLL thread does its own polling */
>   	if ((!(ctx->flags & IORING_SETUP_SQPOLL) && cancel_all) ||
>   	    is_sqpoll_thread) {
> -		while (!wq_list_empty(&ctx->iopoll_list)) {
> +		while (!list_empty(&ctx->iopoll_list)) {
>   			io_iopoll_try_reap_events(ctx);
>   			ret = true;
>   			cond_resched();
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 6cb24cdf8e68..fdb0b43f6fb5 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -334,7 +334,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>   	init_waitqueue_head(&ctx->poll_wq);
>   	spin_lock_init(&ctx->completion_lock);
>   	raw_spin_lock_init(&ctx->timeout_lock);
> -	INIT_WQ_LIST(&ctx->iopoll_list);
> +	INIT_LIST_HEAD(&ctx->iopoll_list);
>   	INIT_LIST_HEAD(&ctx->defer_list);
>   	INIT_LIST_HEAD(&ctx->timeout_list);
>   	INIT_LIST_HEAD(&ctx->ltimeout_list);
> @@ -1561,7 +1561,7 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
>   		return;
>   
>   	mutex_lock(&ctx->uring_lock);
> -	while (!wq_list_empty(&ctx->iopoll_list)) {
> +	while (!list_empty(&ctx->iopoll_list)) {
>   		/* let it sleep and repeat later if can't complete a request */
>   		if (io_do_iopoll(ctx, true) == 0)
>   			break;
> @@ -1626,21 +1626,18 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
>   		 * forever, while the workqueue is stuck trying to acquire the
>   		 * very same mutex.
>   		 */
> -		if (wq_list_empty(&ctx->iopoll_list) ||
> -		    io_task_work_pending(ctx)) {
> +		if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
>   			u32 tail = ctx->cached_cq_tail;
>   
>   			(void) io_run_local_work_locked(ctx, min_events);
>   
> -			if (task_work_pending(current) ||
> -			    wq_list_empty(&ctx->iopoll_list)) {
> +			if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
>   				mutex_unlock(&ctx->uring_lock);
>   				io_run_task_work();
>   				mutex_lock(&ctx->uring_lock);
>   			}
>   			/* some requests don't go through iopoll_list */
> -			if (tail != ctx->cached_cq_tail ||
> -			    wq_list_empty(&ctx->iopoll_list))
> +			if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
>   				break;
>   		}
>   		ret = io_do_iopoll(ctx, !min_events);
> @@ -1683,25 +1680,17 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags)
>   	 * how we do polling eventually, not spinning if we're on potentially
>   	 * different devices.
>   	 */
> -	if (wq_list_empty(&ctx->iopoll_list)) {
> +	if (list_empty(&ctx->iopoll_list)) {
>   		ctx->poll_multi_queue = false;
>   	} else if (!ctx->poll_multi_queue) {
>   		struct io_kiocb *list_req;
>   
> -		list_req = container_of(ctx->iopoll_list.first, struct io_kiocb,
> -					comp_list);
> +		list_req = list_entry(ctx->iopoll_list.next, struct io_kiocb, iopoll_node);
>   		if (list_req->file != req->file)
>   			ctx->poll_multi_queue = true;
>   	}
>   
> -	/*
> -	 * For fast devices, IO may have already completed. If it has, add
> -	 * it to the front so we find it first.
> -	 */
> -	if (READ_ONCE(req->iopoll_completed))
> -		wq_list_add_head(&req->comp_list, &ctx->iopoll_list);
> -	else
> -		wq_list_add_tail(&req->comp_list, &ctx->iopoll_list);
> +	list_add_tail(&req->iopoll_node, &ctx->iopoll_list);
>   
>   	if (unlikely(needs_lock)) {
>   		/*
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 70ca88cc1f54..307f1f39d9f3 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -1315,9 +1315,9 @@ static int io_uring_hybrid_poll(struct io_kiocb *req,
>   
>   int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>   {
> -	struct io_wq_work_node *pos, *start, *prev;
>   	unsigned int poll_flags = 0;
>   	DEFINE_IO_COMP_BATCH(iob);
> +	struct io_kiocb *req, *tmp;
>   	int nr_events = 0;
>   
>   	/*
> @@ -1327,8 +1327,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>   	if (ctx->poll_multi_queue || force_nonspin)
>   		poll_flags |= BLK_POLL_ONESHOT;
>   
> -	wq_list_for_each(pos, start, &ctx->iopoll_list) {
> -		struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list);
> +	list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) {
>   		int ret;
>   
>   		/*
> @@ -1357,31 +1356,20 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>   
>   	if (!rq_list_empty(&iob.req_list))
>   		iob.complete(&iob);
> -	else if (!pos)
> -		return 0;
> -
> -	prev = start;
> -	wq_list_for_each_resume(pos, prev) {
> -		struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list);
>   
> +	list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) {
>   		/* order with io_complete_rw_iopoll(), e.g. ->result updates */
>   		if (!smp_load_acquire(&req->iopoll_completed))
> -			break;
> +			continue;
> +		list_del(&req->iopoll_node);
> +		wq_list_add_tail(&req->comp_list, &ctx->submit_state.compl_reqs);
>   		nr_events++;
>   		req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL);
>   		if (req->opcode != IORING_OP_URING_CMD)
>   			io_req_rw_cleanup(req, 0);
>   	}
> -	if (unlikely(!nr_events))
> -		return 0;
> -
> -	pos = start ? start->next : ctx->iopoll_list.first;
> -	wq_list_cut(&ctx->iopoll_list, prev, start);
> -
> -	if (WARN_ON_ONCE(!wq_list_empty(&ctx->submit_state.compl_reqs)))
> -		return 0;
> -	ctx->submit_state.compl_reqs.first = pos;
> -	__io_submit_flush_completions(ctx);
> +	if (nr_events)
> +		__io_submit_flush_completions(ctx);
>   	return nr_events;
>   }
>   
> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
> index 74c1a130cd87..becdfdd323a9 100644
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -212,7 +212,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd,
>   	if (cap_entries && to_submit > IORING_SQPOLL_CAP_ENTRIES_VALUE)
>   		to_submit = IORING_SQPOLL_CAP_ENTRIES_VALUE;
>   
> -	if (to_submit || !wq_list_empty(&ctx->iopoll_list)) {
> +	if (to_submit || !list_empty(&ctx->iopoll_list)) {
>   		const struct cred *creds = NULL;
>   
>   		io_sq_start_worktime(ist);
> @@ -221,7 +221,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd,
>   			creds = override_creds(ctx->sq_creds);
>   
>   		mutex_lock(&ctx->uring_lock);
> -		if (!wq_list_empty(&ctx->iopoll_list))
> +		if (!list_empty(&ctx->iopoll_list))
>   			io_do_iopoll(ctx, true);
>   
>   		/*
> @@ -344,7 +344,7 @@ static int io_sq_thread(void *data)
>   		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
>   			int ret = __io_sq_thread(ctx, sqd, cap_entries, &ist);
>   
> -			if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
> +			if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list)))
>   				sqt_spin = true;
>   		}
>   		if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
> @@ -379,7 +379,7 @@ static int io_sq_thread(void *data)
>   				atomic_or(IORING_SQ_NEED_WAKEUP,
>   						&ctx->rings->sq_flags);
>   				if ((ctx->flags & IORING_SETUP_IOPOLL) &&
> -				    !wq_list_empty(&ctx->iopoll_list)) {
> +				    !list_empty(&ctx->iopoll_list)) {
>   					needs_sched = false;
>   					break;
>   				}
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-11  7:38       ` Fengnan
@ 2025-12-11 10:22         ` Jens Axboe
  2025-12-11 10:33           ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2025-12-11 10:22 UTC (permalink / raw)
  To: Fengnan, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/11/25 12:38 AM, Fengnan wrote:
> 
> 
> ? 2025/12/11 12:10, Jens Axboe ??:
>> On 12/10/25 7:15 PM, Jens Axboe wrote:
>>> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>>>> is considered that the current req is the actual completed request.
>>>> This may be reasonable for multi-queue ctx, but is problematic for
>>>> single-queue ctx because the current request may not be done when the
>>>> poll gets to the result. In this case, the completed io needs to wait
>>>> for the first io on the chain to complete before notifying the user,
>>>> which may cause io accumulation in the list.
>>>> Our modification plan is as follows: change io_wq_work_list to normal
>>>> list so that the iopoll_list list in it can be removed and put into the
>>>> comp_reqs list when the request is completed. This way each io is
>>>> handled independently and all gets processed in time.
>>>>
>>>> After modification,  test with:
>>>>
>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>>>> /dev/nvme6n1
>>>>
>>>> base IOPS is 725K,  patch IOPS is 782K.
>>>>
>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>>>> /dev/nvme6n1
>>>>
>>>> Base IOPS is 880k, patch IOPS is 895K.
>>> A few notes on this:
>>>
>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>>>     necessarily safe. Yes generally this is invoked from the
>>>     owning/polling task, but that's not guaranteed.
>>>
>>> 2) The patch doesn't apply to the current tree, must be an older
>>>     version?
>>>
>>> 3) When hand-applied, it still throws a compile warning about an unused
>>>     variable. Please don't send untested stuff...
>>>
>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>>>     singly to a doubly linked list, you're growing the io_kiocb size. You
>>>     should be able to use a union with struct io_task_work for example.
>>>     That's already 16b in size - win/win as you don't need to slow down
>>>     the cache management as that can keep using the linkage it currently
>>>     is using, and you're not bloating the io_kiocb.
>>>
>>> 5) The already mentioned point about the cache free list now being
>>>     doubly linked. This is generally a _bad_ idea as removing and adding
>>>     entries now need to touch other entries too. That's not very cache
>>>     friendly.
>>>
>>> #1 is kind of the big one, as it means you'll need to re-think how you
>>> do this. I do agree that the current approach isn't necessarily ideal as
>>> we don't process completions as quickly as we could, so I think there's
>>> merrit in continuing this work.
>> Proof of concept below, entirely untested, at a conference. Basically
>> does what I describe above, and retains the list manipulation logic
>> on the iopoll side, rather than on the completion side. Can you take
>> a look at this? And if it runs, can you do some numbers on that too?
> 
> This patch works, and in my test case, the performance is identical to
> my patch.

Good!

> But there is a small problem, now looking for completed requests,
> always need to traverse the whole iopoll_list. this can be a bit
> inefficient in some cases, for example if the previous sent 128K io ,
> the last io is 4K, the last io will be returned much earlier, this
> kind of scenario can not be verified in the current test. I'm not sure
> if it's a meaningful scenario.

Not sure that's a big problem, you're just spinning to complete anyway.
You could add your iob->nr_reqs or something, and break after finding
those know have completed. That won't necessarily change anything, could
still be the last one that completed. Would probably make more sense to
pass in 'min_events' or similar and stop after that. But I think mostly
tweaks that can be made after the fact. If you want to send out a new
patch based on the one I sent, feel free to.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-11 10:22         ` Jens Axboe
@ 2025-12-11 10:33           ` Jens Axboe
  2025-12-11 11:13             ` Fengnan Chang
  2025-12-12  1:41             ` Fengnan Chang
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2025-12-11 10:33 UTC (permalink / raw)
  To: Fengnan, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/11/25 3:22 AM, Jens Axboe wrote:
> On 12/11/25 12:38 AM, Fengnan wrote:
>>
>>
>> ? 2025/12/11 12:10, Jens Axboe ??:
>>> On 12/10/25 7:15 PM, Jens Axboe wrote:
>>>> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>>>>> is considered that the current req is the actual completed request.
>>>>> This may be reasonable for multi-queue ctx, but is problematic for
>>>>> single-queue ctx because the current request may not be done when the
>>>>> poll gets to the result. In this case, the completed io needs to wait
>>>>> for the first io on the chain to complete before notifying the user,
>>>>> which may cause io accumulation in the list.
>>>>> Our modification plan is as follows: change io_wq_work_list to normal
>>>>> list so that the iopoll_list list in it can be removed and put into the
>>>>> comp_reqs list when the request is completed. This way each io is
>>>>> handled independently and all gets processed in time.
>>>>>
>>>>> After modification,  test with:
>>>>>
>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>>>>> /dev/nvme6n1
>>>>>
>>>>> base IOPS is 725K,  patch IOPS is 782K.
>>>>>
>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>>>>> /dev/nvme6n1
>>>>>
>>>>> Base IOPS is 880k, patch IOPS is 895K.
>>>> A few notes on this:
>>>>
>>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>>>>     necessarily safe. Yes generally this is invoked from the
>>>>     owning/polling task, but that's not guaranteed.
>>>>
>>>> 2) The patch doesn't apply to the current tree, must be an older
>>>>     version?
>>>>
>>>> 3) When hand-applied, it still throws a compile warning about an unused
>>>>     variable. Please don't send untested stuff...
>>>>
>>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>>>>     singly to a doubly linked list, you're growing the io_kiocb size. You
>>>>     should be able to use a union with struct io_task_work for example.
>>>>     That's already 16b in size - win/win as you don't need to slow down
>>>>     the cache management as that can keep using the linkage it currently
>>>>     is using, and you're not bloating the io_kiocb.
>>>>
>>>> 5) The already mentioned point about the cache free list now being
>>>>     doubly linked. This is generally a _bad_ idea as removing and adding
>>>>     entries now need to touch other entries too. That's not very cache
>>>>     friendly.
>>>>
>>>> #1 is kind of the big one, as it means you'll need to re-think how you
>>>> do this. I do agree that the current approach isn't necessarily ideal as
>>>> we don't process completions as quickly as we could, so I think there's
>>>> merrit in continuing this work.
>>> Proof of concept below, entirely untested, at a conference. Basically
>>> does what I describe above, and retains the list manipulation logic
>>> on the iopoll side, rather than on the completion side. Can you take
>>> a look at this? And if it runs, can you do some numbers on that too?
>>
>> This patch works, and in my test case, the performance is identical to
>> my patch.
> 
> Good!
> 
>> But there is a small problem, now looking for completed requests,
>> always need to traverse the whole iopoll_list. this can be a bit
>> inefficient in some cases, for example if the previous sent 128K io ,
>> the last io is 4K, the last io will be returned much earlier, this
>> kind of scenario can not be verified in the current test. I'm not sure
>> if it's a meaningful scenario.
> 
> Not sure that's a big problem, you're just spinning to complete anyway.
> You could add your iob->nr_reqs or something, and break after finding
> those know have completed. That won't necessarily change anything, could
> still be the last one that completed. Would probably make more sense to
> pass in 'min_events' or similar and stop after that. But I think mostly
> tweaks that can be made after the fact. If you want to send out a new
> patch based on the one I sent, feel free to.

Eg, something like this on top would do that. Like I mentioned earlier,
you cannot do the list manipulation where you did it, it's not safe. You
have to defer it to reaping time. If we could do it from the callback
where we mark it complete, then that would surely make things more
trivial and avoid iteration when not needed.


diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index cae9e857aea4..93709ee6c3ee 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -917,6 +917,7 @@ static inline bool blk_mq_add_to_batch(struct request *req,
 	else if (iob->complete != complete)
 		return false;
 	iob->need_ts |= blk_mq_need_time_stamp(req);
+	iob->nr_reqs++;
 	rq_list_add_tail(&iob->req_list, req);
 	return true;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 72e34acd439c..9335b552e040 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1821,6 +1821,7 @@ void bdev_fput(struct file *bdev_file);
 struct io_comp_batch {
 	struct rq_list req_list;
 	bool need_ts;
+	int nr_reqs;
 	void (*complete)(struct io_comp_batch *);
 };
 
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 307f1f39d9f3..37b5b2328f24 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1358,6 +1358,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 		iob.complete(&iob);
 
 	list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) {
+		if (nr_events == iob.nr_reqs)
+			break;
 		/* order with io_complete_rw_iopoll(), e.g. ->result updates */
 		if (!smp_load_acquire(&req->iopoll_completed))
 			continue;

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-11 10:33           ` Jens Axboe
@ 2025-12-11 11:13             ` Fengnan Chang
  2025-12-11 11:19               ` Jens Axboe
  2025-12-12  1:41             ` Fengnan Chang
  1 sibling, 1 reply; 22+ messages in thread
From: Fengnan Chang @ 2025-12-11 11:13 UTC (permalink / raw)
  To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li



在 2025/12/11 18:33, Jens Axboe 写道:
> On 12/11/25 3:22 AM, Jens Axboe wrote:
>> On 12/11/25 12:38 AM, Fengnan wrote:
>>>
>>> ? 2025/12/11 12:10, Jens Axboe ??:
>>>> On 12/10/25 7:15 PM, Jens Axboe wrote:
>>>>> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>>>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>>>>>> is considered that the current req is the actual completed request.
>>>>>> This may be reasonable for multi-queue ctx, but is problematic for
>>>>>> single-queue ctx because the current request may not be done when the
>>>>>> poll gets to the result. In this case, the completed io needs to wait
>>>>>> for the first io on the chain to complete before notifying the user,
>>>>>> which may cause io accumulation in the list.
>>>>>> Our modification plan is as follows: change io_wq_work_list to normal
>>>>>> list so that the iopoll_list list in it can be removed and put into the
>>>>>> comp_reqs list when the request is completed. This way each io is
>>>>>> handled independently and all gets processed in time.
>>>>>>
>>>>>> After modification,  test with:
>>>>>>
>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>>>>>> /dev/nvme6n1
>>>>>>
>>>>>> base IOPS is 725K,  patch IOPS is 782K.
>>>>>>
>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>>>>>> /dev/nvme6n1
>>>>>>
>>>>>> Base IOPS is 880k, patch IOPS is 895K.
>>>>> A few notes on this:
>>>>>
>>>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>>>>>      necessarily safe. Yes generally this is invoked from the
>>>>>      owning/polling task, but that's not guaranteed.
>>>>>
>>>>> 2) The patch doesn't apply to the current tree, must be an older
>>>>>      version?
>>>>>
>>>>> 3) When hand-applied, it still throws a compile warning about an unused
>>>>>      variable. Please don't send untested stuff...
>>>>>
>>>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>>>>>      singly to a doubly linked list, you're growing the io_kiocb size. You
>>>>>      should be able to use a union with struct io_task_work for example.
>>>>>      That's already 16b in size - win/win as you don't need to slow down
>>>>>      the cache management as that can keep using the linkage it currently
>>>>>      is using, and you're not bloating the io_kiocb.
>>>>>
>>>>> 5) The already mentioned point about the cache free list now being
>>>>>      doubly linked. This is generally a _bad_ idea as removing and adding
>>>>>      entries now need to touch other entries too. That's not very cache
>>>>>      friendly.
>>>>>
>>>>> #1 is kind of the big one, as it means you'll need to re-think how you
>>>>> do this. I do agree that the current approach isn't necessarily ideal as
>>>>> we don't process completions as quickly as we could, so I think there's
>>>>> merrit in continuing this work.
>>>> Proof of concept below, entirely untested, at a conference. Basically
>>>> does what I describe above, and retains the list manipulation logic
>>>> on the iopoll side, rather than on the completion side. Can you take
>>>> a look at this? And if it runs, can you do some numbers on that too?
>>> This patch works, and in my test case, the performance is identical to
>>> my patch.
>> Good!
>>
>>> But there is a small problem, now looking for completed requests,
>>> always need to traverse the whole iopoll_list. this can be a bit
>>> inefficient in some cases, for example if the previous sent 128K io ,
>>> the last io is 4K, the last io will be returned much earlier, this
>>> kind of scenario can not be verified in the current test. I'm not sure
>>> if it's a meaningful scenario.
>> Not sure that's a big problem, you're just spinning to complete anyway.
>> You could add your iob->nr_reqs or something, and break after finding
>> those know have completed. That won't necessarily change anything, could
>> still be the last one that completed. Would probably make more sense to
>> pass in 'min_events' or similar and stop after that. But I think mostly
>> tweaks that can be made after the fact. If you want to send out a new
>> patch based on the one I sent, feel free to.
> Eg, something like this on top would do that. Like I mentioned earlier,
> you cannot do the list manipulation where you did it, it's not safe. You
> have to defer it to reaping time. If we could do it from the callback
> where we mark it complete, then that would surely make things more
> trivial and avoid iteration when not needed.

Yes, it's not safe do the list manipulation in io_complete_rw_iopoll.
It looks more reasonable with the following modifications.
Your changes look good enough, but please give me more time, I'd
like to do some more testing and rethink this.

>
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index cae9e857aea4..93709ee6c3ee 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -917,6 +917,7 @@ static inline bool blk_mq_add_to_batch(struct request *req,
>   	else if (iob->complete != complete)
>   		return false;
>   	iob->need_ts |= blk_mq_need_time_stamp(req);
> +	iob->nr_reqs++;
>   	rq_list_add_tail(&iob->req_list, req);
>   	return true;
>   }
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 72e34acd439c..9335b552e040 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1821,6 +1821,7 @@ void bdev_fput(struct file *bdev_file);
>   struct io_comp_batch {
>   	struct rq_list req_list;
>   	bool need_ts;
> +	int nr_reqs;
>   	void (*complete)(struct io_comp_batch *);
>   };
>   
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 307f1f39d9f3..37b5b2328f24 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -1358,6 +1358,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>   		iob.complete(&iob);
>   
>   	list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) {
> +		if (nr_events == iob.nr_reqs)
> +			break;
>   		/* order with io_complete_rw_iopoll(), e.g. ->result updates */
>   		if (!smp_load_acquire(&req->iopoll_completed))
>   			continue;
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-11 11:13             ` Fengnan Chang
@ 2025-12-11 11:19               ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-12-11 11:19 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/11/25 4:13 AM, Fengnan Chang wrote:
> 
> 
> ? 2025/12/11 18:33, Jens Axboe ??:
>> On 12/11/25 3:22 AM, Jens Axboe wrote:
>>> On 12/11/25 12:38 AM, Fengnan wrote:
>>>>
>>>> ? 2025/12/11 12:10, Jens Axboe ??:
>>>>> On 12/10/25 7:15 PM, Jens Axboe wrote:
>>>>>> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>>>>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>>>>>>> is considered that the current req is the actual completed request.
>>>>>>> This may be reasonable for multi-queue ctx, but is problematic for
>>>>>>> single-queue ctx because the current request may not be done when the
>>>>>>> poll gets to the result. In this case, the completed io needs to wait
>>>>>>> for the first io on the chain to complete before notifying the user,
>>>>>>> which may cause io accumulation in the list.
>>>>>>> Our modification plan is as follows: change io_wq_work_list to normal
>>>>>>> list so that the iopoll_list list in it can be removed and put into the
>>>>>>> comp_reqs list when the request is completed. This way each io is
>>>>>>> handled independently and all gets processed in time.
>>>>>>>
>>>>>>> After modification,  test with:
>>>>>>>
>>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>>>>>>> /dev/nvme6n1
>>>>>>>
>>>>>>> base IOPS is 725K,  patch IOPS is 782K.
>>>>>>>
>>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>>>>>>> /dev/nvme6n1
>>>>>>>
>>>>>>> Base IOPS is 880k, patch IOPS is 895K.
>>>>>> A few notes on this:
>>>>>>
>>>>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>>>>>>      necessarily safe. Yes generally this is invoked from the
>>>>>>      owning/polling task, but that's not guaranteed.
>>>>>>
>>>>>> 2) The patch doesn't apply to the current tree, must be an older
>>>>>>      version?
>>>>>>
>>>>>> 3) When hand-applied, it still throws a compile warning about an unused
>>>>>>      variable. Please don't send untested stuff...
>>>>>>
>>>>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>>>>>>      singly to a doubly linked list, you're growing the io_kiocb size. You
>>>>>>      should be able to use a union with struct io_task_work for example.
>>>>>>      That's already 16b in size - win/win as you don't need to slow down
>>>>>>      the cache management as that can keep using the linkage it currently
>>>>>>      is using, and you're not bloating the io_kiocb.
>>>>>>
>>>>>> 5) The already mentioned point about the cache free list now being
>>>>>>      doubly linked. This is generally a _bad_ idea as removing and adding
>>>>>>      entries now need to touch other entries too. That's not very cache
>>>>>>      friendly.
>>>>>>
>>>>>> #1 is kind of the big one, as it means you'll need to re-think how you
>>>>>> do this. I do agree that the current approach isn't necessarily ideal as
>>>>>> we don't process completions as quickly as we could, so I think there's
>>>>>> merrit in continuing this work.
>>>>> Proof of concept below, entirely untested, at a conference. Basically
>>>>> does what I describe above, and retains the list manipulation logic
>>>>> on the iopoll side, rather than on the completion side. Can you take
>>>>> a look at this? And if it runs, can you do some numbers on that too?
>>>> This patch works, and in my test case, the performance is identical to
>>>> my patch.
>>> Good!
>>>
>>>> But there is a small problem, now looking for completed requests,
>>>> always need to traverse the whole iopoll_list. this can be a bit
>>>> inefficient in some cases, for example if the previous sent 128K io ,
>>>> the last io is 4K, the last io will be returned much earlier, this
>>>> kind of scenario can not be verified in the current test. I'm not sure
>>>> if it's a meaningful scenario.
>>> Not sure that's a big problem, you're just spinning to complete anyway.
>>> You could add your iob->nr_reqs or something, and break after finding
>>> those know have completed. That won't necessarily change anything, could
>>> still be the last one that completed. Would probably make more sense to
>>> pass in 'min_events' or similar and stop after that. But I think mostly
>>> tweaks that can be made after the fact. If you want to send out a new
>>> patch based on the one I sent, feel free to.
>> Eg, something like this on top would do that. Like I mentioned earlier,
>> you cannot do the list manipulation where you did it, it's not safe. You
>> have to defer it to reaping time. If we could do it from the callback
>> where we mark it complete, then that would surely make things more
>> trivial and avoid iteration when not needed.
> 
> Yes, it's not safe do the list manipulation in io_complete_rw_iopoll.
> It looks more reasonable with the following modifications.
> Your changes look good enough, but please give me more time, I'd
> like to do some more testing and rethink this.

Of course, there's no rush - we're in the merge window anyway, so this
will be targeted for 6.20/7.0 either way.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-11 10:33           ` Jens Axboe
  2025-12-11 11:13             ` Fengnan Chang
@ 2025-12-12  1:41             ` Fengnan Chang
  2025-12-12  1:53               ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Fengnan Chang @ 2025-12-12  1:41 UTC (permalink / raw)
  To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li



在 2025/12/11 18:33, Jens Axboe 写道:
> On 12/11/25 3:22 AM, Jens Axboe wrote:
>> On 12/11/25 12:38 AM, Fengnan wrote:
>>>
>>> ? 2025/12/11 12:10, Jens Axboe ??:
>>>> On 12/10/25 7:15 PM, Jens Axboe wrote:
>>>>> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>>>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>>>>>> is considered that the current req is the actual completed request.
>>>>>> This may be reasonable for multi-queue ctx, but is problematic for
>>>>>> single-queue ctx because the current request may not be done when the
>>>>>> poll gets to the result. In this case, the completed io needs to wait
>>>>>> for the first io on the chain to complete before notifying the user,
>>>>>> which may cause io accumulation in the list.
>>>>>> Our modification plan is as follows: change io_wq_work_list to normal
>>>>>> list so that the iopoll_list list in it can be removed and put into the
>>>>>> comp_reqs list when the request is completed. This way each io is
>>>>>> handled independently and all gets processed in time.
>>>>>>
>>>>>> After modification,  test with:
>>>>>>
>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>>>>>> /dev/nvme6n1
>>>>>>
>>>>>> base IOPS is 725K,  patch IOPS is 782K.
>>>>>>
>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>>>>>> /dev/nvme6n1
>>>>>>
>>>>>> Base IOPS is 880k, patch IOPS is 895K.
>>>>> A few notes on this:
>>>>>
>>>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>>>>>      necessarily safe. Yes generally this is invoked from the
>>>>>      owning/polling task, but that's not guaranteed.
>>>>>
>>>>> 2) The patch doesn't apply to the current tree, must be an older
>>>>>      version?
>>>>>
>>>>> 3) When hand-applied, it still throws a compile warning about an unused
>>>>>      variable. Please don't send untested stuff...
>>>>>
>>>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>>>>>      singly to a doubly linked list, you're growing the io_kiocb size. You
>>>>>      should be able to use a union with struct io_task_work for example.
>>>>>      That's already 16b in size - win/win as you don't need to slow down
>>>>>      the cache management as that can keep using the linkage it currently
>>>>>      is using, and you're not bloating the io_kiocb.
>>>>>
>>>>> 5) The already mentioned point about the cache free list now being
>>>>>      doubly linked. This is generally a _bad_ idea as removing and adding
>>>>>      entries now need to touch other entries too. That's not very cache
>>>>>      friendly.
>>>>>
>>>>> #1 is kind of the big one, as it means you'll need to re-think how you
>>>>> do this. I do agree that the current approach isn't necessarily ideal as
>>>>> we don't process completions as quickly as we could, so I think there's
>>>>> merrit in continuing this work.
>>>> Proof of concept below, entirely untested, at a conference. Basically
>>>> does what I describe above, and retains the list manipulation logic
>>>> on the iopoll side, rather than on the completion side. Can you take
>>>> a look at this? And if it runs, can you do some numbers on that too?
>>> This patch works, and in my test case, the performance is identical to
>>> my patch.
>> Good!
>>
>>> But there is a small problem, now looking for completed requests,
>>> always need to traverse the whole iopoll_list. this can be a bit
>>> inefficient in some cases, for example if the previous sent 128K io ,
>>> the last io is 4K, the last io will be returned much earlier, this
>>> kind of scenario can not be verified in the current test. I'm not sure
>>> if it's a meaningful scenario.
>> Not sure that's a big problem, you're just spinning to complete anyway.
>> You could add your iob->nr_reqs or something, and break after finding
>> those know have completed. That won't necessarily change anything, could
>> still be the last one that completed. Would probably make more sense to
>> pass in 'min_events' or similar and stop after that. But I think mostly
>> tweaks that can be made after the fact. If you want to send out a new
>> patch based on the one I sent, feel free to.
> Eg, something like this on top would do that. Like I mentioned earlier,
> you cannot do the list manipulation where you did it, it's not safe. You
> have to defer it to reaping time. If we could do it from the callback
> where we mark it complete, then that would surely make things more
> trivial and avoid iteration when not needed.

Oh, we can't add nr_events == iob.nr_reqs check, if blk_mq_add_to_batch 
add failed,
completed IO will not add into iob, iob.nr_reqs will be 0, this may 
cause io hang.

>
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index cae9e857aea4..93709ee6c3ee 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -917,6 +917,7 @@ static inline bool blk_mq_add_to_batch(struct request *req,
>   	else if (iob->complete != complete)
>   		return false;
>   	iob->need_ts |= blk_mq_need_time_stamp(req);
> +	iob->nr_reqs++;
>   	rq_list_add_tail(&iob->req_list, req);
>   	return true;
>   }
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 72e34acd439c..9335b552e040 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1821,6 +1821,7 @@ void bdev_fput(struct file *bdev_file);
>   struct io_comp_batch {
>   	struct rq_list req_list;
>   	bool need_ts;
> +	int nr_reqs;
>   	void (*complete)(struct io_comp_batch *);
>   };
>   
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 307f1f39d9f3..37b5b2328f24 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -1358,6 +1358,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>   		iob.complete(&iob);
>   
>   	list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) {
> +		if (nr_events == iob.nr_reqs)
> +			break;
>   		/* order with io_complete_rw_iopoll(), e.g. ->result updates */
>   		if (!smp_load_acquire(&req->iopoll_completed))
>   			continue;
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-12  1:41             ` Fengnan Chang
@ 2025-12-12  1:53               ` Jens Axboe
  2025-12-12  2:12                 ` Fengnan Chang
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2025-12-12  1:53 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/11/25 6:41 PM, Fengnan Chang wrote:
> Oh, we can't add nr_events == iob.nr_reqs check, if
> blk_mq_add_to_batch add failed, completed IO will not add into iob,
> iob.nr_reqs will be 0, this may cause io hang.

Indeed, won't work as-is.

I do think we're probably making a bigger deal out of the full loop than
necessary. At least I'd be perfectly happy with just the current patch,
performance should be better there than we currently have it. Ideally
we'd have just one loop for polling and catching the completed items,
but that's a bit tricky with the batch completions.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-12  1:53               ` Jens Axboe
@ 2025-12-12  2:12                 ` Fengnan Chang
  2025-12-12  5:11                   ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Fengnan Chang @ 2025-12-12  2:12 UTC (permalink / raw)
  To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li



在 2025/12/12 09:53, Jens Axboe 写道:
> On 12/11/25 6:41 PM, Fengnan Chang wrote:
>> Oh, we can't add nr_events == iob.nr_reqs check, if
>> blk_mq_add_to_batch add failed, completed IO will not add into iob,
>> iob.nr_reqs will be 0, this may cause io hang.
> Indeed, won't work as-is.
>
> I do think we're probably making a bigger deal out of the full loop than
> necessary. At least I'd be perfectly happy with just the current patch,
> performance should be better there than we currently have it. Ideally
> we'd have just one loop for polling and catching the completed items,
> but that's a bit tricky with the batch completions.
Yes, ideally one loop would be enough, but given that there are also 
multi_queue ctx, that
doesn't seem to be possible.
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-12  2:12                 ` Fengnan Chang
@ 2025-12-12  5:11                   ` Jens Axboe
  2025-12-12  8:58                     ` Jens Axboe
  2025-12-12 13:32                     ` Diangang Li
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2025-12-12  5:11 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/11/25 7:12 PM, Fengnan Chang wrote:
> 
> 
> ? 2025/12/12 09:53, Jens Axboe ??:
>> On 12/11/25 6:41 PM, Fengnan Chang wrote:
>>> Oh, we can't add nr_events == iob.nr_reqs check, if
>>> blk_mq_add_to_batch add failed, completed IO will not add into iob,
>>> iob.nr_reqs will be 0, this may cause io hang.
>> Indeed, won't work as-is.
>>
>> I do think we're probably making a bigger deal out of the full loop than
>> necessary. At least I'd be perfectly happy with just the current patch,
>> performance should be better there than we currently have it. Ideally
>> we'd have just one loop for polling and catching the completed items,
>> but that's a bit tricky with the batch completions.
>
> Yes, ideally one loop would be enough, but given that there are also
> multi_queue ctx, that doesn't seem to be possible.

It's not removing the double loop, but the below could help _only_
iterate completed requests at the end. Rather than move items between
the current list at the completion callback, have a separate list just
for completed requests. Then we can simply iterate that, knowing all of
them have completed. Gets rid of the ->iopoll_completed as well, and
then we can move the poll_refs. Not really related at all, obviously
this patch should be split into multiple pieces.

This uses a lockless list. But since the producer and consumer are
generally the same task, that should not add any real overhead. On top
of the previous one I sent. What do you think?

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 54fd30abf2b8..2d67d95a64ee 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -317,6 +317,7 @@ struct io_ring_ctx {
 		 */
 		bool			poll_multi_queue;
 		struct list_head	iopoll_list;
+		struct llist_head	iopoll_complete;
 
 		struct io_file_table	file_table;
 		struct io_rsrc_data	buf_table;
@@ -672,8 +673,9 @@ struct io_kiocb {
 	};
 
 	u8				opcode;
-	/* polled IO has completed */
-	u8				iopoll_completed;
+
+	bool				cancel_seq_set;
+
 	/*
 	 * Can be either a fixed buffer index, or used with provided buffers.
 	 * For the latter, it points to the selected buffer ID.
@@ -700,6 +702,7 @@ struct io_kiocb {
 	union {
 		/* used by request caches, completion batching and iopoll */
 		struct io_wq_work_node	comp_list;
+		struct llist_node	iopoll_done_list;
 		/* cache ->apoll->events */
 		__poll_t apoll_events;
 	};
@@ -707,7 +710,7 @@ struct io_kiocb {
 	struct io_rsrc_node		*file_node;
 
 	atomic_t			refs;
-	bool				cancel_seq_set;
+	atomic_t			poll_refs;
 
 	/*
 	 * IOPOLL doesn't use task_work, so use the ->iopoll_node list
@@ -734,7 +737,6 @@ struct io_kiocb {
 	/* opcode allocated if it needs to store data for async defer */
 	void				*async_data;
 	/* linked requests, IFF REQ_F_HARDLINK or REQ_F_LINK are set */
-	atomic_t			poll_refs;
 	struct io_kiocb			*link;
 	/* custom credentials, valid IFF REQ_F_CREDS is set */
 	const struct cred		*creds;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 05a660c97316..5e503a0bfcfc 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -335,6 +335,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	spin_lock_init(&ctx->completion_lock);
 	raw_spin_lock_init(&ctx->timeout_lock);
 	INIT_LIST_HEAD(&ctx->iopoll_list);
+	init_llist_head(&ctx->iopoll_complete);
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
 	INIT_LIST_HEAD(&ctx->ltimeout_list);
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 307f1f39d9f3..ad481ca74a46 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -604,8 +604,7 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)
 			req->cqe.res = res;
 	}
 
-	/* order with io_iopoll_complete() checking ->iopoll_completed */
-	smp_store_release(&req->iopoll_completed, 1);
+	llist_add(&req->iopoll_done_list, &req->ctx->iopoll_complete);
 }
 
 static inline void io_rw_done(struct io_kiocb *req, ssize_t ret)
@@ -870,7 +869,6 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 			return -EOPNOTSUPP;
 		kiocb->private = NULL;
 		kiocb->ki_flags |= IOCB_HIPRI;
-		req->iopoll_completed = 0;
 		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) {
 			/* make sure every req only blocks once*/
 			req->flags &= ~REQ_F_IOPOLL_STATE;
@@ -1317,7 +1315,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 {
 	unsigned int poll_flags = 0;
 	DEFINE_IO_COMP_BATCH(iob);
-	struct io_kiocb *req, *tmp;
+	struct llist_node *node;
+	struct io_kiocb *req;
 	int nr_events = 0;
 
 	/*
@@ -1327,17 +1326,12 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 	if (ctx->poll_multi_queue || force_nonspin)
 		poll_flags |= BLK_POLL_ONESHOT;
 
+	/*
+	 * Loop over uncompleted polled IO requests, and poll for them.
+	 */
 	list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) {
 		int ret;
 
-		/*
-		 * Move completed and retryable entries to our local lists.
-		 * If we find a request that requires polling, break out
-		 * and complete those lists first, if we have entries there.
-		 */
-		if (READ_ONCE(req->iopoll_completed))
-			break;
-
 		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL)
 			ret = io_uring_hybrid_poll(req, &iob, poll_flags);
 		else
@@ -1349,24 +1343,25 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 			poll_flags |= BLK_POLL_ONESHOT;
 
 		/* iopoll may have completed current req */
-		if (!rq_list_empty(&iob.req_list) ||
-		    READ_ONCE(req->iopoll_completed))
+		if (!rq_list_empty(&iob.req_list))
 			break;
 	}
 
 	if (!rq_list_empty(&iob.req_list))
 		iob.complete(&iob);
 
-	list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) {
-		/* order with io_complete_rw_iopoll(), e.g. ->result updates */
-		if (!smp_load_acquire(&req->iopoll_completed))
-			continue;
+	node = llist_del_all(&ctx->iopoll_complete);
+	while (node) {
+		struct llist_node *next = node->next;
+
+		req = container_of(node, struct io_kiocb, iopoll_done_list);
 		list_del(&req->iopoll_node);
 		wq_list_add_tail(&req->comp_list, &ctx->submit_state.compl_reqs);
 		nr_events++;
 		req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL);
 		if (req->opcode != IORING_OP_URING_CMD)
 			io_req_rw_cleanup(req, 0);
+		node = next;
 	}
 	if (nr_events)
 		__io_submit_flush_completions(ctx);
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 197474911f04..0841fa541f5d 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -159,8 +159,7 @@ void __io_uring_cmd_done(struct io_uring_cmd *ioucmd, s32 ret, u64 res2,
 	}
 	io_req_uring_cleanup(req, issue_flags);
 	if (req->ctx->flags & IORING_SETUP_IOPOLL) {
-		/* order with io_iopoll_req_issued() checking ->iopoll_complete */
-		smp_store_release(&req->iopoll_completed, 1);
+		llist_add(&req->iopoll_done_list, &req->ctx->iopoll_complete);
 	} else if (issue_flags & IO_URING_F_COMPLETE_DEFER) {
 		if (WARN_ON_ONCE(issue_flags & IO_URING_F_UNLOCKED))
 			return;
@@ -252,7 +251,6 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 		if (!file->f_op->uring_cmd_iopoll)
 			return -EOPNOTSUPP;
 		issue_flags |= IO_URING_F_IOPOLL;
-		req->iopoll_completed = 0;
 		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) {
 			/* make sure every req only blocks once */
 			req->flags &= ~REQ_F_IOPOLL_STATE;

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-12  5:11                   ` Jens Axboe
@ 2025-12-12  8:58                     ` Jens Axboe
  2025-12-12  9:49                       ` Fengnan Chang
  2025-12-12 13:32                     ` Diangang Li
  1 sibling, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2025-12-12  8:58 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/11/25 10:11 PM, Jens Axboe wrote:
> On 12/11/25 7:12 PM, Fengnan Chang wrote:
>>
>>
>> ? 2025/12/12 09:53, Jens Axboe ??:
>>> On 12/11/25 6:41 PM, Fengnan Chang wrote:
>>>> Oh, we can't add nr_events == iob.nr_reqs check, if
>>>> blk_mq_add_to_batch add failed, completed IO will not add into iob,
>>>> iob.nr_reqs will be 0, this may cause io hang.
>>> Indeed, won't work as-is.
>>>
>>> I do think we're probably making a bigger deal out of the full loop than
>>> necessary. At least I'd be perfectly happy with just the current patch,
>>> performance should be better there than we currently have it. Ideally
>>> we'd have just one loop for polling and catching the completed items,
>>> but that's a bit tricky with the batch completions.
>>
>> Yes, ideally one loop would be enough, but given that there are also
>> multi_queue ctx, that doesn't seem to be possible.
> 
> It's not removing the double loop, but the below could help _only_
> iterate completed requests at the end. Rather than move items between
> the current list at the completion callback, have a separate list just
> for completed requests. Then we can simply iterate that, knowing all of
> them have completed. Gets rid of the ->iopoll_completed as well, and
> then we can move the poll_refs. Not really related at all, obviously
> this patch should be split into multiple pieces.
> 
> This uses a lockless list. But since the producer and consumer are
> generally the same task, that should not add any real overhead. On top
> of the previous one I sent. What do you think?

Ran some quick testing, as one interesting case is mixing slower and
faster devices. Let's take this basic example:

t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 -t1 -n2 /dev/nvme32n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

where nvme32n1 is about 1.27M IOPS, and the other 3 do about 3.3M IOPS,
and we poll 2 devices with each IO thread. With the current kernel, we
get:

IOPS=5.18M, BW=2.53GiB/s, IOS/call=31/32
IOPS=5.19M, BW=2.53GiB/s, IOS/call=31/31
IOPS=5.17M, BW=2.53GiB/s, IOS/call=31/31

and with the two patches we get:

IOPS=6.54M, BW=3.19GiB/s, IOS/call=31/31
IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31
IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31

or about a 25% improvement. This is mostly due to the issue you
highlighted, where you end up with later completions (that are done)
being stuck behind waiting on a slower completion.

Note: obviously 1 thread driving multiple devices for polling could
still be improved, and in fact it does improve if we simply change -c32
to something lower. The more important case is the one you identified,
where different completion times on the same device will hold
completions up. Multi device polling is just an interesting way to kind
of emulate that, to an extent.

This effect is (naturally) also apparent in the completion latencies as
well, particularly in the higher percentiles.

Ran peak testing too, and it's better all around than before.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-12  8:58                     ` Jens Axboe
@ 2025-12-12  9:49                       ` Fengnan Chang
  2025-12-12 20:22                         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Fengnan Chang @ 2025-12-12  9:49 UTC (permalink / raw)
  To: Jens Axboe, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li



在 2025/12/12 16:58, Jens Axboe 写道:
> On 12/11/25 10:11 PM, Jens Axboe wrote:
>> On 12/11/25 7:12 PM, Fengnan Chang wrote:
>>>
>>> ? 2025/12/12 09:53, Jens Axboe ??:
>>>> On 12/11/25 6:41 PM, Fengnan Chang wrote:
>>>>> Oh, we can't add nr_events == iob.nr_reqs check, if
>>>>> blk_mq_add_to_batch add failed, completed IO will not add into iob,
>>>>> iob.nr_reqs will be 0, this may cause io hang.
>>>> Indeed, won't work as-is.
>>>>
>>>> I do think we're probably making a bigger deal out of the full loop than
>>>> necessary. At least I'd be perfectly happy with just the current patch,
>>>> performance should be better there than we currently have it. Ideally
>>>> we'd have just one loop for polling and catching the completed items,
>>>> but that's a bit tricky with the batch completions.
>>> Yes, ideally one loop would be enough, but given that there are also
>>> multi_queue ctx, that doesn't seem to be possible.
>> It's not removing the double loop, but the below could help _only_
>> iterate completed requests at the end. Rather than move items between
>> the current list at the completion callback, have a separate list just
>> for completed requests. Then we can simply iterate that, knowing all of
>> them have completed. Gets rid of the ->iopoll_completed as well, and
>> then we can move the poll_refs. Not really related at all, obviously
>> this patch should be split into multiple pieces.
>>
>> This uses a lockless list. But since the producer and consumer are
>> generally the same task, that should not add any real overhead. On top
>> of the previous one I sent. What do you think?
> Ran some quick testing, as one interesting case is mixing slower and
> faster devices. Let's take this basic example:
>
> t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 -t1 -n2 /dev/nvme32n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>
> where nvme32n1 is about 1.27M IOPS, and the other 3 do about 3.3M IOPS,
> and we poll 2 devices with each IO thread. With the current kernel, we
> get:
>
> IOPS=5.18M, BW=2.53GiB/s, IOS/call=31/32
> IOPS=5.19M, BW=2.53GiB/s, IOS/call=31/31
> IOPS=5.17M, BW=2.53GiB/s, IOS/call=31/31
>
> and with the two patches we get:
>
> IOPS=6.54M, BW=3.19GiB/s, IOS/call=31/31
> IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31
> IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31
>
> or about a 25% improvement. This is mostly due to the issue you
> highlighted, where you end up with later completions (that are done)
> being stuck behind waiting on a slower completion.
>
> Note: obviously 1 thread driving multiple devices for polling could
> still be improved, and in fact it does improve if we simply change -c32
> to something lower. The more important case is the one you identified,
> where different completion times on the same device will hold
> completions up. Multi device polling is just an interesting way to kind
> of emulate that, to an extent.
>
> This effect is (naturally) also apparent in the completion latencies as
> well, particularly in the higher percentiles.
>
> Ran peak testing too, and it's better all around than before.
I love the patch, I had a similar thought, this addresses my concern,  I 
simple tested it
and the performance is a bit better than the previous performance.

base IOPS is 725K,  previous IOPS is 782K, now 790k.
It looks like all the problems are solved,I 'll do more testing next week.
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-12  5:11                   ` Jens Axboe
  2025-12-12  8:58                     ` Jens Axboe
@ 2025-12-12 13:32                     ` Diangang Li
  2025-12-12 20:09                       ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Diangang Li @ 2025-12-12 13:32 UTC (permalink / raw)
  To: Jens Axboe, Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang

On 2025/12/12 13:11, Jens Axboe wrote:
> On 12/11/25 7:12 PM, Fengnan Chang wrote:
>>
>>
>> ? 2025/12/12 09:53, Jens Axboe ??:
>>> On 12/11/25 6:41 PM, Fengnan Chang wrote:
>>>> Oh, we can't add nr_events == iob.nr_reqs check, if
>>>> blk_mq_add_to_batch add failed, completed IO will not add into iob,
>>>> iob.nr_reqs will be 0, this may cause io hang.
>>> Indeed, won't work as-is.
>>>
>>> I do think we're probably making a bigger deal out of the full loop than
>>> necessary. At least I'd be perfectly happy with just the current patch,
>>> performance should be better there than we currently have it. Ideally
>>> we'd have just one loop for polling and catching the completed items,
>>> but that's a bit tricky with the batch completions.
>>
>> Yes, ideally one loop would be enough, but given that there are also
>> multi_queue ctx, that doesn't seem to be possible.
> 
> It's not removing the double loop, but the below could help _only_
> iterate completed requests at the end. Rather than move items between
> the current list at the completion callback, have a separate list just
> for completed requests. Then we can simply iterate that, knowing all of
> them have completed. Gets rid of the ->iopoll_completed as well, and
> then we can move the poll_refs. Not really related at all, obviously
> this patch should be split into multiple pieces.
> 
> This uses a lockless list. But since the producer and consumer are
> generally the same task, that should not add any real overhead. On top
> of the previous one I sent. What do you think?
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 54fd30abf2b8..2d67d95a64ee 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -317,6 +317,7 @@ struct io_ring_ctx {
>   		 */
>   		bool			poll_multi_queue;
>   		struct list_head	iopoll_list;
> +		struct llist_head	iopoll_complete;
>   
>   		struct io_file_table	file_table;
>   		struct io_rsrc_data	buf_table;
> @@ -672,8 +673,9 @@ struct io_kiocb {
>   	};
>   
>   	u8				opcode;
> -	/* polled IO has completed */
> -	u8				iopoll_completed;
> +
> +	bool				cancel_seq_set;
> +
>   	/*
>   	 * Can be either a fixed buffer index, or used with provided buffers.
>   	 * For the latter, it points to the selected buffer ID.
> @@ -700,6 +702,7 @@ struct io_kiocb {
>   	union {
>   		/* used by request caches, completion batching and iopoll */
>   		struct io_wq_work_node	comp_list;
> +		struct llist_node	iopoll_done_list;
>   		/* cache ->apoll->events */
>   		__poll_t apoll_events;
>   	};
> @@ -707,7 +710,7 @@ struct io_kiocb {
>   	struct io_rsrc_node		*file_node;
>   
>   	atomic_t			refs;
> -	bool				cancel_seq_set;
> +	atomic_t			poll_refs;
>   
>   	/*
>   	 * IOPOLL doesn't use task_work, so use the ->iopoll_node list
> @@ -734,7 +737,6 @@ struct io_kiocb {
>   	/* opcode allocated if it needs to store data for async defer */
>   	void				*async_data;
>   	/* linked requests, IFF REQ_F_HARDLINK or REQ_F_LINK are set */
> -	atomic_t			poll_refs;
>   	struct io_kiocb			*link;
>   	/* custom credentials, valid IFF REQ_F_CREDS is set */
>   	const struct cred		*creds;
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 05a660c97316..5e503a0bfcfc 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -335,6 +335,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>   	spin_lock_init(&ctx->completion_lock);
>   	raw_spin_lock_init(&ctx->timeout_lock);
>   	INIT_LIST_HEAD(&ctx->iopoll_list);
> +	init_llist_head(&ctx->iopoll_complete);
>   	INIT_LIST_HEAD(&ctx->defer_list);
>   	INIT_LIST_HEAD(&ctx->timeout_list);
>   	INIT_LIST_HEAD(&ctx->ltimeout_list);
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 307f1f39d9f3..ad481ca74a46 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -604,8 +604,7 @@ static void io_complete_rw_iopoll(struct kiocb *kiocb, long res)
>   			req->cqe.res = res;
>   	}
>   
> -	/* order with io_iopoll_complete() checking ->iopoll_completed */
> -	smp_store_release(&req->iopoll_completed, 1);
> +	llist_add(&req->iopoll_done_list, &req->ctx->iopoll_complete);
>   }
>   
>   static inline void io_rw_done(struct io_kiocb *req, ssize_t ret)
> @@ -870,7 +869,6 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
>   			return -EOPNOTSUPP;
>   		kiocb->private = NULL;
>   		kiocb->ki_flags |= IOCB_HIPRI;
> -		req->iopoll_completed = 0;
>   		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) {
>   			/* make sure every req only blocks once*/
>   			req->flags &= ~REQ_F_IOPOLL_STATE;
> @@ -1317,7 +1315,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>   {
>   	unsigned int poll_flags = 0;
>   	DEFINE_IO_COMP_BATCH(iob);
> -	struct io_kiocb *req, *tmp;
> +	struct llist_node *node;
> +	struct io_kiocb *req;
>   	int nr_events = 0;
>   
>   	/*
> @@ -1327,17 +1326,12 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>   	if (ctx->poll_multi_queue || force_nonspin)
>   		poll_flags |= BLK_POLL_ONESHOT;
>   
> +	/*
> +	 * Loop over uncompleted polled IO requests, and poll for them.
> +	 */
>   	list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) {
>   		int ret;
>   
> -		/*
> -		 * Move completed and retryable entries to our local lists.
> -		 * If we find a request that requires polling, break out
> -		 * and complete those lists first, if we have entries there.
> -		 */
> -		if (READ_ONCE(req->iopoll_completed))
> -			break;

Suggest keeping iopoll_completed here to avoid unnecessary subsequent 
polling and to process IRQ-completed requests promptly.

> -
>   		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL)
>   			ret = io_uring_hybrid_poll(req, &iob, poll_flags);
>   		else
> @@ -1349,24 +1343,25 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>   			poll_flags |= BLK_POLL_ONESHOT;
>   
>   		/* iopoll may have completed current req */
> -		if (!rq_list_empty(&iob.req_list) ||
> -		    READ_ONCE(req->iopoll_completed))
> +		if (!rq_list_empty(&iob.req_list))
>   			break;
>   	}
>   
>   	if (!rq_list_empty(&iob.req_list))
>   		iob.complete(&iob);
>   
> -	list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) {
> -		/* order with io_complete_rw_iopoll(), e.g. ->result updates */
> -		if (!smp_load_acquire(&req->iopoll_completed))
> -			continue;
> +	node = llist_del_all(&ctx->iopoll_complete);
> +	while (node) {
> +		struct llist_node *next = node->next;
> +
> +		req = container_of(node, struct io_kiocb, iopoll_done_list);
>   		list_del(&req->iopoll_node);
>   		wq_list_add_tail(&req->comp_list, &ctx->submit_state.compl_reqs);
>   		nr_events++;
>   		req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL);
>   		if (req->opcode != IORING_OP_URING_CMD)
>   			io_req_rw_cleanup(req, 0);
> +		node = next;
>   	}
>   	if (nr_events)
>   		__io_submit_flush_completions(ctx);
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 197474911f04..0841fa541f5d 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -159,8 +159,7 @@ void __io_uring_cmd_done(struct io_uring_cmd *ioucmd, s32 ret, u64 res2,
>   	}
>   	io_req_uring_cleanup(req, issue_flags);
>   	if (req->ctx->flags & IORING_SETUP_IOPOLL) {
> -		/* order with io_iopoll_req_issued() checking ->iopoll_complete */
> -		smp_store_release(&req->iopoll_completed, 1);
> +		llist_add(&req->iopoll_done_list, &req->ctx->iopoll_complete);
>   	} else if (issue_flags & IO_URING_F_COMPLETE_DEFER) {
>   		if (WARN_ON_ONCE(issue_flags & IO_URING_F_UNLOCKED))
>   			return;
> @@ -252,7 +251,6 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>   		if (!file->f_op->uring_cmd_iopoll)
>   			return -EOPNOTSUPP;
>   		issue_flags |= IO_URING_F_IOPOLL;
> -		req->iopoll_completed = 0;
>   		if (ctx->flags & IORING_SETUP_HYBRID_IOPOLL) {
>   			/* make sure every req only blocks once */
>   			req->flags &= ~REQ_F_IOPOLL_STATE;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-12 13:32                     ` Diangang Li
@ 2025-12-12 20:09                       ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-12-12 20:09 UTC (permalink / raw)
  To: Diangang Li, Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang

On 12/12/25 6:32 AM, Diangang Li wrote:
>> @@ -1327,17 +1326,12 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>>   	if (ctx->poll_multi_queue || force_nonspin)
>>   		poll_flags |= BLK_POLL_ONESHOT;
>>   
>> +	/*
>> +	 * Loop over uncompleted polled IO requests, and poll for them.
>> +	 */
>>   	list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) {
>>   		int ret;
>>   
>> -		/*
>> -		 * Move completed and retryable entries to our local lists.
>> -		 * If we find a request that requires polling, break out
>> -		 * and complete those lists first, if we have entries there.
>> -		 */
>> -		if (READ_ONCE(req->iopoll_completed))
>> -			break;
> 
> Suggest keeping iopoll_completed here to avoid unnecessary subsequent 
> polling and to process IRQ-completed requests promptly.

There should not be any IRQ completed requests in here. The block layer
used to allow that, but that should no longer be the case. If it's a
polled request, then it will by definition end up in a polled queue and
need iopoll completion. Or it'll sit for a while and be completed by a
timeout. If someone is still allowing polled IO with IRQ completions
then that should be fixed, and there's no reason why we should try and
catch those cases here. Will not happen with NVMe, for example.

For the other point, there's no way for ->iopoll_completed to be set
before iob.complete(&iob) has been called to invoke the callback, which
in turn sets it. Hence there's no way for us to end up in this loop and
have ->iopoll_completed set.

So no, I don't think we need it, and honestly it wasn't even really
needed the posted changes either.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
  2025-12-12  9:49                       ` Fengnan Chang
@ 2025-12-12 20:22                         ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2025-12-12 20:22 UTC (permalink / raw)
  To: Fengnan Chang, asml.silence, io-uring; +Cc: Fengnan Chang, Diangang Li

On 12/12/25 2:49 AM, Fengnan Chang wrote:
> 
> 
> ? 2025/12/12 16:58, Jens Axboe ??:
>> On 12/11/25 10:11 PM, Jens Axboe wrote:
>>> On 12/11/25 7:12 PM, Fengnan Chang wrote:
>>>>
>>>> ? 2025/12/12 09:53, Jens Axboe ??:
>>>>> On 12/11/25 6:41 PM, Fengnan Chang wrote:
>>>>>> Oh, we can't add nr_events == iob.nr_reqs check, if
>>>>>> blk_mq_add_to_batch add failed, completed IO will not add into iob,
>>>>>> iob.nr_reqs will be 0, this may cause io hang.
>>>>> Indeed, won't work as-is.
>>>>>
>>>>> I do think we're probably making a bigger deal out of the full loop than
>>>>> necessary. At least I'd be perfectly happy with just the current patch,
>>>>> performance should be better there than we currently have it. Ideally
>>>>> we'd have just one loop for polling and catching the completed items,
>>>>> but that's a bit tricky with the batch completions.
>>>> Yes, ideally one loop would be enough, but given that there are also
>>>> multi_queue ctx, that doesn't seem to be possible.
>>> It's not removing the double loop, but the below could help _only_
>>> iterate completed requests at the end. Rather than move items between
>>> the current list at the completion callback, have a separate list just
>>> for completed requests. Then we can simply iterate that, knowing all of
>>> them have completed. Gets rid of the ->iopoll_completed as well, and
>>> then we can move the poll_refs. Not really related at all, obviously
>>> this patch should be split into multiple pieces.
>>>
>>> This uses a lockless list. But since the producer and consumer are
>>> generally the same task, that should not add any real overhead. On top
>>> of the previous one I sent. What do you think?
>> Ran some quick testing, as one interesting case is mixing slower and
>> faster devices. Let's take this basic example:
>>
>> t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 -t1 -n2 /dev/nvme32n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> where nvme32n1 is about 1.27M IOPS, and the other 3 do about 3.3M IOPS,
>> and we poll 2 devices with each IO thread. With the current kernel, we
>> get:
>>
>> IOPS=5.18M, BW=2.53GiB/s, IOS/call=31/32
>> IOPS=5.19M, BW=2.53GiB/s, IOS/call=31/31
>> IOPS=5.17M, BW=2.53GiB/s, IOS/call=31/31
>>
>> and with the two patches we get:
>>
>> IOPS=6.54M, BW=3.19GiB/s, IOS/call=31/31
>> IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31
>> IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31
>>
>> or about a 25% improvement. This is mostly due to the issue you
>> highlighted, where you end up with later completions (that are done)
>> being stuck behind waiting on a slower completion.
>>
>> Note: obviously 1 thread driving multiple devices for polling could
>> still be improved, and in fact it does improve if we simply change -c32
>> to something lower. The more important case is the one you identified,
>> where different completion times on the same device will hold
>> completions up. Multi device polling is just an interesting way to kind
>> of emulate that, to an extent.
>>
>> This effect is (naturally) also apparent in the completion latencies as
>> well, particularly in the higher percentiles.
>>
>> Ran peak testing too, and it's better all around than before.
> I love the patch, I had a similar thought, this addresses my concern,  I simple tested it
> and the performance is a bit better than the previous performance.
> 
> base IOPS is 725K,  previous IOPS is 782K, now 790k.
> It looks like all the problems are solved,I 'll do more testing next week.

Nice, thanks! FWIW, I put the patches in a branch here, just to have
them somewhat organized and easier to iterate/test on:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-iopoll

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-12-12 20:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-10  8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-10  8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
2025-12-10  9:19   ` Jens Axboe
2025-12-10  9:53   ` Jens Axboe
2025-12-10  8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-11  2:15   ` Jens Axboe
2025-12-11  4:10     ` Jens Axboe
2025-12-11  7:38       ` Fengnan
2025-12-11 10:22         ` Jens Axboe
2025-12-11 10:33           ` Jens Axboe
2025-12-11 11:13             ` Fengnan Chang
2025-12-11 11:19               ` Jens Axboe
2025-12-12  1:41             ` Fengnan Chang
2025-12-12  1:53               ` Jens Axboe
2025-12-12  2:12                 ` Fengnan Chang
2025-12-12  5:11                   ` Jens Axboe
2025-12-12  8:58                     ` Jens Axboe
2025-12-12  9:49                       ` Fengnan Chang
2025-12-12 20:22                         ` Jens Axboe
2025-12-12 13:32                     ` Diangang Li
2025-12-12 20:09                       ` Jens Axboe
2025-12-10  9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox