From: Fengnan <fengnanchang@gmail.com>
To: Jens Axboe <axboe@kernel.dk>,
asml.silence@gmail.com, io-uring@vger.kernel.org
Cc: Fengnan Chang <changfengnan@bytedance.com>,
Diangang Li <lidiangang@bytedance.com>
Subject: Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
Date: Thu, 11 Dec 2025 15:38:49 +0800 [thread overview]
Message-ID: <0661763c-4f56-4895-afd2-7346bb2452e4@gmail.com> (raw)
In-Reply-To: <69f81ed8-2b4a-461f-90b8-0b9752140f8d@kernel.dk>
在 2025/12/11 12:10, Jens Axboe 写道:
> On 12/10/25 7:15 PM, Jens Axboe wrote:
>> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>>> is considered that the current req is the actual completed request.
>>> This may be reasonable for multi-queue ctx, but is problematic for
>>> single-queue ctx because the current request may not be done when the
>>> poll gets to the result. In this case, the completed io needs to wait
>>> for the first io on the chain to complete before notifying the user,
>>> which may cause io accumulation in the list.
>>> Our modification plan is as follows: change io_wq_work_list to normal
>>> list so that the iopoll_list list in it can be removed and put into the
>>> comp_reqs list when the request is completed. This way each io is
>>> handled independently and all gets processed in time.
>>>
>>> After modification, test with:
>>>
>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>>> /dev/nvme6n1
>>>
>>> base IOPS is 725K, patch IOPS is 782K.
>>>
>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>>> /dev/nvme6n1
>>>
>>> Base IOPS is 880k, patch IOPS is 895K.
>> A few notes on this:
>>
>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>> necessarily safe. Yes generally this is invoked from the
>> owning/polling task, but that's not guaranteed.
>>
>> 2) The patch doesn't apply to the current tree, must be an older
>> version?
>>
>> 3) When hand-applied, it still throws a compile warning about an unused
>> variable. Please don't send untested stuff...
>>
>> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>> singly to a doubly linked list, you're growing the io_kiocb size. You
>> should be able to use a union with struct io_task_work for example.
>> That's already 16b in size - win/win as you don't need to slow down
>> the cache management as that can keep using the linkage it currently
>> is using, and you're not bloating the io_kiocb.
>>
>> 5) The already mentioned point about the cache free list now being
>> doubly linked. This is generally a _bad_ idea as removing and adding
>> entries now need to touch other entries too. That's not very cache
>> friendly.
>>
>> #1 is kind of the big one, as it means you'll need to re-think how you
>> do this. I do agree that the current approach isn't necessarily ideal as
>> we don't process completions as quickly as we could, so I think there's
>> merrit in continuing this work.
> Proof of concept below, entirely untested, at a conference. Basically
> does what I describe above, and retains the list manipulation logic
> on the iopoll side, rather than on the completion side. Can you take
> a look at this? And if it runs, can you do some numbers on that too?
This patch works, and in my test case, the performance is identical to
my patch.
But there is a small problem, now looking for completed requests, always
need to
traverse the whole iopoll_list. this can be a bit inefficient in some
cases, for
example if the previous sent 128K io , the last io is 4K, the last io
will be returned
much earlier, this kind of scenario can not be verified in the current test.
I'm not sure if it's a meaningful scenario.
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index e1adb0d20a0a..54fd30abf2b8 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -316,7 +316,7 @@ struct io_ring_ctx {
> * manipulate the list, hence no extra locking is needed there.
> */
> bool poll_multi_queue;
> - struct io_wq_work_list iopoll_list;
> + struct list_head iopoll_list;
>
> struct io_file_table file_table;
> struct io_rsrc_data buf_table;
> @@ -708,7 +708,16 @@ struct io_kiocb {
>
> atomic_t refs;
> bool cancel_seq_set;
> - struct io_task_work io_task_work;
> +
> + /*
> + * IOPOLL doesn't use task_work, so use the ->iopoll_node list
> + * entry to manage pending iopoll requests.
> + */
> + union {
> + struct io_task_work io_task_work;
> + struct list_head iopoll_node;
> + };
> +
> union {
> /*
> * for polled requests, i.e. IORING_OP_POLL_ADD and async armed
> diff --git a/io_uring/cancel.c b/io_uring/cancel.c
> index ca12ac10c0ae..4136bf04464a 100644
> --- a/io_uring/cancel.c
> +++ b/io_uring/cancel.c
> @@ -534,7 +534,7 @@ __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
> /* SQPOLL thread does its own polling */
> if ((!(ctx->flags & IORING_SETUP_SQPOLL) && cancel_all) ||
> is_sqpoll_thread) {
> - while (!wq_list_empty(&ctx->iopoll_list)) {
> + while (!list_empty(&ctx->iopoll_list)) {
> io_iopoll_try_reap_events(ctx);
> ret = true;
> cond_resched();
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 6cb24cdf8e68..fdb0b43f6fb5 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -334,7 +334,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
> init_waitqueue_head(&ctx->poll_wq);
> spin_lock_init(&ctx->completion_lock);
> raw_spin_lock_init(&ctx->timeout_lock);
> - INIT_WQ_LIST(&ctx->iopoll_list);
> + INIT_LIST_HEAD(&ctx->iopoll_list);
> INIT_LIST_HEAD(&ctx->defer_list);
> INIT_LIST_HEAD(&ctx->timeout_list);
> INIT_LIST_HEAD(&ctx->ltimeout_list);
> @@ -1561,7 +1561,7 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
> return;
>
> mutex_lock(&ctx->uring_lock);
> - while (!wq_list_empty(&ctx->iopoll_list)) {
> + while (!list_empty(&ctx->iopoll_list)) {
> /* let it sleep and repeat later if can't complete a request */
> if (io_do_iopoll(ctx, true) == 0)
> break;
> @@ -1626,21 +1626,18 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
> * forever, while the workqueue is stuck trying to acquire the
> * very same mutex.
> */
> - if (wq_list_empty(&ctx->iopoll_list) ||
> - io_task_work_pending(ctx)) {
> + if (list_empty(&ctx->iopoll_list) || io_task_work_pending(ctx)) {
> u32 tail = ctx->cached_cq_tail;
>
> (void) io_run_local_work_locked(ctx, min_events);
>
> - if (task_work_pending(current) ||
> - wq_list_empty(&ctx->iopoll_list)) {
> + if (task_work_pending(current) || list_empty(&ctx->iopoll_list)) {
> mutex_unlock(&ctx->uring_lock);
> io_run_task_work();
> mutex_lock(&ctx->uring_lock);
> }
> /* some requests don't go through iopoll_list */
> - if (tail != ctx->cached_cq_tail ||
> - wq_list_empty(&ctx->iopoll_list))
> + if (tail != ctx->cached_cq_tail || list_empty(&ctx->iopoll_list))
> break;
> }
> ret = io_do_iopoll(ctx, !min_events);
> @@ -1683,25 +1680,17 @@ static void io_iopoll_req_issued(struct io_kiocb *req, unsigned int issue_flags)
> * how we do polling eventually, not spinning if we're on potentially
> * different devices.
> */
> - if (wq_list_empty(&ctx->iopoll_list)) {
> + if (list_empty(&ctx->iopoll_list)) {
> ctx->poll_multi_queue = false;
> } else if (!ctx->poll_multi_queue) {
> struct io_kiocb *list_req;
>
> - list_req = container_of(ctx->iopoll_list.first, struct io_kiocb,
> - comp_list);
> + list_req = list_entry(ctx->iopoll_list.next, struct io_kiocb, iopoll_node);
> if (list_req->file != req->file)
> ctx->poll_multi_queue = true;
> }
>
> - /*
> - * For fast devices, IO may have already completed. If it has, add
> - * it to the front so we find it first.
> - */
> - if (READ_ONCE(req->iopoll_completed))
> - wq_list_add_head(&req->comp_list, &ctx->iopoll_list);
> - else
> - wq_list_add_tail(&req->comp_list, &ctx->iopoll_list);
> + list_add_tail(&req->iopoll_node, &ctx->iopoll_list);
>
> if (unlikely(needs_lock)) {
> /*
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 70ca88cc1f54..307f1f39d9f3 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -1315,9 +1315,9 @@ static int io_uring_hybrid_poll(struct io_kiocb *req,
>
> int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
> {
> - struct io_wq_work_node *pos, *start, *prev;
> unsigned int poll_flags = 0;
> DEFINE_IO_COMP_BATCH(iob);
> + struct io_kiocb *req, *tmp;
> int nr_events = 0;
>
> /*
> @@ -1327,8 +1327,7 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
> if (ctx->poll_multi_queue || force_nonspin)
> poll_flags |= BLK_POLL_ONESHOT;
>
> - wq_list_for_each(pos, start, &ctx->iopoll_list) {
> - struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list);
> + list_for_each_entry(req, &ctx->iopoll_list, iopoll_node) {
> int ret;
>
> /*
> @@ -1357,31 +1356,20 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
>
> if (!rq_list_empty(&iob.req_list))
> iob.complete(&iob);
> - else if (!pos)
> - return 0;
> -
> - prev = start;
> - wq_list_for_each_resume(pos, prev) {
> - struct io_kiocb *req = container_of(pos, struct io_kiocb, comp_list);
>
> + list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) {
> /* order with io_complete_rw_iopoll(), e.g. ->result updates */
> if (!smp_load_acquire(&req->iopoll_completed))
> - break;
> + continue;
> + list_del(&req->iopoll_node);
> + wq_list_add_tail(&req->comp_list, &ctx->submit_state.compl_reqs);
> nr_events++;
> req->cqe.flags = io_put_kbuf(req, req->cqe.res, NULL);
> if (req->opcode != IORING_OP_URING_CMD)
> io_req_rw_cleanup(req, 0);
> }
> - if (unlikely(!nr_events))
> - return 0;
> -
> - pos = start ? start->next : ctx->iopoll_list.first;
> - wq_list_cut(&ctx->iopoll_list, prev, start);
> -
> - if (WARN_ON_ONCE(!wq_list_empty(&ctx->submit_state.compl_reqs)))
> - return 0;
> - ctx->submit_state.compl_reqs.first = pos;
> - __io_submit_flush_completions(ctx);
> + if (nr_events)
> + __io_submit_flush_completions(ctx);
> return nr_events;
> }
>
> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
> index 74c1a130cd87..becdfdd323a9 100644
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -212,7 +212,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd,
> if (cap_entries && to_submit > IORING_SQPOLL_CAP_ENTRIES_VALUE)
> to_submit = IORING_SQPOLL_CAP_ENTRIES_VALUE;
>
> - if (to_submit || !wq_list_empty(&ctx->iopoll_list)) {
> + if (to_submit || !list_empty(&ctx->iopoll_list)) {
> const struct cred *creds = NULL;
>
> io_sq_start_worktime(ist);
> @@ -221,7 +221,7 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, struct io_sq_data *sqd,
> creds = override_creds(ctx->sq_creds);
>
> mutex_lock(&ctx->uring_lock);
> - if (!wq_list_empty(&ctx->iopoll_list))
> + if (!list_empty(&ctx->iopoll_list))
> io_do_iopoll(ctx, true);
>
> /*
> @@ -344,7 +344,7 @@ static int io_sq_thread(void *data)
> list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
> int ret = __io_sq_thread(ctx, sqd, cap_entries, &ist);
>
> - if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
> + if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list)))
> sqt_spin = true;
> }
> if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
> @@ -379,7 +379,7 @@ static int io_sq_thread(void *data)
> atomic_or(IORING_SQ_NEED_WAKEUP,
> &ctx->rings->sq_flags);
> if ((ctx->flags & IORING_SETUP_IOPOLL) &&
> - !wq_list_empty(&ctx->iopoll_list)) {
> + !list_empty(&ctx->iopoll_list)) {
> needs_sched = false;
> break;
> }
>
next prev parent reply other threads:[~2025-12-11 7:38 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-10 8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-10 8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
2025-12-10 9:19 ` Jens Axboe
2025-12-10 9:53 ` Jens Axboe
2025-12-10 8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-11 2:15 ` Jens Axboe
2025-12-11 4:10 ` Jens Axboe
2025-12-11 7:38 ` Fengnan [this message]
2025-12-11 10:22 ` Jens Axboe
2025-12-11 10:33 ` Jens Axboe
2025-12-11 11:13 ` Fengnan Chang
2025-12-11 11:19 ` Jens Axboe
2025-12-12 1:41 ` Fengnan Chang
2025-12-12 1:53 ` Jens Axboe
2025-12-12 2:12 ` Fengnan Chang
2025-12-12 5:11 ` Jens Axboe
2025-12-12 8:58 ` Jens Axboe
2025-12-12 9:49 ` Fengnan Chang
2025-12-12 20:22 ` Jens Axboe
2025-12-12 13:32 ` Diangang Li
2025-12-12 20:09 ` Jens Axboe
2025-12-15 6:25 ` Diangang Li
2025-12-17 12:34 ` Diangang Li
2025-12-17 16:25 ` Jens Axboe
2025-12-19 5:43 ` Diangang Li
2025-12-10 9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0661763c-4f56-4895-afd2-7346bb2452e4@gmail.com \
--to=fengnanchang@gmail.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=changfengnan@bytedance.com \
--cc=io-uring@vger.kernel.org \
--cc=lidiangang@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox