Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: Fengnan <fengnanchang@gmail.com>,
	asml.silence@gmail.com, io-uring@vger.kernel.org
Cc: Fengnan Chang <changfengnan@bytedance.com>,
	Diangang Li <lidiangang@bytedance.com>
Subject: Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
Date: Thu, 11 Dec 2025 03:33:44 -0700	[thread overview]
Message-ID: <1acb251a-4c4a-479c-a51e-a8db9a6e0fa3@kernel.dk> (raw)
In-Reply-To: <0654d130-665a-4b1a-b99b-bb80ca06353a@kernel.dk>

On 12/11/25 3:22 AM, Jens Axboe wrote:
> On 12/11/25 12:38 AM, Fengnan wrote:
>>
>>
>> ? 2025/12/11 12:10, Jens Axboe ??:
>>> On 12/10/25 7:15 PM, Jens Axboe wrote:
>>>> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>>>>> is considered that the current req is the actual completed request.
>>>>> This may be reasonable for multi-queue ctx, but is problematic for
>>>>> single-queue ctx because the current request may not be done when the
>>>>> poll gets to the result. In this case, the completed io needs to wait
>>>>> for the first io on the chain to complete before notifying the user,
>>>>> which may cause io accumulation in the list.
>>>>> Our modification plan is as follows: change io_wq_work_list to normal
>>>>> list so that the iopoll_list list in it can be removed and put into the
>>>>> comp_reqs list when the request is completed. This way each io is
>>>>> handled independently and all gets processed in time.
>>>>>
>>>>> After modification,  test with:
>>>>>
>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>>>>> /dev/nvme6n1
>>>>>
>>>>> base IOPS is 725K,  patch IOPS is 782K.
>>>>>
>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>>>>> /dev/nvme6n1
>>>>>
>>>>> Base IOPS is 880k, patch IOPS is 895K.
>>>> A few notes on this:
>>>>
>>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>>>>     necessarily safe. Yes generally this is invoked from the
>>>>     owning/polling task, but that's not guaranteed.
>>>>
>>>> 2) The patch doesn't apply to the current tree, must be an older
>>>>     version?
>>>>
>>>> 3) When hand-applied, it still throws a compile warning about an unused
>>>>     variable. Please don't send untested stuff...
>>>>
>>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>>>>     singly to a doubly linked list, you're growing the io_kiocb size. You
>>>>     should be able to use a union with struct io_task_work for example.
>>>>     That's already 16b in size - win/win as you don't need to slow down
>>>>     the cache management as that can keep using the linkage it currently
>>>>     is using, and you're not bloating the io_kiocb.
>>>>
>>>> 5) The already mentioned point about the cache free list now being
>>>>     doubly linked. This is generally a _bad_ idea as removing and adding
>>>>     entries now need to touch other entries too. That's not very cache
>>>>     friendly.
>>>>
>>>> #1 is kind of the big one, as it means you'll need to re-think how you
>>>> do this. I do agree that the current approach isn't necessarily ideal as
>>>> we don't process completions as quickly as we could, so I think there's
>>>> merrit in continuing this work.
>>> Proof of concept below, entirely untested, at a conference. Basically
>>> does what I describe above, and retains the list manipulation logic
>>> on the iopoll side, rather than on the completion side. Can you take
>>> a look at this? And if it runs, can you do some numbers on that too?
>>
>> This patch works, and in my test case, the performance is identical to
>> my patch.
> 
> Good!
> 
>> But there is a small problem, now looking for completed requests,
>> always need to traverse the whole iopoll_list. this can be a bit
>> inefficient in some cases, for example if the previous sent 128K io ,
>> the last io is 4K, the last io will be returned much earlier, this
>> kind of scenario can not be verified in the current test. I'm not sure
>> if it's a meaningful scenario.
> 
> Not sure that's a big problem, you're just spinning to complete anyway.
> You could add your iob->nr_reqs or something, and break after finding
> those know have completed. That won't necessarily change anything, could
> still be the last one that completed. Would probably make more sense to
> pass in 'min_events' or similar and stop after that. But I think mostly
> tweaks that can be made after the fact. If you want to send out a new
> patch based on the one I sent, feel free to.

Eg, something like this on top would do that. Like I mentioned earlier,
you cannot do the list manipulation where you did it, it's not safe. You
have to defer it to reaping time. If we could do it from the callback
where we mark it complete, then that would surely make things more
trivial and avoid iteration when not needed.


diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index cae9e857aea4..93709ee6c3ee 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -917,6 +917,7 @@ static inline bool blk_mq_add_to_batch(struct request *req,
 	else if (iob->complete != complete)
 		return false;
 	iob->need_ts |= blk_mq_need_time_stamp(req);
+	iob->nr_reqs++;
 	rq_list_add_tail(&iob->req_list, req);
 	return true;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 72e34acd439c..9335b552e040 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1821,6 +1821,7 @@ void bdev_fput(struct file *bdev_file);
 struct io_comp_batch {
 	struct rq_list req_list;
 	bool need_ts;
+	int nr_reqs;
 	void (*complete)(struct io_comp_batch *);
 };
 
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 307f1f39d9f3..37b5b2328f24 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1358,6 +1358,8 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
 		iob.complete(&iob);
 
 	list_for_each_entry_safe(req, tmp, &ctx->iopoll_list, iopoll_node) {
+		if (nr_events == iob.nr_reqs)
+			break;
 		/* order with io_complete_rw_iopoll(), e.g. ->result updates */
 		if (!smp_load_acquire(&req->iopoll_completed))
 			continue;

-- 
Jens Axboe

next prev parent reply	other threads:[~2025-12-11 10:33 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-10  8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-10  8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
2025-12-10  9:19   ` Jens Axboe
2025-12-10  9:53   ` Jens Axboe
2025-12-10  8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-11  2:15   ` Jens Axboe
2025-12-11  4:10     ` Jens Axboe
2025-12-11  7:38       ` Fengnan
2025-12-11 10:22         ` Jens Axboe
2025-12-11 10:33           ` Jens Axboe [this message]
2025-12-11 11:13             ` Fengnan Chang
2025-12-11 11:19               ` Jens Axboe
2025-12-12  1:41             ` Fengnan Chang
2025-12-12  1:53               ` Jens Axboe
2025-12-12  2:12                 ` Fengnan Chang
2025-12-12  5:11                   ` Jens Axboe
2025-12-12  8:58                     ` Jens Axboe
2025-12-12  9:49                       ` Fengnan Chang
2025-12-12 20:22                         ` Jens Axboe
2025-12-12 13:32                     ` Diangang Li
2025-12-12 20:09                       ` Jens Axboe
2025-12-10  9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:cae9e857aea dfblob:93709ee6c3e dfblob:72e34acd439
dfblob:9335b552e04 dfblob:307f1f39d9f dfblob:37b5b2328f2 )
 OR (
bs:"Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1acb251a-4c4a-479c-a51e-a8db9a6e0fa3@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=changfengnan@bytedance.com \
    --cc=fengnanchang@gmail.com \
    --cc=io-uring@vger.kernel.org \
    --cc=lidiangang@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox