Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: Fengnan Chang <fengnanchang@gmail.com>,
	asml.silence@gmail.com, io-uring@vger.kernel.org
Cc: Fengnan Chang <changfengnan@bytedance.com>,
	Diangang Li <lidiangang@bytedance.com>
Subject: Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
Date: Thu, 11 Dec 2025 04:19:32 -0700	[thread overview]
Message-ID: <545bf70d-0155-49e7-9356-d296d9c17562@kernel.dk> (raw)
In-Reply-To: <9a63350f-b599-4f00-8d0b-4da2dbe99fc2@gmail.com>

On 12/11/25 4:13 AM, Fengnan Chang wrote:
> 
> 
> ? 2025/12/11 18:33, Jens Axboe ??:
>> On 12/11/25 3:22 AM, Jens Axboe wrote:
>>> On 12/11/25 12:38 AM, Fengnan wrote:
>>>>
>>>> ? 2025/12/11 12:10, Jens Axboe ??:
>>>>> On 12/10/25 7:15 PM, Jens Axboe wrote:
>>>>>> On 12/10/25 1:55 AM, Fengnan Chang wrote:
>>>>>>> In the io_do_iopoll function, when the poll loop of iopoll_list ends, it
>>>>>>> is considered that the current req is the actual completed request.
>>>>>>> This may be reasonable for multi-queue ctx, but is problematic for
>>>>>>> single-queue ctx because the current request may not be done when the
>>>>>>> poll gets to the result. In this case, the completed io needs to wait
>>>>>>> for the first io on the chain to complete before notifying the user,
>>>>>>> which may cause io accumulation in the list.
>>>>>>> Our modification plan is as follows: change io_wq_work_list to normal
>>>>>>> list so that the iopoll_list list in it can be removed and put into the
>>>>>>> comp_reqs list when the request is completed. This way each io is
>>>>>>> handled independently and all gets processed in time.
>>>>>>>
>>>>>>> After modification,  test with:
>>>>>>>
>>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1
>>>>>>> /dev/nvme6n1
>>>>>>>
>>>>>>> base IOPS is 725K,  patch IOPS is 782K.
>>>>>>>
>>>>>>> ./t/io_uring -p1 -d128 -b4096 -s32 -c1 -F1 -B1 -R1 -X1 -n1 -P1
>>>>>>> /dev/nvme6n1
>>>>>>>
>>>>>>> Base IOPS is 880k, patch IOPS is 895K.
>>>>>> A few notes on this:
>>>>>>
>>>>>> 1) Manipulating the list in io_complete_rw_iopoll() I don't think is
>>>>>>      necessarily safe. Yes generally this is invoked from the
>>>>>>      owning/polling task, but that's not guaranteed.
>>>>>>
>>>>>> 2) The patch doesn't apply to the current tree, must be an older
>>>>>>      version?
>>>>>>
>>>>>> 3) When hand-applied, it still throws a compile warning about an unused
>>>>>>      variable. Please don't send untested stuff...
>>>>>>
>>>>>> 4) Don't just blatantly bloat the io_kiocb. When you change from a
>>>>>>      singly to a doubly linked list, you're growing the io_kiocb size. You
>>>>>>      should be able to use a union with struct io_task_work for example.
>>>>>>      That's already 16b in size - win/win as you don't need to slow down
>>>>>>      the cache management as that can keep using the linkage it currently
>>>>>>      is using, and you're not bloating the io_kiocb.
>>>>>>
>>>>>> 5) The already mentioned point about the cache free list now being
>>>>>>      doubly linked. This is generally a _bad_ idea as removing and adding
>>>>>>      entries now need to touch other entries too. That's not very cache
>>>>>>      friendly.
>>>>>>
>>>>>> #1 is kind of the big one, as it means you'll need to re-think how you
>>>>>> do this. I do agree that the current approach isn't necessarily ideal as
>>>>>> we don't process completions as quickly as we could, so I think there's
>>>>>> merrit in continuing this work.
>>>>> Proof of concept below, entirely untested, at a conference. Basically
>>>>> does what I describe above, and retains the list manipulation logic
>>>>> on the iopoll side, rather than on the completion side. Can you take
>>>>> a look at this? And if it runs, can you do some numbers on that too?
>>>> This patch works, and in my test case, the performance is identical to
>>>> my patch.
>>> Good!
>>>
>>>> But there is a small problem, now looking for completed requests,
>>>> always need to traverse the whole iopoll_list. this can be a bit
>>>> inefficient in some cases, for example if the previous sent 128K io ,
>>>> the last io is 4K, the last io will be returned much earlier, this
>>>> kind of scenario can not be verified in the current test. I'm not sure
>>>> if it's a meaningful scenario.
>>> Not sure that's a big problem, you're just spinning to complete anyway.
>>> You could add your iob->nr_reqs or something, and break after finding
>>> those know have completed. That won't necessarily change anything, could
>>> still be the last one that completed. Would probably make more sense to
>>> pass in 'min_events' or similar and stop after that. But I think mostly
>>> tweaks that can be made after the fact. If you want to send out a new
>>> patch based on the one I sent, feel free to.
>> Eg, something like this on top would do that. Like I mentioned earlier,
>> you cannot do the list manipulation where you did it, it's not safe. You
>> have to defer it to reaping time. If we could do it from the callback
>> where we mark it complete, then that would surely make things more
>> trivial and avoid iteration when not needed.
> 
> Yes, it's not safe do the list manipulation in io_complete_rw_iopoll.
> It looks more reasonable with the following modifications.
> Your changes look good enough, but please give me more time, I'd
> like to do some more testing and rethink this.

Of course, there's no rush - we're in the merge window anyway, so this
will be targeted for 6.20/7.0 either way.

-- 
Jens Axboe

next prev parent reply	other threads:[~2025-12-11 11:19 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-10  8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-10  8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
2025-12-10  9:19   ` Jens Axboe
2025-12-10  9:53   ` Jens Axboe
2025-12-10  8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-11  2:15   ` Jens Axboe
2025-12-11  4:10     ` Jens Axboe
2025-12-11  7:38       ` Fengnan
2025-12-11 10:22         ` Jens Axboe
2025-12-11 10:33           ` Jens Axboe
2025-12-11 11:13             ` Fengnan Chang
2025-12-11 11:19               ` Jens Axboe [this message]
2025-12-12  1:41             ` Fengnan Chang
2025-12-12  1:53               ` Jens Axboe
2025-12-12  2:12                 ` Fengnan Chang
2025-12-12  5:11                   ` Jens Axboe
2025-12-12  8:58                     ` Jens Axboe
2025-12-12  9:49                       ` Fengnan Chang
2025-12-12 20:22                         ` Jens Axboe
2025-12-12 13:32                     ` Diangang Li
2025-12-12 20:09                       ` Jens Axboe
2025-12-15  6:25                         ` Diangang Li
2025-12-10  9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=545bf70d-0155-49e7-9356-d296d9c17562@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=changfengnan@bytedance.com \
    --cc=fengnanchang@gmail.com \
    --cc=io-uring@vger.kernel.org \
    --cc=lidiangang@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox