Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: Fengnan Chang <fengnanchang@gmail.com>,
	asml.silence@gmail.com, io-uring@vger.kernel.org
Cc: Fengnan Chang <changfengnan@bytedance.com>,
	Diangang Li <lidiangang@bytedance.com>
Subject: Re: [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode
Date: Fri, 12 Dec 2025 01:58:57 -0700	[thread overview]
Message-ID: <38972bbc-fb6f-42c5-bd17-b19db134dfad@kernel.dk> (raw)
In-Reply-To: <f763dcd7-dcb3-4cc5-a567-f922cda91ca2@kernel.dk>

On 12/11/25 10:11 PM, Jens Axboe wrote:
> On 12/11/25 7:12 PM, Fengnan Chang wrote:
>>
>>
>> ? 2025/12/12 09:53, Jens Axboe ??:
>>> On 12/11/25 6:41 PM, Fengnan Chang wrote:
>>>> Oh, we can't add nr_events == iob.nr_reqs check, if
>>>> blk_mq_add_to_batch add failed, completed IO will not add into iob,
>>>> iob.nr_reqs will be 0, this may cause io hang.
>>> Indeed, won't work as-is.
>>>
>>> I do think we're probably making a bigger deal out of the full loop than
>>> necessary. At least I'd be perfectly happy with just the current patch,
>>> performance should be better there than we currently have it. Ideally
>>> we'd have just one loop for polling and catching the completed items,
>>> but that's a bit tricky with the batch completions.
>>
>> Yes, ideally one loop would be enough, but given that there are also
>> multi_queue ctx, that doesn't seem to be possible.
> 
> It's not removing the double loop, but the below could help _only_
> iterate completed requests at the end. Rather than move items between
> the current list at the completion callback, have a separate list just
> for completed requests. Then we can simply iterate that, knowing all of
> them have completed. Gets rid of the ->iopoll_completed as well, and
> then we can move the poll_refs. Not really related at all, obviously
> this patch should be split into multiple pieces.
> 
> This uses a lockless list. But since the producer and consumer are
> generally the same task, that should not add any real overhead. On top
> of the previous one I sent. What do you think?

Ran some quick testing, as one interesting case is mixing slower and
faster devices. Let's take this basic example:

t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R1 -X1 -n1 -P1 -t1 -n2 /dev/nvme32n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

where nvme32n1 is about 1.27M IOPS, and the other 3 do about 3.3M IOPS,
and we poll 2 devices with each IO thread. With the current kernel, we
get:

IOPS=5.18M, BW=2.53GiB/s, IOS/call=31/32
IOPS=5.19M, BW=2.53GiB/s, IOS/call=31/31
IOPS=5.17M, BW=2.53GiB/s, IOS/call=31/31

and with the two patches we get:

IOPS=6.54M, BW=3.19GiB/s, IOS/call=31/31
IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31
IOPS=6.52M, BW=3.18GiB/s, IOS/call=31/31

or about a 25% improvement. This is mostly due to the issue you
highlighted, where you end up with later completions (that are done)
being stuck behind waiting on a slower completion.

Note: obviously 1 thread driving multiple devices for polling could
still be improved, and in fact it does improve if we simply change -c32
to something lower. The more important case is the one you identified,
where different completion times on the same device will hold
completions up. Multi device polling is just an interesting way to kind
of emulate that, to an extent.

This effect is (naturally) also apparent in the completion latencies as
well, particularly in the higher percentiles.

Ran peak testing too, and it's better all around than before.

-- 
Jens Axboe

next prev parent reply	other threads:[~2025-12-12  8:59 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-10  8:54 [RFC PATCH 0/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-10  8:55 ` [RFC PATCH 1/2] blk-mq: delete task running check in blk_hctx_poll Fengnan Chang
2025-12-10  9:19   ` Jens Axboe
2025-12-10  9:53   ` Jens Axboe
2025-12-10  8:55 ` [RFC PATCH 2/2] io_uring: fix io may accumulation in poll mode Fengnan Chang
2025-12-11  2:15   ` Jens Axboe
2025-12-11  4:10     ` Jens Axboe
2025-12-11  7:38       ` Fengnan
2025-12-11 10:22         ` Jens Axboe
2025-12-11 10:33           ` Jens Axboe
2025-12-11 11:13             ` Fengnan Chang
2025-12-11 11:19               ` Jens Axboe
2025-12-12  1:41             ` Fengnan Chang
2025-12-12  1:53               ` Jens Axboe
2025-12-12  2:12                 ` Fengnan Chang
2025-12-12  5:11                   ` Jens Axboe
2025-12-12  8:58                     ` Jens Axboe [this message]
2025-12-12  9:49                       ` Fengnan Chang
2025-12-12 20:22                         ` Jens Axboe
2025-12-12 13:32                     ` Diangang Li
2025-12-12 20:09                       ` Jens Axboe
2025-12-10  9:53 ` (subset) [RFC PATCH 0/2] " Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=38972bbc-fb6f-42c5-bd17-b19db134dfad@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=asml.silence@gmail.com \
    --cc=changfengnan@bytedance.com \
    --cc=fengnanchang@gmail.com \
    --cc=io-uring@vger.kernel.org \
    --cc=lidiangang@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox