Re: io_uring_prep_timeout() leading to an IO pressure close to 100

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@kernel.dk>
To: Fiona Ebner <f.ebner@proxmox.com>, linux-kernel@vger.kernel.org
Cc: hannes@cmpxchg.org, surenb@google.com, peterz@infradead.org,
	io-uring@vger.kernel.org,
	Thomas Lamprecht <t.lamprecht@proxmox.com>
Subject: Re: io_uring_prep_timeout() leading to an IO pressure close to 100
Date: Sun, 26 Apr 2026 15:13:08 -0600	[thread overview]
Message-ID: <a47f672c-d204-433f-9815-9e6606fdec1f@kernel.dk> (raw)
In-Reply-To: <db7e6abb-677b-4b63-a028-d8fe0bec0277@proxmox.com>

On 4/24/26 9:42 AM, Fiona Ebner wrote:
> Hi Jens,
> 
> Am 02.04.26 um 2:30 PM schrieb Fiona Ebner:
>> Am 02.04.26 um 11:12 AM schrieb Fiona Ebner:
>>> Am 01.04.26 um 5:02 PM schrieb Jens Axboe:
>>>> On 4/1/26 8:59 AM, Fiona Ebner wrote:
>>>>> I'm currently investigating an issue with QEMU causing an IO pressure
>>>>> value of nearly 100 when io_uring is used for the event loop of a QEMU
>>>>> iothread (which is the case since QEMU 10.2 if io_uring is enabled
>>>>> during configuration and available).
>>>>
>>>> It's not "IO pressure", it's the useless iowait metric...
>>>
>>> But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io
>>> (and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure).
>>>
>>>>> The cause seems to be the io_uring_prep_timeout() call that is used for
>>>>> blocking wait. I attached a minimal reproducer below, which exposes the
>>>>> issue [0].
>>>>>
>>>>> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I
>>>>> haven't investigated what happens inside the kernel yet, so I don't know
>>>>> if it is an accounting issue or within io_uring.
>>>>>
>>>>> Let me know if you need more information or if I should test something
>>>>> specific.
>>>>
>>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>>
>>> QEMU does submit actual IO request on the same ring and I suppose iowait
>>> should still be used for those?
>>>
>>> Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout
>>> request is being submitted and no actual IO requests is an option? But
>>> even then, if a request is submitted later via another thread, iowait
>>> for that new request won't be accounted for, right?
>>>
>>> Is there a way to say "I don't want IO wait for timeout submissions"?
>>> Wouldn't that even make sense by default?
>>
>> Turns out, that in my QEMU instances, the branch doing the
>> io_uring_prep_timeout() call is not actually taken, so while the issue
>> could arise like that too, it's different in this practical case.
>>
>> What I'm actually seeing is io_uring_submit_and_wait() being called with
>> wait_nr=1 while there is nothing else going on. So a more accurate
>> reproducer for the scenario is attached below [0]. Note that it does not
>> happen without sumbitting+completing a single request first. 
> 
> I started digging in the kernel now and am wondering whether the number
> of inflight requests is correctly tracked? Does current_pending_io()
> need to consider tctx->cached_refs?
> 
> In __io_cqring_wait_schedule(), there is
> 
>> 	if (ext_arg->iowait && current_pending_io())
>> 		current->in_iowait = 1;
> 
> and current_pending_io() is
> 
>> static bool current_pending_io(void)
>> {
>> 	struct io_uring_task *tctx = current->io_uring;
>>
>> 	if (!tctx)
>> 		return false;
>> 	return percpu_counter_read_positive(&tctx->inflight);
>> }
> 
> so okay, we get iowait when tctx->inflight is positive. Looking at where
> that variable is modified, I found
> 
>> void io_task_refs_refill(struct io_uring_task *tctx)
>> {
>> 	unsigned int refill = -tctx->cached_refs + IO_TCTX_REFS_CACHE_NR;
>>
>> 	percpu_counter_add(&tctx->inflight, refill);
>> 	refcount_add(refill, &current->usage);
>> 	tctx->cached_refs += refill;
>> }

> as well as io_put_task() and io_uring_drop_tctx_refs().

Indeed! Care to send a patch for this? That's definitely a bug. The
existing test case didn't hit this as it only tests with an actual
request pending, and never after refs have been cached.

Thanks for looking into this.

-- 
Jens Axboe

     prev parent reply	other threads:[~2026-04-26 21:13 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-01 14:59 io_uring_prep_timeout() leading to an IO pressure close to 100 Fiona Ebner
2026-04-01 15:03 ` Jens Axboe
2026-04-02  9:12   ` Fiona Ebner
2026-04-02 12:31     ` Fiona Ebner
2026-04-24 15:42       ` Fiona Ebner
2026-04-26 21:13         ` Jens Axboe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a47f672c-d204-433f-9815-9e6606fdec1f@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=f.ebner@proxmox.com \
    --cc=hannes@cmpxchg.org \
    --cc=io-uring@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=surenb@google.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox