io_uring_prep_timeout() leading to an IO pressure close to 100

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* io_uring_prep_timeout() leading to an IO pressure close to 100
@ 2026-04-01 14:59 Fiona Ebner
  2026-04-01 15:03 ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Fiona Ebner @ 2026-04-01 14:59 UTC (permalink / raw)
  To: linux-kernel; +Cc: hannes, surenb, peterz, io-uring, Jens Axboe

Dear maintainers,

I'm currently investigating an issue with QEMU causing an IO pressure
value of nearly 100 when io_uring is used for the event loop of a QEMU
iothread (which is the case since QEMU 10.2 if io_uring is enabled
during configuration and available).

The cause seems to be the io_uring_prep_timeout() call that is used for
blocking wait. I attached a minimal reproducer below, which exposes the
issue [0].

This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I
haven't investigated what happens inside the kernel yet, so I don't know
if it is an accounting issue or within io_uring.

Let me know if you need more information or if I should test something
specific.

Best Regards,
Fiona

[0]:

#include <errno.h>
#include <stdio.h>
#include <liburing.h>

int main(void) {
    int ret;
    struct io_uring ring;
    struct __kernel_timespec ts;
    struct io_uring_sqe *sqe;

    ret = io_uring_queue_init(128, &ring, 0);
    if (ret != 0) {
        printf("Failed to initialize io_uring\n");
        return ret;
    }

    ts = (struct __kernel_timespec){
        .tv_sec = 60,
    };

    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        printf("Full sq\n");
        return -1;
    }

    io_uring_prep_timeout(sqe, &ts, 1, 0);
    io_uring_sqe_set_data(sqe, NULL);

    do {
        ret = io_uring_submit_and_wait(&ring, 1);
        printf("got ret %d\n", ret);
    } while (ret == -EINTR);

    io_uring_queue_exit(&ring);

    return 0;
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: io_uring_prep_timeout() leading to an IO pressure close to 100
  2026-04-01 14:59 io_uring_prep_timeout() leading to an IO pressure close to 100 Fiona Ebner
@ 2026-04-01 15:03 ` Jens Axboe
  2026-04-02  9:12   ` Fiona Ebner
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2026-04-01 15:03 UTC (permalink / raw)
  To: Fiona Ebner, linux-kernel; +Cc: hannes, surenb, peterz, io-uring

On 4/1/26 8:59 AM, Fiona Ebner wrote:
> Dear maintainers,
> 
> I'm currently investigating an issue with QEMU causing an IO pressure
> value of nearly 100 when io_uring is used for the event loop of a QEMU
> iothread (which is the case since QEMU 10.2 if io_uring is enabled
> during configuration and available).

It's not "IO pressure", it's the useless iowait metric...

> The cause seems to be the io_uring_prep_timeout() call that is used for
> blocking wait. I attached a minimal reproducer below, which exposes the
> issue [0].
> 
> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I
> haven't investigated what happens inside the kernel yet, so I don't know
> if it is an accounting issue or within io_uring.
> 
> Let me know if you need more information or if I should test something
> specific.

If you won't want it, just turn it off with io_uring_set_iowait().

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: io_uring_prep_timeout() leading to an IO pressure close to 100
  2026-04-01 15:03 ` Jens Axboe
@ 2026-04-02  9:12   ` Fiona Ebner
  2026-04-02 12:31     ` Fiona Ebner
  0 siblings, 1 reply; 6+ messages in thread
From: Fiona Ebner @ 2026-04-02  9:12 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: hannes, surenb, peterz, io-uring, Thomas Lamprecht

Am 01.04.26 um 5:02 PM schrieb Jens Axboe:
> On 4/1/26 8:59 AM, Fiona Ebner wrote:
>> I'm currently investigating an issue with QEMU causing an IO pressure
>> value of nearly 100 when io_uring is used for the event loop of a QEMU
>> iothread (which is the case since QEMU 10.2 if io_uring is enabled
>> during configuration and available).
> 
> It's not "IO pressure", it's the useless iowait metric...

But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io
(and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure).

>> The cause seems to be the io_uring_prep_timeout() call that is used for
>> blocking wait. I attached a minimal reproducer below, which exposes the
>> issue [0].
>>
>> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I
>> haven't investigated what happens inside the kernel yet, so I don't know
>> if it is an accounting issue or within io_uring.
>>
>> Let me know if you need more information or if I should test something
>> specific.
> 
> If you won't want it, just turn it off with io_uring_set_iowait().

QEMU does submit actual IO request on the same ring and I suppose iowait
should still be used for those?

Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout
request is being submitted and no actual IO requests is an option? But
even then, if a request is submitted later via another thread, iowait
for that new request won't be accounted for, right?

Is there a way to say "I don't want IO wait for timeout submissions"?
Wouldn't that even make sense by default?

Best Regards,
Fiona

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: io_uring_prep_timeout() leading to an IO pressure close to 100
  2026-04-02  9:12   ` Fiona Ebner
@ 2026-04-02 12:31     ` Fiona Ebner
  2026-04-24 15:42       ` Fiona Ebner
  0 siblings, 1 reply; 6+ messages in thread
From: Fiona Ebner @ 2026-04-02 12:31 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: hannes, surenb, peterz, io-uring, Thomas Lamprecht

Am 02.04.26 um 11:12 AM schrieb Fiona Ebner:
> Am 01.04.26 um 5:02 PM schrieb Jens Axboe:
>> On 4/1/26 8:59 AM, Fiona Ebner wrote:
>>> I'm currently investigating an issue with QEMU causing an IO pressure
>>> value of nearly 100 when io_uring is used for the event loop of a QEMU
>>> iothread (which is the case since QEMU 10.2 if io_uring is enabled
>>> during configuration and available).
>>
>> It's not "IO pressure", it's the useless iowait metric...
> 
> But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io
> (and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure).
> 
>>> The cause seems to be the io_uring_prep_timeout() call that is used for
>>> blocking wait. I attached a minimal reproducer below, which exposes the
>>> issue [0].
>>>
>>> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I
>>> haven't investigated what happens inside the kernel yet, so I don't know
>>> if it is an accounting issue or within io_uring.
>>>
>>> Let me know if you need more information or if I should test something
>>> specific.
>>
>> If you won't want it, just turn it off with io_uring_set_iowait().
> 
> QEMU does submit actual IO request on the same ring and I suppose iowait
> should still be used for those?
> 
> Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout
> request is being submitted and no actual IO requests is an option? But
> even then, if a request is submitted later via another thread, iowait
> for that new request won't be accounted for, right?
> 
> Is there a way to say "I don't want IO wait for timeout submissions"?
> Wouldn't that even make sense by default?

Turns out, that in my QEMU instances, the branch doing the
io_uring_prep_timeout() call is not actually taken, so while the issue
could arise like that too, it's different in this practical case.

What I'm actually seeing is io_uring_submit_and_wait() being called with
wait_nr=1 while there is nothing else going on. So a more accurate
reproducer for the scenario is attached below [0]. Note that it does not
happen without sumbitting+completing a single request first.

Best Regards,
Fiona

[0]:

#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <liburing.h>

int main(void) {
    int fd;
    int ret;
    struct io_uring ring;
    struct io_uring_sqe *sqe;

    ret = io_uring_queue_init(128, &ring, 0);
    if (ret != 0) {
        printf("Failed to initialize io_uring\n");
        return ret;
    }

    // before submitting+advancing the issue does not happen
    // ret = io_uring_submit_and_wait(&ring, 1);
    // printf("got ret %d\n", ret);

    sqe = io_uring_get_sqe(&ring);
    if (!sqe) {
        printf("Full sq\n");
        return -1;
    }

    io_uring_prep_nop(sqe);

    do {
        ret = io_uring_submit_and_wait(&ring, 1);
    } while (ret == -EINTR);

    if (ret != 1) {
        printf("Expected to submit one\n");
        return -1;
    }

    // using peek+seen has the same effect
    // struct io_uring_cqe* cqe;
    // io_uring_peek_cqe(&ring, &cqe);
    // io_uring_cqe_seen(&ring, cqe);
    io_uring_cq_advance(&ring, 1);

    ret = io_uring_submit_and_wait(&ring, 1);
    printf("got ret %d\n", ret);

    io_uring_queue_exit(&ring);

    return 0;
}



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: io_uring_prep_timeout() leading to an IO pressure close to 100
  2026-04-02 12:31     ` Fiona Ebner
@ 2026-04-24 15:42       ` Fiona Ebner
  2026-04-26 21:13         ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Fiona Ebner @ 2026-04-24 15:42 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel
  Cc: hannes, surenb, peterz, io-uring, Thomas Lamprecht

Hi Jens,

Am 02.04.26 um 2:30 PM schrieb Fiona Ebner:
> Am 02.04.26 um 11:12 AM schrieb Fiona Ebner:
>> Am 01.04.26 um 5:02 PM schrieb Jens Axboe:
>>> On 4/1/26 8:59 AM, Fiona Ebner wrote:
>>>> I'm currently investigating an issue with QEMU causing an IO pressure
>>>> value of nearly 100 when io_uring is used for the event loop of a QEMU
>>>> iothread (which is the case since QEMU 10.2 if io_uring is enabled
>>>> during configuration and available).
>>>
>>> It's not "IO pressure", it's the useless iowait metric...
>>
>> But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io
>> (and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure).
>>
>>>> The cause seems to be the io_uring_prep_timeout() call that is used for
>>>> blocking wait. I attached a minimal reproducer below, which exposes the
>>>> issue [0].
>>>>
>>>> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I
>>>> haven't investigated what happens inside the kernel yet, so I don't know
>>>> if it is an accounting issue or within io_uring.
>>>>
>>>> Let me know if you need more information or if I should test something
>>>> specific.
>>>
>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>
>> QEMU does submit actual IO request on the same ring and I suppose iowait
>> should still be used for those?
>>
>> Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout
>> request is being submitted and no actual IO requests is an option? But
>> even then, if a request is submitted later via another thread, iowait
>> for that new request won't be accounted for, right?
>>
>> Is there a way to say "I don't want IO wait for timeout submissions"?
>> Wouldn't that even make sense by default?
> 
> Turns out, that in my QEMU instances, the branch doing the
> io_uring_prep_timeout() call is not actually taken, so while the issue
> could arise like that too, it's different in this practical case.
> 
> What I'm actually seeing is io_uring_submit_and_wait() being called with
> wait_nr=1 while there is nothing else going on. So a more accurate
> reproducer for the scenario is attached below [0]. Note that it does not
> happen without sumbitting+completing a single request first. 

I started digging in the kernel now and am wondering whether the number
of inflight requests is correctly tracked? Does current_pending_io()
need to consider tctx->cached_refs?

In __io_cqring_wait_schedule(), there is

> 	if (ext_arg->iowait && current_pending_io())
> 		current->in_iowait = 1;

and current_pending_io() is

> static bool current_pending_io(void)
> {
> 	struct io_uring_task *tctx = current->io_uring;
> 
> 	if (!tctx)
> 		return false;
> 	return percpu_counter_read_positive(&tctx->inflight);
> }

so okay, we get iowait when tctx->inflight is positive. Looking at where
that variable is modified, I found

> void io_task_refs_refill(struct io_uring_task *tctx)
> {
> 	unsigned int refill = -tctx->cached_refs + IO_TCTX_REFS_CACHE_NR;
> 
> 	percpu_counter_add(&tctx->inflight, refill);
> 	refcount_add(refill, &current->usage);
> 	tctx->cached_refs += refill;
> }

as well as io_put_task() and io_uring_drop_tctx_refs().

I made __io_cqring_wait_schedule() and io_put_task() non-static,
non-inline to be able to trace them, made the following bpftrace script
[1] and ran the reproducer [0] getting the following output:

> Attaching 6 probes...
> 12104: io_task_refs_refill: cached: -1 inflight: 0
> 12104: ret io_task_refs_refill: cached: 1024 inflight: 1025
> 12104: io_put_task: cached: 1024 inflight: 1025
> 12104: ret io_put_task: cached: 1025 inflight: 1025
> 12104: __io_cqring_wait_schedule: iowait: 1
> 12104: __io_cqring_wait_schedule: inflight: 1025

And then it's stuck, as expected, but AFAICS, with current->in_iowait
set, which seems surprising to me.

Best Regards,
Fiona

[1]:

> kfunc::io_task_refs_refill
> {
>     printf("%d: %s: cached: %d inflight: %d\n",
>         tid,
>         func,
>         ((struct io_uring_task*)args.tctx)->cached_refs,
>         ((struct io_uring_task*)args.tctx)->inflight.count
>     );
> }
> 
> kretfunc::io_task_refs_refill
> {
>     printf("%d: ret %s: cached: %d inflight: %d\n",
>         tid,
>         func,
>         ((struct io_uring_task*)args.tctx)->cached_refs,
>         ((struct io_uring_task*)args.tctx)->inflight.count
>     );
> }
> 
> kfunc:io_uring_drop_tctx_refs
> {
>     printf("%d: %s\n", tid, func);
> }
> 
> kfunc:__io_cqring_wait_schedule
> {
>     printf("%d: %s: iowait: %d\n",
>         tid,
>         func,
>         ((struct ext_arg*)args.ext_arg)->iowait
>     );
>     if (curtask->io_uring) {
>         printf("%d: %s: inflight: %d\n",
>             tid,
>             func,
>             curtask->io_uring->inflight.count
>         );
>     } else {
>         printf("%d: %s: got no tctx!\n", tid, func);
>     }
> }
> 
> kfunc:io_put_task
> {
>     printf("%d: %s: cached: %d inflight: %d\n",
>         tid,
>         func,
>         ((struct io_kiocb*)args.req)->tctx->cached_refs,
>         ((struct io_kiocb*)args.req)->tctx->inflight.count
>     );
> }
> 
> kretfunc:io_put_task
> {
>     printf("%d: ret %s: cached: %d inflight: %d\n",
>         tid,
>         func,
>         ((struct io_kiocb*)args.req)->tctx->cached_refs,
>         ((struct io_kiocb*)args.req)->tctx->inflight.count
>     );
> }



> 
> [0]:
> 
> #include <errno.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <liburing.h>
> 
> int main(void) {
>     int fd;
>     int ret;
>     struct io_uring ring;
>     struct io_uring_sqe *sqe;
> 
>     ret = io_uring_queue_init(128, &ring, 0);
>     if (ret != 0) {
>         printf("Failed to initialize io_uring\n");
>         return ret;
>     }
> 
>     // before submitting+advancing the issue does not happen
>     // ret = io_uring_submit_and_wait(&ring, 1);
>     // printf("got ret %d\n", ret);
> 
>     sqe = io_uring_get_sqe(&ring);
>     if (!sqe) {
>         printf("Full sq\n");
>         return -1;
>     }
> 
>     io_uring_prep_nop(sqe);
> 
>     do {
>         ret = io_uring_submit_and_wait(&ring, 1);
>     } while (ret == -EINTR);
> 
>     if (ret != 1) {
>         printf("Expected to submit one\n");
>         return -1;
>     }
> 
>     // using peek+seen has the same effect
>     // struct io_uring_cqe* cqe;
>     // io_uring_peek_cqe(&ring, &cqe);
>     // io_uring_cqe_seen(&ring, cqe);
>     io_uring_cq_advance(&ring, 1);
> 
>     ret = io_uring_submit_and_wait(&ring, 1);
>     printf("got ret %d\n", ret);
> 
>     io_uring_queue_exit(&ring);
> 
>     return 0;
> }
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: io_uring_prep_timeout() leading to an IO pressure close to 100
  2026-04-24 15:42       ` Fiona Ebner
@ 2026-04-26 21:13         ` Jens Axboe
  0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2026-04-26 21:13 UTC (permalink / raw)
  To: Fiona Ebner, linux-kernel
  Cc: hannes, surenb, peterz, io-uring, Thomas Lamprecht

On 4/24/26 9:42 AM, Fiona Ebner wrote:
> Hi Jens,
> 
> Am 02.04.26 um 2:30 PM schrieb Fiona Ebner:
>> Am 02.04.26 um 11:12 AM schrieb Fiona Ebner:
>>> Am 01.04.26 um 5:02 PM schrieb Jens Axboe:
>>>> On 4/1/26 8:59 AM, Fiona Ebner wrote:
>>>>> I'm currently investigating an issue with QEMU causing an IO pressure
>>>>> value of nearly 100 when io_uring is used for the event loop of a QEMU
>>>>> iothread (which is the case since QEMU 10.2 if io_uring is enabled
>>>>> during configuration and available).
>>>>
>>>> It's not "IO pressure", it's the useless iowait metric...
>>>
>>> But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io
>>> (and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure).
>>>
>>>>> The cause seems to be the io_uring_prep_timeout() call that is used for
>>>>> blocking wait. I attached a minimal reproducer below, which exposes the
>>>>> issue [0].
>>>>>
>>>>> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I
>>>>> haven't investigated what happens inside the kernel yet, so I don't know
>>>>> if it is an accounting issue or within io_uring.
>>>>>
>>>>> Let me know if you need more information or if I should test something
>>>>> specific.
>>>>
>>>> If you won't want it, just turn it off with io_uring_set_iowait().
>>>
>>> QEMU does submit actual IO request on the same ring and I suppose iowait
>>> should still be used for those?
>>>
>>> Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout
>>> request is being submitted and no actual IO requests is an option? But
>>> even then, if a request is submitted later via another thread, iowait
>>> for that new request won't be accounted for, right?
>>>
>>> Is there a way to say "I don't want IO wait for timeout submissions"?
>>> Wouldn't that even make sense by default?
>>
>> Turns out, that in my QEMU instances, the branch doing the
>> io_uring_prep_timeout() call is not actually taken, so while the issue
>> could arise like that too, it's different in this practical case.
>>
>> What I'm actually seeing is io_uring_submit_and_wait() being called with
>> wait_nr=1 while there is nothing else going on. So a more accurate
>> reproducer for the scenario is attached below [0]. Note that it does not
>> happen without sumbitting+completing a single request first. 
> 
> I started digging in the kernel now and am wondering whether the number
> of inflight requests is correctly tracked? Does current_pending_io()
> need to consider tctx->cached_refs?
> 
> In __io_cqring_wait_schedule(), there is
> 
>> 	if (ext_arg->iowait && current_pending_io())
>> 		current->in_iowait = 1;
> 
> and current_pending_io() is
> 
>> static bool current_pending_io(void)
>> {
>> 	struct io_uring_task *tctx = current->io_uring;
>>
>> 	if (!tctx)
>> 		return false;
>> 	return percpu_counter_read_positive(&tctx->inflight);
>> }
> 
> so okay, we get iowait when tctx->inflight is positive. Looking at where
> that variable is modified, I found
> 
>> void io_task_refs_refill(struct io_uring_task *tctx)
>> {
>> 	unsigned int refill = -tctx->cached_refs + IO_TCTX_REFS_CACHE_NR;
>>
>> 	percpu_counter_add(&tctx->inflight, refill);
>> 	refcount_add(refill, &current->usage);
>> 	tctx->cached_refs += refill;
>> }

> as well as io_put_task() and io_uring_drop_tctx_refs().

Indeed! Care to send a patch for this? That's definitely a bug. The
existing test case didn't hit this as it only tests with an actual
request pending, and never after refs have been cached.

Thanks for looking into this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-26 21:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01 14:59 io_uring_prep_timeout() leading to an IO pressure close to 100 Fiona Ebner
2026-04-01 15:03 ` Jens Axboe
2026-04-02  9:12   ` Fiona Ebner
2026-04-02 12:31     ` Fiona Ebner
2026-04-24 15:42       ` Fiona Ebner
2026-04-26 21:13         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox