* io_uring_prep_timeout() leading to an IO pressure close to 100
@ 2026-04-01 14:59 Fiona Ebner
2026-04-01 15:03 ` Jens Axboe
0 siblings, 1 reply; 6+ messages in thread
From: Fiona Ebner @ 2026-04-01 14:59 UTC (permalink / raw)
To: linux-kernel; +Cc: hannes, surenb, peterz, io-uring, Jens Axboe
Dear maintainers,
I'm currently investigating an issue with QEMU causing an IO pressure
value of nearly 100 when io_uring is used for the event loop of a QEMU
iothread (which is the case since QEMU 10.2 if io_uring is enabled
during configuration and available).
The cause seems to be the io_uring_prep_timeout() call that is used for
blocking wait. I attached a minimal reproducer below, which exposes the
issue [0].
This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I
haven't investigated what happens inside the kernel yet, so I don't know
if it is an accounting issue or within io_uring.
Let me know if you need more information or if I should test something
specific.
Best Regards,
Fiona
[0]:
#include <errno.h>
#include <stdio.h>
#include <liburing.h>
int main(void) {
int ret;
struct io_uring ring;
struct __kernel_timespec ts;
struct io_uring_sqe *sqe;
ret = io_uring_queue_init(128, &ring, 0);
if (ret != 0) {
printf("Failed to initialize io_uring\n");
return ret;
}
ts = (struct __kernel_timespec){
.tv_sec = 60,
};
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
printf("Full sq\n");
return -1;
}
io_uring_prep_timeout(sqe, &ts, 1, 0);
io_uring_sqe_set_data(sqe, NULL);
do {
ret = io_uring_submit_and_wait(&ring, 1);
printf("got ret %d\n", ret);
} while (ret == -EINTR);
io_uring_queue_exit(&ring);
return 0;
}
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: io_uring_prep_timeout() leading to an IO pressure close to 100 2026-04-01 14:59 io_uring_prep_timeout() leading to an IO pressure close to 100 Fiona Ebner @ 2026-04-01 15:03 ` Jens Axboe 2026-04-02 9:12 ` Fiona Ebner 0 siblings, 1 reply; 6+ messages in thread From: Jens Axboe @ 2026-04-01 15:03 UTC (permalink / raw) To: Fiona Ebner, linux-kernel; +Cc: hannes, surenb, peterz, io-uring On 4/1/26 8:59 AM, Fiona Ebner wrote: > Dear maintainers, > > I'm currently investigating an issue with QEMU causing an IO pressure > value of nearly 100 when io_uring is used for the event loop of a QEMU > iothread (which is the case since QEMU 10.2 if io_uring is enabled > during configuration and available). It's not "IO pressure", it's the useless iowait metric... > The cause seems to be the io_uring_prep_timeout() call that is used for > blocking wait. I attached a minimal reproducer below, which exposes the > issue [0]. > > This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I > haven't investigated what happens inside the kernel yet, so I don't know > if it is an accounting issue or within io_uring. > > Let me know if you need more information or if I should test something > specific. If you won't want it, just turn it off with io_uring_set_iowait(). -- Jens Axboe ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: io_uring_prep_timeout() leading to an IO pressure close to 100 2026-04-01 15:03 ` Jens Axboe @ 2026-04-02 9:12 ` Fiona Ebner 2026-04-02 12:31 ` Fiona Ebner 0 siblings, 1 reply; 6+ messages in thread From: Fiona Ebner @ 2026-04-02 9:12 UTC (permalink / raw) To: Jens Axboe, linux-kernel Cc: hannes, surenb, peterz, io-uring, Thomas Lamprecht Am 01.04.26 um 5:02 PM schrieb Jens Axboe: > On 4/1/26 8:59 AM, Fiona Ebner wrote: >> I'm currently investigating an issue with QEMU causing an IO pressure >> value of nearly 100 when io_uring is used for the event loop of a QEMU >> iothread (which is the case since QEMU 10.2 if io_uring is enabled >> during configuration and available). > > It's not "IO pressure", it's the useless iowait metric... But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io (and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure). >> The cause seems to be the io_uring_prep_timeout() call that is used for >> blocking wait. I attached a minimal reproducer below, which exposes the >> issue [0]. >> >> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I >> haven't investigated what happens inside the kernel yet, so I don't know >> if it is an accounting issue or within io_uring. >> >> Let me know if you need more information or if I should test something >> specific. > > If you won't want it, just turn it off with io_uring_set_iowait(). QEMU does submit actual IO request on the same ring and I suppose iowait should still be used for those? Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout request is being submitted and no actual IO requests is an option? But even then, if a request is submitted later via another thread, iowait for that new request won't be accounted for, right? Is there a way to say "I don't want IO wait for timeout submissions"? Wouldn't that even make sense by default? Best Regards, Fiona ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: io_uring_prep_timeout() leading to an IO pressure close to 100 2026-04-02 9:12 ` Fiona Ebner @ 2026-04-02 12:31 ` Fiona Ebner 2026-04-24 15:42 ` Fiona Ebner 0 siblings, 1 reply; 6+ messages in thread From: Fiona Ebner @ 2026-04-02 12:31 UTC (permalink / raw) To: Jens Axboe, linux-kernel Cc: hannes, surenb, peterz, io-uring, Thomas Lamprecht Am 02.04.26 um 11:12 AM schrieb Fiona Ebner: > Am 01.04.26 um 5:02 PM schrieb Jens Axboe: >> On 4/1/26 8:59 AM, Fiona Ebner wrote: >>> I'm currently investigating an issue with QEMU causing an IO pressure >>> value of nearly 100 when io_uring is used for the event loop of a QEMU >>> iothread (which is the case since QEMU 10.2 if io_uring is enabled >>> during configuration and available). >> >> It's not "IO pressure", it's the useless iowait metric... > > But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io > (and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure). > >>> The cause seems to be the io_uring_prep_timeout() call that is used for >>> blocking wait. I attached a minimal reproducer below, which exposes the >>> issue [0]. >>> >>> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I >>> haven't investigated what happens inside the kernel yet, so I don't know >>> if it is an accounting issue or within io_uring. >>> >>> Let me know if you need more information or if I should test something >>> specific. >> >> If you won't want it, just turn it off with io_uring_set_iowait(). > > QEMU does submit actual IO request on the same ring and I suppose iowait > should still be used for those? > > Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout > request is being submitted and no actual IO requests is an option? But > even then, if a request is submitted later via another thread, iowait > for that new request won't be accounted for, right? > > Is there a way to say "I don't want IO wait for timeout submissions"? > Wouldn't that even make sense by default? Turns out, that in my QEMU instances, the branch doing the io_uring_prep_timeout() call is not actually taken, so while the issue could arise like that too, it's different in this practical case. What I'm actually seeing is io_uring_submit_and_wait() being called with wait_nr=1 while there is nothing else going on. So a more accurate reproducer for the scenario is attached below [0]. Note that it does not happen without sumbitting+completing a single request first. Best Regards, Fiona [0]: #include <errno.h> #include <stdio.h> #include <unistd.h> #include <liburing.h> int main(void) { int fd; int ret; struct io_uring ring; struct io_uring_sqe *sqe; ret = io_uring_queue_init(128, &ring, 0); if (ret != 0) { printf("Failed to initialize io_uring\n"); return ret; } // before submitting+advancing the issue does not happen // ret = io_uring_submit_and_wait(&ring, 1); // printf("got ret %d\n", ret); sqe = io_uring_get_sqe(&ring); if (!sqe) { printf("Full sq\n"); return -1; } io_uring_prep_nop(sqe); do { ret = io_uring_submit_and_wait(&ring, 1); } while (ret == -EINTR); if (ret != 1) { printf("Expected to submit one\n"); return -1; } // using peek+seen has the same effect // struct io_uring_cqe* cqe; // io_uring_peek_cqe(&ring, &cqe); // io_uring_cqe_seen(&ring, cqe); io_uring_cq_advance(&ring, 1); ret = io_uring_submit_and_wait(&ring, 1); printf("got ret %d\n", ret); io_uring_queue_exit(&ring); return 0; } ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: io_uring_prep_timeout() leading to an IO pressure close to 100 2026-04-02 12:31 ` Fiona Ebner @ 2026-04-24 15:42 ` Fiona Ebner 2026-04-26 21:13 ` Jens Axboe 0 siblings, 1 reply; 6+ messages in thread From: Fiona Ebner @ 2026-04-24 15:42 UTC (permalink / raw) To: Jens Axboe, linux-kernel Cc: hannes, surenb, peterz, io-uring, Thomas Lamprecht Hi Jens, Am 02.04.26 um 2:30 PM schrieb Fiona Ebner: > Am 02.04.26 um 11:12 AM schrieb Fiona Ebner: >> Am 01.04.26 um 5:02 PM schrieb Jens Axboe: >>> On 4/1/26 8:59 AM, Fiona Ebner wrote: >>>> I'm currently investigating an issue with QEMU causing an IO pressure >>>> value of nearly 100 when io_uring is used for the event loop of a QEMU >>>> iothread (which is the case since QEMU 10.2 if io_uring is enabled >>>> during configuration and available). >>> >>> It's not "IO pressure", it's the useless iowait metric... >> >> But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io >> (and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure). >> >>>> The cause seems to be the io_uring_prep_timeout() call that is used for >>>> blocking wait. I attached a minimal reproducer below, which exposes the >>>> issue [0]. >>>> >>>> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I >>>> haven't investigated what happens inside the kernel yet, so I don't know >>>> if it is an accounting issue or within io_uring. >>>> >>>> Let me know if you need more information or if I should test something >>>> specific. >>> >>> If you won't want it, just turn it off with io_uring_set_iowait(). >> >> QEMU does submit actual IO request on the same ring and I suppose iowait >> should still be used for those? >> >> Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout >> request is being submitted and no actual IO requests is an option? But >> even then, if a request is submitted later via another thread, iowait >> for that new request won't be accounted for, right? >> >> Is there a way to say "I don't want IO wait for timeout submissions"? >> Wouldn't that even make sense by default? > > Turns out, that in my QEMU instances, the branch doing the > io_uring_prep_timeout() call is not actually taken, so while the issue > could arise like that too, it's different in this practical case. > > What I'm actually seeing is io_uring_submit_and_wait() being called with > wait_nr=1 while there is nothing else going on. So a more accurate > reproducer for the scenario is attached below [0]. Note that it does not > happen without sumbitting+completing a single request first. I started digging in the kernel now and am wondering whether the number of inflight requests is correctly tracked? Does current_pending_io() need to consider tctx->cached_refs? In __io_cqring_wait_schedule(), there is > if (ext_arg->iowait && current_pending_io()) > current->in_iowait = 1; and current_pending_io() is > static bool current_pending_io(void) > { > struct io_uring_task *tctx = current->io_uring; > > if (!tctx) > return false; > return percpu_counter_read_positive(&tctx->inflight); > } so okay, we get iowait when tctx->inflight is positive. Looking at where that variable is modified, I found > void io_task_refs_refill(struct io_uring_task *tctx) > { > unsigned int refill = -tctx->cached_refs + IO_TCTX_REFS_CACHE_NR; > > percpu_counter_add(&tctx->inflight, refill); > refcount_add(refill, ¤t->usage); > tctx->cached_refs += refill; > } as well as io_put_task() and io_uring_drop_tctx_refs(). I made __io_cqring_wait_schedule() and io_put_task() non-static, non-inline to be able to trace them, made the following bpftrace script [1] and ran the reproducer [0] getting the following output: > Attaching 6 probes... > 12104: io_task_refs_refill: cached: -1 inflight: 0 > 12104: ret io_task_refs_refill: cached: 1024 inflight: 1025 > 12104: io_put_task: cached: 1024 inflight: 1025 > 12104: ret io_put_task: cached: 1025 inflight: 1025 > 12104: __io_cqring_wait_schedule: iowait: 1 > 12104: __io_cqring_wait_schedule: inflight: 1025 And then it's stuck, as expected, but AFAICS, with current->in_iowait set, which seems surprising to me. Best Regards, Fiona [1]: > kfunc::io_task_refs_refill > { > printf("%d: %s: cached: %d inflight: %d\n", > tid, > func, > ((struct io_uring_task*)args.tctx)->cached_refs, > ((struct io_uring_task*)args.tctx)->inflight.count > ); > } > > kretfunc::io_task_refs_refill > { > printf("%d: ret %s: cached: %d inflight: %d\n", > tid, > func, > ((struct io_uring_task*)args.tctx)->cached_refs, > ((struct io_uring_task*)args.tctx)->inflight.count > ); > } > > kfunc:io_uring_drop_tctx_refs > { > printf("%d: %s\n", tid, func); > } > > kfunc:__io_cqring_wait_schedule > { > printf("%d: %s: iowait: %d\n", > tid, > func, > ((struct ext_arg*)args.ext_arg)->iowait > ); > if (curtask->io_uring) { > printf("%d: %s: inflight: %d\n", > tid, > func, > curtask->io_uring->inflight.count > ); > } else { > printf("%d: %s: got no tctx!\n", tid, func); > } > } > > kfunc:io_put_task > { > printf("%d: %s: cached: %d inflight: %d\n", > tid, > func, > ((struct io_kiocb*)args.req)->tctx->cached_refs, > ((struct io_kiocb*)args.req)->tctx->inflight.count > ); > } > > kretfunc:io_put_task > { > printf("%d: ret %s: cached: %d inflight: %d\n", > tid, > func, > ((struct io_kiocb*)args.req)->tctx->cached_refs, > ((struct io_kiocb*)args.req)->tctx->inflight.count > ); > } > > [0]: > > #include <errno.h> > #include <stdio.h> > #include <unistd.h> > #include <liburing.h> > > int main(void) { > int fd; > int ret; > struct io_uring ring; > struct io_uring_sqe *sqe; > > ret = io_uring_queue_init(128, &ring, 0); > if (ret != 0) { > printf("Failed to initialize io_uring\n"); > return ret; > } > > // before submitting+advancing the issue does not happen > // ret = io_uring_submit_and_wait(&ring, 1); > // printf("got ret %d\n", ret); > > sqe = io_uring_get_sqe(&ring); > if (!sqe) { > printf("Full sq\n"); > return -1; > } > > io_uring_prep_nop(sqe); > > do { > ret = io_uring_submit_and_wait(&ring, 1); > } while (ret == -EINTR); > > if (ret != 1) { > printf("Expected to submit one\n"); > return -1; > } > > // using peek+seen has the same effect > // struct io_uring_cqe* cqe; > // io_uring_peek_cqe(&ring, &cqe); > // io_uring_cqe_seen(&ring, cqe); > io_uring_cq_advance(&ring, 1); > > ret = io_uring_submit_and_wait(&ring, 1); > printf("got ret %d\n", ret); > > io_uring_queue_exit(&ring); > > return 0; > } > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: io_uring_prep_timeout() leading to an IO pressure close to 100 2026-04-24 15:42 ` Fiona Ebner @ 2026-04-26 21:13 ` Jens Axboe 0 siblings, 0 replies; 6+ messages in thread From: Jens Axboe @ 2026-04-26 21:13 UTC (permalink / raw) To: Fiona Ebner, linux-kernel Cc: hannes, surenb, peterz, io-uring, Thomas Lamprecht On 4/24/26 9:42 AM, Fiona Ebner wrote: > Hi Jens, > > Am 02.04.26 um 2:30 PM schrieb Fiona Ebner: >> Am 02.04.26 um 11:12 AM schrieb Fiona Ebner: >>> Am 01.04.26 um 5:02 PM schrieb Jens Axboe: >>>> On 4/1/26 8:59 AM, Fiona Ebner wrote: >>>>> I'm currently investigating an issue with QEMU causing an IO pressure >>>>> value of nearly 100 when io_uring is used for the event loop of a QEMU >>>>> iothread (which is the case since QEMU 10.2 if io_uring is enabled >>>>> during configuration and available). >>>> >>>> It's not "IO pressure", it's the useless iowait metric... >>> >>> But it is reported as IO pressure by the kernel, i.e. /proc/pressure/io >>> (and for a cgroup, /sys/fs/cgroup/foo.slice/bar.scope/io.pressure). >>> >>>>> The cause seems to be the io_uring_prep_timeout() call that is used for >>>>> blocking wait. I attached a minimal reproducer below, which exposes the >>>>> issue [0]. >>>>> >>>>> This was observed on a kernel based on 7.0-rc6 as well as 6.17.13. I >>>>> haven't investigated what happens inside the kernel yet, so I don't know >>>>> if it is an accounting issue or within io_uring. >>>>> >>>>> Let me know if you need more information or if I should test something >>>>> specific. >>>> >>>> If you won't want it, just turn it off with io_uring_set_iowait(). >>> >>> QEMU does submit actual IO request on the same ring and I suppose iowait >>> should still be used for those? >>> >>> Maybe setting the IORING_ENTER_NO_IOWAIT flag if only the timeout >>> request is being submitted and no actual IO requests is an option? But >>> even then, if a request is submitted later via another thread, iowait >>> for that new request won't be accounted for, right? >>> >>> Is there a way to say "I don't want IO wait for timeout submissions"? >>> Wouldn't that even make sense by default? >> >> Turns out, that in my QEMU instances, the branch doing the >> io_uring_prep_timeout() call is not actually taken, so while the issue >> could arise like that too, it's different in this practical case. >> >> What I'm actually seeing is io_uring_submit_and_wait() being called with >> wait_nr=1 while there is nothing else going on. So a more accurate >> reproducer for the scenario is attached below [0]. Note that it does not >> happen without sumbitting+completing a single request first. > > I started digging in the kernel now and am wondering whether the number > of inflight requests is correctly tracked? Does current_pending_io() > need to consider tctx->cached_refs? > > In __io_cqring_wait_schedule(), there is > >> if (ext_arg->iowait && current_pending_io()) >> current->in_iowait = 1; > > and current_pending_io() is > >> static bool current_pending_io(void) >> { >> struct io_uring_task *tctx = current->io_uring; >> >> if (!tctx) >> return false; >> return percpu_counter_read_positive(&tctx->inflight); >> } > > so okay, we get iowait when tctx->inflight is positive. Looking at where > that variable is modified, I found > >> void io_task_refs_refill(struct io_uring_task *tctx) >> { >> unsigned int refill = -tctx->cached_refs + IO_TCTX_REFS_CACHE_NR; >> >> percpu_counter_add(&tctx->inflight, refill); >> refcount_add(refill, ¤t->usage); >> tctx->cached_refs += refill; >> } > as well as io_put_task() and io_uring_drop_tctx_refs(). Indeed! Care to send a patch for this? That's definitely a bug. The existing test case didn't hit this as it only tests with an actual request pending, and never after refs have been cached. Thanks for looking into this. -- Jens Axboe ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-26 21:13 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-01 14:59 io_uring_prep_timeout() leading to an IO pressure close to 100 Fiona Ebner 2026-04-01 15:03 ` Jens Axboe 2026-04-02 9:12 ` Fiona Ebner 2026-04-02 12:31 ` Fiona Ebner 2026-04-24 15:42 ` Fiona Ebner 2026-04-26 21:13 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox