public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()
@ 2026-05-13  6:32 Zizhi Wo
  2026-05-13 14:18 ` Jens Axboe
  0 siblings, 1 reply; 3+ messages in thread
From: Zizhi Wo @ 2026-05-13  6:32 UTC (permalink / raw)
  To: axboe, asml.silence, io-uring
  Cc: linux-kernel, yangerkun, chengzhihao1, wozizhi

From: Zizhi Wo <wozizhi@huawei.com>

[BUG]
A fuzzing run reproduced an unkillable io_uring task stuck at ~100% CPU:

    [root@fedora io_uring_stress]# ps -ef | grep io_uring
    root  1240  1  99 13:36 ?  00:01:35 [io_uring_stress] <defunct>

The task loops inside io_cqring_wait() and never returns to userspace, and
SIGKILL has no effect.

[CAUSE]
The CQ ring exposes rings->cq.head to userspace as writable, while the
authoritative tail lives in kernel-private ctx->cached_cq_tail.
io_cqe_cache_refill() computes free space as an unsigned subtraction:

    free = ctx->cq_entries - min(tail - head, ctx->cq_entries);

If userspace keeps head within [0, tail], the subtraction is well defined
and min() just acts as a defensive clamp. But if userspace advances head
past tail, (tail - head) wraps to a huge value, free becomes 0, and
io_cqe_cache_refill() fails. The CQE is pushed onto the overflow list and
IO_CHECK_CQ_OVERFLOW_BIT is set.

The wait loop in io_cqring_wait() relies on an invariant: refill() only
fails when the CQ is *physically* full, in which case rings->cq.tail has
been advanced to iowq->cq_tail and io_should_wake() returns true. The
tampered head breaks this: refill() fails while the ring is not full, no
OCQE is copied in, rings->cq.tail never catches up, io_should_wake() stays
false, and io_cqring_wait_schedule() keeps returning early because
IO_CHECK_CQ_OVERFLOW_BIT is still set. The result is a tight retry loop
that never returns to userspace.

Note only head is userspace-writable; cached_cq_tail cannot be corrupted
the same way.

[FIX]
Treat rings->cq.head as untrusted, like io_get_sqe() already does for
sq_array[]. Use a wraparound-safe signed comparison: since the real
head/tail distance is bounded by cq_entries (far below 2^31),
(s32)(tail - head) < 0 reliably means userspace moved head past tail.
In that case expose the full cache as free so refill() succeeds, tail
advances, the wait loop wakes, and the task returns to userspace.

CQEs that would otherwise be delivered may be lost when the application
corrupts its own head pointer, but that is an application-visible
consequence of its own action; the kernel's responsibility here is limited
to keeping the task killable and making forward progress.

Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
---
 io_uring/io_uring.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4ed998d60c09..92e255e9e08f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -710,11 +710,13 @@ static bool io_fill_nop_cqe(struct io_ring_ctx *ctx, unsigned int off)
  * fill the cq entry
  */
 bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32)
 {
 	struct io_rings *rings = ctx->rings;
-	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
+	unsigned int head = READ_ONCE(ctx->rings->cq.head);
+	unsigned int tail = ctx->cached_cq_tail;
+	unsigned int off = tail & (ctx->cq_entries - 1);
 	unsigned int free, queued, len;
 
 	/*
 	 * Posting into the CQ when there are pending overflowed CQEs may break
 	 * ordering guarantees, which will affect links, F_MORE users and more.
@@ -731,12 +733,20 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32)
 		if (!io_fill_nop_cqe(ctx, off))
 			return false;
 		off = 0;
 	}
 
-	/* userspace may cheat modifying the tail, be safe and do min */
-	queued = min(__io_cqring_events(ctx), ctx->cq_entries);
+	/*
+	 * rings->cq.head is user-writable.  If userspace advances it past
+	 * cached_cq_tail, (tail - head) underflows and free becomes 0, which
+	 * traps io_cqring_wait() in an unkillable loop via the overflow path.
+	 * Treat such a state as "nothing queued" to guarantee forward progress.
+	 */
+	if (unlikely((s32)(tail - head) < 0))
+		queued = 0;
+	else
+		queued = min(tail - head, ctx->cq_entries);
 	free = ctx->cq_entries - queued;
 	/* we need a contiguous range, limit based on the current array offset */
 	len = min(free, ctx->cq_entries - off);
 	if (len < (cqe32 + 1))
 		return false;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()
  2026-05-13  6:32 [PATCH] io_uring: validate user-controlled cq.head in io_cqe_cache_refill() Zizhi Wo
@ 2026-05-13 14:18 ` Jens Axboe
  2026-05-13 14:20   ` Jens Axboe
  0 siblings, 1 reply; 3+ messages in thread
From: Jens Axboe @ 2026-05-13 14:18 UTC (permalink / raw)
  To: Zizhi Wo, asml.silence, io-uring; +Cc: linux-kernel, yangerkun, chengzhihao1

On 5/13/26 12:32 AM, Zizhi Wo wrote:
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 4ed998d60c09..92e255e9e08f 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -710,11 +710,13 @@ static bool io_fill_nop_cqe(struct io_ring_ctx *ctx, unsigned int off)
>   * fill the cq entry
>   */
>  bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32)
>  {
>  	struct io_rings *rings = ctx->rings;
> -	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
> +	unsigned int head = READ_ONCE(ctx->rings->cq.head);
> +	unsigned int tail = ctx->cached_cq_tail;
> +	unsigned int off = tail & (ctx->cq_entries - 1);
>  	unsigned int free, queued, len;

This looks wrong, as you're snapshotting 'tail' while it could get
modified by if a nop fill before the refill happens. And fwiw, looks
like the refill part potentially suffers from the same unsigned issue.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] io_uring: validate user-controlled cq.head in io_cqe_cache_refill()
  2026-05-13 14:18 ` Jens Axboe
@ 2026-05-13 14:20   ` Jens Axboe
  0 siblings, 0 replies; 3+ messages in thread
From: Jens Axboe @ 2026-05-13 14:20 UTC (permalink / raw)
  To: Zizhi Wo, asml.silence, io-uring; +Cc: linux-kernel, yangerkun, chengzhihao1

On 5/13/26 8:18 AM, Jens Axboe wrote:
> On 5/13/26 12:32 AM, Zizhi Wo wrote:
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 4ed998d60c09..92e255e9e08f 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -710,11 +710,13 @@ static bool io_fill_nop_cqe(struct io_ring_ctx *ctx, unsigned int off)
>>   * fill the cq entry
>>   */
>>  bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32)
>>  {
>>  	struct io_rings *rings = ctx->rings;
>> -	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
>> +	unsigned int head = READ_ONCE(ctx->rings->cq.head);
>> +	unsigned int tail = ctx->cached_cq_tail;
>> +	unsigned int off = tail & (ctx->cq_entries - 1);
>>  	unsigned int free, queued, len;
> 
> This looks wrong, as you're snapshotting 'tail' while it could get
> modified by if a nop fill before the refill happens. And fwiw, looks
> like the refill part potentially suffers from the same unsigned issue.

To be clearer, I think you want to add a helper ala:

static unsigned int io_cqring_queued(struct io_ring_ctx *ctx)
{
	struct io_rings *rings = io_get_rings(ctx);
	int diff;

	diff = (int)( ctx->cached_cq_tail - READ_ONCE(rings->cq.head));
	if (diff >= 0)
        	return min((unsigned int) diff, ctx->cq_entries);
	return 0;
}

or something like that, and then use it in both spots. Would make for a
cleaner fix, too.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-05-13 14:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-13  6:32 [PATCH] io_uring: validate user-controlled cq.head in io_cqe_cache_refill() Zizhi Wo
2026-05-13 14:18 ` Jens Axboe
2026-05-13 14:20   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox