public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RFC 0/8] Add support for mixed sized CQEs
@ 2025-08-08 17:03 Jens Axboe
  2025-08-08 17:03 ` [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-08 17:03 UTC (permalink / raw)
  To: io-uring

Hi,

Currently io_uring supports two modes for CQEs:

1) The standard mode, where 16b CQEs are used
2) Setting IORING_SETUP_CQE32, which makes all CQEs posted 32b

Certain features need to pass more information back than just a single
32-bit res field, and hence mandate the use of CQE32 to be able to work.
Examples of that include passthrough or other uses of ->uring_cmd() like
socket option getting and setting, including timestamps.

This patchset adds support for IORING_SETUP_CQE_MIXED, which allows
posting both 16b and 32b CQEs on the same CQ ring. The idea here is that
we need not waste twice the space for CQ rings, or use twice the space
per CQE posted, if only some of the CQEs posted require the use of 32b
CQEs. On a ring setup in CQE mixed mode, 32b posted CQEs will have
IORING_CQE_F_32 set in cqe->flags to tell the application (or liburing)
about this fact.

This is mostly trivial to support, with the corner case being attempting
to post a 32b CQE when the ring is a single 16b CQE away from wrapping.
As CQEs must be contigious in memory, that's simply not possible. The
solution taken by this patchset is to add a special CQE type, which has
IORING_CQE_F_SKIP set. This is a pad/nop CQE, which should simply be
ignored, as it carries no information and serves no other purpose than
to re-align the posted CQEs for ring wrap.

If used with liburing, then both the 32b vs 16b postings and the skip
are transparent.

liburing support and a few basic test cases can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/liburing.git/log/?h=cqe-mixed

and these patches can also be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-cqe-mix

Patch 1 is just a prep patch, and patch 2 adds the cqe flags so that the
core can be adapted before support is actually there. Patches 3 and 4
are exactly that, and patch 5 finally adds support for the mixed mode.
Patch 6 adds support for NOP testing of this, and patches 7/8 allow
IORING_SETUP_CQE_MIXED for uring_cmd/zcrx which previously required
IORING_SETUP_CQE32 to work.

 Documentation/networking/iou-zcrx.rst |  2 +-
 include/linux/io_uring_types.h        |  6 ---
 include/trace/events/io_uring.h       |  4 +-
 include/uapi/linux/io_uring.h         | 17 +++++++
 io_uring/cmd_net.c                    |  3 +-
 io_uring/fdinfo.c                     | 22 +++++----
 io_uring/io_uring.c                   | 71 +++++++++++++++++++++------
 io_uring/io_uring.h                   | 49 ++++++++++++------
 io_uring/nop.c                        | 17 ++++++-
 io_uring/register.c                   |  3 +-
 io_uring/uring_cmd.c                  |  2 +-
 io_uring/zcrx.c                       |  5 +-
 12 files changed, 146 insertions(+), 55 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper
  2025-08-08 17:03 [PATCHSET RFC 0/8] Add support for mixed sized CQEs Jens Axboe
@ 2025-08-08 17:03 ` Jens Axboe
  2025-08-08 17:03 ` [PATCH 2/8] io_uring: add UAPI definitions for mixed CQE postings Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-08 17:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

It's pretty pointless and only used for the tracing helper, get rid
of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h  | 6 ------
 include/trace/events/io_uring.h | 4 ++--
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 80a178f3d896..6f4080ec968e 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -727,10 +727,4 @@ struct io_overflow_cqe {
 	struct list_head list;
 	struct io_uring_cqe cqe;
 };
-
-static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
-{
-	return ctx->flags & IORING_SETUP_CQE32;
-}
-
 #endif
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 178ab6f611be..6a970625a3ea 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -340,8 +340,8 @@ TP_PROTO(struct io_ring_ctx *ctx, void *req, struct io_uring_cqe *cqe),
 		__entry->user_data	= cqe->user_data;
 		__entry->res		= cqe->res;
 		__entry->cflags		= cqe->flags;
-		__entry->extra1		= io_ctx_cqe32(ctx) ? cqe->big_cqe[0] : 0;
-		__entry->extra2		= io_ctx_cqe32(ctx) ? cqe->big_cqe[1] : 0;
+		__entry->extra1		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[0] : 0;
+		__entry->extra2		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[1] : 0;
 	),
 
 	TP_printk("ring %p, req %p, user_data 0x%llx, result %d, cflags 0x%x "
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/8] io_uring: add UAPI definitions for mixed CQE postings
  2025-08-08 17:03 [PATCHSET RFC 0/8] Add support for mixed sized CQEs Jens Axboe
  2025-08-08 17:03 ` [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper Jens Axboe
@ 2025-08-08 17:03 ` Jens Axboe
  2025-08-08 17:03 ` [PATCH 3/8] io_uring/fdinfo: handle mixed sized CQEs Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-08 17:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

This adds the CQE flags related to supporting a mixed CQ ring mode, where
both normal (16b) and big (32b) CQEs may be posted.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6957dc539d83..69337eb1db33 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -487,12 +487,22 @@ struct io_uring_cqe {
  *			other provided buffer type, all completions with a
  *			buffer passed back is automatically returned to the
  *			application.
+ * IORING_CQE_F_SKIP	If set, then the application/liburing must ignore this
+ *			CQE. It's only purpose is to fill a gap in the ring,
+ *			if a large CQE is attempted posted when the ring has
+ *			just a single small CQE worth of space left before
+ *			wrapping.
+ * IORING_CQE_F_32	If set, this is a 32b/big-cqe posting. Use with rings
+ *			setup in a mixed CQE mode, where both 16b and 32b
+ *			CQEs may be posted to the CQ ring.
  */
 #define IORING_CQE_F_BUFFER		(1U << 0)
 #define IORING_CQE_F_MORE		(1U << 1)
 #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
 #define IORING_CQE_F_NOTIF		(1U << 3)
 #define IORING_CQE_F_BUF_MORE		(1U << 4)
+#define IORING_CQE_F_SKIP		(1U << 5)
+#define IORING_CQE_F_32			(1U << 15)
 
 #define IORING_CQE_BUFFER_SHIFT		16
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/8] io_uring/fdinfo: handle mixed sized CQEs
  2025-08-08 17:03 [PATCHSET RFC 0/8] Add support for mixed sized CQEs Jens Axboe
  2025-08-08 17:03 ` [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper Jens Axboe
  2025-08-08 17:03 ` [PATCH 2/8] io_uring: add UAPI definitions for mixed CQE postings Jens Axboe
@ 2025-08-08 17:03 ` Jens Axboe
  2025-08-08 17:03 ` [PATCH 4/8] io_uring/trace: support completion tracing of mixed 32b CQEs Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-08 17:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Ensure that the CQ ring iteration handles differently sized CQEs, not
just a fixed 16b or 32b size per ring. These CQEs aren't possible just
yet, but prepare the fdinfo CQ ring dumping for handling them.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/fdinfo.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index 9798d6fb4ec7..5c7339838769 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -65,15 +65,12 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 	unsigned int sq_tail = READ_ONCE(r->sq.tail);
 	unsigned int cq_head = READ_ONCE(r->cq.head);
 	unsigned int cq_tail = READ_ONCE(r->cq.tail);
-	unsigned int cq_shift = 0;
 	unsigned int sq_shift = 0;
-	unsigned int sq_entries, cq_entries;
+	unsigned int sq_entries;
 	int sq_pid = -1, sq_cpu = -1;
 	u64 sq_total_time = 0, sq_work_time = 0;
 	unsigned int i;
 
-	if (ctx->flags & IORING_SETUP_CQE32)
-		cq_shift = 1;
 	if (ctx->flags & IORING_SETUP_SQE128)
 		sq_shift = 1;
 
@@ -125,18 +122,23 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 		seq_printf(m, "\n");
 	}
 	seq_printf(m, "CQEs:\t%u\n", cq_tail - cq_head);
-	cq_entries = min(cq_tail - cq_head, ctx->cq_entries);
-	for (i = 0; i < cq_entries; i++) {
-		unsigned int entry = i + cq_head;
-		struct io_uring_cqe *cqe = &r->cqes[(entry & cq_mask) << cq_shift];
+	while (cq_head < cq_tail) {
+		struct io_uring_cqe *cqe;
+		bool cqe32 = false;
 
+		cqe = &r->cqes[(cq_head & cq_mask)];
+		if (cqe->flags & IORING_CQE_F_32 || ctx->flags & IORING_SETUP_CQE32)
+			cqe32 = true;
 		seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x",
-			   entry & cq_mask, cqe->user_data, cqe->res,
+			   cq_head & cq_mask, cqe->user_data, cqe->res,
 			   cqe->flags);
-		if (cq_shift)
+		if (cqe32)
 			seq_printf(m, ", extra1:%llu, extra2:%llu\n",
 					cqe->big_cqe[0], cqe->big_cqe[1]);
 		seq_printf(m, "\n");
+		cq_head++;
+		if (cqe32)
+			cq_head++;
 	}
 
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/8] io_uring/trace: support completion tracing of mixed 32b CQEs
  2025-08-08 17:03 [PATCHSET RFC 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (2 preceding siblings ...)
  2025-08-08 17:03 ` [PATCH 3/8] io_uring/fdinfo: handle mixed sized CQEs Jens Axboe
@ 2025-08-08 17:03 ` Jens Axboe
  2025-08-08 17:03 ` [PATCH 5/8] io_uring: add support for IORING_SETUP_CQE_MIXED Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-08 17:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Check for IORING_CQE_F_32 as well, not just if the ring was setup with
IORING_SETUP_CQE32 to only support big CQEs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/trace/events/io_uring.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 6a970625a3ea..45d15460b495 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -340,8 +340,8 @@ TP_PROTO(struct io_ring_ctx *ctx, void *req, struct io_uring_cqe *cqe),
 		__entry->user_data	= cqe->user_data;
 		__entry->res		= cqe->res;
 		__entry->cflags		= cqe->flags;
-		__entry->extra1		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[0] : 0;
-		__entry->extra2		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[1] : 0;
+		__entry->extra1		= ctx->flags & IORING_SETUP_CQE32 || cqe->flags & IORING_CQE_F_32 ? cqe->big_cqe[0] : 0;
+		__entry->extra2		= ctx->flags & IORING_SETUP_CQE32 || cqe->flags & IORING_CQE_F_32 ? cqe->big_cqe[1] : 0;
 	),
 
 	TP_printk("ring %p, req %p, user_data 0x%llx, result %d, cflags 0x%x "
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/8] io_uring: add support for IORING_SETUP_CQE_MIXED
  2025-08-08 17:03 [PATCHSET RFC 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (3 preceding siblings ...)
  2025-08-08 17:03 ` [PATCH 4/8] io_uring/trace: support completion tracing of mixed 32b CQEs Jens Axboe
@ 2025-08-08 17:03 ` Jens Axboe
  2025-08-08 17:03 ` [PATCH 6/8] io_uring/nop: " Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-08 17:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Normal rings support 16b CQEs for posting completions, while certain
features require the ring to be configured with IORING_SETUP_CQE32, as
they need to convey more information per completion. This, in turn,
makes ALL the CQEs be 32b in size. This is somewhat wasteful and
inefficient, particularly when only certain CQEs need to be of the
bigger variant.

This adds support for setting up a ring with mixed CQE sizes, using
IORING_SETUP_CQE_MIXED. When setup in this mode, CQEs posted to the ring
may be either 16b or 32b in size. If a CQE is 32b in size, then
IORING_CQE_F_32 is set in the CQE flags to indicate that this is the
case. If this flag isn't set, the CQE is the normal 16b variant.

CQEs on these types of mixed rings may also have IORING_CQE_F_SKIP set.
This can happen if the ring is one (small) CQE entry away from wrapping,
and an attempt is made to post a 32b CQE. As CQEs must be contigious in
the CQ ring, a 32b CQE cannot wrap the ring. For this case, a single
dummy CQE is posted with the SKIP flag set. The application should
simply ignore those.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h |  6 +++
 io_uring/io_uring.c           | 71 +++++++++++++++++++++++++++--------
 io_uring/io_uring.h           | 49 +++++++++++++++++-------
 io_uring/register.c           |  3 +-
 4 files changed, 99 insertions(+), 30 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 69337eb1db33..9396afb01dc8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -225,6 +225,12 @@ enum io_uring_sqe_flags_bit {
 /* Use hybrid poll in iopoll process */
 #define IORING_SETUP_HYBRID_IOPOLL	(1U << 17)
 
+/*
+ * Allow both 16b and 32b CQEs. If a 32b CQE is posted, it will have
+ * IORING_CQE_F_32 set in cqe->flags.
+ */
+#define IORING_SETUP_CQE_MIXED		(1U << 18)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4ef69dd58734..c83e065ed56d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -598,27 +598,27 @@ static void io_cq_unlock_post(struct io_ring_ctx *ctx)
 
 static void __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool dying)
 {
-	size_t cqe_size = sizeof(struct io_uring_cqe);
-
 	lockdep_assert_held(&ctx->uring_lock);
 
 	/* don't abort if we're dying, entries must get freed */
 	if (!dying && __io_cqring_events(ctx) == ctx->cq_entries)
 		return;
 
-	if (ctx->flags & IORING_SETUP_CQE32)
-		cqe_size <<= 1;
-
 	io_cq_lock(ctx);
 	while (!list_empty(&ctx->cq_overflow_list)) {
+		size_t cqe_size = sizeof(struct io_uring_cqe);
 		struct io_uring_cqe *cqe;
 		struct io_overflow_cqe *ocqe;
+		bool is_cqe32;
 
 		ocqe = list_first_entry(&ctx->cq_overflow_list,
 					struct io_overflow_cqe, list);
+		is_cqe32 = !!(ocqe->cqe.flags & IORING_CQE_F_32);
+		if (is_cqe32)
+			cqe_size <<= 1;
 
 		if (!dying) {
-			if (!io_get_cqe_overflow(ctx, &cqe, true))
+			if (!io_get_cqe_overflow(ctx, &cqe, true, is_cqe32))
 				break;
 			memcpy(cqe, &ocqe->cqe, cqe_size);
 		}
@@ -730,10 +730,10 @@ static struct io_overflow_cqe *io_alloc_ocqe(struct io_ring_ctx *ctx,
 {
 	struct io_overflow_cqe *ocqe;
 	size_t ocq_size = sizeof(struct io_overflow_cqe);
-	bool is_cqe32 = (ctx->flags & IORING_SETUP_CQE32);
+	bool is_cqe32 = cqe->flags & IORING_CQE_F_32;
 
 	if (is_cqe32)
-		ocq_size += sizeof(struct io_uring_cqe);
+		ocq_size <<= 1;
 
 	ocqe = kzalloc(ocq_size, gfp | __GFP_ACCOUNT);
 	trace_io_uring_cqe_overflow(ctx, cqe->user_data, cqe->res, cqe->flags, ocqe);
@@ -751,12 +751,29 @@ static struct io_overflow_cqe *io_alloc_ocqe(struct io_ring_ctx *ctx,
 	return ocqe;
 }
 
+/*
+ * Fill an empty dummy CQE, in case alignment is off for posting a 32b CQE
+ * because the ring is a single 16b entry away from wrapping.
+ */
+static bool io_fill_nop_cqe(struct io_ring_ctx *ctx, unsigned int off)
+{
+	if (__io_cqring_events(ctx) < ctx->cq_entries) {
+		struct io_uring_cqe *cqe = &ctx->rings->cqes[off];
+
+		memset(cqe, 0, sizeof(*cqe));
+		cqe->flags = IORING_CQE_F_SKIP;
+		ctx->cached_cq_tail++;
+		return true;
+	}
+	return false;
+}
+
 /*
  * writes to the cq entry need to come after reading head; the
  * control dependency is enough as we're using WRITE_ONCE to
  * fill the cq entry
  */
-bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow)
+bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32)
 {
 	struct io_rings *rings = ctx->rings;
 	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
@@ -770,12 +787,22 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow)
 	if (!overflow && (ctx->check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT)))
 		return false;
 
+	/*
+	 * Post dummy CQE if a 32b CQE is needed and there's only room for a
+	 * 16b CQE before the ring wraps.
+	 */
+	if (cqe32 && ctx->cq_entries - off == 1) {
+		if (!io_fill_nop_cqe(ctx, off))
+			return false;
+		off = 0;
+	}
+
 	/* userspace may cheat modifying the tail, be safe and do min */
 	queued = min(__io_cqring_events(ctx), ctx->cq_entries);
 	free = ctx->cq_entries - queued;
 	/* we need a contiguous range, limit based on the current array offset */
 	len = min(free, ctx->cq_entries - off);
-	if (!len)
+	if (len < (cqe32 + 1))
 		return false;
 
 	if (ctx->flags & IORING_SETUP_CQE32) {
@@ -793,9 +820,9 @@ static bool io_fill_cqe_aux32(struct io_ring_ctx *ctx,
 {
 	struct io_uring_cqe *cqe;
 
-	if (WARN_ON_ONCE(!(ctx->flags & IORING_SETUP_CQE32)))
+	if (WARN_ON_ONCE(!(ctx->flags & (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED))))
 		return false;
-	if (unlikely(!io_get_cqe(ctx, &cqe)))
+	if (unlikely(!io_get_cqe(ctx, &cqe, true)))
 		return false;
 
 	memcpy(cqe, src_cqe, 2 * sizeof(*cqe));
@@ -806,14 +833,15 @@ static bool io_fill_cqe_aux32(struct io_ring_ctx *ctx,
 static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
 			      u32 cflags)
 {
+	bool cqe32 = cflags & IORING_CQE_F_32;
 	struct io_uring_cqe *cqe;
 
-	if (likely(io_get_cqe(ctx, &cqe))) {
+	if (likely(io_get_cqe(ctx, &cqe, cqe32))) {
 		WRITE_ONCE(cqe->user_data, user_data);
 		WRITE_ONCE(cqe->res, res);
 		WRITE_ONCE(cqe->flags, cflags);
 
-		if (ctx->flags & IORING_SETUP_CQE32) {
+		if (cqe32) {
 			WRITE_ONCE(cqe->big_cqe[0], 0);
 			WRITE_ONCE(cqe->big_cqe[1], 0);
 		}
@@ -2735,6 +2763,10 @@ unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
 		if (check_shl_overflow(off, 1, &off))
 			return SIZE_MAX;
 	}
+	if (flags & IORING_SETUP_CQE_MIXED) {
+		if (cq_entries < 2)
+			return SIZE_MAX;
+	}
 
 #ifdef CONFIG_SMP
 	off = ALIGN(off, SMP_CACHE_BYTES);
@@ -3658,6 +3690,14 @@ static int io_uring_sanitise_params(struct io_uring_params *p)
 	    !(flags & IORING_SETUP_SINGLE_ISSUER))
 		return -EINVAL;
 
+	/*
+	 * Nonsensical to ask for CQE32 and mixed CQE support, it's not
+	 * supported to post 16b CQEs on a ring setup with CQE32.
+	 */
+	if ((flags & (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED)) ==
+	    (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED))
+		return -EINVAL;
+
 	return 0;
 }
 
@@ -3884,7 +3924,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
 			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
 			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY |
-			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL))
+			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL |
+			IORING_SETUP_CQE_MIXED))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index abc6de227f74..2e4f7223a767 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -75,7 +75,7 @@ static inline bool io_should_wake(struct io_wait_queue *iowq)
 unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
 			 unsigned int cq_entries, size_t *sq_offset);
 int io_uring_fill_params(unsigned entries, struct io_uring_params *p);
-bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow);
+bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32);
 int io_run_task_work_sig(struct io_ring_ctx *ctx);
 void io_req_defer_failed(struct io_kiocb *req, s32 res);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
@@ -169,25 +169,31 @@ static inline void io_submit_flush_completions(struct io_ring_ctx *ctx)
 
 static inline bool io_get_cqe_overflow(struct io_ring_ctx *ctx,
 					struct io_uring_cqe **ret,
-					bool overflow)
+					bool overflow, bool cqe32)
 {
 	io_lockdep_assert_cq_locked(ctx);
 
-	if (unlikely(ctx->cqe_cached >= ctx->cqe_sentinel)) {
-		if (unlikely(!io_cqe_cache_refill(ctx, overflow)))
+	if (unlikely(ctx->cqe_sentinel - ctx->cqe_cached < (cqe32 + 1))) {
+		if (unlikely(!io_cqe_cache_refill(ctx, overflow, cqe32)))
 			return false;
 	}
 	*ret = ctx->cqe_cached;
 	ctx->cached_cq_tail++;
 	ctx->cqe_cached++;
-	if (ctx->flags & IORING_SETUP_CQE32)
+	if (ctx->flags & IORING_SETUP_CQE32) {
+		ctx->cqe_cached++;
+	} else if (cqe32 && ctx->flags & IORING_SETUP_CQE_MIXED) {
 		ctx->cqe_cached++;
+		ctx->cached_cq_tail++;
+	}
+	WARN_ON_ONCE(ctx->cqe_cached > ctx->cqe_sentinel);
 	return true;
 }
 
-static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret)
+static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret,
+				bool cqe32)
 {
-	return io_get_cqe_overflow(ctx, ret, false);
+	return io_get_cqe_overflow(ctx, ret, false, cqe32);
 }
 
 static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
@@ -196,25 +202,24 @@ static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
 	io_lockdep_assert_cq_locked(ctx);
 
 	ctx->submit_state.cq_flush = true;
-	return io_get_cqe(ctx, cqe_ret);
+	return io_get_cqe(ctx, cqe_ret, false);
 }
 
 static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
 					    struct io_kiocb *req)
 {
+	bool is_cqe32 = req->cqe.flags & IORING_CQE_F_32;
 	struct io_uring_cqe *cqe;
 
 	/*
-	 * If we can't get a cq entry, userspace overflowed the
-	 * submission (by quite a lot). Increment the overflow count in
-	 * the ring.
+	 * If we can't get a cq entry, userspace overflowed the submission
+	 * (by quite a lot).
 	 */
-	if (unlikely(!io_get_cqe(ctx, &cqe)))
+	if (unlikely(!io_get_cqe(ctx, &cqe, is_cqe32)))
 		return false;
 
-
 	memcpy(cqe, &req->cqe, sizeof(*cqe));
-	if (ctx->flags & IORING_SETUP_CQE32) {
+	if (is_cqe32) {
 		memcpy(cqe->big_cqe, &req->big_cqe, sizeof(*cqe));
 		memset(&req->big_cqe, 0, sizeof(req->big_cqe));
 	}
@@ -239,6 +244,22 @@ static inline void io_req_set_res(struct io_kiocb *req, s32 res, u32 cflags)
 	req->cqe.flags = cflags;
 }
 
+static inline u32 ctx_cqe32_flags(struct io_ring_ctx *ctx)
+{
+	if (ctx->flags & IORING_SETUP_CQE_MIXED)
+		return IORING_CQE_F_32;
+	return 0;
+}
+
+static inline void io_req_set_res32(struct io_kiocb *req, s32 res, u32 cflags,
+				    __u64 extra1, __u64 extra2)
+{
+	req->cqe.res = res;
+	req->cqe.flags = cflags | ctx_cqe32_flags(req->ctx);
+	req->big_cqe.extra1 = extra1;
+	req->big_cqe.extra2 = extra2;
+}
+
 static inline void *io_uring_alloc_async_data(struct io_alloc_cache *cache,
 					      struct io_kiocb *req)
 {
diff --git a/io_uring/register.c b/io_uring/register.c
index a59589249fce..a1a9b2884eae 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -396,7 +396,8 @@ static void io_register_free_rings(struct io_ring_ctx *ctx,
 
 #define RESIZE_FLAGS	(IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP)
 #define COPY_FLAGS	(IORING_SETUP_NO_SQARRAY | IORING_SETUP_SQE128 | \
-			 IORING_SETUP_CQE32 | IORING_SETUP_NO_MMAP)
+			 IORING_SETUP_CQE32 | IORING_SETUP_NO_MMAP | \
+			 IORING_SETUP_CQE_MIXED)
 
 static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
 {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/8] io_uring/nop: add support for IORING_SETUP_CQE_MIXED
  2025-08-08 17:03 [PATCHSET RFC 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (4 preceding siblings ...)
  2025-08-08 17:03 ` [PATCH 5/8] io_uring: add support for IORING_SETUP_CQE_MIXED Jens Axboe
@ 2025-08-08 17:03 ` Jens Axboe
  2025-08-08 17:03 ` [PATCH 7/8] io_uring/uring_cmd: " Jens Axboe
  2025-08-08 17:03 ` [PATCH 8/8] io_uring/zcrx: " Jens Axboe
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-08 17:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

This adds support for setting IORING_NOP_CQE32 as a flag for a NOP
command, in which case a 32b CQE will be posted rather than a regular
one. This is the default if the ring has been setup with
IORING_SETUP_CQE32. If the ring has been setup with
IORING_SETUP_CQE_MIXED, then 16b CQEs will be posted without this flag
set, and 32b CQEs if this flag is set. For the latter case, sqe->off is
what will be posted as cqe->big_cqe[0] and sqe->addr is what will be
posted as cqe->big_cqe[1].

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h |  1 +
 io_uring/nop.c                | 17 +++++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 9396afb01dc8..33b386f43d47 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -460,6 +460,7 @@ enum io_uring_msg_ring_flags {
 #define IORING_NOP_FIXED_FILE		(1U << 2)
 #define IORING_NOP_FIXED_BUFFER		(1U << 3)
 #define IORING_NOP_TW			(1U << 4)
+#define IORING_NOP_CQE32		(1U << 5)
 
 /*
  * IO completion data structure (Completion Queue Entry)
diff --git a/io_uring/nop.c b/io_uring/nop.c
index 20ed0f85b1c2..3caf07878f8a 100644
--- a/io_uring/nop.c
+++ b/io_uring/nop.c
@@ -17,11 +17,13 @@ struct io_nop {
 	int             result;
 	int		fd;
 	unsigned int	flags;
+	__u64		extra1;
+	__u64		extra2;
 };
 
 #define NOP_FLAGS	(IORING_NOP_INJECT_RESULT | IORING_NOP_FIXED_FILE | \
 			 IORING_NOP_FIXED_BUFFER | IORING_NOP_FILE | \
-			 IORING_NOP_TW)
+			 IORING_NOP_TW | IORING_NOP_CQE32)
 
 int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
@@ -41,6 +43,14 @@ int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		nop->fd = -1;
 	if (nop->flags & IORING_NOP_FIXED_BUFFER)
 		req->buf_index = READ_ONCE(sqe->buf_index);
+	if (nop->flags & IORING_NOP_CQE32) {
+		struct io_ring_ctx *ctx = req->ctx;
+
+		if (!(ctx->flags & (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED)))
+			return -EINVAL;
+		nop->extra1 = READ_ONCE(sqe->off);
+		nop->extra2 = READ_ONCE(sqe->addr);
+	}
 	return 0;
 }
 
@@ -68,7 +78,10 @@ int io_nop(struct io_kiocb *req, unsigned int issue_flags)
 done:
 	if (ret < 0)
 		req_set_fail(req);
-	io_req_set_res(req, nop->result, 0);
+	if (nop->flags & IORING_NOP_CQE32)
+		io_req_set_res32(req, nop->result, 0, nop->extra1, nop->extra2);
+	else
+		io_req_set_res(req, nop->result, 0);
 	if (nop->flags & IORING_NOP_TW) {
 		req->io_task_work.func = io_req_task_complete;
 		io_req_task_work_add(req);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/8] io_uring/uring_cmd: add support for IORING_SETUP_CQE_MIXED
  2025-08-08 17:03 [PATCHSET RFC 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (5 preceding siblings ...)
  2025-08-08 17:03 ` [PATCH 6/8] io_uring/nop: " Jens Axboe
@ 2025-08-08 17:03 ` Jens Axboe
  2025-08-08 17:03 ` [PATCH 8/8] io_uring/zcrx: " Jens Axboe
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-08 17:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Certain users of uring_cmd currently require fixed 32b CQE support,
which is propagated through IO_URING_F_CQE32. Allow
IORING_SETUP_CQE_MIXED to cover that case as well, so not all CQEs
posted need to be 32b in size.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/cmd_net.c   | 3 ++-
 io_uring/uring_cmd.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/io_uring/cmd_net.c b/io_uring/cmd_net.c
index 3866fe6ff541..27a09aa4c9d0 100644
--- a/io_uring/cmd_net.c
+++ b/io_uring/cmd_net.c
@@ -4,6 +4,7 @@
 #include <net/sock.h>
 
 #include "uring_cmd.h"
+#include "io_uring.h"
 
 static inline int io_uring_cmd_getsockopt(struct socket *sock,
 					  struct io_uring_cmd *cmd,
@@ -73,7 +74,7 @@ static bool io_process_timestamp_skb(struct io_uring_cmd *cmd, struct sock *sk,
 
 	cqe->user_data = 0;
 	cqe->res = tskey;
-	cqe->flags = IORING_CQE_F_MORE;
+	cqe->flags = IORING_CQE_F_MORE | ctx_cqe32_flags(cmd_to_io_kiocb(cmd)->ctx);
 	cqe->flags |= tstype << IORING_TIMESTAMP_TYPE_SHIFT;
 	if (ret == SOF_TIMESTAMPING_TX_HARDWARE)
 		cqe->flags |= IORING_CQE_F_TSTAMP_HW;
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 053bac89b6c0..450d5be260a2 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -234,7 +234,7 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 
 	if (ctx->flags & IORING_SETUP_SQE128)
 		issue_flags |= IO_URING_F_SQE128;
-	if (ctx->flags & IORING_SETUP_CQE32)
+	if (ctx->flags & (IORING_SETUP_CQE32 | IORING_SETUP_CQE_MIXED))
 		issue_flags |= IO_URING_F_CQE32;
 	if (io_is_compat(ctx))
 		issue_flags |= IO_URING_F_COMPAT;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 8/8] io_uring/zcrx: add support for IORING_SETUP_CQE_MIXED
  2025-08-08 17:03 [PATCHSET RFC 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (6 preceding siblings ...)
  2025-08-08 17:03 ` [PATCH 7/8] io_uring/uring_cmd: " Jens Axboe
@ 2025-08-08 17:03 ` Jens Axboe
  7 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-08 17:03 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

zcrx currently requires the ring to be set up with fixed 32b CQEs,
allow it to use IORING_SETUP_CQE_MIXED as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/networking/iou-zcrx.rst | 2 +-
 io_uring/zcrx.c                       | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst
index 0127319b30bb..54a72e172bdc 100644
--- a/Documentation/networking/iou-zcrx.rst
+++ b/Documentation/networking/iou-zcrx.rst
@@ -75,7 +75,7 @@ Create an io_uring instance with the following required setup flags::
 
   IORING_SETUP_SINGLE_ISSUER
   IORING_SETUP_DEFER_TASKRUN
-  IORING_SETUP_CQE32
+  IORING_SETUP_CQE32 or IORING_SETUP_CQE_MIXED
 
 Create memory area
 ------------------
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index e5ff49f3425e..f1da852c496b 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -554,8 +554,9 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 		return -EPERM;
 
 	/* mandatory io_uring features for zc rx */
-	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN &&
-	      ctx->flags & IORING_SETUP_CQE32))
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+		return -EINVAL;
+	if (!(ctx->flags & (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED)))
 		return -EINVAL;
 	if (copy_from_user(&reg, arg, sizeof(reg)))
 		return -EFAULT;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
@ 2025-08-21 14:18 ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

It's pretty pointless and only used for the tracing helper, get rid
of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h  | 6 ------
 include/trace/events/io_uring.h | 4 ++--
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 9c6c548f43f5..d1e25f3fe0b3 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -740,10 +740,4 @@ struct io_overflow_cqe {
 	struct list_head list;
 	struct io_uring_cqe cqe;
 };
-
-static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
-{
-	return ctx->flags & IORING_SETUP_CQE32;
-}
-
 #endif
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 178ab6f611be..6a970625a3ea 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -340,8 +340,8 @@ TP_PROTO(struct io_ring_ctx *ctx, void *req, struct io_uring_cqe *cqe),
 		__entry->user_data	= cqe->user_data;
 		__entry->res		= cqe->res;
 		__entry->cflags		= cqe->flags;
-		__entry->extra1		= io_ctx_cqe32(ctx) ? cqe->big_cqe[0] : 0;
-		__entry->extra2		= io_ctx_cqe32(ctx) ? cqe->big_cqe[1] : 0;
+		__entry->extra1		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[0] : 0;
+		__entry->extra2		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[1] : 0;
 	),
 
 	TP_printk("ring %p, req %p, user_data 0x%llx, result %d, cflags 0x%x "
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-08-21 14:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-08 17:03 [PATCHSET RFC 0/8] Add support for mixed sized CQEs Jens Axboe
2025-08-08 17:03 ` [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper Jens Axboe
2025-08-08 17:03 ` [PATCH 2/8] io_uring: add UAPI definitions for mixed CQE postings Jens Axboe
2025-08-08 17:03 ` [PATCH 3/8] io_uring/fdinfo: handle mixed sized CQEs Jens Axboe
2025-08-08 17:03 ` [PATCH 4/8] io_uring/trace: support completion tracing of mixed 32b CQEs Jens Axboe
2025-08-08 17:03 ` [PATCH 5/8] io_uring: add support for IORING_SETUP_CQE_MIXED Jens Axboe
2025-08-08 17:03 ` [PATCH 6/8] io_uring/nop: " Jens Axboe
2025-08-08 17:03 ` [PATCH 7/8] io_uring/uring_cmd: " Jens Axboe
2025-08-08 17:03 ` [PATCH 8/8] io_uring/zcrx: " Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
2025-08-21 14:18 ` [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox