public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 0/3]
@ 2025-09-24 15:12 Keith Busch
  2025-09-24 15:12 ` [PATCHv3 1/3] Add support IORING_SETUP_SQE_MIXED Keith Busch
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Keith Busch @ 2025-09-24 15:12 UTC (permalink / raw)
  To: io-uring; +Cc: axboe, csander, ming.lei, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Previous version:

  https://lore.kernel.org/io-uring/20250904192716.3064736-1-kbusch@meta.com/

The CQ supports mixed size entries, so why not for SQ's too? There are
use cases that currently allocate different queues just to keep these
things separated, but we can efficiently handle both cases in a single
ring.

Changes since v2:

 - Define 128B opcodes to be used on mixed SQs. This is done instead of
   using the last SQE flags bit to generically identify a command as
   such. The new opcodes are valid only on a mixed SQ.

 - Fixed up the accounting of sqes left to dispatch. The big sqes on a
   mixed sq count for two entries, so previously would have fetched too
   many.

 - liburing won't bother submitting the nop-skip for the wrap-around
   condition if there are not enoungh free entries for the big-sqe.

kernel:

Keith Busch (1):
  io_uring: add support for IORING_SETUP_SQE_MIXED

 include/uapi/linux/io_uring.h |  8 ++++++++
 io_uring/fdinfo.c             | 34 +++++++++++++++++++++++++++-------
 io_uring/io_uring.c           | 27 +++++++++++++++++++++++----
 io_uring/io_uring.h           |  8 +++++---
 io_uring/opdef.c              | 26 ++++++++++++++++++++++++++
 io_uring/opdef.h              |  2 ++
 io_uring/register.c           |  2 +-
 io_uring/uring_cmd.c          | 12 ++++++++++--
 io_uring/uring_cmd.h          |  1 +
 9 files changed, 103 insertions(+), 17 deletions(-)

liburing:

Keith Busch (3):
  Add support IORING_SETUP_SQE_MIXED
  Add nop testing for IORING_SETUP_SQE_MIXED
  Add mixed sqe test for uring commands

 src/include/liburing.h          |  50 +++++++++++
 src/include/liburing/io_uring.h |  11 +++
 test/Makefile                   |   3 +
 test/sqe-mixed-bad-wrap.c       |  89 ++++++++++++++++++++
 test/sqe-mixed-nop.c            |  82 ++++++++++++++++++
 test/sqe-mixed-uring_cmd.c      | 142 ++++++++++++++++++++++++++++++++
 6 files changed, 377 insertions(+)
 create mode 100644 test/sqe-mixed-bad-wrap.c
 create mode 100644 test/sqe-mixed-nop.c
 create mode 100644 test/sqe-mixed-uring_cmd.c

-- 
2.47.3


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCHv3 1/3] Add support IORING_SETUP_SQE_MIXED
  2025-09-24 15:12 [PATCHv3 0/3] Keith Busch
@ 2025-09-24 15:12 ` Keith Busch
  2025-09-24 20:20   ` Caleb Sander Mateos
  2025-09-24 15:12 ` [PATCHv3 1/1] io_uring: add support for IORING_SETUP_SQE_MIXED Keith Busch
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-09-24 15:12 UTC (permalink / raw)
  To: io-uring; +Cc: axboe, csander, ming.lei, Keith Busch

From: Keith Busch <kbusch@kernel.org>

This adds core support for mixed sized SQEs in the same SQ ring. Before
this, SQEs were either 64b in size (the normal size), or 128b if
IORING_SETUP_SQE128 was set in the ring initialization. With the mixed
support, an SQE may be either 64b or 128b on the same SQ ring. If the
SQE is 128b in size, then a 128b opcode will be set in the sqe op. When
acquiring a large sqe at the end of the sq, the client may post a NOP
SQE with IOSQE_CQE_SKIP_SUCCESS set that the kernel should simply ignore
as it's just a pad filler that is posted when required.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 src/include/liburing.h          | 50 +++++++++++++++++++++++++++++++++
 src/include/liburing/io_uring.h | 11 ++++++++
 2 files changed, 61 insertions(+)

diff --git a/src/include/liburing.h b/src/include/liburing.h
index 052d6b56..66f1b990 100644
--- a/src/include/liburing.h
+++ b/src/include/liburing.h
@@ -575,6 +575,7 @@ IOURINGINLINE void io_uring_initialize_sqe(struct io_uring_sqe *sqe)
 	sqe->buf_index = 0;
 	sqe->personality = 0;
 	sqe->file_index = 0;
+	sqe->addr2 = 0;
 	sqe->addr3 = 0;
 	sqe->__pad2[0] = 0;
 }
@@ -799,6 +800,12 @@ IOURINGINLINE void io_uring_prep_nop(struct io_uring_sqe *sqe)
 	io_uring_prep_rw(IORING_OP_NOP, sqe, -1, NULL, 0, 0);
 }
 
+IOURINGINLINE void io_uring_prep_nop128(struct io_uring_sqe *sqe)
+	LIBURING_NOEXCEPT
+{
+	io_uring_prep_rw(IORING_OP_NOP128, sqe, -1, NULL, 0, 0);
+}
+
 IOURINGINLINE void io_uring_prep_timeout(struct io_uring_sqe *sqe,
 					 struct __kernel_timespec *ts,
 					 unsigned count, unsigned flags)
@@ -1882,6 +1889,49 @@ IOURINGINLINE struct io_uring_sqe *_io_uring_get_sqe(struct io_uring *ring)
 	return sqe;
 }
 
+/*
+ * Return a 128B sqe to fill. Applications must later call io_uring_submit()
+ * when it's ready to tell the kernel about it. The caller may call this
+ * function multiple times before calling io_uring_submit().
+ *
+ * Returns a vacant 128B sqe, or NULL if we're full. If the current tail is the
+ * last entry in the ring, this function will insert a nop + skip complete such
+ * that the 128b entry wraps back to the beginning of the queue for a
+ * contiguous big sq entry. It's up to the caller to use a 128b opcode in order
+ * for the kernel to know how to advance its sq head pointer.
+ */
+IOURINGINLINE struct io_uring_sqe *io_uring_get_sqe128_mixed(struct io_uring *ring)
+	LIBURING_NOEXCEPT
+{
+	struct io_uring_sq *sq = &ring->sq;
+	unsigned head = io_uring_load_sq_head(ring), tail = sq->sqe_tail;
+	struct io_uring_sqe *sqe;
+
+	if (!(ring->flags & IORING_SETUP_SQE_MIXED))
+		return NULL;
+
+	if (((tail + 1) & sq->ring_mask) == 0) {
+		if ((tail + 2) - head >= sq->ring_entries)
+			return NULL;
+
+		sqe = _io_uring_get_sqe(ring);
+		if (!sqe)
+			return NULL;
+
+		io_uring_prep_nop(sqe);
+		sqe->flags |= IOSQE_CQE_SKIP_SUCCESS;
+		tail = sq->sqe_tail;
+	} else if ((tail + 1) - head >= sq->ring_entries) {
+		return NULL;
+	}
+
+	sqe = &sq->sqes[tail & sq->ring_mask];
+	sq->sqe_tail = tail + 2;
+	io_uring_initialize_sqe(sqe);
+
+	return sqe;
+}
+
 /*
  * Return the appropriate mask for a buffer ring of size 'ring_entries'
  */
diff --git a/src/include/liburing/io_uring.h b/src/include/liburing/io_uring.h
index 31396057..1e0b6398 100644
--- a/src/include/liburing/io_uring.h
+++ b/src/include/liburing/io_uring.h
@@ -126,6 +126,7 @@ enum io_uring_sqe_flags_bit {
 	IOSQE_ASYNC_BIT,
 	IOSQE_BUFFER_SELECT_BIT,
 	IOSQE_CQE_SKIP_SUCCESS_BIT,
+	IOSQE_SQE_128B_BIT,
 };
 
 /*
@@ -145,6 +146,8 @@ enum io_uring_sqe_flags_bit {
 #define IOSQE_BUFFER_SELECT	(1U << IOSQE_BUFFER_SELECT_BIT)
 /* don't post CQE if request succeeded */
 #define IOSQE_CQE_SKIP_SUCCESS	(1U << IOSQE_CQE_SKIP_SUCCESS_BIT)
+/* this is a 128b/big-sqe posting */
+#define IOSQE_SQE_128B          (1U << IOSQE_SQE_128B_BIT)
 
 /*
  * io_uring_setup() flags
@@ -211,6 +214,12 @@ enum io_uring_sqe_flags_bit {
  */
 #define IORING_SETUP_CQE_MIXED		(1U << 18)
 
+/*
+ *  Allow both 64b and 128b SQEs. If a 128b SQE is posted, it will have
+ *  IOSQE_SQE_128B set in sqe->flags.
+ */
+#define IORING_SETUP_SQE_MIXED		(1U << 19)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
@@ -275,6 +284,8 @@ enum io_uring_op {
 	IORING_OP_READV_FIXED,
 	IORING_OP_WRITEV_FIXED,
 	IORING_OP_PIPE,
+	IORING_OP_NOP128,
+	IORING_OP_URING_CMD128,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCHv3 1/1] io_uring: add support for IORING_SETUP_SQE_MIXED
  2025-09-24 15:12 [PATCHv3 0/3] Keith Busch
  2025-09-24 15:12 ` [PATCHv3 1/3] Add support IORING_SETUP_SQE_MIXED Keith Busch
@ 2025-09-24 15:12 ` Keith Busch
  2025-09-25 15:03   ` Jens Axboe
  2025-09-24 15:12 ` [PATCHv3 2/3] Add nop testing " Keith Busch
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-09-24 15:12 UTC (permalink / raw)
  To: io-uring; +Cc: axboe, csander, ming.lei, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Normal rings support 64b SQEs for posting submissions, while certain
features require the ring to be configured with IORING_SETUP_SQE128, as
they need to convey more information per submission. This, in turn,
makes ALL the SQEs be 128b in size. This is somewhat wasteful and
inefficient, particularly when only certain SQEs need to be of the
bigger variant.

This adds support for setting up a ring with mixed SQE sizes, using
IORING_SETUP_SQE_MIXED. When setup in this mode, SQEs posted to the ring
may be either 64b or 128b in size. If a SQE is 128b in size, then opcode
will be set to a variante to indicate that this is the case. Any other
non-128b opcode will assume the SQ's default size.

SQEs on these types of mixed rings may also utilize NOP with skip
success set.  This can happen if the ring is one (small) SQE entry away
from wrapping, and an attempt is made to post a 128b SQE. As SQEs must be
contiguous in the SQ ring, a 128b SQE cannot wrap the ring. For this
case, a single NOP SQE should be posted with the SKIP_SUCCESS flag set.
The kernel should simply ignore those.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 include/uapi/linux/io_uring.h |  8 ++++++++
 io_uring/fdinfo.c             | 34 +++++++++++++++++++++++++++-------
 io_uring/io_uring.c           | 27 +++++++++++++++++++++++----
 io_uring/io_uring.h           |  8 +++++---
 io_uring/opdef.c              | 26 ++++++++++++++++++++++++++
 io_uring/opdef.h              |  2 ++
 io_uring/register.c           |  2 +-
 io_uring/uring_cmd.c          | 12 ++++++++++--
 io_uring/uring_cmd.h          |  1 +
 9 files changed, 103 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index a0cc1cc0dd015..ce01f043fceec 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -231,6 +231,12 @@ enum io_uring_sqe_flags_bit {
  */
 #define IORING_SETUP_CQE_MIXED		(1U << 18)
 
+/*
+ * Allow both 64b and 128b SQEs. If a 128b SQE is posted, it will have
+ * a 128b opcode.
+ */
+#define IORING_SETUP_SQE_MIXED		(1U << 19)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
@@ -295,6 +301,8 @@ enum io_uring_op {
 	IORING_OP_READV_FIXED,
 	IORING_OP_WRITEV_FIXED,
 	IORING_OP_PIPE,
+	IORING_OP_NOP128,
+	IORING_OP_URING_CMD128,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index ff3364531c77b..d14d2e983b623 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -14,6 +14,7 @@
 #include "fdinfo.h"
 #include "cancel.h"
 #include "rsrc.h"
+#include "opdef.h"
 
 #ifdef CONFIG_NET_RX_BUSY_POLL
 static __cold void common_tracking_show_fdinfo(struct io_ring_ctx *ctx,
@@ -66,7 +67,6 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 	unsigned int cq_head = READ_ONCE(r->cq.head);
 	unsigned int cq_tail = READ_ONCE(r->cq.tail);
 	unsigned int sq_shift = 0;
-	unsigned int sq_entries;
 	int sq_pid = -1, sq_cpu = -1;
 	u64 sq_total_time = 0, sq_work_time = 0;
 	unsigned int i;
@@ -89,26 +89,45 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 	seq_printf(m, "CqTail:\t%u\n", cq_tail);
 	seq_printf(m, "CachedCqTail:\t%u\n", data_race(ctx->cached_cq_tail));
 	seq_printf(m, "SQEs:\t%u\n", sq_tail - sq_head);
-	sq_entries = min(sq_tail - sq_head, ctx->sq_entries);
-	for (i = 0; i < sq_entries; i++) {
-		unsigned int entry = i + sq_head;
+	while (sq_head < sq_tail) {
 		struct io_uring_sqe *sqe;
 		unsigned int sq_idx;
+		bool sqe128 = false;
+		u8 opcode;
 
 		if (ctx->flags & IORING_SETUP_NO_SQARRAY)
 			break;
-		sq_idx = READ_ONCE(ctx->sq_array[entry & sq_mask]);
+		sq_idx = READ_ONCE(ctx->sq_array[sq_head & sq_mask]);
 		if (sq_idx > sq_mask)
 			continue;
+
+		opcode = READ_ONCE(sqe->opcode);
 		sqe = &ctx->sq_sqes[sq_idx << sq_shift];
+		if (sq_shift)
+			sqe128 = true;
+		else if (io_issue_defs[opcode].is_128) {
+			if (!(ctx->flags & IORING_SETUP_SQE_MIXED)) {
+				seq_printf(m,
+					"%5u: invalid sqe, 128B entry on non-mixed sq\n",
+					sq_idx);
+				break;
+			}
+			if ((++sq_head & sq_mask) == 0) {
+				seq_printf(m,
+					"%5u: corrupted sqe, wrapping 128B entry\n",
+					sq_idx);
+				break;
+			}
+			sqe128 = true;
+		}
 		seq_printf(m, "%5u: opcode:%s, fd:%d, flags:%x, off:%llu, "
 			      "addr:0x%llx, rw_flags:0x%x, buf_index:%d "
 			      "user_data:%llu",
-			   sq_idx, io_uring_get_opcode(sqe->opcode), sqe->fd,
+			   sq_idx, io_uring_get_opcode(opcode), sqe->fd,
 			   sqe->flags, (unsigned long long) sqe->off,
 			   (unsigned long long) sqe->addr, sqe->rw_flags,
 			   sqe->buf_index, sqe->user_data);
-		if (sq_shift) {
+		if (sqe128) {
 			u64 *sqeb = (void *) (sqe + 1);
 			int size = sizeof(struct io_uring_sqe) / sizeof(u64);
 			int j;
@@ -120,6 +139,7 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 			}
 		}
 		seq_printf(m, "\n");
+		sq_head++;
 	}
 	seq_printf(m, "CQEs:\t%u\n", cq_tail - cq_head);
 	while (cq_head < cq_tail) {
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 1bfa124565f71..f9bc442bb4188 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2153,7 +2153,7 @@ static __cold int io_init_fail_req(struct io_kiocb *req, int err)
 }
 
 static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
-		       const struct io_uring_sqe *sqe)
+		       const struct io_uring_sqe *sqe, unsigned int *left)
 	__must_hold(&ctx->uring_lock)
 {
 	const struct io_issue_def *def;
@@ -2179,6 +2179,14 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	opcode = array_index_nospec(opcode, IORING_OP_LAST);
 
 	def = &io_issue_defs[opcode];
+	if (def->is_128) {
+		if (!(ctx->flags & IORING_SETUP_SQE_MIXED) || *left < 2 ||
+		    (ctx->cached_sq_head & (ctx->sq_entries - 1)) == 0)
+			return io_init_fail_req(req, -EINVAL);
+		ctx->cached_sq_head++;
+		(*left)--;
+	}
+
 	if (unlikely(sqe_flags & ~SQE_COMMON_FLAGS)) {
 		/* enforce forwards compatibility on users */
 		if (sqe_flags & ~SQE_VALID_FLAGS)
@@ -2288,13 +2296,13 @@ static __cold int io_submit_fail_init(const struct io_uring_sqe *sqe,
 }
 
 static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			 const struct io_uring_sqe *sqe)
+			 const struct io_uring_sqe *sqe, unsigned int *left)
 	__must_hold(&ctx->uring_lock)
 {
 	struct io_submit_link *link = &ctx->submit_state.link;
 	int ret;
 
-	ret = io_init_req(ctx, req, sqe);
+	ret = io_init_req(ctx, req, sqe, left);
 	if (unlikely(ret))
 		return io_submit_fail_init(sqe, req, ret);
 
@@ -2446,7 +2454,7 @@ int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr)
 		 * Continue submitting even for sqe failure if the
 		 * ring was setup with IORING_SETUP_SUBMIT_ALL
 		 */
-		if (unlikely(io_submit_sqe(ctx, req, sqe)) &&
+		if (unlikely(io_submit_sqe(ctx, req, sqe, &left)) &&
 		    !(ctx->flags & IORING_SETUP_SUBMIT_ALL)) {
 			left--;
 			break;
@@ -2791,6 +2799,10 @@ unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
 		if (cq_entries < 2)
 			return SIZE_MAX;
 	}
+	if (flags & IORING_SETUP_SQE_MIXED) {
+		if (sq_entries < 2)
+			return SIZE_MAX;
+	}
 
 #ifdef CONFIG_SMP
 	off = ALIGN(off, SMP_CACHE_BYTES);
@@ -3717,6 +3729,13 @@ static int io_uring_sanitise_params(struct io_uring_params *p)
 	if ((flags & (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED)) ==
 	    (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED))
 		return -EINVAL;
+	/*
+	 * Nonsensical to ask for SQE128 and mixed SQE support, it's not
+	 * supported to post 64b SQEs on a ring setup with SQE128.
+	 */
+	if ((flags & (IORING_SETUP_SQE128|IORING_SETUP_SQE_MIXED)) ==
+	    (IORING_SETUP_SQE128|IORING_SETUP_SQE_MIXED))
+		return -EINVAL;
 
 	return 0;
 }
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index af4c113106523..074908d5884e4 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -54,7 +54,8 @@
 			IORING_SETUP_REGISTERED_FD_ONLY |\
 			IORING_SETUP_NO_SQARRAY |\
 			IORING_SETUP_HYBRID_IOPOLL |\
-			IORING_SETUP_CQE_MIXED)
+			IORING_SETUP_CQE_MIXED |\
+			IORING_SETUP_SQE_MIXED)
 
 #define IORING_ENTER_FLAGS (IORING_ENTER_GETEVENTS |\
 			IORING_ENTER_SQ_WAKEUP |\
@@ -582,9 +583,10 @@ static inline void io_req_queue_tw_complete(struct io_kiocb *req, s32 res)
  * IORING_SETUP_SQE128 contexts allocate twice the normal SQE size for each
  * slot.
  */
-static inline size_t uring_sqe_size(struct io_ring_ctx *ctx)
+static inline size_t uring_sqe_size(struct io_kiocb *req)
 {
-	if (ctx->flags & IORING_SETUP_SQE128)
+	if (req->ctx->flags & IORING_SETUP_SQE128 ||
+	    req->opcode == IORING_OP_URING_CMD128)
 		return 2 * sizeof(struct io_uring_sqe);
 	return sizeof(struct io_uring_sqe);
 }
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 932319633eac2..36feebcce3827 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -575,6 +575,24 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_pipe_prep,
 		.issue			= io_pipe,
 	},
+	[IORING_OP_NOP128] = {
+		.audit_skip		= 1,
+		.iopoll			= 1,
+		.is_128			= 1,
+		.prep			= io_nop_prep,
+		.issue			= io_nop,
+	},
+	[IORING_OP_URING_CMD128] = {
+		.buffer_select		= 1,
+		.needs_file		= 1,
+		.plug			= 1,
+		.iopoll			= 1,
+		.iopoll_queue		= 1,
+		.is_128			= 1,
+		.async_size		= sizeof(struct io_async_cmd),
+		.prep			= io_uring_cmd128_prep,
+		.issue			= io_uring_cmd,
+	},
 };
 
 const struct io_cold_def io_cold_defs[] = {
@@ -825,6 +843,14 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_PIPE] = {
 		.name			= "PIPE",
 	},
+	[IORING_OP_NOP128] = {
+		.name			= "NOP128",
+	},
+	[IORING_OP_URING_CMD128] = {
+		.name			= "URING_CMD128",
+		.sqe_copy		= io_uring_cmd_sqe_copy,
+		.cleanup		= io_uring_cmd_cleanup,
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index c2f0907ed78cc..aa37846880ffd 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -27,6 +27,8 @@ struct io_issue_def {
 	unsigned		iopoll_queue : 1;
 	/* vectored opcode, set if 1) vectored, and 2) handler needs to know */
 	unsigned		vectored : 1;
+	/* set to 1 if this opcode uses 128b sqes in a mixed sq */
+	unsigned		is_128 : 1;
 
 	/* size of async data needed, if any */
 	unsigned short		async_size;
diff --git a/io_uring/register.c b/io_uring/register.c
index 43f04c47522c0..e97d9cbba7111 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -395,7 +395,7 @@ static void io_register_free_rings(struct io_ring_ctx *ctx,
 #define RESIZE_FLAGS	(IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP)
 #define COPY_FLAGS	(IORING_SETUP_NO_SQARRAY | IORING_SETUP_SQE128 | \
 			 IORING_SETUP_CQE32 | IORING_SETUP_NO_MMAP | \
-			 IORING_SETUP_CQE_MIXED)
+			 IORING_SETUP_CQE_MIXED | IORING_SETUP_SQE_MIXED)
 
 static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
 {
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index a688c9f1a21cd..5fa3c260bc142 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -216,6 +216,13 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return 0;
 }
 
+int io_uring_cmd128_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	if (!(req->ctx->flags & IORING_SETUP_SQE_MIXED))
+		return -EINVAL;
+	return io_uring_cmd_prep(req, sqe);
+}
+
 void io_uring_cmd_sqe_copy(struct io_kiocb *req)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
@@ -224,7 +231,7 @@ void io_uring_cmd_sqe_copy(struct io_kiocb *req)
 	/* Should not happen, as REQ_F_SQE_COPIED covers this */
 	if (WARN_ON_ONCE(ioucmd->sqe == ac->sqes))
 		return;
-	memcpy(ac->sqes, ioucmd->sqe, uring_sqe_size(req->ctx));
+	memcpy(ac->sqes, ioucmd->sqe, uring_sqe_size(req));
 	ioucmd->sqe = ac->sqes;
 }
 
@@ -242,7 +249,8 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 	if (ret)
 		return ret;
 
-	if (ctx->flags & IORING_SETUP_SQE128)
+	if (ctx->flags & IORING_SETUP_SQE128 ||
+	    req->opcode == IORING_OP_URING_CMD128)
 		issue_flags |= IO_URING_F_SQE128;
 	if (ctx->flags & (IORING_SETUP_CQE32 | IORING_SETUP_CQE_MIXED))
 		issue_flags |= IO_URING_F_CQE32;
diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
index 041aef8a8aa3f..0d6068fba7d0d 100644
--- a/io_uring/uring_cmd.h
+++ b/io_uring/uring_cmd.h
@@ -10,6 +10,7 @@ struct io_async_cmd {
 
 int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags);
 int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_uring_cmd128_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 void io_uring_cmd_sqe_copy(struct io_kiocb *req);
 void io_uring_cmd_cleanup(struct io_kiocb *req);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCHv3 2/3] Add nop testing for IORING_SETUP_SQE_MIXED
  2025-09-24 15:12 [PATCHv3 0/3] Keith Busch
  2025-09-24 15:12 ` [PATCHv3 1/3] Add support IORING_SETUP_SQE_MIXED Keith Busch
  2025-09-24 15:12 ` [PATCHv3 1/1] io_uring: add support for IORING_SETUP_SQE_MIXED Keith Busch
@ 2025-09-24 15:12 ` Keith Busch
  2025-09-24 15:12 ` [PATCHv3 3/3] Add mixed sqe test for uring commands Keith Busch
  2025-09-24 15:54 ` [PATCHv3 0/3] io_uring: mixed submission queue size support Keith Busch
  4 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2025-09-24 15:12 UTC (permalink / raw)
  To: io-uring; +Cc: axboe, csander, ming.lei, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Test mixing 64 and 128 byte sqe entries on a queue.

Insert a bad 128b operation at the end of a mixed sqe to test the
kernel's invalid entry detection.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 test/Makefile              |  2 +
 test/sqe-mixed-bad-wrap.c  | 89 ++++++++++++++++++++++++++++++++++++++
 test/sqe-mixed-nop.c       | 82 +++++++++++++++++++++++++++++++++++
 test/sqe-mixed-uring_cmd.c |  0
 4 files changed, 173 insertions(+)
 create mode 100644 test/sqe-mixed-bad-wrap.c
 create mode 100644 test/sqe-mixed-nop.c
 create mode 100644 test/sqe-mixed-uring_cmd.c

diff --git a/test/Makefile b/test/Makefile
index 64d67a1e..2c250c81 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -234,6 +234,8 @@ test_srcs := \
 	sq-poll-share.c \
 	sqpoll-sleep.c \
 	sq-space_left.c \
+	sqe-mixed-nop.c \
+	sqe-mixed-bad-wrap.c \
 	sqwait.c \
 	stdout.c \
 	submit-and-wait.c \
diff --git a/test/sqe-mixed-bad-wrap.c b/test/sqe-mixed-bad-wrap.c
new file mode 100644
index 00000000..61f711da
--- /dev/null
+++ b/test/sqe-mixed-bad-wrap.c
@@ -0,0 +1,89 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Description: run various nop tests
+ *
+ */
+#include <stdio.h>
+
+#include "liburing.h"
+#include "helpers.h"
+#include "test.h"
+
+static int seq;
+
+static int test_single_nop(struct io_uring *ring, bool should_fail)
+{
+	struct io_uring_cqe *cqe;
+	struct io_uring_sqe *sqe;
+	int ret;
+
+	sqe = io_uring_get_sqe(ring);
+	if (!sqe) {
+		fprintf(stderr, "get sqe failed\n");
+		return T_EXIT_FAIL;
+	}
+
+	io_uring_prep_nop(sqe);
+	sqe->user_data = ++seq;
+
+	if (should_fail)
+		io_uring_prep_nop128(sqe);
+	else
+		io_uring_prep_nop(sqe);
+
+	ret = io_uring_submit(ring);
+	if (ret <= 0) {
+		fprintf(stderr, "sqe submit failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+
+	ret = io_uring_wait_cqe(ring, &cqe);
+	if (ret < 0)
+		fprintf(stderr, "wait completion %d\n", ret);
+	else if (should_fail && cqe->res == 0)
+		fprintf(stderr, "Unexpected success\n");
+	else if (!should_fail && cqe->res != 0)
+		fprintf(stderr, "Completion error:%d\n", cqe->res);
+	else if (cqe->res == 0 && cqe->user_data != seq)
+		fprintf(stderr, "Unexpected user_data: %ld\n", (long) cqe->user_data);
+	else {
+		io_uring_cqe_seen(ring, cqe);
+		return T_EXIT_PASS;
+	}
+	return T_EXIT_FAIL;
+}
+
+int main(int argc, char *argv[])
+{
+	struct io_uring ring;
+	int ret, i;
+
+	if (argc > 1)
+		return T_EXIT_SKIP;
+
+	ret = io_uring_queue_init(8, &ring, IORING_SETUP_SQE_MIXED);
+	if (ret) {
+		if (ret == -EINVAL)
+			return T_EXIT_SKIP;
+		fprintf(stderr, "ring setup failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+
+	/* prime the sq to the last entry before wrapping */
+	for (i = 0; i < 7; i++) {
+		ret = test_single_nop(&ring, false);
+		if (ret != T_EXIT_PASS)
+			goto done;
+	}
+
+	/* inserting a 128b sqe in the last entry should fail */
+	ret = test_single_nop(&ring, true);
+	if (ret != T_EXIT_PASS)
+		goto done;
+
+	/* proceeding from the bad wrap should succeed */
+	ret = test_single_nop(&ring, false);
+done:
+	io_uring_queue_exit(&ring);
+	return ret;
+}
diff --git a/test/sqe-mixed-nop.c b/test/sqe-mixed-nop.c
new file mode 100644
index 00000000..88bd6ad2
--- /dev/null
+++ b/test/sqe-mixed-nop.c
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Description: run various nop tests
+ *
+ */
+#include <stdio.h>
+
+#include "liburing.h"
+#include "helpers.h"
+#include "test.h"
+
+static int seq;
+
+static int test_single_nop(struct io_uring *ring, bool sqe128)
+{
+	struct io_uring_cqe *cqe;
+	struct io_uring_sqe *sqe;
+	int ret;
+
+	if (sqe128)
+		sqe = io_uring_get_sqe128_mixed(ring);
+	else
+		sqe = io_uring_get_sqe(ring);
+
+	if (!sqe) {
+		fprintf(stderr, "get sqe failed\n");
+		return T_EXIT_FAIL;
+	}
+
+	if (sqe128)
+		io_uring_prep_nop128(sqe);
+	else
+		io_uring_prep_nop(sqe);
+
+	sqe->user_data = ++seq;
+
+	ret = io_uring_submit(ring);
+	if (ret <= 0) {
+		fprintf(stderr, "sqe submit failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+
+	ret = io_uring_wait_cqe(ring, &cqe);
+	if (ret < 0)
+		fprintf(stderr, "wait completion %d\n", ret);
+	else if (cqe->res != 0)
+		fprintf(stderr, "Completion error:%d\n", cqe->res);
+	else if (cqe->user_data != seq)
+		fprintf(stderr, "Unexpected user_data: %ld\n", (long) cqe->user_data);
+	else {
+		io_uring_cqe_seen(ring, cqe);
+		return T_EXIT_PASS;
+	}
+	return T_EXIT_FAIL;
+}
+
+int main(int argc, char *argv[])
+{
+	struct io_uring ring;
+	int ret, i;
+
+	if (argc > 1)
+		return T_EXIT_SKIP;
+
+	ret = io_uring_queue_init(8, &ring, IORING_SETUP_SQE_MIXED);
+	if (ret) {
+		if (ret == -EINVAL)
+			return T_EXIT_SKIP;
+		fprintf(stderr, "ring setup failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+
+	/* alternate big and little sqe's */
+	for (i = 0; i < 32; i++) {
+		ret = test_single_nop(&ring, i & 1);
+		if (ret != T_EXIT_PASS)
+			break;
+	}
+
+	io_uring_queue_exit(&ring);
+	return ret;
+}
diff --git a/test/sqe-mixed-uring_cmd.c b/test/sqe-mixed-uring_cmd.c
new file mode 100644
index 00000000..e69de29b
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCHv3 3/3] Add mixed sqe test for uring commands
  2025-09-24 15:12 [PATCHv3 0/3] Keith Busch
                   ` (2 preceding siblings ...)
  2025-09-24 15:12 ` [PATCHv3 2/3] Add nop testing " Keith Busch
@ 2025-09-24 15:12 ` Keith Busch
  2025-09-24 15:54 ` [PATCHv3 0/3] io_uring: mixed submission queue size support Keith Busch
  4 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2025-09-24 15:12 UTC (permalink / raw)
  To: io-uring; +Cc: axboe, csander, ming.lei, Keith Busch

From: Keith Busch <kbusch@kernel.org>

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 test/Makefile              |   1 +
 test/sqe-mixed-uring_cmd.c | 142 +++++++++++++++++++++++++++++++++++++
 2 files changed, 143 insertions(+)

diff --git a/test/Makefile b/test/Makefile
index 2c250c81..2b2e3967 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -236,6 +236,7 @@ test_srcs := \
 	sq-space_left.c \
 	sqe-mixed-nop.c \
 	sqe-mixed-bad-wrap.c \
+	sqe-mixed-uring_cmd.c \
 	sqwait.c \
 	stdout.c \
 	submit-and-wait.c \
diff --git a/test/sqe-mixed-uring_cmd.c b/test/sqe-mixed-uring_cmd.c
index e69de29b..4a6e7fd3 100644
--- a/test/sqe-mixed-uring_cmd.c
+++ b/test/sqe-mixed-uring_cmd.c
@@ -0,0 +1,142 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Description: mixed sqes utilizing basic nop and io_uring passthrough commands
+ */
+#include <errno.h>
+#include <stdio.h>
+#include <string.h>
+
+#include "helpers.h"
+#include "liburing.h"
+#include "nvme.h"
+
+#define len 0x1000
+static unsigned char buf[len];
+static int seq;
+
+static int test_single_nop(struct io_uring *ring)
+{
+	struct io_uring_cqe *cqe;
+	struct io_uring_sqe *sqe;
+	int ret;
+
+	sqe = io_uring_get_sqe(ring);
+	if (!sqe) {
+		fprintf(stderr, "get sqe failed\n");
+		return T_EXIT_FAIL;
+	}
+
+	io_uring_prep_nop(sqe);
+	sqe->user_data = ++seq;
+
+	ret = io_uring_submit(ring);
+	if (ret <= 0) {
+		fprintf(stderr, "sqe submit failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+
+	ret = io_uring_wait_cqe(ring, &cqe);
+	if (ret < 0)
+		fprintf(stderr, "wait completion %d\n", ret);
+	else if (cqe->user_data != seq)
+		fprintf(stderr, "Unexpected user_data: %ld\n", (long) cqe->user_data);
+	else {
+		io_uring_cqe_seen(ring, cqe);
+		return T_EXIT_PASS;
+	}
+	return T_EXIT_FAIL;
+}
+
+static int test_single_nvme_read(struct io_uring *ring, int fd)
+{
+	struct nvme_uring_cmd *cmd;
+	struct io_uring_cqe *cqe;
+	struct io_uring_sqe *sqe;
+	int ret;
+
+	sqe = io_uring_get_sqe128_mixed(ring);
+	if (!sqe) {
+		fprintf(stderr, "get sqe failed\n");
+		return T_EXIT_FAIL;
+	}
+
+	sqe->fd = fd;
+	sqe->user_data = ++seq;
+	sqe->opcode = IORING_OP_URING_CMD128;
+	sqe->cmd_op = NVME_URING_CMD_IO;
+
+	cmd = (struct nvme_uring_cmd *)sqe->cmd;
+	memset(cmd, 0, sizeof(struct nvme_uring_cmd));
+	cmd->opcode = nvme_cmd_read;
+	cmd->cdw12 = (len >> lba_shift) - 1;
+	cmd->addr = (__u64)(uintptr_t)buf;
+	cmd->data_len = len;
+	cmd->nsid = nsid;
+
+	ret = io_uring_submit(ring);
+	if (ret <= 0) {
+		fprintf(stderr, "sqe submit failed: %d\n", ret);
+		return T_EXIT_FAIL;
+	}
+
+	ret = io_uring_wait_cqe(ring, &cqe);
+	if (ret < 0)
+		fprintf(stderr, "wait completion %d\n", ret);
+	else if (cqe->res != 0)
+		fprintf(stderr, "cqe res %d, wanted 0\n", cqe->res);
+	else if (cqe->user_data != seq)
+		fprintf(stderr, "Unexpected user_data: %ld\n", (long) cqe->user_data);
+	else {
+		io_uring_cqe_seen(ring, cqe);
+		return T_EXIT_PASS;
+	}
+	return T_EXIT_FAIL;
+}
+
+int main(int argc, char *argv[])
+{
+	struct io_uring ring;
+	int fd, ret, i;
+
+	if (argc < 2)
+		return T_EXIT_SKIP;
+
+	ret = nvme_get_info(argv[1]);
+	if (ret)
+		return T_EXIT_SKIP;
+
+	fd = open(argv[1], O_RDONLY);
+	if (fd < 0) {
+		if (errno == EACCES || errno == EPERM)
+			return T_EXIT_SKIP;
+		perror("file open");
+		return T_EXIT_FAIL;
+	}
+
+	ret = io_uring_queue_init(8, &ring,
+		IORING_SETUP_CQE_MIXED | IORING_SETUP_SQE_MIXED);
+	if (ret) {
+		if (ret == -EINVAL)
+			ret = T_EXIT_SKIP;
+		else {
+			fprintf(stderr, "ring setup failed: %d\n", ret);
+			ret = T_EXIT_FAIL;
+		}
+		goto close;
+	}
+
+	for (i = 0; i < 32; i++) {
+		if (i & 1)
+			ret = test_single_nvme_read(&ring, fd);
+		else
+			ret = test_single_nop(&ring);
+
+		if (ret)
+			break;
+	}
+
+	io_uring_queue_exit(&ring);
+close:
+	close(fd);
+	return ret;
+}
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCHv3 0/3] io_uring: mixed submission queue size support
  2025-09-24 15:12 [PATCHv3 0/3] Keith Busch
                   ` (3 preceding siblings ...)
  2025-09-24 15:12 ` [PATCHv3 3/3] Add mixed sqe test for uring commands Keith Busch
@ 2025-09-24 15:54 ` Keith Busch
  4 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2025-09-24 15:54 UTC (permalink / raw)
  To: Keith Busch; +Cc: io-uring, axboe, csander, ming.lei

Resend for missing subject... I keep pointing to the wrong directory
after merging the liburing and kernel cover letters.

On Wed, Sep 24, 2025 at 08:12:06AM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Previous version:
> 
>   https://lore.kernel.org/io-uring/20250904192716.3064736-1-kbusch@meta.com/
> 
> The CQ supports mixed size entries, so why not for SQ's too? There are
> use cases that currently allocate different queues just to keep these
> things separated, but we can efficiently handle both cases in a single
> ring.
> 
> Changes since v2:
> 
>  - Define 128B opcodes to be used on mixed SQs. This is done instead of
>    using the last SQE flags bit to generically identify a command as
>    such. The new opcodes are valid only on a mixed SQ.
> 
>  - Fixed up the accounting of sqes left to dispatch. The big sqes on a
>    mixed sq count for two entries, so previously would have fetched too
>    many.
> 
>  - liburing won't bother submitting the nop-skip for the wrap-around
>    condition if there are not enoungh free entries for the big-sqe.
> 
> kernel:
> 
> Keith Busch (1):
>   io_uring: add support for IORING_SETUP_SQE_MIXED
> 
>  include/uapi/linux/io_uring.h |  8 ++++++++
>  io_uring/fdinfo.c             | 34 +++++++++++++++++++++++++++-------
>  io_uring/io_uring.c           | 27 +++++++++++++++++++++++----
>  io_uring/io_uring.h           |  8 +++++---
>  io_uring/opdef.c              | 26 ++++++++++++++++++++++++++
>  io_uring/opdef.h              |  2 ++
>  io_uring/register.c           |  2 +-
>  io_uring/uring_cmd.c          | 12 ++++++++++--
>  io_uring/uring_cmd.h          |  1 +
>  9 files changed, 103 insertions(+), 17 deletions(-)
> 
> liburing:
> 
> Keith Busch (3):
>   Add support IORING_SETUP_SQE_MIXED
>   Add nop testing for IORING_SETUP_SQE_MIXED
>   Add mixed sqe test for uring commands
> 
>  src/include/liburing.h          |  50 +++++++++++
>  src/include/liburing/io_uring.h |  11 +++
>  test/Makefile                   |   3 +
>  test/sqe-mixed-bad-wrap.c       |  89 ++++++++++++++++++++
>  test/sqe-mixed-nop.c            |  82 ++++++++++++++++++
>  test/sqe-mixed-uring_cmd.c      | 142 ++++++++++++++++++++++++++++++++
>  6 files changed, 377 insertions(+)
>  create mode 100644 test/sqe-mixed-bad-wrap.c
>  create mode 100644 test/sqe-mixed-nop.c
>  create mode 100644 test/sqe-mixed-uring_cmd.c
> 
> -- 
> 2.47.3
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv3 1/3] Add support IORING_SETUP_SQE_MIXED
  2025-09-24 15:12 ` [PATCHv3 1/3] Add support IORING_SETUP_SQE_MIXED Keith Busch
@ 2025-09-24 20:20   ` Caleb Sander Mateos
  2025-09-24 20:30     ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-09-24 20:20 UTC (permalink / raw)
  To: Keith Busch; +Cc: io-uring, axboe, ming.lei, Keith Busch

On Wed, Sep 24, 2025 at 8:12 AM Keith Busch <kbusch@meta.com> wrote:
>
> From: Keith Busch <kbusch@kernel.org>
>
> This adds core support for mixed sized SQEs in the same SQ ring. Before
> this, SQEs were either 64b in size (the normal size), or 128b if
> IORING_SETUP_SQE128 was set in the ring initialization. With the mixed
> support, an SQE may be either 64b or 128b on the same SQ ring. If the
> SQE is 128b in size, then a 128b opcode will be set in the sqe op. When
> acquiring a large sqe at the end of the sq, the client may post a NOP
> SQE with IOSQE_CQE_SKIP_SUCCESS set that the kernel should simply ignore
> as it's just a pad filler that is posted when required.
>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  src/include/liburing.h          | 50 +++++++++++++++++++++++++++++++++
>  src/include/liburing/io_uring.h | 11 ++++++++
>  2 files changed, 61 insertions(+)
>
> diff --git a/src/include/liburing.h b/src/include/liburing.h
> index 052d6b56..66f1b990 100644
> --- a/src/include/liburing.h
> +++ b/src/include/liburing.h
> @@ -575,6 +575,7 @@ IOURINGINLINE void io_uring_initialize_sqe(struct io_uring_sqe *sqe)
>         sqe->buf_index = 0;
>         sqe->personality = 0;
>         sqe->file_index = 0;
> +       sqe->addr2 = 0;

Why is this necessary for mixed SQE size support? It looks like this
field is already initialized in io_uring_prep_rw() via the unioned off
field. Though, to be honest, I can't say I understand why the
initialization of the SQE fields is split between
io_uring_initialize_sqe() and io_uring_prep_rw().

>         sqe->addr3 = 0;
>         sqe->__pad2[0] = 0;
>  }
> @@ -799,6 +800,12 @@ IOURINGINLINE void io_uring_prep_nop(struct io_uring_sqe *sqe)
>         io_uring_prep_rw(IORING_OP_NOP, sqe, -1, NULL, 0, 0);
>  }
>
> +IOURINGINLINE void io_uring_prep_nop128(struct io_uring_sqe *sqe)
> +       LIBURING_NOEXCEPT
> +{
> +       io_uring_prep_rw(IORING_OP_NOP128, sqe, -1, NULL, 0, 0);
> +}
> +
>  IOURINGINLINE void io_uring_prep_timeout(struct io_uring_sqe *sqe,
>                                          struct __kernel_timespec *ts,
>                                          unsigned count, unsigned flags)
> @@ -1882,6 +1889,49 @@ IOURINGINLINE struct io_uring_sqe *_io_uring_get_sqe(struct io_uring *ring)
>         return sqe;
>  }
>
> +/*
> + * Return a 128B sqe to fill. Applications must later call io_uring_submit()
> + * when it's ready to tell the kernel about it. The caller may call this
> + * function multiple times before calling io_uring_submit().
> + *
> + * Returns a vacant 128B sqe, or NULL if we're full. If the current tail is the
> + * last entry in the ring, this function will insert a nop + skip complete such
> + * that the 128b entry wraps back to the beginning of the queue for a
> + * contiguous big sq entry. It's up to the caller to use a 128b opcode in order
> + * for the kernel to know how to advance its sq head pointer.
> + */
> +IOURINGINLINE struct io_uring_sqe *io_uring_get_sqe128_mixed(struct io_uring *ring)
> +       LIBURING_NOEXCEPT
> +{
> +       struct io_uring_sq *sq = &ring->sq;
> +       unsigned head = io_uring_load_sq_head(ring), tail = sq->sqe_tail;
> +       struct io_uring_sqe *sqe;
> +
> +       if (!(ring->flags & IORING_SETUP_SQE_MIXED))
> +               return NULL;
> +
> +       if (((tail + 1) & sq->ring_mask) == 0) {
> +               if ((tail + 2) - head >= sq->ring_entries)
> +                       return NULL;
> +
> +               sqe = _io_uring_get_sqe(ring);
> +               if (!sqe)
> +                       return NULL;

This case should be impossible since we just checked there is an empty
SQ slot at the end of the ring plus two more at the beginning.

> +
> +               io_uring_prep_nop(sqe);
> +               sqe->flags |= IOSQE_CQE_SKIP_SUCCESS;
> +               tail = sq->sqe_tail;
> +       } else if ((tail + 1) - head >= sq->ring_entries) {
> +               return NULL;
> +       }
> +
> +       sqe = &sq->sqes[tail & sq->ring_mask];
> +       sq->sqe_tail = tail + 2;
> +       io_uring_initialize_sqe(sqe);
> +
> +       return sqe;
> +}
> +
>  /*
>   * Return the appropriate mask for a buffer ring of size 'ring_entries'
>   */
> diff --git a/src/include/liburing/io_uring.h b/src/include/liburing/io_uring.h
> index 31396057..1e0b6398 100644
> --- a/src/include/liburing/io_uring.h
> +++ b/src/include/liburing/io_uring.h
> @@ -126,6 +126,7 @@ enum io_uring_sqe_flags_bit {
>         IOSQE_ASYNC_BIT,
>         IOSQE_BUFFER_SELECT_BIT,
>         IOSQE_CQE_SKIP_SUCCESS_BIT,
> +       IOSQE_SQE_128B_BIT,

I thought we decided against using an SQE flag bit for this? Looks
like this needs to be re-synced with the kernel uapi header.

Best,
Caleb

>  };
>
>  /*
> @@ -145,6 +146,8 @@ enum io_uring_sqe_flags_bit {
>  #define IOSQE_BUFFER_SELECT    (1U << IOSQE_BUFFER_SELECT_BIT)
>  /* don't post CQE if request succeeded */
>  #define IOSQE_CQE_SKIP_SUCCESS (1U << IOSQE_CQE_SKIP_SUCCESS_BIT)
> +/* this is a 128b/big-sqe posting */
> +#define IOSQE_SQE_128B          (1U << IOSQE_SQE_128B_BIT)
>
>  /*
>   * io_uring_setup() flags
> @@ -211,6 +214,12 @@ enum io_uring_sqe_flags_bit {
>   */
>  #define IORING_SETUP_CQE_MIXED         (1U << 18)
>
> +/*
> + *  Allow both 64b and 128b SQEs. If a 128b SQE is posted, it will have
> + *  IOSQE_SQE_128B set in sqe->flags.
> + */
> +#define IORING_SETUP_SQE_MIXED         (1U << 19)
> +
>  enum io_uring_op {
>         IORING_OP_NOP,
>         IORING_OP_READV,
> @@ -275,6 +284,8 @@ enum io_uring_op {
>         IORING_OP_READV_FIXED,
>         IORING_OP_WRITEV_FIXED,
>         IORING_OP_PIPE,
> +       IORING_OP_NOP128,
> +       IORING_OP_URING_CMD128,
>
>         /* this goes last, obviously */
>         IORING_OP_LAST,
> --
> 2.47.3
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv3 1/3] Add support IORING_SETUP_SQE_MIXED
  2025-09-24 20:20   ` Caleb Sander Mateos
@ 2025-09-24 20:30     ` Keith Busch
  2025-09-24 20:37       ` Caleb Sander Mateos
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-09-24 20:30 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: Keith Busch, io-uring, axboe, ming.lei

On Wed, Sep 24, 2025 at 01:20:44PM -0700, Caleb Sander Mateos wrote:
> > index 052d6b56..66f1b990 100644
> > --- a/src/include/liburing.h
> > +++ b/src/include/liburing.h
> > @@ -575,6 +575,7 @@ IOURINGINLINE void io_uring_initialize_sqe(struct io_uring_sqe *sqe)
> >         sqe->buf_index = 0;
> >         sqe->personality = 0;
> >         sqe->file_index = 0;
> > +       sqe->addr2 = 0;
> 
> Why is this necessary for mixed SQE size support? It looks like this
> field is already initialized in io_uring_prep_rw() via the unioned off
> field. Though, to be honest, I can't say I understand why the
> initialization of the SQE fields is split between
> io_uring_initialize_sqe() and io_uring_prep_rw().

The nvme passthrough uring_cmd doesn't call io_uring_prep_rw(), so we'd
just get a stale value in that field if we don't clear it. But you're
right that many cases would end up setting the field twice when we don't
need that.
 
> > +       IOSQE_SQE_128B_BIT,
> 
> I thought we decided against using an SQE flag bit for this? Looks
> like this needs to be re-synced with the kernel uapi header.

We did, and this is a left over artifact that is not supposed to be
here. :( Nothing is depending on the bit in this series.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv3 1/3] Add support IORING_SETUP_SQE_MIXED
  2025-09-24 20:30     ` Keith Busch
@ 2025-09-24 20:37       ` Caleb Sander Mateos
  0 siblings, 0 replies; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-09-24 20:37 UTC (permalink / raw)
  To: Keith Busch; +Cc: Keith Busch, io-uring, axboe, ming.lei

On Wed, Sep 24, 2025 at 1:30 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Sep 24, 2025 at 01:20:44PM -0700, Caleb Sander Mateos wrote:
> > > index 052d6b56..66f1b990 100644
> > > --- a/src/include/liburing.h
> > > +++ b/src/include/liburing.h
> > > @@ -575,6 +575,7 @@ IOURINGINLINE void io_uring_initialize_sqe(struct io_uring_sqe *sqe)
> > >         sqe->buf_index = 0;
> > >         sqe->personality = 0;
> > >         sqe->file_index = 0;
> > > +       sqe->addr2 = 0;
> >
> > Why is this necessary for mixed SQE size support? It looks like this
> > field is already initialized in io_uring_prep_rw() via the unioned off
> > field. Though, to be honest, I can't say I understand why the
> > initialization of the SQE fields is split between
> > io_uring_initialize_sqe() and io_uring_prep_rw().
>
> The nvme passthrough uring_cmd doesn't call io_uring_prep_rw(), so we'd
> just get a stale value in that field if we don't clear it. But you're
> right that many cases would end up setting the field twice when we don't
> need that.

Sure, that's a reasonable concern. Perhaps a helper for initializing a
NVMe passthru operation would make sense, though maybe it's difficult
to do that without requiring the linux/nvme_ioctl.h uapi header. But
regardless, it seems unrelated to the mixed SQE size work.

>
> > > +       IOSQE_SQE_128B_BIT,
> >
> > I thought we decided against using an SQE flag bit for this? Looks
> > like this needs to be re-synced with the kernel uapi header.
>
> We did, and this is a left over artifact that is not supposed to be
> here. :( Nothing is depending on the bit in this series.

Yeah I figured this file just needed to be updated with the current
version of the uapi header defined in your latest kernel patch.

Best,
Caleb

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv3 1/1] io_uring: add support for IORING_SETUP_SQE_MIXED
  2025-09-24 15:12 ` [PATCHv3 1/1] io_uring: add support for IORING_SETUP_SQE_MIXED Keith Busch
@ 2025-09-25 15:03   ` Jens Axboe
  2025-09-25 18:21     ` Caleb Sander Mateos
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2025-09-25 15:03 UTC (permalink / raw)
  To: Keith Busch, io-uring; +Cc: csander, ming.lei, Keith Busch

On 9/24/25 9:12 AM, Keith Busch wrote:
> contiguous in the SQ ring, a 128b SQE cannot wrap the ring. For this
> case, a single NOP SQE should be posted with the SKIP_SUCCESS flag set.
> The kernel should simply ignore those.

I think this mirrors the CQE side too much - the kernel doesn't ignore
then, they get processed just like any other NOP that has SKIP_SUCCESS
set. They don't post a CQE, but that's not because they are ignored,
that's just the nature of a successful NOP w/SKIP_SUCCESS set.

> @@ -2179,6 +2179,14 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  	opcode = array_index_nospec(opcode, IORING_OP_LAST);
>  
>  	def = &io_issue_defs[opcode];
> +	if (def->is_128) {
> +		if (!(ctx->flags & IORING_SETUP_SQE_MIXED) || *left < 2 ||
> +		    (ctx->cached_sq_head & (ctx->sq_entries - 1)) == 0)
> +			return io_init_fail_req(req, -EINVAL);
> +		ctx->cached_sq_head++;
> +		(*left)--;
> +	}

This could do with a comment!

> @@ -582,9 +583,10 @@ static inline void io_req_queue_tw_complete(struct io_kiocb *req, s32 res)
>   * IORING_SETUP_SQE128 contexts allocate twice the normal SQE size for each
>   * slot.
>   */
> -static inline size_t uring_sqe_size(struct io_ring_ctx *ctx)
> +static inline size_t uring_sqe_size(struct io_kiocb *req)
>  {
> -	if (ctx->flags & IORING_SETUP_SQE128)
> +	if (req->ctx->flags & IORING_SETUP_SQE128 ||
> +	    req->opcode == IORING_OP_URING_CMD128)
>  		return 2 * sizeof(struct io_uring_sqe);
>  	return sizeof(struct io_uring_sqe);

This one really confused me, but then I grep'ed, and it's uring_cmd
specific. Should probably move this one to uring_cmd.c rather than have
it elsewhere.

> +int io_uring_cmd128_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> +	if (!(req->ctx->flags & IORING_SETUP_SQE_MIXED))
> +		return -EINVAL;
> +	return io_uring_cmd_prep(req, sqe);
> +}

Why isn't this just allowed for SQE128 as well? There should be no
reason to disallow explicitly 128b sqe commands in SQE128 mode, they
should work for any mode that supports 128b SQEs which is either
SQE_MIXED or SQE128?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv3 1/1] io_uring: add support for IORING_SETUP_SQE_MIXED
  2025-09-25 15:03   ` Jens Axboe
@ 2025-09-25 18:21     ` Caleb Sander Mateos
  2025-09-25 18:44       ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Caleb Sander Mateos @ 2025-09-25 18:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Keith Busch, io-uring, ming.lei, Keith Busch

On Thu, Sep 25, 2025 at 8:03 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 9/24/25 9:12 AM, Keith Busch wrote:
> > contiguous in the SQ ring, a 128b SQE cannot wrap the ring. For this
> > case, a single NOP SQE should be posted with the SKIP_SUCCESS flag set.
> > The kernel should simply ignore those.
>
> I think this mirrors the CQE side too much - the kernel doesn't ignore
> then, they get processed just like any other NOP that has SKIP_SUCCESS
> set. They don't post a CQE, but that's not because they are ignored,
> that's just the nature of a successful NOP w/SKIP_SUCCESS set.
>
> > @@ -2179,6 +2179,14 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
> >       opcode = array_index_nospec(opcode, IORING_OP_LAST);
> >
> >       def = &io_issue_defs[opcode];
> > +     if (def->is_128) {
> > +             if (!(ctx->flags & IORING_SETUP_SQE_MIXED) || *left < 2 ||
> > +                 (ctx->cached_sq_head & (ctx->sq_entries - 1)) == 0)
> > +                     return io_init_fail_req(req, -EINVAL);
> > +             ctx->cached_sq_head++;
> > +             (*left)--;
> > +     }
>
> This could do with a comment!
>
> > @@ -582,9 +583,10 @@ static inline void io_req_queue_tw_complete(struct io_kiocb *req, s32 res)
> >   * IORING_SETUP_SQE128 contexts allocate twice the normal SQE size for each
> >   * slot.
> >   */
> > -static inline size_t uring_sqe_size(struct io_ring_ctx *ctx)
> > +static inline size_t uring_sqe_size(struct io_kiocb *req)
> >  {
> > -     if (ctx->flags & IORING_SETUP_SQE128)
> > +     if (req->ctx->flags & IORING_SETUP_SQE128 ||
> > +         req->opcode == IORING_OP_URING_CMD128)
> >               return 2 * sizeof(struct io_uring_sqe);
> >       return sizeof(struct io_uring_sqe);
>
> This one really confused me, but then I grep'ed, and it's uring_cmd
> specific. Should probably move this one to uring_cmd.c rather than have
> it elsewhere.
>
> > +int io_uring_cmd128_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> > +{
> > +     if (!(req->ctx->flags & IORING_SETUP_SQE_MIXED))
> > +             return -EINVAL;
> > +     return io_uring_cmd_prep(req, sqe);
> > +}
>
> Why isn't this just allowed for SQE128 as well? There should be no
> reason to disallow explicitly 128b sqe commands in SQE128 mode, they
> should work for any mode that supports 128b SQEs which is either
> SQE_MIXED or SQE128?

Not to mention, the check in io_init_req() should already have
rejected a 128-byte operation on a non-IORING_SETUP_SQE_MIXED
io_ring_ctx.

Best,
Caleb

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCHv3 1/1] io_uring: add support for IORING_SETUP_SQE_MIXED
  2025-09-25 18:21     ` Caleb Sander Mateos
@ 2025-09-25 18:44       ` Jens Axboe
  0 siblings, 0 replies; 12+ messages in thread
From: Jens Axboe @ 2025-09-25 18:44 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: Keith Busch, io-uring, ming.lei, Keith Busch

On 9/25/25 12:21 PM, Caleb Sander Mateos wrote:
>>> +int io_uring_cmd128_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>> +{
>>> +     if (!(req->ctx->flags & IORING_SETUP_SQE_MIXED))
>>> +             return -EINVAL;
>>> +     return io_uring_cmd_prep(req, sqe);
>>> +}
>>
>> Why isn't this just allowed for SQE128 as well? There should be no
>> reason to disallow explicitly 128b sqe commands in SQE128 mode, they
>> should work for any mode that supports 128b SQEs which is either
>> SQE_MIXED or SQE128?
> 
> Not to mention, the check in io_init_req() should already have
> rejected a 128-byte operation on a non-IORING_SETUP_SQE_MIXED
> io_ring_ctx.

Yes good point, it should all be caught there, no need for any per-op
checking (or the separate prep handler).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-09-25 18:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-24 15:12 [PATCHv3 0/3] Keith Busch
2025-09-24 15:12 ` [PATCHv3 1/3] Add support IORING_SETUP_SQE_MIXED Keith Busch
2025-09-24 20:20   ` Caleb Sander Mateos
2025-09-24 20:30     ` Keith Busch
2025-09-24 20:37       ` Caleb Sander Mateos
2025-09-24 15:12 ` [PATCHv3 1/1] io_uring: add support for IORING_SETUP_SQE_MIXED Keith Busch
2025-09-25 15:03   ` Jens Axboe
2025-09-25 18:21     ` Caleb Sander Mateos
2025-09-25 18:44       ` Jens Axboe
2025-09-24 15:12 ` [PATCHv3 2/3] Add nop testing " Keith Busch
2025-09-24 15:12 ` [PATCHv3 3/3] Add mixed sqe test for uring commands Keith Busch
2025-09-24 15:54 ` [PATCHv3 0/3] io_uring: mixed submission queue size support Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox