public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RFC v2 0/4] uring_cmd copy avoidance
@ 2025-06-05 19:40 Jens Axboe
  2025-06-05 19:40 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Jens Axboe @ 2025-06-05 19:40 UTC (permalink / raw)
  To: io-uring; +Cc: csander

Hi,

Currently uring_cmd unconditionally copies the SQE at prep time, as it
has no other choice - the SQE data must remain stable after submit.
This can lead to excessive memory bandwidth being used for that copy,
as passthrough will often use 128b SQEs, and efficiency concerns as
those copies will potentially use quite a lot of CPU cycles as well.

As a quick test, running the current -git kernel on a box with 23
NVMe drives doing passthrough IO, memcpy() is the highest cycle user
at 9.05%, which is all off the uring_cmd prep path. The test case is
a 512b random read, which runs at 91-92M IOPS.

With these patches, memcpy() is gone from the profiles, and it runs
at 98-99M IOPS, or about 7-8% faster.

Before:

IOPS=91.12M, BW=44.49GiB/s, IOS/call=32/32
IOPS=91.16M, BW=44.51GiB/s, IOS/call=32/32
IOPS=91.18M, BW=44.52GiB/s, IOS/call=31/32
IOPS=91.92M, BW=44.88GiB/s, IOS/call=32/32
IOPS=91.88M, BW=44.86GiB/s, IOS/call=32/32
IOPS=91.82M, BW=44.83GiB/s, IOS/call=32/31
IOPS=91.52M, BW=44.69GiB/s, IOS/call=32/32

with the top perf report -g --no-children being:

+    9.07%  io_uring  [kernel.kallsyms]  [k] memcpy

and after:

# bash run-peak-pass.sh
[...]
IOPS=99.30M, BW=48.49GiB/s, IOS/call=32/32
IOPS=99.27M, BW=48.47GiB/s, IOS/call=31/32
IOPS=99.60M, BW=48.63GiB/s, IOS/call=32/32
IOPS=99.68M, BW=48.67GiB/s, IOS/call=32/31
IOPS=99.80M, BW=48.73GiB/s, IOS/call=31/32
IOPS=99.84M, BW=48.75GiB/s, IOS/call=32/32

with memcpy not even in profiles. If you do the actual math of 100M
requests per second, and 128b of copying per IOP, then it's almost
12GB/sec of reduced memory bandwidth.

Even for lower IOPS production testing, Caleb reports that memcpy()
overhead is in the realm of 1.1% of CPU time.

v2 of this patchset takes a different approach than v1 did - rather
than have the core mark a request as being potentially issued
out-of-line, this one adds an io_cold_def ->sqe_copy() helper, and
puts the onus on io_uring core to call it appropriately. Outside of
that, it also adds an IO_URING_F_INLINE flag so that the copy helper
_knows_ if it may sanely copy the SQE, or whether there's a bug in
the core and it should just be ended with -EFAULT. Where possible,
the actual SQE is also passed in.

I think this approach is saner, and in fact it can be extended to
reduce over-eager copies in other spots. For now I just did uring_cmd,
and verified that the memcpy's are still gone from my test.

Can also be found here:

https://git.kernel.dk/cgit/linux/log/?h=uring_cmd.2

 include/linux/io_uring_types.h |  2 ++
 io_uring/io_uring.c            | 35 +++++++++++++++------
 io_uring/opdef.c               |  1 +
 io_uring/opdef.h               |  1 +
 io_uring/uring_cmd.c           | 57 ++++++++++++++++++----------------
 io_uring/uring_cmd.h           |  2 ++
 6 files changed, 63 insertions(+), 35 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag
  2025-06-05 19:40 [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Jens Axboe
@ 2025-06-05 19:40 ` Jens Axboe
  2025-06-06 17:31   ` Caleb Sander Mateos
  2025-06-05 19:40 ` [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method Jens Axboe
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2025-06-05 19:40 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

Set when the execution of the request is done inline from the system
call itself. Any deferred issue will never have this flag set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h | 2 ++
 io_uring/io_uring.c            | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 2922635986f5..054c43c02c96 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -26,6 +26,8 @@ enum io_uring_cmd_flags {
 	IO_URING_F_MULTISHOT		= 4,
 	/* executed by io-wq */
 	IO_URING_F_IOWQ			= 8,
+	/* executed inline from syscall */
+	IO_URING_F_INLINE		= 16,
 	/* int's last bit, sign checks are usually faster than a bit test */
 	IO_URING_F_NONBLOCK		= INT_MIN,
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index cf759c172083..079a95e1bd82 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1962,7 +1962,8 @@ static inline void io_queue_sqe(struct io_kiocb *req)
 {
 	int ret;
 
-	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
+	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER|
+				IO_URING_F_INLINE);
 
 	/*
 	 * We async punt it if the file wasn't marked NOWAIT, or if the file
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method
  2025-06-05 19:40 [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Jens Axboe
  2025-06-05 19:40 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
@ 2025-06-05 19:40 ` Jens Axboe
  2025-06-05 20:05   ` Jens Axboe
  2025-06-06 17:36   ` Caleb Sander Mateos
  2025-06-05 19:40 ` [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() Jens Axboe
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2025-06-05 19:40 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

Will be called by the core of io_uring, if inline issue is not going
to be tried for a request. Opcodes can define this handler to defer
copying of SQE data that should remain stable.

Called with IO_URING_F_INLINE set if this is an inline issue, and that
flag NOT set if it's an out-of-line call. The handler can use this to
determine if it's still safe to copy the SQE. The core should always
guarantee that it will be safe, but by having this flag available the
handler is able to check and fail.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/io_uring.c | 32 ++++++++++++++++++++++++--------
 io_uring/opdef.h    |  1 +
 2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 079a95e1bd82..fdf23e81c4ff 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -147,7 +147,7 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 bool cancel_all,
 					 bool is_sqpoll_thread);
 
-static void io_queue_sqe(struct io_kiocb *req);
+static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 static void __io_req_caches_free(struct io_ring_ctx *ctx);
 
 static __read_mostly DEFINE_STATIC_KEY_FALSE(io_key_has_sqarray);
@@ -1377,7 +1377,7 @@ void io_req_task_submit(struct io_kiocb *req, io_tw_token_t tw)
 	else if (req->flags & REQ_F_FORCE_ASYNC)
 		io_queue_iowq(req);
 	else
-		io_queue_sqe(req);
+		io_queue_sqe(req, NULL);
 }
 
 void io_req_task_queue_fail(struct io_kiocb *req, int ret)
@@ -1935,14 +1935,30 @@ struct file *io_file_get_normal(struct io_kiocb *req, int fd)
 	return file;
 }
 
-static void io_queue_async(struct io_kiocb *req, int ret)
+static int io_req_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			   unsigned int issue_flags)
+{
+	const struct io_cold_def *def = &io_cold_defs[req->opcode];
+
+	if (!def->sqe_copy)
+		return 0;
+	return def->sqe_copy(req, sqe, issue_flags);
+}
+
+static void io_queue_async(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			   int ret)
 	__must_hold(&req->ctx->uring_lock)
 {
 	if (ret != -EAGAIN || (req->flags & REQ_F_NOWAIT)) {
+fail:
 		io_req_defer_failed(req, ret);
 		return;
 	}
 
+	ret = io_req_sqe_copy(req, sqe, 0);
+	if (unlikely(ret))
+		goto fail;
+
 	switch (io_arm_poll_handler(req, 0)) {
 	case IO_APOLL_READY:
 		io_kbuf_recycle(req, 0);
@@ -1957,7 +1973,7 @@ static void io_queue_async(struct io_kiocb *req, int ret)
 	}
 }
 
-static inline void io_queue_sqe(struct io_kiocb *req)
+static inline void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	__must_hold(&req->ctx->uring_lock)
 {
 	int ret;
@@ -1970,7 +1986,7 @@ static inline void io_queue_sqe(struct io_kiocb *req)
 	 * doesn't support non-blocking read/write attempts
 	 */
 	if (unlikely(ret))
-		io_queue_async(req, ret);
+		io_queue_async(req, sqe, ret);
 }
 
 static void io_queue_sqe_fallback(struct io_kiocb *req)
@@ -2200,7 +2216,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		link->last = req;
 
 		if (req->flags & IO_REQ_LINK_FLAGS)
-			return 0;
+			return io_req_sqe_copy(req, sqe, IO_URING_F_INLINE);
 		/* last request of the link, flush it */
 		req = link->head;
 		link->head = NULL;
@@ -2216,10 +2232,10 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 fallback:
 			io_queue_sqe_fallback(req);
 		}
-		return 0;
+		return io_req_sqe_copy(req, sqe, IO_URING_F_INLINE);
 	}
 
-	io_queue_sqe(req);
+	io_queue_sqe(req, sqe);
 	return 0;
 }
 
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index 719a52104abe..71bfaa3c8afd 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -38,6 +38,7 @@ struct io_issue_def {
 struct io_cold_def {
 	const char		*name;
 
+	int (*sqe_copy)(struct io_kiocb *, const struct io_uring_sqe *, unsigned int issue_flags);
 	void (*cleanup)(struct io_kiocb *);
 	void (*fail)(struct io_kiocb *);
 };
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup()
  2025-06-05 19:40 [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Jens Axboe
  2025-06-05 19:40 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
  2025-06-05 19:40 ` [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method Jens Axboe
@ 2025-06-05 19:40 ` Jens Axboe
  2025-06-06 17:37   ` Caleb Sander Mateos
  2025-06-05 19:40 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
  2025-06-06 17:29 ` [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Caleb Sander Mateos
  4 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2025-06-05 19:40 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

It's a pretty pointless helper, just allocates and copies data. Fold it
into io_uring_cmd_prep().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/uring_cmd.c | 34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 929cad6ee326..e204f4941d72 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -181,8 +181,7 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, u64 res2,
 }
 EXPORT_SYMBOL_GPL(io_uring_cmd_done);
 
-static int io_uring_cmd_prep_setup(struct io_kiocb *req,
-				   const struct io_uring_sqe *sqe)
+int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
 	struct io_async_cmd *ac;
@@ -190,6 +189,18 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 	/* see io_uring_cmd_get_async_data() */
 	BUILD_BUG_ON(offsetof(struct io_async_cmd, data) != 0);
 
+	if (sqe->__pad1)
+		return -EINVAL;
+
+	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
+	if (ioucmd->flags & ~IORING_URING_CMD_MASK)
+		return -EINVAL;
+
+	if (ioucmd->flags & IORING_URING_CMD_FIXED)
+		req->buf_index = READ_ONCE(sqe->buf_index);
+
+	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
+
 	ac = io_uring_alloc_async_data(&req->ctx->cmd_cache, req);
 	if (!ac)
 		return -ENOMEM;
@@ -207,25 +218,6 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 	return 0;
 }
 
-int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
-{
-	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-
-	if (sqe->__pad1)
-		return -EINVAL;
-
-	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
-	if (ioucmd->flags & ~IORING_URING_CMD_MASK)
-		return -EINVAL;
-
-	if (ioucmd->flags & IORING_URING_CMD_FIXED)
-		req->buf_index = READ_ONCE(sqe->buf_index);
-
-	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
-
-	return io_uring_cmd_prep_setup(req, sqe);
-}
-
 int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-05 19:40 [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Jens Axboe
                   ` (2 preceding siblings ...)
  2025-06-05 19:40 ` [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() Jens Axboe
@ 2025-06-05 19:40 ` Jens Axboe
  2025-06-06 17:39   ` Caleb Sander Mateos
  2025-06-06 17:29 ` [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Caleb Sander Mateos
  4 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2025-06-05 19:40 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

uring_cmd currently copies the full SQE at prep time, just in case it
needs it to be stable. Opt in to using ->sqe_copy() to let the core of
io_uring decide when to copy SQEs.

This provides two checks to see if ioucmd->sqe is still valid:

1) If ioucmd->sqe is not the uring copied version AND IO_URING_F_INLINE
   isn't set, then the core of io_uring has a bug. Warn and return
   -EFAULT.

2) If sqe is NULL AND IO_URING_F_INLINE isn't set, then the core of
   io_uring has a bug. Warn and return -EFAULT.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/opdef.c     |  1 +
 io_uring/uring_cmd.c | 35 ++++++++++++++++++++++++-----------
 io_uring/uring_cmd.h |  2 ++
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 6e0882b051f9..287f9a23b816 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
 	},
 	[IORING_OP_URING_CMD] = {
 		.name			= "URING_CMD",
+		.sqe_copy		= io_uring_cmd_sqe_copy,
 		.cleanup		= io_uring_cmd_cleanup,
 	},
 	[IORING_OP_SEND_ZC] = {
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index e204f4941d72..f682b9d442e1 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -205,16 +205,25 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (!ac)
 		return -ENOMEM;
 	ac->data.op_data = NULL;
+	ioucmd->sqe = sqe;
+	return 0;
+}
+
+int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			  unsigned int issue_flags)
+{
+	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	struct io_async_cmd *ac = req->async_data;
+
+	if (sqe != ac->sqes) {
+		if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_INLINE)))
+			return -EFAULT;
+		if (WARN_ON_ONCE(!sqe))
+			return -EFAULT;
+		memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
+		ioucmd->sqe = ac->sqes;
+	}
 
-	/*
-	 * Unconditionally cache the SQE for now - this is only needed for
-	 * requests that go async, but prep handlers must ensure that any
-	 * sqe data is stable beyond prep. Since uring_cmd is special in
-	 * that it doesn't read in per-op data, play it safe and ensure that
-	 * any SQE data is stable beyond prep. This can later get relaxed.
-	 */
-	memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
-	ioucmd->sqe = ac->sqes;
 	return 0;
 }
 
@@ -251,8 +260,12 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	ret = file->f_op->uring_cmd(ioucmd, issue_flags);
-	if (ret == -EAGAIN || ret == -EIOCBQUEUED)
-		return ret;
+	if (ret == -EAGAIN) {
+		io_uring_cmd_sqe_copy(req, ioucmd->sqe, issue_flags);
+		return -EAGAIN;
+	} else if (ret == -EIOCBQUEUED) {
+		return -EIOCBQUEUED;
+	}
 	if (ret < 0)
 		req_set_fail(req);
 	io_req_uring_cleanup(req, issue_flags);
diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
index e6a5142c890e..f956b0e7c351 100644
--- a/io_uring/uring_cmd.h
+++ b/io_uring/uring_cmd.h
@@ -11,6 +11,8 @@ struct io_async_cmd {
 
 int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags);
 int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			  unsigned int issue_flags);
 void io_uring_cmd_cleanup(struct io_kiocb *req);
 
 bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *ctx,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method
  2025-06-05 19:40 ` [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method Jens Axboe
@ 2025-06-05 20:05   ` Jens Axboe
  2025-06-06 17:36   ` Caleb Sander Mateos
  1 sibling, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2025-06-05 20:05 UTC (permalink / raw)
  To: io-uring; +Cc: csander

Ran all test cases and found that...

> +static void io_queue_async(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> +			   int ret)
>  	__must_hold(&req->ctx->uring_lock)
>  {
>  	if (ret != -EAGAIN || (req->flags & REQ_F_NOWAIT)) {
> +fail:
>  		io_req_defer_failed(req, ret);
>  		return;
>  	}
>  
> +	ret = io_req_sqe_copy(req, sqe, 0);
> +	if (unlikely(ret))
> +		goto fail;
> +

we of course need to pass 'issue_flags' from io_queue_sqe() here,
otherwise we risk breaking the sanity checks in uring_cmd later on.

The git branch referenced in the cover letter has been updated.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHSET RFC v2 0/4] uring_cmd copy avoidance
  2025-06-05 19:40 [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Jens Axboe
                   ` (3 preceding siblings ...)
  2025-06-05 19:40 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
@ 2025-06-06 17:29 ` Caleb Sander Mateos
  2025-06-06 17:32   ` Jens Axboe
  4 siblings, 1 reply; 18+ messages in thread
From: Caleb Sander Mateos @ 2025-06-06 17:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Thu, Jun 5, 2025 at 12:47 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Hi,
>
> Currently uring_cmd unconditionally copies the SQE at prep time, as it
> has no other choice - the SQE data must remain stable after submit.
> This can lead to excessive memory bandwidth being used for that copy,
> as passthrough will often use 128b SQEs, and efficiency concerns as
> those copies will potentially use quite a lot of CPU cycles as well.
>
> As a quick test, running the current -git kernel on a box with 23
> NVMe drives doing passthrough IO, memcpy() is the highest cycle user
> at 9.05%, which is all off the uring_cmd prep path. The test case is
> a 512b random read, which runs at 91-92M IOPS.
>
> With these patches, memcpy() is gone from the profiles, and it runs
> at 98-99M IOPS, or about 7-8% faster.
>
> Before:
>
> IOPS=91.12M, BW=44.49GiB/s, IOS/call=32/32
> IOPS=91.16M, BW=44.51GiB/s, IOS/call=32/32
> IOPS=91.18M, BW=44.52GiB/s, IOS/call=31/32
> IOPS=91.92M, BW=44.88GiB/s, IOS/call=32/32
> IOPS=91.88M, BW=44.86GiB/s, IOS/call=32/32
> IOPS=91.82M, BW=44.83GiB/s, IOS/call=32/31
> IOPS=91.52M, BW=44.69GiB/s, IOS/call=32/32
>
> with the top perf report -g --no-children being:
>
> +    9.07%  io_uring  [kernel.kallsyms]  [k] memcpy
>
> and after:
>
> # bash run-peak-pass.sh
> [...]
> IOPS=99.30M, BW=48.49GiB/s, IOS/call=32/32
> IOPS=99.27M, BW=48.47GiB/s, IOS/call=31/32
> IOPS=99.60M, BW=48.63GiB/s, IOS/call=32/32
> IOPS=99.68M, BW=48.67GiB/s, IOS/call=32/31
> IOPS=99.80M, BW=48.73GiB/s, IOS/call=31/32
> IOPS=99.84M, BW=48.75GiB/s, IOS/call=32/32
>
> with memcpy not even in profiles. If you do the actual math of 100M
> requests per second, and 128b of copying per IOP, then it's almost
> 12GB/sec of reduced memory bandwidth.
>
> Even for lower IOPS production testing, Caleb reports that memcpy()
> overhead is in the realm of 1.1% of CPU time.
>
> v2 of this patchset takes a different approach than v1 did - rather
> than have the core mark a request as being potentially issued
> out-of-line, this one adds an io_cold_def ->sqe_copy() helper, and
> puts the onus on io_uring core to call it appropriately. Outside of
> that, it also adds an IO_URING_F_INLINE flag so that the copy helper
> _knows_ if it may sanely copy the SQE, or whether there's a bug in
> the core and it should just be ended with -EFAULT. Where possible,
> the actual SQE is also passed in.

I like the ->sqe_copy() approach. I'm not totally convinced the
complexity of computing and checking IO_URING_F_INLINE is worth it for
what's effectively an assertion, but I'm not strongly opposed to it
either.

Thanks,
Caleb


>
> I think this approach is saner, and in fact it can be extended to
> reduce over-eager copies in other spots. For now I just did uring_cmd,
> and verified that the memcpy's are still gone from my test.
>
> Can also be found here:
>
> https://git.kernel.dk/cgit/linux/log/?h=uring_cmd.2
>
>  include/linux/io_uring_types.h |  2 ++
>  io_uring/io_uring.c            | 35 +++++++++++++++------
>  io_uring/opdef.c               |  1 +
>  io_uring/opdef.h               |  1 +
>  io_uring/uring_cmd.c           | 57 ++++++++++++++++++----------------
>  io_uring/uring_cmd.h           |  2 ++
>  6 files changed, 63 insertions(+), 35 deletions(-)
>
> --
> Jens Axboe
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag
  2025-06-05 19:40 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
@ 2025-06-06 17:31   ` Caleb Sander Mateos
  2025-06-06 21:02     ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Caleb Sander Mateos @ 2025-06-06 17:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Thu, Jun 5, 2025 at 12:47 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Set when the execution of the request is done inline from the system
> call itself. Any deferred issue will never have this flag set.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h | 2 ++
>  io_uring/io_uring.c            | 3 ++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 2922635986f5..054c43c02c96 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -26,6 +26,8 @@ enum io_uring_cmd_flags {
>         IO_URING_F_MULTISHOT            = 4,
>         /* executed by io-wq */
>         IO_URING_F_IOWQ                 = 8,
> +       /* executed inline from syscall */
> +       IO_URING_F_INLINE               = 16,
>         /* int's last bit, sign checks are usually faster than a bit test */
>         IO_URING_F_NONBLOCK             = INT_MIN,
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index cf759c172083..079a95e1bd82 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -1962,7 +1962,8 @@ static inline void io_queue_sqe(struct io_kiocb *req)
>  {
>         int ret;
>
> -       ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
> +       ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER|
> +                               IO_URING_F_INLINE);

Isn't io_queue_sqe() also called from io_req_task_submit(), which is
an io_req_tw_func_t callback? Task work runs after the SQ slots have
already been returned to userspace, right? Before the unconditional
memcpy() was added, we had observed requests in linked chains with
corrupted SQEs due to the async task work issue.

As is, it looks like IO_URING_F_INLINE is just the inverse of
IO_URING_F_IOWQ, so it may not be necessary to add a new flag. But I
can see how core io_uring might add additional async issue cases in
the future.

Best,
Caleb



>
>         /*
>          * We async punt it if the file wasn't marked NOWAIT, or if the file
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHSET RFC v2 0/4] uring_cmd copy avoidance
  2025-06-06 17:29 ` [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Caleb Sander Mateos
@ 2025-06-06 17:32   ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2025-06-06 17:32 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 11:29 AM, Caleb Sander Mateos wrote:
> On Thu, Jun 5, 2025 at 12:47?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Hi,
>>
>> Currently uring_cmd unconditionally copies the SQE at prep time, as it
>> has no other choice - the SQE data must remain stable after submit.
>> This can lead to excessive memory bandwidth being used for that copy,
>> as passthrough will often use 128b SQEs, and efficiency concerns as
>> those copies will potentially use quite a lot of CPU cycles as well.
>>
>> As a quick test, running the current -git kernel on a box with 23
>> NVMe drives doing passthrough IO, memcpy() is the highest cycle user
>> at 9.05%, which is all off the uring_cmd prep path. The test case is
>> a 512b random read, which runs at 91-92M IOPS.
>>
>> With these patches, memcpy() is gone from the profiles, and it runs
>> at 98-99M IOPS, or about 7-8% faster.
>>
>> Before:
>>
>> IOPS=91.12M, BW=44.49GiB/s, IOS/call=32/32
>> IOPS=91.16M, BW=44.51GiB/s, IOS/call=32/32
>> IOPS=91.18M, BW=44.52GiB/s, IOS/call=31/32
>> IOPS=91.92M, BW=44.88GiB/s, IOS/call=32/32
>> IOPS=91.88M, BW=44.86GiB/s, IOS/call=32/32
>> IOPS=91.82M, BW=44.83GiB/s, IOS/call=32/31
>> IOPS=91.52M, BW=44.69GiB/s, IOS/call=32/32
>>
>> with the top perf report -g --no-children being:
>>
>> +    9.07%  io_uring  [kernel.kallsyms]  [k] memcpy
>>
>> and after:
>>
>> # bash run-peak-pass.sh
>> [...]
>> IOPS=99.30M, BW=48.49GiB/s, IOS/call=32/32
>> IOPS=99.27M, BW=48.47GiB/s, IOS/call=31/32
>> IOPS=99.60M, BW=48.63GiB/s, IOS/call=32/32
>> IOPS=99.68M, BW=48.67GiB/s, IOS/call=32/31
>> IOPS=99.80M, BW=48.73GiB/s, IOS/call=31/32
>> IOPS=99.84M, BW=48.75GiB/s, IOS/call=32/32
>>
>> with memcpy not even in profiles. If you do the actual math of 100M
>> requests per second, and 128b of copying per IOP, then it's almost
>> 12GB/sec of reduced memory bandwidth.
>>
>> Even for lower IOPS production testing, Caleb reports that memcpy()
>> overhead is in the realm of 1.1% of CPU time.
>>
>> v2 of this patchset takes a different approach than v1 did - rather
>> than have the core mark a request as being potentially issued
>> out-of-line, this one adds an io_cold_def ->sqe_copy() helper, and
>> puts the onus on io_uring core to call it appropriately. Outside of
>> that, it also adds an IO_URING_F_INLINE flag so that the copy helper
>> _knows_ if it may sanely copy the SQE, or whether there's a bug in
>> the core and it should just be ended with -EFAULT. Where possible,
>> the actual SQE is also passed in.
> 
> I like the ->sqe_copy() approach. I'm not totally convinced the
> complexity of computing and checking IO_URING_F_INLINE is worth it for
> what's effectively an assertion, but I'm not strongly opposed to it
> either.

It's no extra overhead on the normal issue side, as the mask isn't
conditional. For now it's just belt and suspenders, down the line we can
relax (and remove) some of this on the uring_cmd side.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method
  2025-06-05 19:40 ` [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method Jens Axboe
  2025-06-05 20:05   ` Jens Axboe
@ 2025-06-06 17:36   ` Caleb Sander Mateos
  2025-06-06 21:01     ` Jens Axboe
  1 sibling, 1 reply; 18+ messages in thread
From: Caleb Sander Mateos @ 2025-06-06 17:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Thu, Jun 5, 2025 at 12:47 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Will be called by the core of io_uring, if inline issue is not going
> to be tried for a request. Opcodes can define this handler to defer
> copying of SQE data that should remain stable.
>
> Called with IO_URING_F_INLINE set if this is an inline issue, and that
> flag NOT set if it's an out-of-line call. The handler can use this to
> determine if it's still safe to copy the SQE. The core should always
> guarantee that it will be safe, but by having this flag available the
> handler is able to check and fail.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  io_uring/io_uring.c | 32 ++++++++++++++++++++++++--------
>  io_uring/opdef.h    |  1 +
>  2 files changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 079a95e1bd82..fdf23e81c4ff 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -147,7 +147,7 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>                                          bool cancel_all,
>                                          bool is_sqpoll_thread);
>
> -static void io_queue_sqe(struct io_kiocb *req);
> +static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe);
>  static void __io_req_caches_free(struct io_ring_ctx *ctx);
>
>  static __read_mostly DEFINE_STATIC_KEY_FALSE(io_key_has_sqarray);
> @@ -1377,7 +1377,7 @@ void io_req_task_submit(struct io_kiocb *req, io_tw_token_t tw)
>         else if (req->flags & REQ_F_FORCE_ASYNC)
>                 io_queue_iowq(req);
>         else
> -               io_queue_sqe(req);
> +               io_queue_sqe(req, NULL);

Passing NULL here is a bit weird. As I mentioned on patch 1, I think
it would make more sense to consider this task work path a
"non-inline" issue.

>  }
>
>  void io_req_task_queue_fail(struct io_kiocb *req, int ret)
> @@ -1935,14 +1935,30 @@ struct file *io_file_get_normal(struct io_kiocb *req, int fd)
>         return file;
>  }
>
> -static void io_queue_async(struct io_kiocb *req, int ret)
> +static int io_req_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> +                          unsigned int issue_flags)
> +{
> +       const struct io_cold_def *def = &io_cold_defs[req->opcode];
> +
> +       if (!def->sqe_copy)
> +               return 0;
> +       return def->sqe_copy(req, sqe, issue_flags);
> +}
> +
> +static void io_queue_async(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> +                          int ret)
>         __must_hold(&req->ctx->uring_lock)
>  {
>         if (ret != -EAGAIN || (req->flags & REQ_F_NOWAIT)) {
> +fail:
>                 io_req_defer_failed(req, ret);
>                 return;
>         }
>
> +       ret = io_req_sqe_copy(req, sqe, 0);
> +       if (unlikely(ret))
> +               goto fail;

It seems possible to avoid the goto by just adding "|| unlikely(ret =
io_req_sqe_copy(req, sqe, 0))" to the if condition above. But the
control flow isn't super convoluted with the goto, so I don't feel
strongly.

> +
>         switch (io_arm_poll_handler(req, 0)) {
>         case IO_APOLL_READY:
>                 io_kbuf_recycle(req, 0);
> @@ -1957,7 +1973,7 @@ static void io_queue_async(struct io_kiocb *req, int ret)
>         }
>  }
>
> -static inline void io_queue_sqe(struct io_kiocb *req)
> +static inline void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>         __must_hold(&req->ctx->uring_lock)
>  {
>         int ret;
> @@ -1970,7 +1986,7 @@ static inline void io_queue_sqe(struct io_kiocb *req)
>          * doesn't support non-blocking read/write attempts
>          */
>         if (unlikely(ret))
> -               io_queue_async(req, ret);
> +               io_queue_async(req, sqe, ret);
>  }
>
>  static void io_queue_sqe_fallback(struct io_kiocb *req)
> @@ -2200,7 +2216,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>                 link->last = req;
>
>                 if (req->flags & IO_REQ_LINK_FLAGS)
> -                       return 0;
> +                       return io_req_sqe_copy(req, sqe, IO_URING_F_INLINE);

This only copies the SQE for the middle reqs in a linked chain. Don't
we need to copy it for the last req too? I would call
io_req_sqe_copy() unconditionally before the req->flags &
IO_REQ_LINK_FLAGS check.

>                 /* last request of the link, flush it */
>                 req = link->head;
>                 link->head = NULL;
> @@ -2216,10 +2232,10 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  fallback:
>                         io_queue_sqe_fallback(req);
>                 }
> -               return 0;
> +               return io_req_sqe_copy(req, sqe, IO_URING_F_INLINE);

The first req in a linked chain hits this code path too, but it
doesn't usually need its SQE copied since it will still be issued
synchronously by default. (The io_queue_sqe_fallback() to handle
unterminated links in io_submit_state_end() might be an exception.)
But I am fine with copying the SQE for the first req too if it keeps
the code simpler. If the first req in a chain also has
REQ_F_FORCE_ASYNC | REQ_F_FAIL set, it will actually reach this
io_req_sqe_copy() twice (the second time from the goto fallback path).
I think the io_req_sqe_copy() can just move into the fallback block?

>         }
>
> -       io_queue_sqe(req);
> +       io_queue_sqe(req, sqe);
>         return 0;
>  }
>
> diff --git a/io_uring/opdef.h b/io_uring/opdef.h
> index 719a52104abe..71bfaa3c8afd 100644
> --- a/io_uring/opdef.h
> +++ b/io_uring/opdef.h
> @@ -38,6 +38,7 @@ struct io_issue_def {
>  struct io_cold_def {
>         const char              *name;
>
> +       int (*sqe_copy)(struct io_kiocb *, const struct io_uring_sqe *, unsigned int issue_flags);

nit: this line is a tad long

Best,
Caleb

>         void (*cleanup)(struct io_kiocb *);
>         void (*fail)(struct io_kiocb *);
>  };
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup()
  2025-06-05 19:40 ` [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() Jens Axboe
@ 2025-06-06 17:37   ` Caleb Sander Mateos
  0 siblings, 0 replies; 18+ messages in thread
From: Caleb Sander Mateos @ 2025-06-06 17:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Thu, Jun 5, 2025 at 12:47 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> It's a pretty pointless helper, just allocates and copies data. Fold it
> into io_uring_cmd_prep().
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-05 19:40 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
@ 2025-06-06 17:39   ` Caleb Sander Mateos
  2025-06-06 21:05     ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Caleb Sander Mateos @ 2025-06-06 17:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Thu, Jun 5, 2025 at 12:47 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> uring_cmd currently copies the full SQE at prep time, just in case it
> needs it to be stable. Opt in to using ->sqe_copy() to let the core of
> io_uring decide when to copy SQEs.
>
> This provides two checks to see if ioucmd->sqe is still valid:
>
> 1) If ioucmd->sqe is not the uring copied version AND IO_URING_F_INLINE
>    isn't set, then the core of io_uring has a bug. Warn and return
>    -EFAULT.
>
> 2) If sqe is NULL AND IO_URING_F_INLINE isn't set, then the core of
>    io_uring has a bug. Warn and return -EFAULT.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  io_uring/opdef.c     |  1 +
>  io_uring/uring_cmd.c | 35 ++++++++++++++++++++++++-----------
>  io_uring/uring_cmd.h |  2 ++
>  3 files changed, 27 insertions(+), 11 deletions(-)
>
> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> index 6e0882b051f9..287f9a23b816 100644
> --- a/io_uring/opdef.c
> +++ b/io_uring/opdef.c
> @@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
>         },
>         [IORING_OP_URING_CMD] = {
>                 .name                   = "URING_CMD",
> +               .sqe_copy               = io_uring_cmd_sqe_copy,
>                 .cleanup                = io_uring_cmd_cleanup,
>         },
>         [IORING_OP_SEND_ZC] = {
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index e204f4941d72..f682b9d442e1 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -205,16 +205,25 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>         if (!ac)
>                 return -ENOMEM;
>         ac->data.op_data = NULL;
> +       ioucmd->sqe = sqe;
> +       return 0;
> +}
> +
> +int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> +                         unsigned int issue_flags)

Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
Presumably any other opcode that implements ->sqe_copy() would also
have the sqe pointer stashed somewhere. Seems like it would simplify
the core io_uring code a bit not to have to thread the sqe through
several function calls.

> +{
> +       struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> +       struct io_async_cmd *ac = req->async_data;
> +
> +       if (sqe != ac->sqes) {

Maybe return early if sqe == ac->sqes to reduce indentation?

> +               if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_INLINE)))
> +                       return -EFAULT;
> +               if (WARN_ON_ONCE(!sqe))
> +                       return -EFAULT;
> +               memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
> +               ioucmd->sqe = ac->sqes;
> +       }
>
> -       /*
> -        * Unconditionally cache the SQE for now - this is only needed for
> -        * requests that go async, but prep handlers must ensure that any
> -        * sqe data is stable beyond prep. Since uring_cmd is special in
> -        * that it doesn't read in per-op data, play it safe and ensure that
> -        * any SQE data is stable beyond prep. This can later get relaxed.
> -        */
> -       memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
> -       ioucmd->sqe = ac->sqes;
>         return 0;
>  }
>
> @@ -251,8 +260,12 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>         }
>
>         ret = file->f_op->uring_cmd(ioucmd, issue_flags);
> -       if (ret == -EAGAIN || ret == -EIOCBQUEUED)
> -               return ret;
> +       if (ret == -EAGAIN) {
> +               io_uring_cmd_sqe_copy(req, ioucmd->sqe, issue_flags);

Is it necessary to call io_uring_cmd_sqe_copy() here? Won't the call
in io_queue_async() already handle this case?

> +               return -EAGAIN;
> +       } else if (ret == -EIOCBQUEUED) {

nit: else could be omitted since the if case diverges

Best,
Caleb

> +               return -EIOCBQUEUED;
> +       }
>         if (ret < 0)
>                 req_set_fail(req);
>         io_req_uring_cleanup(req, issue_flags);
> diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
> index e6a5142c890e..f956b0e7c351 100644
> --- a/io_uring/uring_cmd.h
> +++ b/io_uring/uring_cmd.h
> @@ -11,6 +11,8 @@ struct io_async_cmd {
>
>  int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags);
>  int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
> +int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> +                         unsigned int issue_flags);
>  void io_uring_cmd_cleanup(struct io_kiocb *req);
>
>  bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *ctx,
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method
  2025-06-06 17:36   ` Caleb Sander Mateos
@ 2025-06-06 21:01     ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2025-06-06 21:01 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 11:36 AM, Caleb Sander Mateos wrote:
> On Thu, Jun 5, 2025 at 12:47?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Will be called by the core of io_uring, if inline issue is not going
>> to be tried for a request. Opcodes can define this handler to defer
>> copying of SQE data that should remain stable.
>>
>> Called with IO_URING_F_INLINE set if this is an inline issue, and that
>> flag NOT set if it's an out-of-line call. The handler can use this to
>> determine if it's still safe to copy the SQE. The core should always
>> guarantee that it will be safe, but by having this flag available the
>> handler is able to check and fail.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  io_uring/io_uring.c | 32 ++++++++++++++++++++++++--------
>>  io_uring/opdef.h    |  1 +
>>  2 files changed, 25 insertions(+), 8 deletions(-)
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 079a95e1bd82..fdf23e81c4ff 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -147,7 +147,7 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
>>                                          bool cancel_all,
>>                                          bool is_sqpoll_thread);
>>
>> -static void io_queue_sqe(struct io_kiocb *req);
>> +static void io_queue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe);
>>  static void __io_req_caches_free(struct io_ring_ctx *ctx);
>>
>>  static __read_mostly DEFINE_STATIC_KEY_FALSE(io_key_has_sqarray);
>> @@ -1377,7 +1377,7 @@ void io_req_task_submit(struct io_kiocb *req, io_tw_token_t tw)
>>         else if (req->flags & REQ_F_FORCE_ASYNC)
>>                 io_queue_iowq(req);
>>         else
>> -               io_queue_sqe(req);
>> +               io_queue_sqe(req, NULL);
> 
> Passing NULL here is a bit weird. As I mentioned on patch 1, I think
> it would make more sense to consider this task work path a
> "non-inline" issue.

It very much _is_ a non-inline issue, the sqe being NULL is how the
->sqe_copy() would treat it. But it is a bit confusing in that the
callback would need to check

if (!sqe || !(flags & INLINE))

as the condition for whether or not the sqe is valid. Probably better to
make the INLINE flag dependent on sqe != NULL as well, and then that's
all the callback would need to check for.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag
  2025-06-06 17:31   ` Caleb Sander Mateos
@ 2025-06-06 21:02     ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2025-06-06 21:02 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 11:31 AM, Caleb Sander Mateos wrote:
> On Thu, Jun 5, 2025 at 12:47?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Set when the execution of the request is done inline from the system
>> call itself. Any deferred issue will never have this flag set.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  include/linux/io_uring_types.h | 2 ++
>>  io_uring/io_uring.c            | 3 ++-
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 2922635986f5..054c43c02c96 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -26,6 +26,8 @@ enum io_uring_cmd_flags {
>>         IO_URING_F_MULTISHOT            = 4,
>>         /* executed by io-wq */
>>         IO_URING_F_IOWQ                 = 8,
>> +       /* executed inline from syscall */
>> +       IO_URING_F_INLINE               = 16,
>>         /* int's last bit, sign checks are usually faster than a bit test */
>>         IO_URING_F_NONBLOCK             = INT_MIN,
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index cf759c172083..079a95e1bd82 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -1962,7 +1962,8 @@ static inline void io_queue_sqe(struct io_kiocb *req)
>>  {
>>         int ret;
>>
>> -       ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
>> +       ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER|
>> +                               IO_URING_F_INLINE);
> 
> Isn't io_queue_sqe() also called from io_req_task_submit(), which is
> an io_req_tw_func_t callback? Task work runs after the SQ slots have
> already been returned to userspace, right? Before the unconditional
> memcpy() was added, we had observed requests in linked chains with
> corrupted SQEs due to the async task work issue.

It's really just a patch ordering issue, as per the other email.

> As is, it looks like IO_URING_F_INLINE is just the inverse of
> IO_URING_F_IOWQ, so it may not be necessary to add a new flag. But I
> can see how core io_uring might add additional async issue cases in
> the future.

I'd greatly prefer a separate flag, it's more robust than tying it
to some other flag.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 17:39   ` Caleb Sander Mateos
@ 2025-06-06 21:05     ` Jens Axboe
  2025-06-06 22:08       ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2025-06-06 21:05 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 11:39 AM, Caleb Sander Mateos wrote:
> On Thu, Jun 5, 2025 at 12:47?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> uring_cmd currently copies the full SQE at prep time, just in case it
>> needs it to be stable. Opt in to using ->sqe_copy() to let the core of
>> io_uring decide when to copy SQEs.
>>
>> This provides two checks to see if ioucmd->sqe is still valid:
>>
>> 1) If ioucmd->sqe is not the uring copied version AND IO_URING_F_INLINE
>>    isn't set, then the core of io_uring has a bug. Warn and return
>>    -EFAULT.
>>
>> 2) If sqe is NULL AND IO_URING_F_INLINE isn't set, then the core of
>>    io_uring has a bug. Warn and return -EFAULT.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  io_uring/opdef.c     |  1 +
>>  io_uring/uring_cmd.c | 35 ++++++++++++++++++++++++-----------
>>  io_uring/uring_cmd.h |  2 ++
>>  3 files changed, 27 insertions(+), 11 deletions(-)
>>
>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
>> index 6e0882b051f9..287f9a23b816 100644
>> --- a/io_uring/opdef.c
>> +++ b/io_uring/opdef.c
>> @@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
>>         },
>>         [IORING_OP_URING_CMD] = {
>>                 .name                   = "URING_CMD",
>> +               .sqe_copy               = io_uring_cmd_sqe_copy,
>>                 .cleanup                = io_uring_cmd_cleanup,
>>         },
>>         [IORING_OP_SEND_ZC] = {
>> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
>> index e204f4941d72..f682b9d442e1 100644
>> --- a/io_uring/uring_cmd.c
>> +++ b/io_uring/uring_cmd.c
>> @@ -205,16 +205,25 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>         if (!ac)
>>                 return -ENOMEM;
>>         ac->data.op_data = NULL;
>> +       ioucmd->sqe = sqe;
>> +       return 0;
>> +}
>> +
>> +int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>> +                         unsigned int issue_flags)
> 
> Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
> Presumably any other opcode that implements ->sqe_copy() would also
> have the sqe pointer stashed somewhere. Seems like it would simplify
> the core io_uring code a bit not to have to thread the sqe through
> several function calls.

It's not necessary, but I would rather get rid of needing to store an
SQE since that is a bit iffy than get rid of passing the SQE. When it
comes from the core, you _know_ it's going to be valid. I feel like you
need a fairly intimate understanding of io_uring issue flow to make any
judgement on this, if you were adding an opcode and defining this type
of handler.


>> +{
>> +       struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
>> +       struct io_async_cmd *ac = req->async_data;
>> +
>> +       if (sqe != ac->sqes) {
> 
> Maybe return early if sqe == ac->sqes to reduce indentation?

Heh, the current -git version has it like that, since yesterday. So
yeah, I agree.

>> @@ -251,8 +260,12 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>>         }
>>
>>         ret = file->f_op->uring_cmd(ioucmd, issue_flags);
>> -       if (ret == -EAGAIN || ret == -EIOCBQUEUED)
>> -               return ret;
>> +       if (ret == -EAGAIN) {
>> +               io_uring_cmd_sqe_copy(req, ioucmd->sqe, issue_flags);
> 
> Is it necessary to call io_uring_cmd_sqe_copy() here? Won't the call
> in io_queue_async() already handle this case?
> 

Good point, yes it should not be necessary at all. I'll kill it.

>> +               return -EAGAIN;
>> +       } else if (ret == -EIOCBQUEUED) {
> 
> nit: else could be omitted since the if case diverges

With the above removal of cmd_sqe_copu(), then this entire hunk goes
away.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 21:05     ` Jens Axboe
@ 2025-06-06 22:08       ` Jens Axboe
  2025-06-06 22:09         ` Caleb Sander Mateos
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Axboe @ 2025-06-06 22:08 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 3:05 PM, Jens Axboe wrote:
>> Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
>> Presumably any other opcode that implements ->sqe_copy() would also
>> have the sqe pointer stashed somewhere. Seems like it would simplify
>> the core io_uring code a bit not to have to thread the sqe through
>> several function calls.
> 
> It's not necessary, but I would rather get rid of needing to store an
> SQE since that is a bit iffy than get rid of passing the SQE. When it
> comes from the core, you _know_ it's going to be valid. I feel like you
> need a fairly intimate understanding of io_uring issue flow to make any
> judgement on this, if you were adding an opcode and defining this type
> of handler.

Actually did go that route anyway, because we still need to stash it.
And if we do go that route, then we can keep all the checking in the
core and leave the handler just a basic copy with a void return. Which
is pretty nice.

Anyway, checkout v3 and see what you thing.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 22:08       ` Jens Axboe
@ 2025-06-06 22:09         ` Caleb Sander Mateos
  2025-06-06 23:53           ` Jens Axboe
  0 siblings, 1 reply; 18+ messages in thread
From: Caleb Sander Mateos @ 2025-06-06 22:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Fri, Jun 6, 2025 at 3:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/6/25 3:05 PM, Jens Axboe wrote:
> >> Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
> >> Presumably any other opcode that implements ->sqe_copy() would also
> >> have the sqe pointer stashed somewhere. Seems like it would simplify
> >> the core io_uring code a bit not to have to thread the sqe through
> >> several function calls.
> >
> > It's not necessary, but I would rather get rid of needing to store an
> > SQE since that is a bit iffy than get rid of passing the SQE. When it
> > comes from the core, you _know_ it's going to be valid. I feel like you
> > need a fairly intimate understanding of io_uring issue flow to make any
> > judgement on this, if you were adding an opcode and defining this type
> > of handler.
>
> Actually did go that route anyway, because we still need to stash it.
> And if we do go that route, then we can keep all the checking in the
> core and leave the handler just a basic copy with a void return. Which
> is pretty nice.
>
> Anyway, checkout v3 and see what you thing.

From a quick glance it looks good to me. Let me give it a more
detailed look with a fresh pair of eyes.

Best,
Caleb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 22:09         ` Caleb Sander Mateos
@ 2025-06-06 23:53           ` Jens Axboe
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2025-06-06 23:53 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 4:09 PM, Caleb Sander Mateos wrote:
> On Fri, Jun 6, 2025 at 3:08?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 6/6/25 3:05 PM, Jens Axboe wrote:
>>>> Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
>>>> Presumably any other opcode that implements ->sqe_copy() would also
>>>> have the sqe pointer stashed somewhere. Seems like it would simplify
>>>> the core io_uring code a bit not to have to thread the sqe through
>>>> several function calls.
>>>
>>> It's not necessary, but I would rather get rid of needing to store an
>>> SQE since that is a bit iffy than get rid of passing the SQE. When it
>>> comes from the core, you _know_ it's going to be valid. I feel like you
>>> need a fairly intimate understanding of io_uring issue flow to make any
>>> judgement on this, if you were adding an opcode and defining this type
>>> of handler.
>>
>> Actually did go that route anyway, because we still need to stash it.
>> And if we do go that route, then we can keep all the checking in the
>> core and leave the handler just a basic copy with a void return. Which
>> is pretty nice.
>>
>> Anyway, checkout v3 and see what you thing.
> 
> From a quick glance it looks good to me. Let me give it a more
> detailed look with a fresh pair of eyes.

Thanks, that'd be great. Ignore the commit message for patch 4,
I did update it but didn't amend before sending out... The one
in the git tree is accurate.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-06-06 23:53 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 19:40 [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Jens Axboe
2025-06-05 19:40 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
2025-06-06 17:31   ` Caleb Sander Mateos
2025-06-06 21:02     ` Jens Axboe
2025-06-05 19:40 ` [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method Jens Axboe
2025-06-05 20:05   ` Jens Axboe
2025-06-06 17:36   ` Caleb Sander Mateos
2025-06-06 21:01     ` Jens Axboe
2025-06-05 19:40 ` [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() Jens Axboe
2025-06-06 17:37   ` Caleb Sander Mateos
2025-06-05 19:40 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
2025-06-06 17:39   ` Caleb Sander Mateos
2025-06-06 21:05     ` Jens Axboe
2025-06-06 22:08       ` Jens Axboe
2025-06-06 22:09         ` Caleb Sander Mateos
2025-06-06 23:53           ` Jens Axboe
2025-06-06 17:29 ` [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Caleb Sander Mateos
2025-06-06 17:32   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox