public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-05 19:40 [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Jens Axboe
@ 2025-06-05 19:40 ` Jens Axboe
  2025-06-06 17:39   ` Caleb Sander Mateos
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2025-06-05 19:40 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

uring_cmd currently copies the full SQE at prep time, just in case it
needs it to be stable. Opt in to using ->sqe_copy() to let the core of
io_uring decide when to copy SQEs.

This provides two checks to see if ioucmd->sqe is still valid:

1) If ioucmd->sqe is not the uring copied version AND IO_URING_F_INLINE
   isn't set, then the core of io_uring has a bug. Warn and return
   -EFAULT.

2) If sqe is NULL AND IO_URING_F_INLINE isn't set, then the core of
   io_uring has a bug. Warn and return -EFAULT.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/opdef.c     |  1 +
 io_uring/uring_cmd.c | 35 ++++++++++++++++++++++++-----------
 io_uring/uring_cmd.h |  2 ++
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 6e0882b051f9..287f9a23b816 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
 	},
 	[IORING_OP_URING_CMD] = {
 		.name			= "URING_CMD",
+		.sqe_copy		= io_uring_cmd_sqe_copy,
 		.cleanup		= io_uring_cmd_cleanup,
 	},
 	[IORING_OP_SEND_ZC] = {
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index e204f4941d72..f682b9d442e1 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -205,16 +205,25 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (!ac)
 		return -ENOMEM;
 	ac->data.op_data = NULL;
+	ioucmd->sqe = sqe;
+	return 0;
+}
+
+int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			  unsigned int issue_flags)
+{
+	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	struct io_async_cmd *ac = req->async_data;
+
+	if (sqe != ac->sqes) {
+		if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_INLINE)))
+			return -EFAULT;
+		if (WARN_ON_ONCE(!sqe))
+			return -EFAULT;
+		memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
+		ioucmd->sqe = ac->sqes;
+	}
 
-	/*
-	 * Unconditionally cache the SQE for now - this is only needed for
-	 * requests that go async, but prep handlers must ensure that any
-	 * sqe data is stable beyond prep. Since uring_cmd is special in
-	 * that it doesn't read in per-op data, play it safe and ensure that
-	 * any SQE data is stable beyond prep. This can later get relaxed.
-	 */
-	memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
-	ioucmd->sqe = ac->sqes;
 	return 0;
 }
 
@@ -251,8 +260,12 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	ret = file->f_op->uring_cmd(ioucmd, issue_flags);
-	if (ret == -EAGAIN || ret == -EIOCBQUEUED)
-		return ret;
+	if (ret == -EAGAIN) {
+		io_uring_cmd_sqe_copy(req, ioucmd->sqe, issue_flags);
+		return -EAGAIN;
+	} else if (ret == -EIOCBQUEUED) {
+		return -EIOCBQUEUED;
+	}
 	if (ret < 0)
 		req_set_fail(req);
 	io_req_uring_cleanup(req, issue_flags);
diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
index e6a5142c890e..f956b0e7c351 100644
--- a/io_uring/uring_cmd.h
+++ b/io_uring/uring_cmd.h
@@ -11,6 +11,8 @@ struct io_async_cmd {
 
 int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags);
 int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			  unsigned int issue_flags);
 void io_uring_cmd_cleanup(struct io_kiocb *req);
 
 bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *ctx,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-05 19:40 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
@ 2025-06-06 17:39   ` Caleb Sander Mateos
  2025-06-06 21:05     ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Caleb Sander Mateos @ 2025-06-06 17:39 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Thu, Jun 5, 2025 at 12:47 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> uring_cmd currently copies the full SQE at prep time, just in case it
> needs it to be stable. Opt in to using ->sqe_copy() to let the core of
> io_uring decide when to copy SQEs.
>
> This provides two checks to see if ioucmd->sqe is still valid:
>
> 1) If ioucmd->sqe is not the uring copied version AND IO_URING_F_INLINE
>    isn't set, then the core of io_uring has a bug. Warn and return
>    -EFAULT.
>
> 2) If sqe is NULL AND IO_URING_F_INLINE isn't set, then the core of
>    io_uring has a bug. Warn and return -EFAULT.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  io_uring/opdef.c     |  1 +
>  io_uring/uring_cmd.c | 35 ++++++++++++++++++++++++-----------
>  io_uring/uring_cmd.h |  2 ++
>  3 files changed, 27 insertions(+), 11 deletions(-)
>
> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> index 6e0882b051f9..287f9a23b816 100644
> --- a/io_uring/opdef.c
> +++ b/io_uring/opdef.c
> @@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
>         },
>         [IORING_OP_URING_CMD] = {
>                 .name                   = "URING_CMD",
> +               .sqe_copy               = io_uring_cmd_sqe_copy,
>                 .cleanup                = io_uring_cmd_cleanup,
>         },
>         [IORING_OP_SEND_ZC] = {
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index e204f4941d72..f682b9d442e1 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -205,16 +205,25 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>         if (!ac)
>                 return -ENOMEM;
>         ac->data.op_data = NULL;
> +       ioucmd->sqe = sqe;
> +       return 0;
> +}
> +
> +int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> +                         unsigned int issue_flags)

Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
Presumably any other opcode that implements ->sqe_copy() would also
have the sqe pointer stashed somewhere. Seems like it would simplify
the core io_uring code a bit not to have to thread the sqe through
several function calls.

> +{
> +       struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> +       struct io_async_cmd *ac = req->async_data;
> +
> +       if (sqe != ac->sqes) {

Maybe return early if sqe == ac->sqes to reduce indentation?

> +               if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_INLINE)))
> +                       return -EFAULT;
> +               if (WARN_ON_ONCE(!sqe))
> +                       return -EFAULT;
> +               memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
> +               ioucmd->sqe = ac->sqes;
> +       }
>
> -       /*
> -        * Unconditionally cache the SQE for now - this is only needed for
> -        * requests that go async, but prep handlers must ensure that any
> -        * sqe data is stable beyond prep. Since uring_cmd is special in
> -        * that it doesn't read in per-op data, play it safe and ensure that
> -        * any SQE data is stable beyond prep. This can later get relaxed.
> -        */
> -       memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
> -       ioucmd->sqe = ac->sqes;
>         return 0;
>  }
>
> @@ -251,8 +260,12 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>         }
>
>         ret = file->f_op->uring_cmd(ioucmd, issue_flags);
> -       if (ret == -EAGAIN || ret == -EIOCBQUEUED)
> -               return ret;
> +       if (ret == -EAGAIN) {
> +               io_uring_cmd_sqe_copy(req, ioucmd->sqe, issue_flags);

Is it necessary to call io_uring_cmd_sqe_copy() here? Won't the call
in io_queue_async() already handle this case?

> +               return -EAGAIN;
> +       } else if (ret == -EIOCBQUEUED) {

nit: else could be omitted since the if case diverges

Best,
Caleb

> +               return -EIOCBQUEUED;
> +       }
>         if (ret < 0)
>                 req_set_fail(req);
>         io_req_uring_cleanup(req, issue_flags);
> diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
> index e6a5142c890e..f956b0e7c351 100644
> --- a/io_uring/uring_cmd.h
> +++ b/io_uring/uring_cmd.h
> @@ -11,6 +11,8 @@ struct io_async_cmd {
>
>  int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags);
>  int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
> +int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> +                         unsigned int issue_flags);
>  void io_uring_cmd_cleanup(struct io_kiocb *req);
>
>  bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *ctx,
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 17:39   ` Caleb Sander Mateos
@ 2025-06-06 21:05     ` Jens Axboe
  2025-06-06 22:08       ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2025-06-06 21:05 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 11:39 AM, Caleb Sander Mateos wrote:
> On Thu, Jun 5, 2025 at 12:47?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> uring_cmd currently copies the full SQE at prep time, just in case it
>> needs it to be stable. Opt in to using ->sqe_copy() to let the core of
>> io_uring decide when to copy SQEs.
>>
>> This provides two checks to see if ioucmd->sqe is still valid:
>>
>> 1) If ioucmd->sqe is not the uring copied version AND IO_URING_F_INLINE
>>    isn't set, then the core of io_uring has a bug. Warn and return
>>    -EFAULT.
>>
>> 2) If sqe is NULL AND IO_URING_F_INLINE isn't set, then the core of
>>    io_uring has a bug. Warn and return -EFAULT.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  io_uring/opdef.c     |  1 +
>>  io_uring/uring_cmd.c | 35 ++++++++++++++++++++++++-----------
>>  io_uring/uring_cmd.h |  2 ++
>>  3 files changed, 27 insertions(+), 11 deletions(-)
>>
>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
>> index 6e0882b051f9..287f9a23b816 100644
>> --- a/io_uring/opdef.c
>> +++ b/io_uring/opdef.c
>> @@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
>>         },
>>         [IORING_OP_URING_CMD] = {
>>                 .name                   = "URING_CMD",
>> +               .sqe_copy               = io_uring_cmd_sqe_copy,
>>                 .cleanup                = io_uring_cmd_cleanup,
>>         },
>>         [IORING_OP_SEND_ZC] = {
>> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
>> index e204f4941d72..f682b9d442e1 100644
>> --- a/io_uring/uring_cmd.c
>> +++ b/io_uring/uring_cmd.c
>> @@ -205,16 +205,25 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>         if (!ac)
>>                 return -ENOMEM;
>>         ac->data.op_data = NULL;
>> +       ioucmd->sqe = sqe;
>> +       return 0;
>> +}
>> +
>> +int io_uring_cmd_sqe_copy(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>> +                         unsigned int issue_flags)
> 
> Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
> Presumably any other opcode that implements ->sqe_copy() would also
> have the sqe pointer stashed somewhere. Seems like it would simplify
> the core io_uring code a bit not to have to thread the sqe through
> several function calls.

It's not necessary, but I would rather get rid of needing to store an
SQE since that is a bit iffy than get rid of passing the SQE. When it
comes from the core, you _know_ it's going to be valid. I feel like you
need a fairly intimate understanding of io_uring issue flow to make any
judgement on this, if you were adding an opcode and defining this type
of handler.


>> +{
>> +       struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
>> +       struct io_async_cmd *ac = req->async_data;
>> +
>> +       if (sqe != ac->sqes) {
> 
> Maybe return early if sqe == ac->sqes to reduce indentation?

Heh, the current -git version has it like that, since yesterday. So
yeah, I agree.

>> @@ -251,8 +260,12 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>>         }
>>
>>         ret = file->f_op->uring_cmd(ioucmd, issue_flags);
>> -       if (ret == -EAGAIN || ret == -EIOCBQUEUED)
>> -               return ret;
>> +       if (ret == -EAGAIN) {
>> +               io_uring_cmd_sqe_copy(req, ioucmd->sqe, issue_flags);
> 
> Is it necessary to call io_uring_cmd_sqe_copy() here? Won't the call
> in io_queue_async() already handle this case?
> 

Good point, yes it should not be necessary at all. I'll kill it.

>> +               return -EAGAIN;
>> +       } else if (ret == -EIOCBQUEUED) {
> 
> nit: else could be omitted since the if case diverges

With the above removal of cmd_sqe_copu(), then this entire hunk goes
away.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHSET v3 0/4] uring_cmd copy avoidance
@ 2025-06-06 21:54 Jens Axboe
  2025-06-06 21:54 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Jens Axboe @ 2025-06-06 21:54 UTC (permalink / raw)
  To: io-uring; +Cc: csander

Hi,

Currently uring_cmd unconditionally copies the SQE at prep time, as it
has no other choice - the SQE data must remain stable after submit.
This can lead to excessive memory bandwidth being used for that copy,
as passthrough will often use 128b SQEs, and efficiency concerns as
those copies will potentially use quite a lot of CPU cycles as well.

As a quick test, running the current -git kernel on a box with 23
NVMe drives doing passthrough IO, memcpy() is the highest cycle user
at 9.05%, which is all off the uring_cmd prep path. The test case is
a 512b random read, which runs at 91-92M IOPS.

With these patches, memcpy() is gone from the profiles, and it runs
at 98-99M IOPS, or about 7-8% faster.

Before:

IOPS=91.12M, BW=44.49GiB/s, IOS/call=32/32
IOPS=91.16M, BW=44.51GiB/s, IOS/call=32/32
IOPS=91.18M, BW=44.52GiB/s, IOS/call=31/32
IOPS=91.92M, BW=44.88GiB/s, IOS/call=32/32
IOPS=91.88M, BW=44.86GiB/s, IOS/call=32/32
IOPS=91.82M, BW=44.83GiB/s, IOS/call=32/31
IOPS=91.52M, BW=44.69GiB/s, IOS/call=32/32

with the top perf report -g --no-children being:

+    9.07%  io_uring  [kernel.kallsyms]  [k] memcpy

and after:

# bash run-peak-pass.sh
[...]
IOPS=99.30M, BW=48.49GiB/s, IOS/call=32/32
IOPS=99.27M, BW=48.47GiB/s, IOS/call=31/32
IOPS=99.60M, BW=48.63GiB/s, IOS/call=32/32
IOPS=99.68M, BW=48.67GiB/s, IOS/call=32/31
IOPS=99.80M, BW=48.73GiB/s, IOS/call=31/32
IOPS=99.84M, BW=48.75GiB/s, IOS/call=32/32

with memcpy not even in profiles. If you do the actual math of 100M
requests per second, and 128b of copying per IOP, then it's almost
12GB/sec of reduced memory bandwidth.

Even for lower IOPS production testing, Caleb reports that memcpy()
overhead is in the realm of 1.1% of CPU time.

v3 cleans up the ->sqe_copy() handling, relying solely on the
IO_URING_F_INLINE flag. And the core will handle it now, rejecting
commands if they should not be copied. This will returns in an
-EFAULT failure.

I think this approach is saner, and in fact it can be extended to
reduce over-eager copies in other spots. For now I just did uring_cmd,
and verified that the memcpy's are still gone from my test.

Can also be found here:

https://git.kernel.dk/cgit/linux/log/?h=uring_cmd.2

 include/linux/io_uring_types.h |  2 ++
 io_uring/io_uring.c            | 37 ++++++++++++++++++++++-------
 io_uring/opdef.c               |  1 +
 io_uring/opdef.h               |  1 +
 io_uring/uring_cmd.c           | 43 +++++++++++++++-------------------
 io_uring/uring_cmd.h           |  1 +
 6 files changed, 53 insertions(+), 32 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag
  2025-06-06 21:54 [PATCHSET v3 0/4] uring_cmd copy avoidance Jens Axboe
@ 2025-06-06 21:54 ` Jens Axboe
  2025-06-07  0:49   ` Caleb Sander Mateos
  2025-06-06 21:54 ` [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method Jens Axboe
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2025-06-06 21:54 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

Set when the execution of the request is done inline from the system
call itself. Any deferred issue will never have this flag set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  2 ++
 io_uring/io_uring.c            | 12 +++++++-----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 2922635986f5..054c43c02c96 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -26,6 +26,8 @@ enum io_uring_cmd_flags {
 	IO_URING_F_MULTISHOT		= 4,
 	/* executed by io-wq */
 	IO_URING_F_IOWQ			= 8,
+	/* executed inline from syscall */
+	IO_URING_F_INLINE		= 16,
 	/* int's last bit, sign checks are usually faster than a bit test */
 	IO_URING_F_NONBLOCK		= INT_MIN,
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index cf759c172083..0f9f6a173e66 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -147,7 +147,7 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 					 bool cancel_all,
 					 bool is_sqpoll_thread);
 
-static void io_queue_sqe(struct io_kiocb *req);
+static void io_queue_sqe(struct io_kiocb *req, unsigned int extra_flags);
 static void __io_req_caches_free(struct io_ring_ctx *ctx);
 
 static __read_mostly DEFINE_STATIC_KEY_FALSE(io_key_has_sqarray);
@@ -1377,7 +1377,7 @@ void io_req_task_submit(struct io_kiocb *req, io_tw_token_t tw)
 	else if (req->flags & REQ_F_FORCE_ASYNC)
 		io_queue_iowq(req);
 	else
-		io_queue_sqe(req);
+		io_queue_sqe(req, 0);
 }
 
 void io_req_task_queue_fail(struct io_kiocb *req, int ret)
@@ -1957,12 +1957,14 @@ static void io_queue_async(struct io_kiocb *req, int ret)
 	}
 }
 
-static inline void io_queue_sqe(struct io_kiocb *req)
+static inline void io_queue_sqe(struct io_kiocb *req, unsigned int extra_flags)
 	__must_hold(&req->ctx->uring_lock)
 {
+	unsigned int issue_flags = IO_URING_F_NONBLOCK |
+				   IO_URING_F_COMPLETE_DEFER | extra_flags;
 	int ret;
 
-	ret = io_issue_sqe(req, IO_URING_F_NONBLOCK|IO_URING_F_COMPLETE_DEFER);
+	ret = io_issue_sqe(req, issue_flags);
 
 	/*
 	 * We async punt it if the file wasn't marked NOWAIT, or if the file
@@ -2218,7 +2220,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		return 0;
 	}
 
-	io_queue_sqe(req);
+	io_queue_sqe(req, IO_URING_F_INLINE);
 	return 0;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method
  2025-06-06 21:54 [PATCHSET v3 0/4] uring_cmd copy avoidance Jens Axboe
  2025-06-06 21:54 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
@ 2025-06-06 21:54 ` Jens Axboe
  2025-06-07  0:50   ` Caleb Sander Mateos
  2025-06-06 21:54 ` [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() Jens Axboe
  2025-06-06 21:54 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
  3 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2025-06-06 21:54 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

Will be called by the core of io_uring, if inline issue is not going
to be tried for a request. Opcodes can define this handler to defer
copying of SQE data that should remain stable.

Only called if IO_URING_F_INLINE is set. If it isn't set, then there's a
bug in the core handling of this, and -EFAULT will be returned instead
to terminate the request. This will trigger a WARN_ON_ONCE(). Don't
expect this to ever trigger, and down the line this can be removed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/io_uring.c | 25 ++++++++++++++++++++++---
 io_uring/opdef.h    |  1 +
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0f9f6a173e66..9799a31a2b29 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1935,14 +1935,31 @@ struct file *io_file_get_normal(struct io_kiocb *req, int fd)
 	return file;
 }
 
-static void io_queue_async(struct io_kiocb *req, int ret)
+static int io_req_sqe_copy(struct io_kiocb *req, unsigned int issue_flags)
+{
+	const struct io_cold_def *def = &io_cold_defs[req->opcode];
+
+	if (!def->sqe_copy)
+		return 0;
+	if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_INLINE)))
+		return -EFAULT;
+	def->sqe_copy(req);
+	return 0;
+}
+
+static void io_queue_async(struct io_kiocb *req, unsigned int issue_flags, int ret)
 	__must_hold(&req->ctx->uring_lock)
 {
 	if (ret != -EAGAIN || (req->flags & REQ_F_NOWAIT)) {
+fail:
 		io_req_defer_failed(req, ret);
 		return;
 	}
 
+	ret = io_req_sqe_copy(req, issue_flags);
+	if (unlikely(ret))
+		goto fail;
+
 	switch (io_arm_poll_handler(req, 0)) {
 	case IO_APOLL_READY:
 		io_kbuf_recycle(req, 0);
@@ -1971,7 +1988,7 @@ static inline void io_queue_sqe(struct io_kiocb *req, unsigned int extra_flags)
 	 * doesn't support non-blocking read/write attempts
 	 */
 	if (unlikely(ret))
-		io_queue_async(req, ret);
+		io_queue_async(req, issue_flags, ret);
 }
 
 static void io_queue_sqe_fallback(struct io_kiocb *req)
@@ -1986,6 +2003,8 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
 		req->flags |= REQ_F_LINK;
 		io_req_defer_failed(req, req->cqe.res);
 	} else {
+		/* can't fail with IO_URING_F_INLINE */
+		io_req_sqe_copy(req, IO_URING_F_INLINE);
 		if (unlikely(req->ctx->drain_active))
 			io_drain_req(req);
 		else
@@ -2201,7 +2220,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		link->last = req;
 
 		if (req->flags & IO_REQ_LINK_FLAGS)
-			return 0;
+			return io_req_sqe_copy(req, IO_URING_F_INLINE);
 		/* last request of the link, flush it */
 		req = link->head;
 		link->head = NULL;
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index 719a52104abe..c2f0907ed78c 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -38,6 +38,7 @@ struct io_issue_def {
 struct io_cold_def {
 	const char		*name;
 
+	void (*sqe_copy)(struct io_kiocb *);
 	void (*cleanup)(struct io_kiocb *);
 	void (*fail)(struct io_kiocb *);
 };
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup()
  2025-06-06 21:54 [PATCHSET v3 0/4] uring_cmd copy avoidance Jens Axboe
  2025-06-06 21:54 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
  2025-06-06 21:54 ` [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method Jens Axboe
@ 2025-06-06 21:54 ` Jens Axboe
  2025-06-08 19:57   ` Anuj gupta
  2025-06-06 21:54 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
  3 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2025-06-06 21:54 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

It's a pretty pointless helper, just allocates and copies data. Fold it
into io_uring_cmd_prep().

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/uring_cmd.c | 34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 929cad6ee326..e204f4941d72 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -181,8 +181,7 @@ void io_uring_cmd_done(struct io_uring_cmd *ioucmd, ssize_t ret, u64 res2,
 }
 EXPORT_SYMBOL_GPL(io_uring_cmd_done);
 
-static int io_uring_cmd_prep_setup(struct io_kiocb *req,
-				   const struct io_uring_sqe *sqe)
+int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
 	struct io_async_cmd *ac;
@@ -190,6 +189,18 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 	/* see io_uring_cmd_get_async_data() */
 	BUILD_BUG_ON(offsetof(struct io_async_cmd, data) != 0);
 
+	if (sqe->__pad1)
+		return -EINVAL;
+
+	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
+	if (ioucmd->flags & ~IORING_URING_CMD_MASK)
+		return -EINVAL;
+
+	if (ioucmd->flags & IORING_URING_CMD_FIXED)
+		req->buf_index = READ_ONCE(sqe->buf_index);
+
+	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
+
 	ac = io_uring_alloc_async_data(&req->ctx->cmd_cache, req);
 	if (!ac)
 		return -ENOMEM;
@@ -207,25 +218,6 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 	return 0;
 }
 
-int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
-{
-	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-
-	if (sqe->__pad1)
-		return -EINVAL;
-
-	ioucmd->flags = READ_ONCE(sqe->uring_cmd_flags);
-	if (ioucmd->flags & ~IORING_URING_CMD_MASK)
-		return -EINVAL;
-
-	if (ioucmd->flags & IORING_URING_CMD_FIXED)
-		req->buf_index = READ_ONCE(sqe->buf_index);
-
-	ioucmd->cmd_op = READ_ONCE(sqe->cmd_op);
-
-	return io_uring_cmd_prep_setup(req, sqe);
-}
-
 int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 21:54 [PATCHSET v3 0/4] uring_cmd copy avoidance Jens Axboe
                   ` (2 preceding siblings ...)
  2025-06-06 21:54 ` [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() Jens Axboe
@ 2025-06-06 21:54 ` Jens Axboe
  2025-06-07  0:50   ` Caleb Sander Mateos
  3 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2025-06-06 21:54 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

uring_cmd currently copies the full SQE at prep time, just in case it
needs it to be stable. However, for inline completions or requests that
get queued up on the device side, there's no need to ever copy the SQE.
This is particularly important, as various use cases of uring_cmd will
be using 128b sized SQEs.

Opt in to using ->sqe_copy() to let the core of io_uring decide when to
copy SQEs, rather than to it upfront unconditionally.

This provides two checks to see if ioucmd->sqe is still valid:

1) IO_URING_F_INLINE must be set, indicating the ->sqe_copy() call is
   happening inline from the syscal submitting the IO. As long as we're
   in that context, the SQE cannot have been reused.

2) If the SQE being passed in is NULL, then we're off the task_work
   submission path. This check could be combined with IO_URING_F_INLINE,
   but it'd require an additional branch-and-check in SQE queueing.

If either of these aren't true and the SQE hasn't been copied already,
then fail the request with -EFAULT and trigger a WARN_ON_ONCE() to
indicate that there's a bug to figure out. With that, it should not be
possible to ever reuse an SQE outside of the direct syscall path.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/opdef.c     |  1 +
 io_uring/uring_cmd.c | 21 ++++++++++++---------
 io_uring/uring_cmd.h |  1 +
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 6e0882b051f9..287f9a23b816 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
 	},
 	[IORING_OP_URING_CMD] = {
 		.name			= "URING_CMD",
+		.sqe_copy		= io_uring_cmd_sqe_copy,
 		.cleanup		= io_uring_cmd_cleanup,
 	},
 	[IORING_OP_SEND_ZC] = {
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index e204f4941d72..a99dc2f9c4b5 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -205,17 +205,20 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (!ac)
 		return -ENOMEM;
 	ac->data.op_data = NULL;
+	ioucmd->sqe = sqe;
+	return 0;
+}
+
+void io_uring_cmd_sqe_copy(struct io_kiocb *req)
+{
+	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	struct io_async_cmd *ac = req->async_data;
 
-	/*
-	 * Unconditionally cache the SQE for now - this is only needed for
-	 * requests that go async, but prep handlers must ensure that any
-	 * sqe data is stable beyond prep. Since uring_cmd is special in
-	 * that it doesn't read in per-op data, play it safe and ensure that
-	 * any SQE data is stable beyond prep. This can later get relaxed.
-	 */
-	memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
+	/* already copied, nothing to do */
+	if (ioucmd->sqe == ac->sqes)
+		return;
+	memcpy(ac->sqes, ioucmd->sqe, uring_sqe_size(req->ctx));
 	ioucmd->sqe = ac->sqes;
-	return 0;
 }
 
 int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
index e6a5142c890e..a6dad47afc6b 100644
--- a/io_uring/uring_cmd.h
+++ b/io_uring/uring_cmd.h
@@ -11,6 +11,7 @@ struct io_async_cmd {
 
 int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags);
 int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+void io_uring_cmd_sqe_copy(struct io_kiocb *req);
 void io_uring_cmd_cleanup(struct io_kiocb *req);
 
 bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *ctx,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 21:05     ` Jens Axboe
@ 2025-06-06 22:08       ` Jens Axboe
  2025-06-06 22:09         ` Caleb Sander Mateos
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2025-06-06 22:08 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 3:05 PM, Jens Axboe wrote:
>> Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
>> Presumably any other opcode that implements ->sqe_copy() would also
>> have the sqe pointer stashed somewhere. Seems like it would simplify
>> the core io_uring code a bit not to have to thread the sqe through
>> several function calls.
> 
> It's not necessary, but I would rather get rid of needing to store an
> SQE since that is a bit iffy than get rid of passing the SQE. When it
> comes from the core, you _know_ it's going to be valid. I feel like you
> need a fairly intimate understanding of io_uring issue flow to make any
> judgement on this, if you were adding an opcode and defining this type
> of handler.

Actually did go that route anyway, because we still need to stash it.
And if we do go that route, then we can keep all the checking in the
core and leave the handler just a basic copy with a void return. Which
is pretty nice.

Anyway, checkout v3 and see what you thing.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 22:08       ` Jens Axboe
@ 2025-06-06 22:09         ` Caleb Sander Mateos
  2025-06-06 23:53           ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Caleb Sander Mateos @ 2025-06-06 22:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Fri, Jun 6, 2025 at 3:08 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/6/25 3:05 PM, Jens Axboe wrote:
> >> Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
> >> Presumably any other opcode that implements ->sqe_copy() would also
> >> have the sqe pointer stashed somewhere. Seems like it would simplify
> >> the core io_uring code a bit not to have to thread the sqe through
> >> several function calls.
> >
> > It's not necessary, but I would rather get rid of needing to store an
> > SQE since that is a bit iffy than get rid of passing the SQE. When it
> > comes from the core, you _know_ it's going to be valid. I feel like you
> > need a fairly intimate understanding of io_uring issue flow to make any
> > judgement on this, if you were adding an opcode and defining this type
> > of handler.
>
> Actually did go that route anyway, because we still need to stash it.
> And if we do go that route, then we can keep all the checking in the
> core and leave the handler just a basic copy with a void return. Which
> is pretty nice.
>
> Anyway, checkout v3 and see what you thing.

From a quick glance it looks good to me. Let me give it a more
detailed look with a fresh pair of eyes.

Best,
Caleb

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 22:09         ` Caleb Sander Mateos
@ 2025-06-06 23:53           ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2025-06-06 23:53 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 4:09 PM, Caleb Sander Mateos wrote:
> On Fri, Jun 6, 2025 at 3:08?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 6/6/25 3:05 PM, Jens Axboe wrote:
>>>> Is it necessary to pass the sqe? Wouldn't it always be ioucmd->sqe?
>>>> Presumably any other opcode that implements ->sqe_copy() would also
>>>> have the sqe pointer stashed somewhere. Seems like it would simplify
>>>> the core io_uring code a bit not to have to thread the sqe through
>>>> several function calls.
>>>
>>> It's not necessary, but I would rather get rid of needing to store an
>>> SQE since that is a bit iffy than get rid of passing the SQE. When it
>>> comes from the core, you _know_ it's going to be valid. I feel like you
>>> need a fairly intimate understanding of io_uring issue flow to make any
>>> judgement on this, if you were adding an opcode and defining this type
>>> of handler.
>>
>> Actually did go that route anyway, because we still need to stash it.
>> And if we do go that route, then we can keep all the checking in the
>> core and leave the handler just a basic copy with a void return. Which
>> is pretty nice.
>>
>> Anyway, checkout v3 and see what you thing.
> 
> From a quick glance it looks good to me. Let me give it a more
> detailed look with a fresh pair of eyes.

Thanks, that'd be great. Ignore the commit message for patch 4,
I did update it but didn't amend before sending out... The one
in the git tree is accurate.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag
  2025-06-06 21:54 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
@ 2025-06-07  0:49   ` Caleb Sander Mateos
  0 siblings, 0 replies; 19+ messages in thread
From: Caleb Sander Mateos @ 2025-06-07  0:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Fri, Jun 6, 2025 at 2:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Set when the execution of the request is done inline from the system
> call itself. Any deferred issue will never have this flag set.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method
  2025-06-06 21:54 ` [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method Jens Axboe
@ 2025-06-07  0:50   ` Caleb Sander Mateos
  2025-06-07 11:16     ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Caleb Sander Mateos @ 2025-06-07  0:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Fri, Jun 6, 2025 at 2:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Will be called by the core of io_uring, if inline issue is not going
> to be tried for a request. Opcodes can define this handler to defer
> copying of SQE data that should remain stable.
>
> Only called if IO_URING_F_INLINE is set. If it isn't set, then there's a
> bug in the core handling of this, and -EFAULT will be returned instead
> to terminate the request. This will trigger a WARN_ON_ONCE(). Don't
> expect this to ever trigger, and down the line this can be removed.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  io_uring/io_uring.c | 25 ++++++++++++++++++++++---
>  io_uring/opdef.h    |  1 +
>  2 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 0f9f6a173e66..9799a31a2b29 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -1935,14 +1935,31 @@ struct file *io_file_get_normal(struct io_kiocb *req, int fd)
>         return file;
>  }
>
> -static void io_queue_async(struct io_kiocb *req, int ret)
> +static int io_req_sqe_copy(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +       const struct io_cold_def *def = &io_cold_defs[req->opcode];
> +
> +       if (!def->sqe_copy)
> +               return 0;
> +       if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_INLINE)))

I'm pretty confident that every initial async path under
io_submit_sqe() will call io_req_sqe_copy(). But I'm not positive that
io_req_sqe_copy() won't get called *additional* times from non-inline
contexts. One example scenario:
- io_submit_sqe() calls io_queue_sqe()
- io_issue_sqe() returns -EAGAIN, so io_queue_sqe() calls io_queue_async()
- io_queue_async() calls io_req_sqe_copy() in inline context
- io_queue_async() calls io_arm_poll_handler(), which returns
IO_APOLL_READY, so io_req_task_queue() is called
- Some other I/O to the file (possibly on a different task) clears the
ready poll events
- io_req_task_submit() calls io_queue_sqe() in task work context
- io_issue_sqe() returns -EAGAIN again, so io_queue_async() is called
- io_queue_async() calls io_req_sqe_copy() a second time in non-inline
(task work) context

If this is indeed possible, then I think we may need to relax this
check so it only verifies that IO_URING_F_INLINE is set *the first
time* io_req_sqe_copy() is called for a given req. (Or just remove the
IO_URING_F_INLINE check entirely.)

> +               return -EFAULT;
> +       def->sqe_copy(req);
> +       return 0;
> +}
> +
> +static void io_queue_async(struct io_kiocb *req, unsigned int issue_flags, int ret)
>         __must_hold(&req->ctx->uring_lock)
>  {
>         if (ret != -EAGAIN || (req->flags & REQ_F_NOWAIT)) {
> +fail:
>                 io_req_defer_failed(req, ret);
>                 return;
>         }
>
> +       ret = io_req_sqe_copy(req, issue_flags);
> +       if (unlikely(ret))
> +               goto fail;
> +
>         switch (io_arm_poll_handler(req, 0)) {
>         case IO_APOLL_READY:
>                 io_kbuf_recycle(req, 0);
> @@ -1971,7 +1988,7 @@ static inline void io_queue_sqe(struct io_kiocb *req, unsigned int extra_flags)
>          * doesn't support non-blocking read/write attempts
>          */
>         if (unlikely(ret))
> -               io_queue_async(req, ret);
> +               io_queue_async(req, issue_flags, ret);
>  }
>
>  static void io_queue_sqe_fallback(struct io_kiocb *req)
> @@ -1986,6 +2003,8 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
>                 req->flags |= REQ_F_LINK;
>                 io_req_defer_failed(req, req->cqe.res);
>         } else {
> +               /* can't fail with IO_URING_F_INLINE */
> +               io_req_sqe_copy(req, IO_URING_F_INLINE);
>                 if (unlikely(req->ctx->drain_active))
>                         io_drain_req(req);
>                 else
> @@ -2201,7 +2220,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>                 link->last = req;
>
>                 if (req->flags & IO_REQ_LINK_FLAGS)
> -                       return 0;
> +                       return io_req_sqe_copy(req, IO_URING_F_INLINE);

I still think this misses the last req in a linked chain, which will
be issued async but won't have IO_REQ_LINK_FLAGS set. Am I missing
something?

Best,
Caleb


>                 /* last request of the link, flush it */
>                 req = link->head;
>                 link->head = NULL;
> diff --git a/io_uring/opdef.h b/io_uring/opdef.h
> index 719a52104abe..c2f0907ed78c 100644
> --- a/io_uring/opdef.h
> +++ b/io_uring/opdef.h
> @@ -38,6 +38,7 @@ struct io_issue_def {
>  struct io_cold_def {
>         const char              *name;
>
> +       void (*sqe_copy)(struct io_kiocb *);
>         void (*cleanup)(struct io_kiocb *);
>         void (*fail)(struct io_kiocb *);
>  };
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-06 21:54 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
@ 2025-06-07  0:50   ` Caleb Sander Mateos
  0 siblings, 0 replies; 19+ messages in thread
From: Caleb Sander Mateos @ 2025-06-07  0:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Fri, Jun 6, 2025 at 2:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> uring_cmd currently copies the full SQE at prep time, just in case it
> needs it to be stable. However, for inline completions or requests that
> get queued up on the device side, there's no need to ever copy the SQE.
> This is particularly important, as various use cases of uring_cmd will
> be using 128b sized SQEs.
>
> Opt in to using ->sqe_copy() to let the core of io_uring decide when to
> copy SQEs, rather than to it upfront unconditionally.
>
> This provides two checks to see if ioucmd->sqe is still valid:
>
> 1) IO_URING_F_INLINE must be set, indicating the ->sqe_copy() call is
>    happening inline from the syscal submitting the IO. As long as we're
>    in that context, the SQE cannot have been reused.
>
> 2) If the SQE being passed in is NULL, then we're off the task_work
>    submission path. This check could be combined with IO_URING_F_INLINE,
>    but it'd require an additional branch-and-check in SQE queueing.
>
> If either of these aren't true and the SQE hasn't been copied already,
> then fail the request with -EFAULT and trigger a WARN_ON_ONCE() to
> indicate that there's a bug to figure out. With that, it should not be
> possible to ever reuse an SQE outside of the direct syscall path.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

With the parts about IO_URING_F_INLINE and NULL SQE removed from the
commit message,

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method
  2025-06-07  0:50   ` Caleb Sander Mateos
@ 2025-06-07 11:16     ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2025-06-07 11:16 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/6/25 6:50 PM, Caleb Sander Mateos wrote:
> On Fri, Jun 6, 2025 at 2:56?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Will be called by the core of io_uring, if inline issue is not going
>> to be tried for a request. Opcodes can define this handler to defer
>> copying of SQE data that should remain stable.
>>
>> Only called if IO_URING_F_INLINE is set. If it isn't set, then there's a
>> bug in the core handling of this, and -EFAULT will be returned instead
>> to terminate the request. This will trigger a WARN_ON_ONCE(). Don't
>> expect this to ever trigger, and down the line this can be removed.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  io_uring/io_uring.c | 25 ++++++++++++++++++++++---
>>  io_uring/opdef.h    |  1 +
>>  2 files changed, 23 insertions(+), 3 deletions(-)
>>
>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
>> index 0f9f6a173e66..9799a31a2b29 100644
>> --- a/io_uring/io_uring.c
>> +++ b/io_uring/io_uring.c
>> @@ -1935,14 +1935,31 @@ struct file *io_file_get_normal(struct io_kiocb *req, int fd)
>>         return file;
>>  }
>>
>> -static void io_queue_async(struct io_kiocb *req, int ret)
>> +static int io_req_sqe_copy(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +       const struct io_cold_def *def = &io_cold_defs[req->opcode];
>> +
>> +       if (!def->sqe_copy)
>> +               return 0;
>> +       if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_INLINE)))
> 
> I'm pretty confident that every initial async path under
> io_submit_sqe() will call io_req_sqe_copy(). But I'm not positive that
> io_req_sqe_copy() won't get called *additional* times from non-inline
> contexts. One example scenario:
> - io_submit_sqe() calls io_queue_sqe()
> - io_issue_sqe() returns -EAGAIN, so io_queue_sqe() calls io_queue_async()
> - io_queue_async() calls io_req_sqe_copy() in inline context
> - io_queue_async() calls io_arm_poll_handler(), which returns
> IO_APOLL_READY, so io_req_task_queue() is called
> - Some other I/O to the file (possibly on a different task) clears the
> ready poll events
> - io_req_task_submit() calls io_queue_sqe() in task work context
> - io_issue_sqe() returns -EAGAIN again, so io_queue_async() is called
> - io_queue_async() calls io_req_sqe_copy() a second time in non-inline
> (task work) context
> 
> If this is indeed possible, then I think we may need to relax this
> check so it only verifies that IO_URING_F_INLINE is set *the first
> time* io_req_sqe_copy() is called for a given req. (Or just remove the
> IO_URING_F_INLINE check entirely.)

Yes, the check is a bit eager indeed. I've added a flag for this, so
that we only go through the IO_URING_F_INLINE check and ->sqe_copy()
callback once.

>> +               return -EFAULT;
>> +       def->sqe_copy(req);
>> +       return 0;
>> +}
>> +
>> +static void io_queue_async(struct io_kiocb *req, unsigned int issue_flags, int ret)
>>         __must_hold(&req->ctx->uring_lock)
>>  {
>>         if (ret != -EAGAIN || (req->flags & REQ_F_NOWAIT)) {
>> +fail:
>>                 io_req_defer_failed(req, ret);
>>                 return;
>>         }
>>
>> +       ret = io_req_sqe_copy(req, issue_flags);
>> +       if (unlikely(ret))
>> +               goto fail;
>> +
>>         switch (io_arm_poll_handler(req, 0)) {
>>         case IO_APOLL_READY:
>>                 io_kbuf_recycle(req, 0);
>> @@ -1971,7 +1988,7 @@ static inline void io_queue_sqe(struct io_kiocb *req, unsigned int extra_flags)
>>          * doesn't support non-blocking read/write attempts
>>          */
>>         if (unlikely(ret))
>> -               io_queue_async(req, ret);
>> +               io_queue_async(req, issue_flags, ret);
>>  }
>>
>>  static void io_queue_sqe_fallback(struct io_kiocb *req)
>> @@ -1986,6 +2003,8 @@ static void io_queue_sqe_fallback(struct io_kiocb *req)
>>                 req->flags |= REQ_F_LINK;
>>                 io_req_defer_failed(req, req->cqe.res);
>>         } else {
>> +               /* can't fail with IO_URING_F_INLINE */
>> +               io_req_sqe_copy(req, IO_URING_F_INLINE);
>>                 if (unlikely(req->ctx->drain_active))
>>                         io_drain_req(req);
>>                 else
>> @@ -2201,7 +2220,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>                 link->last = req;
>>
>>                 if (req->flags & IO_REQ_LINK_FLAGS)
>> -                       return 0;
>> +                       return io_req_sqe_copy(req, IO_URING_F_INLINE);
> 
> I still think this misses the last req in a linked chain, which will
> be issued async but won't have IO_REQ_LINK_FLAGS set. Am I missing
> something?

Indeed, we need to call this before that section. Fixed that up too,
thanks.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup()
  2025-06-06 21:54 ` [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() Jens Axboe
@ 2025-06-08 19:57   ` Anuj gupta
  0 siblings, 0 replies; 19+ messages in thread
From: Anuj gupta @ 2025-06-08 19:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, csander

Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-09 17:36 [PATCHSET v4 0/4] uring_cmd copy avoidance Jens Axboe
@ 2025-06-09 17:36 ` Jens Axboe
  2025-06-09 21:54   ` Caleb Sander Mateos
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2025-06-09 17:36 UTC (permalink / raw)
  To: io-uring; +Cc: csander, Jens Axboe

uring_cmd currently copies the full SQE at prep time, just in case it
needs it to be stable. However, for inline completions or requests that
get queued up on the device side, there's no need to ever copy the SQE.
This is particularly important, as various use cases of uring_cmd will
be using 128b sized SQEs.

Opt in to using ->sqe_copy() to let the core of io_uring decide when to
copy SQEs. This callback will only be called if it is safe to do so.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/opdef.c     |  1 +
 io_uring/uring_cmd.c | 21 ++++++++++++---------
 io_uring/uring_cmd.h |  1 +
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 6e0882b051f9..287f9a23b816 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
 	},
 	[IORING_OP_URING_CMD] = {
 		.name			= "URING_CMD",
+		.sqe_copy		= io_uring_cmd_sqe_copy,
 		.cleanup		= io_uring_cmd_cleanup,
 	},
 	[IORING_OP_SEND_ZC] = {
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index e204f4941d72..a99dc2f9c4b5 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -205,17 +205,20 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (!ac)
 		return -ENOMEM;
 	ac->data.op_data = NULL;
+	ioucmd->sqe = sqe;
+	return 0;
+}
+
+void io_uring_cmd_sqe_copy(struct io_kiocb *req)
+{
+	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	struct io_async_cmd *ac = req->async_data;
 
-	/*
-	 * Unconditionally cache the SQE for now - this is only needed for
-	 * requests that go async, but prep handlers must ensure that any
-	 * sqe data is stable beyond prep. Since uring_cmd is special in
-	 * that it doesn't read in per-op data, play it safe and ensure that
-	 * any SQE data is stable beyond prep. This can later get relaxed.
-	 */
-	memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
+	/* already copied, nothing to do */
+	if (ioucmd->sqe == ac->sqes)
+		return;
+	memcpy(ac->sqes, ioucmd->sqe, uring_sqe_size(req->ctx));
 	ioucmd->sqe = ac->sqes;
-	return 0;
 }
 
 int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
index e6a5142c890e..a6dad47afc6b 100644
--- a/io_uring/uring_cmd.h
+++ b/io_uring/uring_cmd.h
@@ -11,6 +11,7 @@ struct io_async_cmd {
 
 int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags);
 int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+void io_uring_cmd_sqe_copy(struct io_kiocb *req);
 void io_uring_cmd_cleanup(struct io_kiocb *req);
 
 bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *ctx,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-09 17:36 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
@ 2025-06-09 21:54   ` Caleb Sander Mateos
  2025-06-10 13:35     ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Caleb Sander Mateos @ 2025-06-09 21:54 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Mon, Jun 9, 2025 at 10:39 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> uring_cmd currently copies the full SQE at prep time, just in case it
> needs it to be stable. However, for inline completions or requests that
> get queued up on the device side, there's no need to ever copy the SQE.
> This is particularly important, as various use cases of uring_cmd will
> be using 128b sized SQEs.
>
> Opt in to using ->sqe_copy() to let the core of io_uring decide when to
> copy SQEs. This callback will only be called if it is safe to do so.
>
> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  io_uring/opdef.c     |  1 +
>  io_uring/uring_cmd.c | 21 ++++++++++++---------
>  io_uring/uring_cmd.h |  1 +
>  3 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> index 6e0882b051f9..287f9a23b816 100644
> --- a/io_uring/opdef.c
> +++ b/io_uring/opdef.c
> @@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
>         },
>         [IORING_OP_URING_CMD] = {
>                 .name                   = "URING_CMD",
> +               .sqe_copy               = io_uring_cmd_sqe_copy,
>                 .cleanup                = io_uring_cmd_cleanup,
>         },
>         [IORING_OP_SEND_ZC] = {
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index e204f4941d72..a99dc2f9c4b5 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -205,17 +205,20 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>         if (!ac)
>                 return -ENOMEM;
>         ac->data.op_data = NULL;
> +       ioucmd->sqe = sqe;
> +       return 0;
> +}
> +
> +void io_uring_cmd_sqe_copy(struct io_kiocb *req)
> +{
> +       struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> +       struct io_async_cmd *ac = req->async_data;
>
> -       /*
> -        * Unconditionally cache the SQE for now - this is only needed for
> -        * requests that go async, but prep handlers must ensure that any
> -        * sqe data is stable beyond prep. Since uring_cmd is special in
> -        * that it doesn't read in per-op data, play it safe and ensure that
> -        * any SQE data is stable beyond prep. This can later get relaxed.
> -        */
> -       memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
> +       /* already copied, nothing to do */
> +       if (ioucmd->sqe == ac->sqes)

REQ_F_SQE_COPY should prevent this from ever happening, right? Could
we make it a WARN_ON()? Or drop it entirely?

Best,
Caleb


> +               return;
> +       memcpy(ac->sqes, ioucmd->sqe, uring_sqe_size(req->ctx));
>         ioucmd->sqe = ac->sqes;
> -       return 0;
>  }
>
>  int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
> diff --git a/io_uring/uring_cmd.h b/io_uring/uring_cmd.h
> index e6a5142c890e..a6dad47afc6b 100644
> --- a/io_uring/uring_cmd.h
> +++ b/io_uring/uring_cmd.h
> @@ -11,6 +11,7 @@ struct io_async_cmd {
>
>  int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags);
>  int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
> +void io_uring_cmd_sqe_copy(struct io_kiocb *req);
>  void io_uring_cmd_cleanup(struct io_kiocb *req);
>
>  bool io_uring_try_cancel_uring_cmd(struct io_ring_ctx *ctx,
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies
  2025-06-09 21:54   ` Caleb Sander Mateos
@ 2025-06-10 13:35     ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2025-06-10 13:35 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring

On 6/9/25 3:54 PM, Caleb Sander Mateos wrote:
> On Mon, Jun 9, 2025 at 10:39 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> uring_cmd currently copies the full SQE at prep time, just in case it
>> needs it to be stable. However, for inline completions or requests that
>> get queued up on the device side, there's no need to ever copy the SQE.
>> This is particularly important, as various use cases of uring_cmd will
>> be using 128b sized SQEs.
>>
>> Opt in to using ->sqe_copy() to let the core of io_uring decide when to
>> copy SQEs. This callback will only be called if it is safe to do so.
>>
>> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  io_uring/opdef.c     |  1 +
>>  io_uring/uring_cmd.c | 21 ++++++++++++---------
>>  io_uring/uring_cmd.h |  1 +
>>  3 files changed, 14 insertions(+), 9 deletions(-)
>>
>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
>> index 6e0882b051f9..287f9a23b816 100644
>> --- a/io_uring/opdef.c
>> +++ b/io_uring/opdef.c
>> @@ -759,6 +759,7 @@ const struct io_cold_def io_cold_defs[] = {
>>         },
>>         [IORING_OP_URING_CMD] = {
>>                 .name                   = "URING_CMD",
>> +               .sqe_copy               = io_uring_cmd_sqe_copy,
>>                 .cleanup                = io_uring_cmd_cleanup,
>>         },
>>         [IORING_OP_SEND_ZC] = {
>> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
>> index e204f4941d72..a99dc2f9c4b5 100644
>> --- a/io_uring/uring_cmd.c
>> +++ b/io_uring/uring_cmd.c
>> @@ -205,17 +205,20 @@ int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>         if (!ac)
>>                 return -ENOMEM;
>>         ac->data.op_data = NULL;
>> +       ioucmd->sqe = sqe;
>> +       return 0;
>> +}
>> +
>> +void io_uring_cmd_sqe_copy(struct io_kiocb *req)
>> +{
>> +       struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
>> +       struct io_async_cmd *ac = req->async_data;
>>
>> -       /*
>> -        * Unconditionally cache the SQE for now - this is only needed for
>> -        * requests that go async, but prep handlers must ensure that any
>> -        * sqe data is stable beyond prep. Since uring_cmd is special in
>> -        * that it doesn't read in per-op data, play it safe and ensure that
>> -        * any SQE data is stable beyond prep. This can later get relaxed.
>> -        */
>> -       memcpy(ac->sqes, sqe, uring_sqe_size(req->ctx));
>> +       /* already copied, nothing to do */
>> +       if (ioucmd->sqe == ac->sqes)
> 
> REQ_F_SQE_COPY should prevent this from ever happening, right? Could
> we make it a WARN_ON()? Or drop it entirely?

It should in the current form of the patchset, now that the copy is
gone inside uring_cmd and we have the SQE_COPIED flag. I'll make it
a WARN_ON_ONCE() instead.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-06-10 13:35 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06 21:54 [PATCHSET v3 0/4] uring_cmd copy avoidance Jens Axboe
2025-06-06 21:54 ` [PATCH 1/4] io_uring: add IO_URING_F_INLINE issue flag Jens Axboe
2025-06-07  0:49   ` Caleb Sander Mateos
2025-06-06 21:54 ` [PATCH 2/4] io_uring: add struct io_cold_def->sqe_copy() method Jens Axboe
2025-06-07  0:50   ` Caleb Sander Mateos
2025-06-07 11:16     ` Jens Axboe
2025-06-06 21:54 ` [PATCH 3/4] io_uring/uring_cmd: get rid of io_uring_cmd_prep_setup() Jens Axboe
2025-06-08 19:57   ` Anuj gupta
2025-06-06 21:54 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
2025-06-07  0:50   ` Caleb Sander Mateos
  -- strict thread matches above, loose matches on Subject: below --
2025-06-09 17:36 [PATCHSET v4 0/4] uring_cmd copy avoidance Jens Axboe
2025-06-09 17:36 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
2025-06-09 21:54   ` Caleb Sander Mateos
2025-06-10 13:35     ` Jens Axboe
2025-06-05 19:40 [PATCHSET RFC v2 0/4] uring_cmd copy avoidance Jens Axboe
2025-06-05 19:40 ` [PATCH 4/4] io_uring/uring_cmd: implement ->sqe_copy() to avoid unnecessary copies Jens Axboe
2025-06-06 17:39   ` Caleb Sander Mateos
2025-06-06 21:05     ` Jens Axboe
2025-06-06 22:08       ` Jens Axboe
2025-06-06 22:09         ` Caleb Sander Mateos
2025-06-06 23:53           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox