public inbox for [email protected]
 help / color / mirror / Atom feed
* memory access op ideas
@ 2022-04-13 10:33 Avi Kivity
  2022-04-22 12:52 ` Hao Xu
  2022-04-22 14:50 ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: Avi Kivity @ 2022-04-13 10:33 UTC (permalink / raw)
  To: io-uring

Unfortunately, only ideas, no patches. But at least the first seems very 
easy.


- IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op 
itself (1-8 bytes) to a user memory location specified by the op.


Linked to another op, this can generate an in-memory notification useful 
for busy-waiters or the UMWAIT instruction


This would be useful for Seastar, which looks at a timer-managed memory 
location to check when to break computation loops.


- IORING_OP_MEMCPY - asynchronously copy memory


Some CPUs include a DMA engine, and io_uring is a perfect interface to 
exercise it. It may be difficult to find space for two iovecs though.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-13 10:33 memory access op ideas Avi Kivity
@ 2022-04-22 12:52 ` Hao Xu
  2022-04-22 13:24   ` Hao Xu
                     ` (2 more replies)
  2022-04-22 14:50 ` Jens Axboe
  1 sibling, 3 replies; 22+ messages in thread
From: Hao Xu @ 2022-04-22 12:52 UTC (permalink / raw)
  To: Avi Kivity, io-uring

Hi Avi,
在 4/13/22 6:33 PM, Avi Kivity 写道:
> Unfortunately, only ideas, no patches. But at least the first seems very 
> easy.
> 
> 
> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op 
> itself (1-8 bytes) to a user memory location specified by the op.
> 
> 
> Linked to another op, this can generate an in-memory notification useful 
> for busy-waiters or the UMWAIT instruction
> 
> 
> This would be useful for Seastar, which looks at a timer-managed memory 
> location to check when to break computation loops.
> 
> 
> - IORING_OP_MEMCPY - asynchronously copy memory
> 
> 
> Some CPUs include a DMA engine, and io_uring is a perfect interface to 
> exercise it. It may be difficult to find space for two iovecs though.

I have a question about the 'DMA' here, do you mean DMA device for
memory copy? My understanding is you want async memcpy so that the
cpu can relax when the specific hardware is doing memory copy. the
thing is for cases like busy waiting or UMAIT, the length of the memory
to be copied is usually small(otherwise we don't use busy waiting or
UMAIT, right?). Then making it async by io_uring's iowq may introduce
much more overhead(the context switch).

Regards,
Hao

> 
> 
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 12:52 ` Hao Xu
@ 2022-04-22 13:24   ` Hao Xu
  2022-04-22 13:38   ` Jens Axboe
  2022-04-23 16:14   ` Avi Kivity
  2 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2022-04-22 13:24 UTC (permalink / raw)
  To: Avi Kivity, io-uring



在 4/22/22 8:52 PM, Hao Xu 写道:
> Hi Avi,
> 在 4/13/22 6:33 PM, Avi Kivity 写道:
>> Unfortunately, only ideas, no patches. But at least the first seems 
>> very easy.
>>
>>
>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op 
>> itself (1-8 bytes) to a user memory location specified by the op.
>>
>>
>> Linked to another op, this can generate an in-memory notification 
>> useful for busy-waiters or the UMWAIT instruction
>>
>>
>> This would be useful for Seastar, which looks at a timer-managed 
>> memory location to check when to break computation loops.
>>
>>
>> - IORING_OP_MEMCPY - asynchronously copy memory
>>
>>
>> Some CPUs include a DMA engine, and io_uring is a perfect interface 
>> to exercise it. It may be difficult to find space for two iovecs though.
>
> I have a question about the 'DMA' here, do you mean DMA device for
> memory copy? My understanding is you want async memcpy so that the
> cpu can relax when the specific hardware is doing memory copy. the
> thing is for cases like busy waiting or UMAIT, the length of the memory
> to be copied is usually small(otherwise we don't use busy waiting or
> UMAIT, right?). Then making it async by io_uring's iowq may introduce
> much more overhead(the context switch).

Ok, maybe we can leverage task work to avoid context switch but still
unclear how much the overhead is.

>
> Regards,
> Hao
>
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 12:52 ` Hao Xu
  2022-04-22 13:24   ` Hao Xu
@ 2022-04-22 13:38   ` Jens Axboe
  2022-04-23  7:19     ` Hao Xu
  2022-04-23 16:14   ` Avi Kivity
  2 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2022-04-22 13:38 UTC (permalink / raw)
  To: Hao Xu, Avi Kivity, io-uring

On 4/22/22 6:52 AM, Hao Xu wrote:
> Hi Avi,
> ? 4/13/22 6:33 PM, Avi Kivity ??:
>> Unfortunately, only ideas, no patches. But at least the first seems very easy.
>>
>>
>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op itself (1-8 bytes) to a user memory location specified by the op.
>>
>>
>> Linked to another op, this can generate an in-memory notification useful for busy-waiters or the UMWAIT instruction
>>
>>
>> This would be useful for Seastar, which looks at a timer-managed memory location to check when to break computation loops.
>>
>>
>> - IORING_OP_MEMCPY - asynchronously copy memory
>>
>>
>> Some CPUs include a DMA engine, and io_uring is a perfect interface to exercise it. It may be difficult to find space for two iovecs though.
> 
> I have a question about the 'DMA' here, do you mean DMA device for
> memory copy? My understanding is you want async memcpy so that the
> cpu can relax when the specific hardware is doing memory copy. the
> thing is for cases like busy waiting or UMAIT, the length of the memory
> to be copied is usually small(otherwise we don't use busy waiting or
> UMAIT, right?). Then making it async by io_uring's iowq may introduce
> much more overhead(the context switch).

Nothing fast should use io-wq. But not sure why you think this would
need it, as long as you can start the operation in a sane fashion and
get notified when done, why would it need io-wq?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-13 10:33 memory access op ideas Avi Kivity
  2022-04-22 12:52 ` Hao Xu
@ 2022-04-22 14:50 ` Jens Axboe
  2022-04-22 15:03   ` Jens Axboe
                     ` (2 more replies)
  1 sibling, 3 replies; 22+ messages in thread
From: Jens Axboe @ 2022-04-22 14:50 UTC (permalink / raw)
  To: Avi Kivity, io-uring

On 4/13/22 4:33 AM, Avi Kivity wrote:
> Unfortunately, only ideas, no patches. But at least the first seems very easy.
> 
> 
> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op
> itself (1-8 bytes) to a user memory location specified by the op.
> 
> 
> Linked to another op, this can generate an in-memory notification
> useful for busy-waiters or the UMWAIT instruction
>
> This would be useful for Seastar, which looks at a timer-managed
> memory location to check when to break computation loops.

This one would indeed be trivial to do. If we limit the max size
supported to eg 8 bytes like suggested, then it could be in the sqe
itself and just copied to the user address specified.

Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination
address, and sqe->off the data to copy.

If you'll commit to testing this, I can hack it up pretty quickly...

> - IORING_OP_MEMCPY - asynchronously copy memory
> 
> 
> Some CPUs include a DMA engine, and io_uring is a perfect interface to
> exercise it. It may be difficult to find space for two iovecs though.

I've considered this one in the past too, and it is indeed an ideal fit
in terms of API. Outside of the DMA engines, it can also be used to
offload memcpy to a GPU, for example.

The io_uring side would not be hard to wire up, basically just have the
sqe specfy source, destination, length. Add some well defined flags
depending on what the copy engine offers, for example.

But probably some work required here in exposing an API and testing
etc...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 14:50 ` Jens Axboe
@ 2022-04-22 15:03   ` Jens Axboe
  2022-04-23 16:30     ` Avi Kivity
  2022-04-22 20:03   ` Walker, Benjamin
  2022-04-23 16:23   ` Avi Kivity
  2 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2022-04-22 15:03 UTC (permalink / raw)
  To: Avi Kivity, io-uring

On 4/22/22 8:50 AM, Jens Axboe wrote:
> On 4/13/22 4:33 AM, Avi Kivity wrote:
>> Unfortunately, only ideas, no patches. But at least the first seems very easy.
>>
>>
>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op
>> itself (1-8 bytes) to a user memory location specified by the op.
>>
>>
>> Linked to another op, this can generate an in-memory notification
>> useful for busy-waiters or the UMWAIT instruction
>>
>> This would be useful for Seastar, which looks at a timer-managed
>> memory location to check when to break computation loops.
> 
> This one would indeed be trivial to do. If we limit the max size
> supported to eg 8 bytes like suggested, then it could be in the sqe
> itself and just copied to the user address specified.
> 
> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination
> address, and sqe->off the data to copy.
> 
> If you'll commit to testing this, I can hack it up pretty quickly...

Something like this, totally untested. Maybe the return value should be
bytes copied? Just returns 0/error right now.

Follows the above convention.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2052a796436c..d2a95f9d9d2d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -586,6 +586,13 @@ struct io_socket {
 	unsigned long			nofile;
 };
 
+struct io_mem {
+	struct file			*file;
+	u64				value;
+	void __user			*dest;
+	u32				len;
+};
+
 struct io_sync {
 	struct file			*file;
 	loff_t				len;
@@ -962,6 +969,7 @@ struct io_kiocb {
 		struct io_msg		msg;
 		struct io_xattr		xattr;
 		struct io_socket	sock;
+		struct io_mem		mem;
 	};
 
 	u8				opcode;
@@ -1231,16 +1239,19 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 	},
 	[IORING_OP_FSETXATTR] = {
-		.needs_file = 1
+		.needs_file		= 1,
 	},
 	[IORING_OP_SETXATTR] = {},
 	[IORING_OP_FGETXATTR] = {
-		.needs_file = 1
+		.needs_file		= 1,
 	},
 	[IORING_OP_GETXATTR] = {},
 	[IORING_OP_SOCKET] = {
 		.audit_skip		= 1,
 	},
+	[IORING_OP_MEMCPY_IMM] = {
+		.audit_skip		= 1,
+	},
 };
 
 /* requests with any of those set should undergo io_disarm_next() */
@@ -5527,6 +5538,38 @@ static int io_sync_file_range(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
+static int io_memcpy_imm_prep(struct io_kiocb *req,
+			      const struct io_uring_sqe *sqe)
+{
+	struct io_mem *mem = &req->mem;
+
+	if (unlikely(sqe->ioprio || sqe->rw_flags || sqe->buf_index ||
+		     sqe->splice_fd_in))
+		return -EINVAL;
+
+	mem->value = READ_ONCE(sqe->off);
+	mem->dest = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	mem->len = READ_ONCE(sqe->len);
+	if (!mem->len || mem->len > sizeof(u64))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int io_memcpy_imm(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_mem *mem = &req->mem;
+	int ret = 0;
+
+	if (copy_to_user(mem->dest, &mem->value, mem->len))
+		ret = -EFAULT;
+
+	if (ret < 0)
+		req_set_fail(req);
+	io_req_complete(req, ret);
+	return 0;
+}
+
 #if defined(CONFIG_NET)
 static bool io_net_retry(struct socket *sock, int flags)
 {
@@ -7494,6 +7537,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_getxattr_prep(req, sqe);
 	case IORING_OP_SOCKET:
 		return io_socket_prep(req, sqe);
+	case IORING_OP_MEMCPY_IMM:
+		return io_memcpy_imm_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -7815,6 +7860,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	case IORING_OP_SOCKET:
 		ret = io_socket(req, issue_flags);
 		break;
+	case IORING_OP_MEMCPY_IMM:
+		ret = io_memcpy_imm(req, issue_flags);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5fb52bf32435..853f00a2bddd 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -152,6 +152,7 @@ enum {
 	IORING_OP_FGETXATTR,
 	IORING_OP_GETXATTR,
 	IORING_OP_SOCKET,
+	IORING_OP_MEMCPY_IMM,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 14:50 ` Jens Axboe
  2022-04-22 15:03   ` Jens Axboe
@ 2022-04-22 20:03   ` Walker, Benjamin
  2022-04-23 10:19     ` Pavel Begunkov
  2022-04-23 13:20     ` Jens Axboe
  2022-04-23 16:23   ` Avi Kivity
  2 siblings, 2 replies; 22+ messages in thread
From: Walker, Benjamin @ 2022-04-22 20:03 UTC (permalink / raw)
  To: Jens Axboe, Avi Kivity, io-uring

On 4/22/2022 7:50 AM, Jens Axboe wrote:
> On 4/13/22 4:33 AM, Avi Kivity wrote:
>> Unfortunately, only ideas, no patches. But at least the first seems very easy.
>>
>>
>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op
>> itself (1-8 bytes) to a user memory location specified by the op.
>>
>>
>> Linked to another op, this can generate an in-memory notification
>> useful for busy-waiters or the UMWAIT instruction
>>
>> This would be useful for Seastar, which looks at a timer-managed
>> memory location to check when to break computation loops.
> 
> This one would indeed be trivial to do. If we limit the max size
> supported to eg 8 bytes like suggested, then it could be in the sqe
> itself and just copied to the user address specified.
> 
> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination
> address, and sqe->off the data to copy.
> 
> If you'll commit to testing this, I can hack it up pretty quickly...
> 
>> - IORING_OP_MEMCPY - asynchronously copy memory
>>
>>
>> Some CPUs include a DMA engine, and io_uring is a perfect interface to
>> exercise it. It may be difficult to find space for two iovecs though.
> 
> I've considered this one in the past too, and it is indeed an ideal fit
> in terms of API. Outside of the DMA engines, it can also be used to
> offload memcpy to a GPU, for example.
> 
> The io_uring side would not be hard to wire up, basically just have the
> sqe specfy source, destination, length. Add some well defined flags
> depending on what the copy engine offers, for example.
> 
> But probably some work required here in exposing an API and testing
> etc...
> 

I'm about to send a set of patches to associate an io_uring with a 
dmaengine channel to this list. I'm not necessarily thinking of using it 
to directly drive the DMA engine itself (although we could, and there's 
some nice advantages to that), but rather as a way to offload data 
copies/transforms on existing io_uring operations. My primary focus has 
been the copy between kernel and user space when receiving from a socket.

Upcoming DMA engines also support SVA, allowing them to copy from kernel 
to user without page pinning. We've got patches for full SVA enabling in 
dmaengine prepared, such that each io_uring can allocate a PASID 
describing the user+kernel address space for the current context, 
allocate a channel via dmaengine and assign it the PASID, and then do 
DMA between kernel/user with new dmaengine APIs without any page pinning.

As preparation, I have submitted a series to dmaengine that allows for 
polling and out-of-order completions. See 
https://lore.kernel.org/dmaengine/[email protected]/T/#u. 
This is a necessary first step.

I'll get the patches out ASAP as an RFC. I'm sure my approach was 
entirely wrong, but you'll get the idea.

Thanks,
Ben



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 13:38   ` Jens Axboe
@ 2022-04-23  7:19     ` Hao Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Hao Xu @ 2022-04-23  7:19 UTC (permalink / raw)
  To: Jens Axboe, Avi Kivity, io-uring

在 4/22/22 9:38 PM, Jens Axboe 写道:
> On 4/22/22 6:52 AM, Hao Xu wrote:
>> Hi Avi,
>> ? 4/13/22 6:33 PM, Avi Kivity ??:
>>> Unfortunately, only ideas, no patches. But at least the first seems very easy.
>>>
>>>
>>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op itself (1-8 bytes) to a user memory location specified by the op.
>>>
>>>
>>> Linked to another op, this can generate an in-memory notification useful for busy-waiters or the UMWAIT instruction
>>>
>>>
>>> This would be useful for Seastar, which looks at a timer-managed memory location to check when to break computation loops.
>>>
>>>
>>> - IORING_OP_MEMCPY - asynchronously copy memory
>>>
>>>
>>> Some CPUs include a DMA engine, and io_uring is a perfect interface to exercise it. It may be difficult to find space for two iovecs though.
>>
>> I have a question about the 'DMA' here, do you mean DMA device for
>> memory copy? My understanding is you want async memcpy so that the
>> cpu can relax when the specific hardware is doing memory copy. the
>> thing is for cases like busy waiting or UMAIT, the length of the memory
>> to be copied is usually small(otherwise we don't use busy waiting or
>> UMAIT, right?). Then making it async by io_uring's iowq may introduce
>> much more overhead(the context switch).
> 
> Nothing fast should use io-wq. But not sure why you think this would
> need it, as long as you can start the operation in a sane fashion and
> get notified when done, why would it need io-wq?
Ah, yes, forgot the some basic logic while sending the email.
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 20:03   ` Walker, Benjamin
@ 2022-04-23 10:19     ` Pavel Begunkov
  2022-04-23 13:20     ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Pavel Begunkov @ 2022-04-23 10:19 UTC (permalink / raw)
  To: Walker, Benjamin, Jens Axboe, Avi Kivity, io-uring

On 4/22/22 21:03, Walker, Benjamin wrote:
> On 4/22/2022 7:50 AM, Jens Axboe wrote:
>> On 4/13/22 4:33 AM, Avi Kivity wrote:
>>> Unfortunately, only ideas, no patches. But at least the first seems very easy.
>>>
>>>
>>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op
>>> itself (1-8 bytes) to a user memory location specified by the op.
>>>
>>>
>>> Linked to another op, this can generate an in-memory notification
>>> useful for busy-waiters or the UMWAIT instruction
>>>
>>> This would be useful for Seastar, which looks at a timer-managed
>>> memory location to check when to break computation loops.
>>
>> This one would indeed be trivial to do. If we limit the max size
>> supported to eg 8 bytes like suggested, then it could be in the sqe
>> itself and just copied to the user address specified.
>>
>> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination
>> address, and sqe->off the data to copy.
>>
>> If you'll commit to testing this, I can hack it up pretty quickly...
>>
>>> - IORING_OP_MEMCPY - asynchronously copy memory
>>>
>>>
>>> Some CPUs include a DMA engine, and io_uring is a perfect interface to
>>> exercise it. It may be difficult to find space for two iovecs though.
>>
>> I've considered this one in the past too, and it is indeed an ideal fit
>> in terms of API. Outside of the DMA engines, it can also be used to
>> offload memcpy to a GPU, for example.
>>
>> The io_uring side would not be hard to wire up, basically just have the
>> sqe specfy source, destination, length. Add some well defined flags
>> depending on what the copy engine offers, for example.
>>
>> But probably some work required here in exposing an API and testing
>> etc...
>>
> 
> I'm about to send a set of patches to associate an io_uring with a dmaengine channel to this list. I'm not necessarily thinking of using it to directly drive the DMA engine itself (although we could, and there's some nice advantages to that), but rather as a way to offload data copies/transforms on existing io_uring operations. My primary focus has been the copy between kernel and user space when receiving from a socket.
> 
> Upcoming DMA engines also support SVA, allowing them to copy from kernel to user without page pinning. We've got patches for full SVA enabling in dmaengine prepared, such that each io_uring can allocate a PASID describing the user+kernel address space for the current context, allocate a channel via dmaengine and assign it the PASID, and then do DMA between kernel/user with new dmaengine APIs without any page pinning.
> 
> As preparation, I have submitted a series to dmaengine that allows for polling and out-of-order completions. See https://lore.kernel.org/dmaengine/[email protected]/T/#u. This is a necessary first step.
> 
> I'll get the patches out ASAP as an RFC. I'm sure my approach was entirely wrong, but you'll get the idea.

Please do. Sounds interesting and it will likely be nicely
aligned with our work/plans on p2p.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 20:03   ` Walker, Benjamin
  2022-04-23 10:19     ` Pavel Begunkov
@ 2022-04-23 13:20     ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2022-04-23 13:20 UTC (permalink / raw)
  To: Walker, Benjamin, Avi Kivity, io-uring

On 4/22/22 2:03 PM, Walker, Benjamin wrote:
> On 4/22/2022 7:50 AM, Jens Axboe wrote:
>> On 4/13/22 4:33 AM, Avi Kivity wrote:
>>> Unfortunately, only ideas, no patches. But at least the first seems very easy.
>>>
>>>
>>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op
>>> itself (1-8 bytes) to a user memory location specified by the op.
>>>
>>>
>>> Linked to another op, this can generate an in-memory notification
>>> useful for busy-waiters or the UMWAIT instruction
>>>
>>> This would be useful for Seastar, which looks at a timer-managed
>>> memory location to check when to break computation loops.
>>
>> This one would indeed be trivial to do. If we limit the max size
>> supported to eg 8 bytes like suggested, then it could be in the sqe
>> itself and just copied to the user address specified.
>>
>> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination
>> address, and sqe->off the data to copy.
>>
>> If you'll commit to testing this, I can hack it up pretty quickly...
>>
>>> - IORING_OP_MEMCPY - asynchronously copy memory
>>>
>>>
>>> Some CPUs include a DMA engine, and io_uring is a perfect interface to
>>> exercise it. It may be difficult to find space for two iovecs though.
>>
>> I've considered this one in the past too, and it is indeed an ideal fit
>> in terms of API. Outside of the DMA engines, it can also be used to
>> offload memcpy to a GPU, for example.
>>
>> The io_uring side would not be hard to wire up, basically just have the
>> sqe specfy source, destination, length. Add some well defined flags
>> depending on what the copy engine offers, for example.
>>
>> But probably some work required here in exposing an API and testing
>> etc...
>>
> 
> I'm about to send a set of patches to associate an io_uring with a
> dmaengine channel to this list. I'm not necessarily thinking of using
> it to directly drive the DMA engine itself (although we could, and
> there's some nice advantages to that), but rather as a way to offload
> data copies/transforms on existing io_uring operations. My primary
> focus has been the copy between kernel and user space when receiving
> from a socket.

Interesting - I think both uses cases are actually valid, offloading a
memcpy or using the engine to copy the data of an existing operation.

> Upcoming DMA engines also support SVA, allowing them to copy from
> kernel to user without page pinning. We've got patches for full SVA
> enabling in dmaengine prepared, such that each io_uring can allocate a
> PASID describing the user+kernel address space for the current
> context, allocate a channel via dmaengine and assign it the PASID, and
> then do DMA between kernel/user with new dmaengine APIs without any
> page pinning.
> 
> As preparation, I have submitted a series to dmaengine that allows for
> polling and out-of-order completions. See
> https://lore.kernel.org/dmaengine/[email protected]/T/#u.
> This is a necessary first step.
> 
> I'll get the patches out ASAP as an RFC. I'm sure my approach was
> entirely wrong, but you'll get the idea.

Please do, this sounds exciting! The whole point of an RFC is to get
some feedback on initial design before it potentially goes too far down
the wrong path.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 12:52 ` Hao Xu
  2022-04-22 13:24   ` Hao Xu
  2022-04-22 13:38   ` Jens Axboe
@ 2022-04-23 16:14   ` Avi Kivity
  2 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2022-04-23 16:14 UTC (permalink / raw)
  To: Hao Xu, io-uring


On 22/04/2022 15.52, Hao Xu wrote:
> Hi Avi,
> 在 4/13/22 6:33 PM, Avi Kivity 写道:
>> Unfortunately, only ideas, no patches. But at least the first seems 
>> very easy.
>>
>>
>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op 
>> itself (1-8 bytes) to a user memory location specified by the op.
>>
>>
>> Linked to another op, this can generate an in-memory notification 
>> useful for busy-waiters or the UMWAIT instruction
>>
>>
>> This would be useful for Seastar, which looks at a timer-managed 
>> memory location to check when to break computation loops.
>>
>>
>> - IORING_OP_MEMCPY - asynchronously copy memory
>>
>>
>> Some CPUs include a DMA engine, and io_uring is a perfect interface 
>> to exercise it. It may be difficult to find space for two iovecs though.
>
> I have a question about the 'DMA' here, do you mean DMA device for
> memory copy?


Yes. I understand some Intel server processors have them.


> My understanding is you want async memcpy so that the
> cpu can relax when the specific hardware is doing memory copy. the
> thing is for cases like busy waiting or UMAIT, the length of the memory
> to be copied is usually small(otherwise we don't use busy waiting or
> UMAIT, right?). Then making it async by io_uring's iowq may introduce
> much more overhead(the context switch).


These are two separate cases.


1. Bulk data copies (usually large), use DMA

2. small memory writes to wake up a thread that is using UMWAIT or 
busy-polling, do not use DMA


Whether to use DMA or not can be based on writes size. I'd say anything 
<= 512 bytes should be done by the CPU.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 14:50 ` Jens Axboe
  2022-04-22 15:03   ` Jens Axboe
  2022-04-22 20:03   ` Walker, Benjamin
@ 2022-04-23 16:23   ` Avi Kivity
  2022-04-23 17:30     ` Jens Axboe
  2 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2022-04-23 16:23 UTC (permalink / raw)
  To: Jens Axboe, io-uring


On 22/04/2022 17.50, Jens Axboe wrote:
> On 4/13/22 4:33 AM, Avi Kivity wrote:
>> Unfortunately, only ideas, no patches. But at least the first seems very easy.
>>
>>
>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op
>> itself (1-8 bytes) to a user memory location specified by the op.
>>
>>
>> Linked to another op, this can generate an in-memory notification
>> useful for busy-waiters or the UMWAIT instruction
>>
>> This would be useful for Seastar, which looks at a timer-managed
>> memory location to check when to break computation loops.
> This one would indeed be trivial to do. If we limit the max size
> supported to eg 8 bytes like suggested, then it could be in the sqe
> itself and just copied to the user address specified.
>
> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination
> address, and sqe->off the data to copy.
>
> If you'll commit to testing this, I can hack it up pretty quickly...


I can certainly commit to test it in a VM (my workstation has a 
hate-hate relationship with custom kernels).


>> - IORING_OP_MEMCPY - asynchronously copy memory
>>
>>
>> Some CPUs include a DMA engine, and io_uring is a perfect interface to
>> exercise it. It may be difficult to find space for two iovecs though.
> I've considered this one in the past too, and it is indeed an ideal fit
> in terms of API. Outside of the DMA engines, it can also be used to
> offload memcpy to a GPU, for example.
>
> The io_uring side would not be hard to wire up, basically just have the
> sqe specfy source, destination, length. Add some well defined flags
> depending on what the copy engine offers, for example.
>
> But probably some work required here in exposing an API and testing
> etc...


Perhaps the interface should be kept separate from io_uring. e.g. use a 
pidfd to represent the address space, and then issue 
IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy 
across process boundaries.


A different angle is to use expose the dma device as a separate fd. This 
can be useful as dma engine can often do other operations, like xor or 
crc or encryption or compression. In any case I'd argue for the 
interface to be useful outside io_uring, although that considerably 
increases the scope. I also don't have a direct use case for it, though 
I'm sure others will.


The kernel itself should find the DMA engine useful for things like 
memory compaction.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-22 15:03   ` Jens Axboe
@ 2022-04-23 16:30     ` Avi Kivity
  2022-04-23 17:32       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2022-04-23 16:30 UTC (permalink / raw)
  To: Jens Axboe, io-uring


On 22/04/2022 18.03, Jens Axboe wrote:
> On 4/22/22 8:50 AM, Jens Axboe wrote:
>> On 4/13/22 4:33 AM, Avi Kivity wrote:
>>> Unfortunately, only ideas, no patches. But at least the first seems very easy.
>>>
>>>
>>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op
>>> itself (1-8 bytes) to a user memory location specified by the op.
>>>
>>>
>>> Linked to another op, this can generate an in-memory notification
>>> useful for busy-waiters or the UMWAIT instruction
>>>
>>> This would be useful for Seastar, which looks at a timer-managed
>>> memory location to check when to break computation loops.
>> This one would indeed be trivial to do. If we limit the max size
>> supported to eg 8 bytes like suggested, then it could be in the sqe
>> itself and just copied to the user address specified.
>>
>> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination
>> address, and sqe->off the data to copy.
>>
>> If you'll commit to testing this, I can hack it up pretty quickly...
> Something like this, totally untested. Maybe the return value should be
> bytes copied?


Yes. It could be less than what the user expected, unless we enforce 
alignment (perhaps we should).


> Just returns 0/error right now.
>
> Follows the above convention.
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 2052a796436c..d2a95f9d9d2d 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -586,6 +586,13 @@ struct io_socket {
>   	unsigned long			nofile;
>   };
>   
> +struct io_mem {
> +	struct file			*file;
> +	u64				value;
> +	void __user			*dest;
> +	u32				len;
> +};
> +


>   struct io_sync {
>   	struct file			*file;
>   	loff_t				len;
> @@ -962,6 +969,7 @@ struct io_kiocb {
>   		struct io_msg		msg;
>   		struct io_xattr		xattr;
>   		struct io_socket	sock;
> +		struct io_mem		mem;
>   	};
>   
>   	u8				opcode;
> @@ -1231,16 +1239,19 @@ static const struct io_op_def io_op_defs[] = {
>   		.needs_file		= 1,
>   	},
>   	[IORING_OP_FSETXATTR] = {
> -		.needs_file = 1
> +		.needs_file		= 1,
>   	},
>   	[IORING_OP_SETXATTR] = {},
>   	[IORING_OP_FGETXATTR] = {
> -		.needs_file = 1
> +		.needs_file		= 1,
>   	},
>   	[IORING_OP_GETXATTR] = {},
>   	[IORING_OP_SOCKET] = {
>   		.audit_skip		= 1,
>   	},
> +	[IORING_OP_MEMCPY_IMM] = {
> +		.audit_skip		= 1,
> +	},
>   };
>   
>   /* requests with any of those set should undergo io_disarm_next() */
> @@ -5527,6 +5538,38 @@ static int io_sync_file_range(struct io_kiocb *req, unsigned int issue_flags)
>   	return 0;
>   }
>   
> +static int io_memcpy_imm_prep(struct io_kiocb *req,
> +			      const struct io_uring_sqe *sqe)
> +{
> +	struct io_mem *mem = &req->mem;
> +
> +	if (unlikely(sqe->ioprio || sqe->rw_flags || sqe->buf_index ||
> +		     sqe->splice_fd_in))
> +		return -EINVAL;
> +
> +	mem->value = READ_ONCE(sqe->off);
> +	mem->dest = u64_to_user_ptr(READ_ONCE(sqe->addr));
> +	mem->len = READ_ONCE(sqe->len);
> +	if (!mem->len || mem->len > sizeof(u64))
> +		return -EINVAL;
> +


I'd also check that the length is a power-of-two to avoid having to deal 
with weird sizes if we later find it inconvenient.


> +	return 0;
> +}
> +
> +static int io_memcpy_imm(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	struct io_mem *mem = &req->mem;
> +	int ret = 0;
> +
> +	if (copy_to_user(mem->dest, &mem->value, mem->len))
> +		ret = -EFAULT;
> +


Is copy_to_user efficient for tiny sizes? Or is it better to use a 
switch and put_user()?


I guess copy_to_user saves us from having to consider endianness.


> +	if (ret < 0)
> +		req_set_fail(req);
> +	io_req_complete(req, ret);
> +	return 0;
> +}
> +
>   #if defined(CONFIG_NET)
>   static bool io_net_retry(struct socket *sock, int flags)
>   {
> @@ -7494,6 +7537,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>   		return io_getxattr_prep(req, sqe);
>   	case IORING_OP_SOCKET:
>   		return io_socket_prep(req, sqe);
> +	case IORING_OP_MEMCPY_IMM:
> +		return io_memcpy_imm_prep(req, sqe);
>   	}
>   
>   	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
> @@ -7815,6 +7860,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>   	case IORING_OP_SOCKET:
>   		ret = io_socket(req, issue_flags);
>   		break;
> +	case IORING_OP_MEMCPY_IMM:
> +		ret = io_memcpy_imm(req, issue_flags);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 5fb52bf32435..853f00a2bddd 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -152,6 +152,7 @@ enum {
>   	IORING_OP_FGETXATTR,
>   	IORING_OP_GETXATTR,
>   	IORING_OP_SOCKET,
> +	IORING_OP_MEMCPY_IMM,
>   
>   	/* this goes last, obviously */
>   	IORING_OP_LAST,
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-23 16:23   ` Avi Kivity
@ 2022-04-23 17:30     ` Jens Axboe
  2022-04-24 13:04       ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2022-04-23 17:30 UTC (permalink / raw)
  To: Avi Kivity, io-uring

On 4/23/22 10:23 AM, Avi Kivity wrote:
> Perhaps the interface should be kept separate from io_uring. e.g. use
> a pidfd to represent the address space, and then issue
> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy
> across process boundaries.

Then you just made it a ton less efficient, particularly if you used the
vectored read/write. For this to make sense, I think it has to be a
separate op. At least that's the only implementation I'd be willing to
entertain for the immediate copy.

> A different angle is to use expose the dma device as a separate fd.
> This can be useful as dma engine can often do other operations, like
> xor or crc or encryption or compression. In any case I'd argue for the
> interface to be useful outside io_uring, although that considerably
> increases the scope. I also don't have a direct use case for it,
> though I'm sure others will.

I'd say that whoever does it get to at least dictate the initial
implementation. For outside of io_uring, you're looking at a sync
interface, which I think already exists for this (ioctls?).

> The kernel itself should find the DMA engine useful for things like
> memory compaction.

That's a very different use case though and just deals with wiring it up
internally.

Let's try and keep the scope here reasonable, imho nothing good comes
out of attempting to do all the things at once.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-23 16:30     ` Avi Kivity
@ 2022-04-23 17:32       ` Jens Axboe
  2022-04-23 18:02         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2022-04-23 17:32 UTC (permalink / raw)
  To: Avi Kivity, io-uring

On 4/23/22 10:30 AM, Avi Kivity wrote:
> 
> On 22/04/2022 18.03, Jens Axboe wrote:
>> On 4/22/22 8:50 AM, Jens Axboe wrote:
>>> On 4/13/22 4:33 AM, Avi Kivity wrote:
>>>> Unfortunately, only ideas, no patches. But at least the first seems very easy.
>>>>
>>>>
>>>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op
>>>> itself (1-8 bytes) to a user memory location specified by the op.
>>>>
>>>>
>>>> Linked to another op, this can generate an in-memory notification
>>>> useful for busy-waiters or the UMWAIT instruction
>>>>
>>>> This would be useful for Seastar, which looks at a timer-managed
>>>> memory location to check when to break computation loops.
>>> This one would indeed be trivial to do. If we limit the max size
>>> supported to eg 8 bytes like suggested, then it could be in the sqe
>>> itself and just copied to the user address specified.
>>>
>>> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination
>>> address, and sqe->off the data to copy.
>>>
>>> If you'll commit to testing this, I can hack it up pretty quickly...
>> Something like this, totally untested. Maybe the return value should be
>> bytes copied?
> 
> 
> Yes. It could be less than what the user expected, unless we enforce
> alignment (perhaps we should).

Yes, this is just a quick hack. Did make some changes after that,
notably just collapsing it into IORING_OP_MEMCPY and having a flag for
immediate mode or not.

>>   +static int io_memcpy_imm_prep(struct io_kiocb *req,
>> +                  const struct io_uring_sqe *sqe)
>> +{
>> +    struct io_mem *mem = &req->mem;
>> +
>> +    if (unlikely(sqe->ioprio || sqe->rw_flags || sqe->buf_index ||
>> +             sqe->splice_fd_in))
>> +        return -EINVAL;
>> +
>> +    mem->value = READ_ONCE(sqe->off);
>> +    mem->dest = u64_to_user_ptr(READ_ONCE(sqe->addr));
>> +    mem->len = READ_ONCE(sqe->len);
>> +    if (!mem->len || mem->len > sizeof(u64))
>> +        return -EINVAL;
>> +
> 
> 
> I'd also check that the length is a power-of-two to avoid having to
> deal with weird sizes if we later find it inconvenient.

Yes, that's a good idea.

>> +    return 0;
>> +}
>> +
>> +static int io_memcpy_imm(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +    struct io_mem *mem = &req->mem;
>> +    int ret = 0;
>> +
>> +    if (copy_to_user(mem->dest, &mem->value, mem->len))
>> +        ret = -EFAULT;
>> +
> 
> 
> Is copy_to_user efficient for tiny sizes? Or is it better to use a
> switch and put_user()?
> 
> 
> I guess copy_to_user saves us from having to consider endianness.

I was considering that too, definitely something that should be
investigated. Making it a 1/2/4/8 switch and using put_user() is
probably a better idea. Easy enough to benchmark.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-23 17:32       ` Jens Axboe
@ 2022-04-23 18:02         ` Jens Axboe
  2022-04-23 18:11           ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2022-04-23 18:02 UTC (permalink / raw)
  To: Avi Kivity, io-uring

On 4/23/22 11:32 AM, Jens Axboe wrote:
>> I guess copy_to_user saves us from having to consider endianness.
> 
> I was considering that too, definitely something that should be
> investigated. Making it a 1/2/4/8 switch and using put_user() is
> probably a better idea. Easy enough to benchmark.

FWIW, this is the current version. Some quick benchmarking doesn't show
any difference between copy_to_user and put_user, but that may depend on
the arch as well (using aarch64). But we might as well use put user and
combine it with the length check, so we explicitly only support 1/2/4/8
sizes.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2052a796436c..3b94cb4b67ed 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -586,6 +586,14 @@ struct io_socket {
 	unsigned long			nofile;
 };
 
+struct io_mem {
+	struct file			*file;
+	u64				value;
+	void __user			*dest;
+	u32				len;
+	u32				flags;
+};
+
 struct io_sync {
 	struct file			*file;
 	loff_t				len;
@@ -962,6 +970,7 @@ struct io_kiocb {
 		struct io_msg		msg;
 		struct io_xattr		xattr;
 		struct io_socket	sock;
+		struct io_mem		mem;
 	};
 
 	u8				opcode;
@@ -1231,16 +1240,19 @@ static const struct io_op_def io_op_defs[] = {
 		.needs_file		= 1,
 	},
 	[IORING_OP_FSETXATTR] = {
-		.needs_file = 1
+		.needs_file		= 1,
 	},
 	[IORING_OP_SETXATTR] = {},
 	[IORING_OP_FGETXATTR] = {
-		.needs_file = 1
+		.needs_file		= 1,
 	},
 	[IORING_OP_GETXATTR] = {},
 	[IORING_OP_SOCKET] = {
 		.audit_skip		= 1,
 	},
+	[IORING_OP_MEMCPY] = {
+		.audit_skip		= 1,
+	},
 };
 
 /* requests with any of those set should undergo io_disarm_next() */
@@ -5527,6 +5539,71 @@ static int io_sync_file_range(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
+static int io_memcpy_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_mem *mem = &req->mem;
+
+	if (unlikely(sqe->ioprio || sqe->buf_index || sqe->splice_fd_in))
+		return -EINVAL;
+
+	mem->value = READ_ONCE(sqe->off);
+	mem->dest = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	mem->len = READ_ONCE(sqe->len);
+	if (!mem->len || mem->len > sizeof(u64))
+		return -EINVAL;
+
+	mem->flags = READ_ONCE(sqe->memcpy_flags);
+	if (mem->flags & ~IORING_MEMCPY_IMM)
+		return -EINVAL;
+
+	/* only supports immediate mode for now */
+	if (!(mem->flags & IORING_MEMCPY_IMM))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int io_memcpy(struct io_kiocb *req)
+{
+	struct io_mem *mem = &req->mem;
+	int ret = mem->len;
+
+	switch (mem->len) {
+	case 1: {
+		u8 val = mem->value;
+		if (put_user(val, (u8 *) mem->dest))
+			ret = -EFAULT;
+		break;
+		}
+	case 2: {
+		u16 val = mem->value;
+		if (put_user(val, (u16 *) mem->dest))
+			ret = -EFAULT;
+		break;
+		}
+	case 4: {
+		u32 val = mem->value;
+		if (put_user(val, (u32 *) mem->dest))
+			ret = -EFAULT;
+		break;
+		}
+	case 8: {
+		u64 val = mem->value;
+		if (put_user(val, (u64 *) mem->dest))
+			ret = -EFAULT;
+		break;
+		}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	if (ret < 0)
+		req_set_fail(req);
+	io_req_complete(req, ret);
+	return 0;
+}
+
 #if defined(CONFIG_NET)
 static bool io_net_retry(struct socket *sock, int flags)
 {
@@ -7494,6 +7571,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		return io_getxattr_prep(req, sqe);
 	case IORING_OP_SOCKET:
 		return io_socket_prep(req, sqe);
+	case IORING_OP_MEMCPY:
+		return io_memcpy_prep(req, sqe);
 	}
 
 	printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -7815,6 +7894,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	case IORING_OP_SOCKET:
 		ret = io_socket(req, issue_flags);
 		break;
+	case IORING_OP_MEMCPY:
+		ret = io_memcpy(req);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5fb52bf32435..9e69d70a3b5b 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -46,6 +46,7 @@ struct io_uring_sqe {
 		__u32		unlink_flags;
 		__u32		hardlink_flags;
 		__u32		xattr_flags;
+		__u32		memcpy_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	/* pack this to avoid bogus arm OABI complaints */
@@ -152,6 +153,7 @@ enum {
 	IORING_OP_FGETXATTR,
 	IORING_OP_GETXATTR,
 	IORING_OP_SOCKET,
+	IORING_OP_MEMCPY,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
@@ -206,6 +208,14 @@ enum {
 #define IORING_ASYNC_CANCEL_FD	(1U << 1)
 #define IORING_ASYNC_CANCEL_ANY	(1U << 2)
 
+/*
+ * IORING_OP_MEMCPY flags.
+ *
+ * IORING_MEMCPY_IMM		Immediate copy. 'off' contains an immediate
+ *				value. If not set, 'off' is a source address.
+ */
+#define IORING_MEMCPY_IMM	(1U << 0)
+
 /*
  * IO completion data structure (Completion Queue Entry)
  */

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-23 18:02         ` Jens Axboe
@ 2022-04-23 18:11           ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2022-04-23 18:11 UTC (permalink / raw)
  To: Avi Kivity, io-uring

On 4/23/22 12:02 PM, Jens Axboe wrote:
> On 4/23/22 11:32 AM, Jens Axboe wrote:
>>> I guess copy_to_user saves us from having to consider endianness.
>>
>> I was considering that too, definitely something that should be
>> investigated. Making it a 1/2/4/8 switch and using put_user() is
>> probably a better idea. Easy enough to benchmark.
> 
> FWIW, this is the current version. Some quick benchmarking doesn't show
> any difference between copy_to_user and put_user, but that may depend on
> the arch as well (using aarch64). But we might as well use put user and
> combine it with the length check, so we explicitly only support 1/2/4/8
> sizes.

In terms of performance, on this laptop, I can do about 36-37M NOP
requests per second. If I use IORING_OP_MEMCPY with immediate mode, it's
around 15M ops/sec. This is regardless of the size, get about the same
whether it's 1 or 8 byte memory writes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-23 17:30     ` Jens Axboe
@ 2022-04-24 13:04       ` Avi Kivity
  2022-04-24 13:30         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2022-04-24 13:04 UTC (permalink / raw)
  To: Jens Axboe, io-uring


On 23/04/2022 20.30, Jens Axboe wrote:
> On 4/23/22 10:23 AM, Avi Kivity wrote:
>> Perhaps the interface should be kept separate from io_uring. e.g. use
>> a pidfd to represent the address space, and then issue
>> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy
>> across process boundaries.
> Then you just made it a ton less efficient, particularly if you used the
> vectored read/write. For this to make sense, I think it has to be a
> separate op. At least that's the only implementation I'd be willing to
> entertain for the immediate copy.


Sorry, I caused a lot of confusion by bundling immediate copy and a DMA 
engine interface. For sure the immediate copy should be a direct 
implementation like you posted!


User-to-user copies are another matter. I feel like that should be a 
stand-alone driver, and that io_uring should be an io_uring-y way to 
access it. Just like io_uring isn't an NVMe driver.


>> A different angle is to use expose the dma device as a separate fd.
>> This can be useful as dma engine can often do other operations, like
>> xor or crc or encryption or compression. In any case I'd argue for the
>> interface to be useful outside io_uring, although that considerably
>> increases the scope. I also don't have a direct use case for it,
>> though I'm sure others will.
> I'd say that whoever does it get to at least dictate the initial
> implementation.


Of course, but bikeshedding from the sidelines never hurt anyone.


>   For outside of io_uring, you're looking at a sync
> interface, which I think already exists for this (ioctls?).


Yes, it would be a asynchronous interface. I don't know if one exists, 
but I can't claim to have kept track.


>
>> The kernel itself should find the DMA engine useful for things like
>> memory compaction.
> That's a very different use case though and just deals with wiring it up
> internally.
>
> Let's try and keep the scope here reasonable, imho nothing good comes
> out of attempting to do all the things at once.
>

For sure, I'm just noting that the DMA engine has many different uses 
and so deserves an interface that is untied to io_uring.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-24 13:04       ` Avi Kivity
@ 2022-04-24 13:30         ` Jens Axboe
  2022-04-24 14:56           ` Avi Kivity
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2022-04-24 13:30 UTC (permalink / raw)
  To: Avi Kivity, io-uring

On 4/24/22 7:04 AM, Avi Kivity wrote:
> 
> On 23/04/2022 20.30, Jens Axboe wrote:
>> On 4/23/22 10:23 AM, Avi Kivity wrote:
>>> Perhaps the interface should be kept separate from io_uring. e.g. use
>>> a pidfd to represent the address space, and then issue
>>> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy
>>> across process boundaries.
>> Then you just made it a ton less efficient, particularly if you used the
>> vectored read/write. For this to make sense, I think it has to be a
>> separate op. At least that's the only implementation I'd be willing to
>> entertain for the immediate copy.
> 
> 
> Sorry, I caused a lot of confusion by bundling immediate copy and a
> DMA engine interface. For sure the immediate copy should be a direct
> implementation like you posted!
>
> User-to-user copies are another matter. I feel like that should be a
> stand-alone driver, and that io_uring should be an io_uring-y way to
> access it. Just like io_uring isn't an NVMe driver.

Not sure I understand your logic here or the io_uring vs nvme driver
reference, to be honest. io_uring _is_ a standalone way to access it,
you can use it sync or async through that.

If you're talking about a standalone op vs being useful from a command
itself, I do think both have merit and I can see good use cases for
both.

>>   For outside of io_uring, you're looking at a sync
>> interface, which I think already exists for this (ioctls?).
> 
> 
> Yes, it would be a asynchronous interface. I don't know if one exists,
> but I can't claim to have kept track.

Again not following. So you're saying there should be a 2nd async
interface for it?

>>> The kernel itself should find the DMA engine useful for things like
>>> memory compaction.
>> That's a very different use case though and just deals with wiring it up
>> internally.
>>
>> Let's try and keep the scope here reasonable, imho nothing good comes
>> out of attempting to do all the things at once.
>>
> 
> For sure, I'm just noting that the DMA engine has many different uses
> and so deserves an interface that is untied to io_uring.

And again, not following, what's the point of having 2 interfaces for
the same thing? I can sort of agree if one is just the basic ioctl kind
of interface, a basic sync one. But outside of that I'm a bit puzzled as
to why that would be useful at all.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-24 13:30         ` Jens Axboe
@ 2022-04-24 14:56           ` Avi Kivity
  2022-04-25  0:45             ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2022-04-24 14:56 UTC (permalink / raw)
  To: Jens Axboe, io-uring


On 24/04/2022 16.30, Jens Axboe wrote:
> On 4/24/22 7:04 AM, Avi Kivity wrote:
>> On 23/04/2022 20.30, Jens Axboe wrote:
>>> On 4/23/22 10:23 AM, Avi Kivity wrote:
>>>> Perhaps the interface should be kept separate from io_uring. e.g. use
>>>> a pidfd to represent the address space, and then issue
>>>> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy
>>>> across process boundaries.
>>> Then you just made it a ton less efficient, particularly if you used the
>>> vectored read/write. For this to make sense, I think it has to be a
>>> separate op. At least that's the only implementation I'd be willing to
>>> entertain for the immediate copy.
>>
>> Sorry, I caused a lot of confusion by bundling immediate copy and a
>> DMA engine interface. For sure the immediate copy should be a direct
>> implementation like you posted!
>>
>> User-to-user copies are another matter. I feel like that should be a
>> stand-alone driver, and that io_uring should be an io_uring-y way to
>> access it. Just like io_uring isn't an NVMe driver.
> Not sure I understand your logic here or the io_uring vs nvme driver
> reference, to be honest. io_uring _is_ a standalone way to access it,
> you can use it sync or async through that.
>
> If you're talking about a standalone op vs being useful from a command
> itself, I do think both have merit and I can see good use cases for
> both.


I'm saying that if dma is exposed to userspace, it should have a regular 
synchronous interface (maybe open("/dev/dma"), maybe something else). 
io_uring adds asynchrony to everything, but it's not everything's driver.


Anyway maybe we drifted off somewhere and this should be decided by 
pragmatic concerns (like whatever the author of the driver prefers).


>
>>>    For outside of io_uring, you're looking at a sync
>>> interface, which I think already exists for this (ioctls?).
>>
>> Yes, it would be a asynchronous interface. I don't know if one exists,
>> but I can't claim to have kept track.
> Again not following. So you're saying there should be a 2nd async
> interface for it?


No. And I misspelled "synchronous" as "asynchronous" (I was agreeing 
with you that it would be a sync interface).


>
>>>> The kernel itself should find the DMA engine useful for things like
>>>> memory compaction.
>>> That's a very different use case though and just deals with wiring it up
>>> internally.
>>>
>>> Let's try and keep the scope here reasonable, imho nothing good comes
>>> out of attempting to do all the things at once.
>>>
>> For sure, I'm just noting that the DMA engine has many different uses
>> and so deserves an interface that is untied to io_uring.
> And again, not following, what's the point of having 2 interfaces for
> the same thing? I can sort of agree if one is just the basic ioctl kind
> of interface, a basic sync one. But outside of that I'm a bit puzzled as
> to why that would be useful at all.
>

Yes I meant the basic sync one. Sorry I caused quite a lot of confusion 
here!



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-24 14:56           ` Avi Kivity
@ 2022-04-25  0:45             ` Jens Axboe
  2022-04-25 18:05               ` Walker, Benjamin
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2022-04-25  0:45 UTC (permalink / raw)
  To: Avi Kivity, io-uring

On 4/24/22 8:56 AM, Avi Kivity wrote:
> 
> On 24/04/2022 16.30, Jens Axboe wrote:
>> On 4/24/22 7:04 AM, Avi Kivity wrote:
>>> On 23/04/2022 20.30, Jens Axboe wrote:
>>>> On 4/23/22 10:23 AM, Avi Kivity wrote:
>>>>> Perhaps the interface should be kept separate from io_uring. e.g. use
>>>>> a pidfd to represent the address space, and then issue
>>>>> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy
>>>>> across process boundaries.
>>>> Then you just made it a ton less efficient, particularly if you used the
>>>> vectored read/write. For this to make sense, I think it has to be a
>>>> separate op. At least that's the only implementation I'd be willing to
>>>> entertain for the immediate copy.
>>>
>>> Sorry, I caused a lot of confusion by bundling immediate copy and a
>>> DMA engine interface. For sure the immediate copy should be a direct
>>> implementation like you posted!
>>>
>>> User-to-user copies are another matter. I feel like that should be a
>>> stand-alone driver, and that io_uring should be an io_uring-y way to
>>> access it. Just like io_uring isn't an NVMe driver.
>> Not sure I understand your logic here or the io_uring vs nvme driver
>> reference, to be honest. io_uring _is_ a standalone way to access it,
>> you can use it sync or async through that.
>>
>> If you're talking about a standalone op vs being useful from a command
>> itself, I do think both have merit and I can see good use cases for
>> both.
> 
> 
> I'm saying that if dma is exposed to userspace, it should have a
> regular synchronous interface (maybe open("/dev/dma"), maybe something
> else). io_uring adds asynchrony to everything, but it's not
> everything's driver.

Sure, my point is that if/when someone wants to add that, they should be
free to do so. It's not a fair requirement to put on someone doing the
initial work on wiring this up. It may not be something they would want
to use to begin with, and it's perfectly easy to run io_uring in sync
mode should you wish to do so. The hard part is making the
issue+complete separate actions, rolling a sync API on top of that would
be trivial.

> Anyway maybe we drifted off somewhere and this should be decided by
> pragmatic concerns (like whatever the author of the driver prefers).

Indeed!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: memory access op ideas
  2022-04-25  0:45             ` Jens Axboe
@ 2022-04-25 18:05               ` Walker, Benjamin
  0 siblings, 0 replies; 22+ messages in thread
From: Walker, Benjamin @ 2022-04-25 18:05 UTC (permalink / raw)
  To: Jens Axboe, Avi Kivity, io-uring

On 4/24/2022 5:45 PM, Jens Axboe wrote:
> On 4/24/22 8:56 AM, Avi Kivity wrote:
>>
>> On 24/04/2022 16.30, Jens Axboe wrote:
>>> On 4/24/22 7:04 AM, Avi Kivity wrote:
>>>> On 23/04/2022 20.30, Jens Axboe wrote:
>>>>> On 4/23/22 10:23 AM, Avi Kivity wrote:
>>>>>> Perhaps the interface should be kept separate from io_uring. e.g. use
>>>>>> a pidfd to represent the address space, and then issue
>>>>>> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy
>>>>>> across process boundaries.
>>>>> Then you just made it a ton less efficient, particularly if you used the
>>>>> vectored read/write. For this to make sense, I think it has to be a
>>>>> separate op. At least that's the only implementation I'd be willing to
>>>>> entertain for the immediate copy.
>>>>
>>>> Sorry, I caused a lot of confusion by bundling immediate copy and a
>>>> DMA engine interface. For sure the immediate copy should be a direct
>>>> implementation like you posted!
>>>>
>>>> User-to-user copies are another matter. I feel like that should be a
>>>> stand-alone driver, and that io_uring should be an io_uring-y way to
>>>> access it. Just like io_uring isn't an NVMe driver.
>>> Not sure I understand your logic here or the io_uring vs nvme driver
>>> reference, to be honest. io_uring _is_ a standalone way to access it,
>>> you can use it sync or async through that.
>>>
>>> If you're talking about a standalone op vs being useful from a command
>>> itself, I do think both have merit and I can see good use cases for
>>> both.

I'm actually not so certain the model where io_uring has special 
operations for driving DMA engines works out. I think in all cases you 
can accomplish what you want by reading or writing to existing file 
constructs, and just having those transparently offload to a DMA engine 
if one is available on your behalf.

As a concrete example, let's take an inter-process copy. The main 
challenges with this one are the security model (who's allowed to copy 
where?) and synchronization between the two applications (when did the 
data change?).

Rather, I'd consider implementing the inter-process copy using an 
existing mechanism like a Unix domain socket. The sender maybe does a 
MSG_ZEROCOPY send via io_uring, and the receiver does an async recv, and 
the kernel can use a DMA engine to move the data directly between the 
two buffers if it has one avaiable. Then you get the existing security 
model and coordination, and software works whether there's a DMA engine 
available or not.

It's a similar story for copying to memory on a PCI device. You'd need 
some security model to decide you're allowed to copy there, which is 
probably best expressed by opening a file that represents that BAR and 
then doing reads/writes to it.

This is at least the direction I've been pursuing. The DMA engine 
channel is associated with the io_uring and the kernel just 
intelligently offloads whatever it can.

>>
>>
>> I'm saying that if dma is exposed to userspace, it should have a
>> regular synchronous interface (maybe open("/dev/dma"), maybe something
>> else). io_uring adds asynchrony to everything, but it's not
>> everything's driver.
> 
> Sure, my point is that if/when someone wants to add that, they should be
> free to do so. It's not a fair requirement to put on someone doing the
> initial work on wiring this up. It may not be something they would want
> to use to begin with, and it's perfectly easy to run io_uring in sync
> mode should you wish to do so. The hard part is making the
> issue+complete separate actions, rolling a sync API on top of that would
> be trivial.

Just FYI but the Intel idxd driver already has a userspace interface 
that's async/poll-mode. Commands are submitted to a mmap'd portal using 
the movdir64/enqcmd instructions directly. It does not expose an fd you 
can read/write to in order to trigger copies, so it is not compatible 
with io_uring, but it doesn't really need to be since it is already async.

What isn't currently exposed to userspace is access to the "dmaengine" 
framework. Prior to the patchset I have pending that I linked earlier in 
the thread, the "dmaengine" framework couldn't really operate in 
async/poll mode and handle out-of-order processing, etc. But after that 
series maybe.

> 
>> Anyway maybe we drifted off somewhere and this should be decided by
>> pragmatic concerns (like whatever the author of the driver prefers).
> 
> Indeed!
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-04-25 18:05 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-13 10:33 memory access op ideas Avi Kivity
2022-04-22 12:52 ` Hao Xu
2022-04-22 13:24   ` Hao Xu
2022-04-22 13:38   ` Jens Axboe
2022-04-23  7:19     ` Hao Xu
2022-04-23 16:14   ` Avi Kivity
2022-04-22 14:50 ` Jens Axboe
2022-04-22 15:03   ` Jens Axboe
2022-04-23 16:30     ` Avi Kivity
2022-04-23 17:32       ` Jens Axboe
2022-04-23 18:02         ` Jens Axboe
2022-04-23 18:11           ` Jens Axboe
2022-04-22 20:03   ` Walker, Benjamin
2022-04-23 10:19     ` Pavel Begunkov
2022-04-23 13:20     ` Jens Axboe
2022-04-23 16:23   ` Avi Kivity
2022-04-23 17:30     ` Jens Axboe
2022-04-24 13:04       ` Avi Kivity
2022-04-24 13:30         ` Jens Axboe
2022-04-24 14:56           ` Avi Kivity
2022-04-25  0:45             ` Jens Axboe
2022-04-25 18:05               ` Walker, Benjamin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox