* memory access op ideas @ 2022-04-13 10:33 Avi Kivity 2022-04-22 12:52 ` Hao Xu 2022-04-22 14:50 ` Jens Axboe 0 siblings, 2 replies; 22+ messages in thread From: Avi Kivity @ 2022-04-13 10:33 UTC (permalink / raw) To: io-uring Unfortunately, only ideas, no patches. But at least the first seems very easy. - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op itself (1-8 bytes) to a user memory location specified by the op. Linked to another op, this can generate an in-memory notification useful for busy-waiters or the UMWAIT instruction This would be useful for Seastar, which looks at a timer-managed memory location to check when to break computation loops. - IORING_OP_MEMCPY - asynchronously copy memory Some CPUs include a DMA engine, and io_uring is a perfect interface to exercise it. It may be difficult to find space for two iovecs though. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-13 10:33 memory access op ideas Avi Kivity @ 2022-04-22 12:52 ` Hao Xu 2022-04-22 13:24 ` Hao Xu ` (2 more replies) 2022-04-22 14:50 ` Jens Axboe 1 sibling, 3 replies; 22+ messages in thread From: Hao Xu @ 2022-04-22 12:52 UTC (permalink / raw) To: Avi Kivity, io-uring Hi Avi, 在 4/13/22 6:33 PM, Avi Kivity 写道: > Unfortunately, only ideas, no patches. But at least the first seems very > easy. > > > - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op > itself (1-8 bytes) to a user memory location specified by the op. > > > Linked to another op, this can generate an in-memory notification useful > for busy-waiters or the UMWAIT instruction > > > This would be useful for Seastar, which looks at a timer-managed memory > location to check when to break computation loops. > > > - IORING_OP_MEMCPY - asynchronously copy memory > > > Some CPUs include a DMA engine, and io_uring is a perfect interface to > exercise it. It may be difficult to find space for two iovecs though. I have a question about the 'DMA' here, do you mean DMA device for memory copy? My understanding is you want async memcpy so that the cpu can relax when the specific hardware is doing memory copy. the thing is for cases like busy waiting or UMAIT, the length of the memory to be copied is usually small(otherwise we don't use busy waiting or UMAIT, right?). Then making it async by io_uring's iowq may introduce much more overhead(the context switch). Regards, Hao > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 12:52 ` Hao Xu @ 2022-04-22 13:24 ` Hao Xu 2022-04-22 13:38 ` Jens Axboe 2022-04-23 16:14 ` Avi Kivity 2 siblings, 0 replies; 22+ messages in thread From: Hao Xu @ 2022-04-22 13:24 UTC (permalink / raw) To: Avi Kivity, io-uring 在 4/22/22 8:52 PM, Hao Xu 写道: > Hi Avi, > 在 4/13/22 6:33 PM, Avi Kivity 写道: >> Unfortunately, only ideas, no patches. But at least the first seems >> very easy. >> >> >> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op >> itself (1-8 bytes) to a user memory location specified by the op. >> >> >> Linked to another op, this can generate an in-memory notification >> useful for busy-waiters or the UMWAIT instruction >> >> >> This would be useful for Seastar, which looks at a timer-managed >> memory location to check when to break computation loops. >> >> >> - IORING_OP_MEMCPY - asynchronously copy memory >> >> >> Some CPUs include a DMA engine, and io_uring is a perfect interface >> to exercise it. It may be difficult to find space for two iovecs though. > > I have a question about the 'DMA' here, do you mean DMA device for > memory copy? My understanding is you want async memcpy so that the > cpu can relax when the specific hardware is doing memory copy. the > thing is for cases like busy waiting or UMAIT, the length of the memory > to be copied is usually small(otherwise we don't use busy waiting or > UMAIT, right?). Then making it async by io_uring's iowq may introduce > much more overhead(the context switch). Ok, maybe we can leverage task work to avoid context switch but still unclear how much the overhead is. > > Regards, > Hao > >> >> >> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 12:52 ` Hao Xu 2022-04-22 13:24 ` Hao Xu @ 2022-04-22 13:38 ` Jens Axboe 2022-04-23 7:19 ` Hao Xu 2022-04-23 16:14 ` Avi Kivity 2 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2022-04-22 13:38 UTC (permalink / raw) To: Hao Xu, Avi Kivity, io-uring On 4/22/22 6:52 AM, Hao Xu wrote: > Hi Avi, > ? 4/13/22 6:33 PM, Avi Kivity ??: >> Unfortunately, only ideas, no patches. But at least the first seems very easy. >> >> >> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op itself (1-8 bytes) to a user memory location specified by the op. >> >> >> Linked to another op, this can generate an in-memory notification useful for busy-waiters or the UMWAIT instruction >> >> >> This would be useful for Seastar, which looks at a timer-managed memory location to check when to break computation loops. >> >> >> - IORING_OP_MEMCPY - asynchronously copy memory >> >> >> Some CPUs include a DMA engine, and io_uring is a perfect interface to exercise it. It may be difficult to find space for two iovecs though. > > I have a question about the 'DMA' here, do you mean DMA device for > memory copy? My understanding is you want async memcpy so that the > cpu can relax when the specific hardware is doing memory copy. the > thing is for cases like busy waiting or UMAIT, the length of the memory > to be copied is usually small(otherwise we don't use busy waiting or > UMAIT, right?). Then making it async by io_uring's iowq may introduce > much more overhead(the context switch). Nothing fast should use io-wq. But not sure why you think this would need it, as long as you can start the operation in a sane fashion and get notified when done, why would it need io-wq? -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 13:38 ` Jens Axboe @ 2022-04-23 7:19 ` Hao Xu 0 siblings, 0 replies; 22+ messages in thread From: Hao Xu @ 2022-04-23 7:19 UTC (permalink / raw) To: Jens Axboe, Avi Kivity, io-uring 在 4/22/22 9:38 PM, Jens Axboe 写道: > On 4/22/22 6:52 AM, Hao Xu wrote: >> Hi Avi, >> ? 4/13/22 6:33 PM, Avi Kivity ??: >>> Unfortunately, only ideas, no patches. But at least the first seems very easy. >>> >>> >>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op itself (1-8 bytes) to a user memory location specified by the op. >>> >>> >>> Linked to another op, this can generate an in-memory notification useful for busy-waiters or the UMWAIT instruction >>> >>> >>> This would be useful for Seastar, which looks at a timer-managed memory location to check when to break computation loops. >>> >>> >>> - IORING_OP_MEMCPY - asynchronously copy memory >>> >>> >>> Some CPUs include a DMA engine, and io_uring is a perfect interface to exercise it. It may be difficult to find space for two iovecs though. >> >> I have a question about the 'DMA' here, do you mean DMA device for >> memory copy? My understanding is you want async memcpy so that the >> cpu can relax when the specific hardware is doing memory copy. the >> thing is for cases like busy waiting or UMAIT, the length of the memory >> to be copied is usually small(otherwise we don't use busy waiting or >> UMAIT, right?). Then making it async by io_uring's iowq may introduce >> much more overhead(the context switch). > > Nothing fast should use io-wq. But not sure why you think this would > need it, as long as you can start the operation in a sane fashion and > get notified when done, why would it need io-wq? Ah, yes, forgot the some basic logic while sending the email. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 12:52 ` Hao Xu 2022-04-22 13:24 ` Hao Xu 2022-04-22 13:38 ` Jens Axboe @ 2022-04-23 16:14 ` Avi Kivity 2 siblings, 0 replies; 22+ messages in thread From: Avi Kivity @ 2022-04-23 16:14 UTC (permalink / raw) To: Hao Xu, io-uring On 22/04/2022 15.52, Hao Xu wrote: > Hi Avi, > 在 4/13/22 6:33 PM, Avi Kivity 写道: >> Unfortunately, only ideas, no patches. But at least the first seems >> very easy. >> >> >> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op >> itself (1-8 bytes) to a user memory location specified by the op. >> >> >> Linked to another op, this can generate an in-memory notification >> useful for busy-waiters or the UMWAIT instruction >> >> >> This would be useful for Seastar, which looks at a timer-managed >> memory location to check when to break computation loops. >> >> >> - IORING_OP_MEMCPY - asynchronously copy memory >> >> >> Some CPUs include a DMA engine, and io_uring is a perfect interface >> to exercise it. It may be difficult to find space for two iovecs though. > > I have a question about the 'DMA' here, do you mean DMA device for > memory copy? Yes. I understand some Intel server processors have them. > My understanding is you want async memcpy so that the > cpu can relax when the specific hardware is doing memory copy. the > thing is for cases like busy waiting or UMAIT, the length of the memory > to be copied is usually small(otherwise we don't use busy waiting or > UMAIT, right?). Then making it async by io_uring's iowq may introduce > much more overhead(the context switch). These are two separate cases. 1. Bulk data copies (usually large), use DMA 2. small memory writes to wake up a thread that is using UMWAIT or busy-polling, do not use DMA Whether to use DMA or not can be based on writes size. I'd say anything <= 512 bytes should be done by the CPU. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-13 10:33 memory access op ideas Avi Kivity 2022-04-22 12:52 ` Hao Xu @ 2022-04-22 14:50 ` Jens Axboe 2022-04-22 15:03 ` Jens Axboe ` (2 more replies) 1 sibling, 3 replies; 22+ messages in thread From: Jens Axboe @ 2022-04-22 14:50 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/13/22 4:33 AM, Avi Kivity wrote: > Unfortunately, only ideas, no patches. But at least the first seems very easy. > > > - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op > itself (1-8 bytes) to a user memory location specified by the op. > > > Linked to another op, this can generate an in-memory notification > useful for busy-waiters or the UMWAIT instruction > > This would be useful for Seastar, which looks at a timer-managed > memory location to check when to break computation loops. This one would indeed be trivial to do. If we limit the max size supported to eg 8 bytes like suggested, then it could be in the sqe itself and just copied to the user address specified. Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination address, and sqe->off the data to copy. If you'll commit to testing this, I can hack it up pretty quickly... > - IORING_OP_MEMCPY - asynchronously copy memory > > > Some CPUs include a DMA engine, and io_uring is a perfect interface to > exercise it. It may be difficult to find space for two iovecs though. I've considered this one in the past too, and it is indeed an ideal fit in terms of API. Outside of the DMA engines, it can also be used to offload memcpy to a GPU, for example. The io_uring side would not be hard to wire up, basically just have the sqe specfy source, destination, length. Add some well defined flags depending on what the copy engine offers, for example. But probably some work required here in exposing an API and testing etc... -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 14:50 ` Jens Axboe @ 2022-04-22 15:03 ` Jens Axboe 2022-04-23 16:30 ` Avi Kivity 2022-04-22 20:03 ` Walker, Benjamin 2022-04-23 16:23 ` Avi Kivity 2 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2022-04-22 15:03 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/22/22 8:50 AM, Jens Axboe wrote: > On 4/13/22 4:33 AM, Avi Kivity wrote: >> Unfortunately, only ideas, no patches. But at least the first seems very easy. >> >> >> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op >> itself (1-8 bytes) to a user memory location specified by the op. >> >> >> Linked to another op, this can generate an in-memory notification >> useful for busy-waiters or the UMWAIT instruction >> >> This would be useful for Seastar, which looks at a timer-managed >> memory location to check when to break computation loops. > > This one would indeed be trivial to do. If we limit the max size > supported to eg 8 bytes like suggested, then it could be in the sqe > itself and just copied to the user address specified. > > Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination > address, and sqe->off the data to copy. > > If you'll commit to testing this, I can hack it up pretty quickly... Something like this, totally untested. Maybe the return value should be bytes copied? Just returns 0/error right now. Follows the above convention. diff --git a/fs/io_uring.c b/fs/io_uring.c index 2052a796436c..d2a95f9d9d2d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -586,6 +586,13 @@ struct io_socket { unsigned long nofile; }; +struct io_mem { + struct file *file; + u64 value; + void __user *dest; + u32 len; +}; + struct io_sync { struct file *file; loff_t len; @@ -962,6 +969,7 @@ struct io_kiocb { struct io_msg msg; struct io_xattr xattr; struct io_socket sock; + struct io_mem mem; }; u8 opcode; @@ -1231,16 +1239,19 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, }, [IORING_OP_FSETXATTR] = { - .needs_file = 1 + .needs_file = 1, }, [IORING_OP_SETXATTR] = {}, [IORING_OP_FGETXATTR] = { - .needs_file = 1 + .needs_file = 1, }, [IORING_OP_GETXATTR] = {}, [IORING_OP_SOCKET] = { .audit_skip = 1, }, + [IORING_OP_MEMCPY_IMM] = { + .audit_skip = 1, + }, }; /* requests with any of those set should undergo io_disarm_next() */ @@ -5527,6 +5538,38 @@ static int io_sync_file_range(struct io_kiocb *req, unsigned int issue_flags) return 0; } +static int io_memcpy_imm_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_mem *mem = &req->mem; + + if (unlikely(sqe->ioprio || sqe->rw_flags || sqe->buf_index || + sqe->splice_fd_in)) + return -EINVAL; + + mem->value = READ_ONCE(sqe->off); + mem->dest = u64_to_user_ptr(READ_ONCE(sqe->addr)); + mem->len = READ_ONCE(sqe->len); + if (!mem->len || mem->len > sizeof(u64)) + return -EINVAL; + + return 0; +} + +static int io_memcpy_imm(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_mem *mem = &req->mem; + int ret = 0; + + if (copy_to_user(mem->dest, &mem->value, mem->len)) + ret = -EFAULT; + + if (ret < 0) + req_set_fail(req); + io_req_complete(req, ret); + return 0; +} + #if defined(CONFIG_NET) static bool io_net_retry(struct socket *sock, int flags) { @@ -7494,6 +7537,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return io_getxattr_prep(req, sqe); case IORING_OP_SOCKET: return io_socket_prep(req, sqe); + case IORING_OP_MEMCPY_IMM: + return io_memcpy_imm_prep(req, sqe); } printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", @@ -7815,6 +7860,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) case IORING_OP_SOCKET: ret = io_socket(req, issue_flags); break; + case IORING_OP_MEMCPY_IMM: + ret = io_memcpy_imm(req, issue_flags); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 5fb52bf32435..853f00a2bddd 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -152,6 +152,7 @@ enum { IORING_OP_FGETXATTR, IORING_OP_GETXATTR, IORING_OP_SOCKET, + IORING_OP_MEMCPY_IMM, /* this goes last, obviously */ IORING_OP_LAST, -- Jens Axboe ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 15:03 ` Jens Axboe @ 2022-04-23 16:30 ` Avi Kivity 2022-04-23 17:32 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Avi Kivity @ 2022-04-23 16:30 UTC (permalink / raw) To: Jens Axboe, io-uring On 22/04/2022 18.03, Jens Axboe wrote: > On 4/22/22 8:50 AM, Jens Axboe wrote: >> On 4/13/22 4:33 AM, Avi Kivity wrote: >>> Unfortunately, only ideas, no patches. But at least the first seems very easy. >>> >>> >>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op >>> itself (1-8 bytes) to a user memory location specified by the op. >>> >>> >>> Linked to another op, this can generate an in-memory notification >>> useful for busy-waiters or the UMWAIT instruction >>> >>> This would be useful for Seastar, which looks at a timer-managed >>> memory location to check when to break computation loops. >> This one would indeed be trivial to do. If we limit the max size >> supported to eg 8 bytes like suggested, then it could be in the sqe >> itself and just copied to the user address specified. >> >> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination >> address, and sqe->off the data to copy. >> >> If you'll commit to testing this, I can hack it up pretty quickly... > Something like this, totally untested. Maybe the return value should be > bytes copied? Yes. It could be less than what the user expected, unless we enforce alignment (perhaps we should). > Just returns 0/error right now. > > Follows the above convention. > > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 2052a796436c..d2a95f9d9d2d 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -586,6 +586,13 @@ struct io_socket { > unsigned long nofile; > }; > > +struct io_mem { > + struct file *file; > + u64 value; > + void __user *dest; > + u32 len; > +}; > + > struct io_sync { > struct file *file; > loff_t len; > @@ -962,6 +969,7 @@ struct io_kiocb { > struct io_msg msg; > struct io_xattr xattr; > struct io_socket sock; > + struct io_mem mem; > }; > > u8 opcode; > @@ -1231,16 +1239,19 @@ static const struct io_op_def io_op_defs[] = { > .needs_file = 1, > }, > [IORING_OP_FSETXATTR] = { > - .needs_file = 1 > + .needs_file = 1, > }, > [IORING_OP_SETXATTR] = {}, > [IORING_OP_FGETXATTR] = { > - .needs_file = 1 > + .needs_file = 1, > }, > [IORING_OP_GETXATTR] = {}, > [IORING_OP_SOCKET] = { > .audit_skip = 1, > }, > + [IORING_OP_MEMCPY_IMM] = { > + .audit_skip = 1, > + }, > }; > > /* requests with any of those set should undergo io_disarm_next() */ > @@ -5527,6 +5538,38 @@ static int io_sync_file_range(struct io_kiocb *req, unsigned int issue_flags) > return 0; > } > > +static int io_memcpy_imm_prep(struct io_kiocb *req, > + const struct io_uring_sqe *sqe) > +{ > + struct io_mem *mem = &req->mem; > + > + if (unlikely(sqe->ioprio || sqe->rw_flags || sqe->buf_index || > + sqe->splice_fd_in)) > + return -EINVAL; > + > + mem->value = READ_ONCE(sqe->off); > + mem->dest = u64_to_user_ptr(READ_ONCE(sqe->addr)); > + mem->len = READ_ONCE(sqe->len); > + if (!mem->len || mem->len > sizeof(u64)) > + return -EINVAL; > + I'd also check that the length is a power-of-two to avoid having to deal with weird sizes if we later find it inconvenient. > + return 0; > +} > + > +static int io_memcpy_imm(struct io_kiocb *req, unsigned int issue_flags) > +{ > + struct io_mem *mem = &req->mem; > + int ret = 0; > + > + if (copy_to_user(mem->dest, &mem->value, mem->len)) > + ret = -EFAULT; > + Is copy_to_user efficient for tiny sizes? Or is it better to use a switch and put_user()? I guess copy_to_user saves us from having to consider endianness. > + if (ret < 0) > + req_set_fail(req); > + io_req_complete(req, ret); > + return 0; > +} > + > #if defined(CONFIG_NET) > static bool io_net_retry(struct socket *sock, int flags) > { > @@ -7494,6 +7537,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > return io_getxattr_prep(req, sqe); > case IORING_OP_SOCKET: > return io_socket_prep(req, sqe); > + case IORING_OP_MEMCPY_IMM: > + return io_memcpy_imm_prep(req, sqe); > } > > printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", > @@ -7815,6 +7860,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) > case IORING_OP_SOCKET: > ret = io_socket(req, issue_flags); > break; > + case IORING_OP_MEMCPY_IMM: > + ret = io_memcpy_imm(req, issue_flags); > + break; > default: > ret = -EINVAL; > break; > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 5fb52bf32435..853f00a2bddd 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -152,6 +152,7 @@ enum { > IORING_OP_FGETXATTR, > IORING_OP_GETXATTR, > IORING_OP_SOCKET, > + IORING_OP_MEMCPY_IMM, > > /* this goes last, obviously */ > IORING_OP_LAST, > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-23 16:30 ` Avi Kivity @ 2022-04-23 17:32 ` Jens Axboe 2022-04-23 18:02 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2022-04-23 17:32 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/23/22 10:30 AM, Avi Kivity wrote: > > On 22/04/2022 18.03, Jens Axboe wrote: >> On 4/22/22 8:50 AM, Jens Axboe wrote: >>> On 4/13/22 4:33 AM, Avi Kivity wrote: >>>> Unfortunately, only ideas, no patches. But at least the first seems very easy. >>>> >>>> >>>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op >>>> itself (1-8 bytes) to a user memory location specified by the op. >>>> >>>> >>>> Linked to another op, this can generate an in-memory notification >>>> useful for busy-waiters or the UMWAIT instruction >>>> >>>> This would be useful for Seastar, which looks at a timer-managed >>>> memory location to check when to break computation loops. >>> This one would indeed be trivial to do. If we limit the max size >>> supported to eg 8 bytes like suggested, then it could be in the sqe >>> itself and just copied to the user address specified. >>> >>> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination >>> address, and sqe->off the data to copy. >>> >>> If you'll commit to testing this, I can hack it up pretty quickly... >> Something like this, totally untested. Maybe the return value should be >> bytes copied? > > > Yes. It could be less than what the user expected, unless we enforce > alignment (perhaps we should). Yes, this is just a quick hack. Did make some changes after that, notably just collapsing it into IORING_OP_MEMCPY and having a flag for immediate mode or not. >> +static int io_memcpy_imm_prep(struct io_kiocb *req, >> + const struct io_uring_sqe *sqe) >> +{ >> + struct io_mem *mem = &req->mem; >> + >> + if (unlikely(sqe->ioprio || sqe->rw_flags || sqe->buf_index || >> + sqe->splice_fd_in)) >> + return -EINVAL; >> + >> + mem->value = READ_ONCE(sqe->off); >> + mem->dest = u64_to_user_ptr(READ_ONCE(sqe->addr)); >> + mem->len = READ_ONCE(sqe->len); >> + if (!mem->len || mem->len > sizeof(u64)) >> + return -EINVAL; >> + > > > I'd also check that the length is a power-of-two to avoid having to > deal with weird sizes if we later find it inconvenient. Yes, that's a good idea. >> + return 0; >> +} >> + >> +static int io_memcpy_imm(struct io_kiocb *req, unsigned int issue_flags) >> +{ >> + struct io_mem *mem = &req->mem; >> + int ret = 0; >> + >> + if (copy_to_user(mem->dest, &mem->value, mem->len)) >> + ret = -EFAULT; >> + > > > Is copy_to_user efficient for tiny sizes? Or is it better to use a > switch and put_user()? > > > I guess copy_to_user saves us from having to consider endianness. I was considering that too, definitely something that should be investigated. Making it a 1/2/4/8 switch and using put_user() is probably a better idea. Easy enough to benchmark. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-23 17:32 ` Jens Axboe @ 2022-04-23 18:02 ` Jens Axboe 2022-04-23 18:11 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2022-04-23 18:02 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/23/22 11:32 AM, Jens Axboe wrote: >> I guess copy_to_user saves us from having to consider endianness. > > I was considering that too, definitely something that should be > investigated. Making it a 1/2/4/8 switch and using put_user() is > probably a better idea. Easy enough to benchmark. FWIW, this is the current version. Some quick benchmarking doesn't show any difference between copy_to_user and put_user, but that may depend on the arch as well (using aarch64). But we might as well use put user and combine it with the length check, so we explicitly only support 1/2/4/8 sizes. diff --git a/fs/io_uring.c b/fs/io_uring.c index 2052a796436c..3b94cb4b67ed 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -586,6 +586,14 @@ struct io_socket { unsigned long nofile; }; +struct io_mem { + struct file *file; + u64 value; + void __user *dest; + u32 len; + u32 flags; +}; + struct io_sync { struct file *file; loff_t len; @@ -962,6 +970,7 @@ struct io_kiocb { struct io_msg msg; struct io_xattr xattr; struct io_socket sock; + struct io_mem mem; }; u8 opcode; @@ -1231,16 +1240,19 @@ static const struct io_op_def io_op_defs[] = { .needs_file = 1, }, [IORING_OP_FSETXATTR] = { - .needs_file = 1 + .needs_file = 1, }, [IORING_OP_SETXATTR] = {}, [IORING_OP_FGETXATTR] = { - .needs_file = 1 + .needs_file = 1, }, [IORING_OP_GETXATTR] = {}, [IORING_OP_SOCKET] = { .audit_skip = 1, }, + [IORING_OP_MEMCPY] = { + .audit_skip = 1, + }, }; /* requests with any of those set should undergo io_disarm_next() */ @@ -5527,6 +5539,71 @@ static int io_sync_file_range(struct io_kiocb *req, unsigned int issue_flags) return 0; } +static int io_memcpy_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_mem *mem = &req->mem; + + if (unlikely(sqe->ioprio || sqe->buf_index || sqe->splice_fd_in)) + return -EINVAL; + + mem->value = READ_ONCE(sqe->off); + mem->dest = u64_to_user_ptr(READ_ONCE(sqe->addr)); + mem->len = READ_ONCE(sqe->len); + if (!mem->len || mem->len > sizeof(u64)) + return -EINVAL; + + mem->flags = READ_ONCE(sqe->memcpy_flags); + if (mem->flags & ~IORING_MEMCPY_IMM) + return -EINVAL; + + /* only supports immediate mode for now */ + if (!(mem->flags & IORING_MEMCPY_IMM)) + return -EINVAL; + + return 0; +} + +static int io_memcpy(struct io_kiocb *req) +{ + struct io_mem *mem = &req->mem; + int ret = mem->len; + + switch (mem->len) { + case 1: { + u8 val = mem->value; + if (put_user(val, (u8 *) mem->dest)) + ret = -EFAULT; + break; + } + case 2: { + u16 val = mem->value; + if (put_user(val, (u16 *) mem->dest)) + ret = -EFAULT; + break; + } + case 4: { + u32 val = mem->value; + if (put_user(val, (u32 *) mem->dest)) + ret = -EFAULT; + break; + } + case 8: { + u64 val = mem->value; + if (put_user(val, (u64 *) mem->dest)) + ret = -EFAULT; + break; + } + default: + ret = -EINVAL; + break; + } + + if (ret < 0) + req_set_fail(req); + io_req_complete(req, ret); + return 0; +} + #if defined(CONFIG_NET) static bool io_net_retry(struct socket *sock, int flags) { @@ -7494,6 +7571,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return io_getxattr_prep(req, sqe); case IORING_OP_SOCKET: return io_socket_prep(req, sqe); + case IORING_OP_MEMCPY: + return io_memcpy_prep(req, sqe); } printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n", @@ -7815,6 +7894,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags) case IORING_OP_SOCKET: ret = io_socket(req, issue_flags); break; + case IORING_OP_MEMCPY: + ret = io_memcpy(req); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 5fb52bf32435..9e69d70a3b5b 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -46,6 +46,7 @@ struct io_uring_sqe { __u32 unlink_flags; __u32 hardlink_flags; __u32 xattr_flags; + __u32 memcpy_flags; }; __u64 user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ @@ -152,6 +153,7 @@ enum { IORING_OP_FGETXATTR, IORING_OP_GETXATTR, IORING_OP_SOCKET, + IORING_OP_MEMCPY, /* this goes last, obviously */ IORING_OP_LAST, @@ -206,6 +208,14 @@ enum { #define IORING_ASYNC_CANCEL_FD (1U << 1) #define IORING_ASYNC_CANCEL_ANY (1U << 2) +/* + * IORING_OP_MEMCPY flags. + * + * IORING_MEMCPY_IMM Immediate copy. 'off' contains an immediate + * value. If not set, 'off' is a source address. + */ +#define IORING_MEMCPY_IMM (1U << 0) + /* * IO completion data structure (Completion Queue Entry) */ -- Jens Axboe ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-23 18:02 ` Jens Axboe @ 2022-04-23 18:11 ` Jens Axboe 0 siblings, 0 replies; 22+ messages in thread From: Jens Axboe @ 2022-04-23 18:11 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/23/22 12:02 PM, Jens Axboe wrote: > On 4/23/22 11:32 AM, Jens Axboe wrote: >>> I guess copy_to_user saves us from having to consider endianness. >> >> I was considering that too, definitely something that should be >> investigated. Making it a 1/2/4/8 switch and using put_user() is >> probably a better idea. Easy enough to benchmark. > > FWIW, this is the current version. Some quick benchmarking doesn't show > any difference between copy_to_user and put_user, but that may depend on > the arch as well (using aarch64). But we might as well use put user and > combine it with the length check, so we explicitly only support 1/2/4/8 > sizes. In terms of performance, on this laptop, I can do about 36-37M NOP requests per second. If I use IORING_OP_MEMCPY with immediate mode, it's around 15M ops/sec. This is regardless of the size, get about the same whether it's 1 or 8 byte memory writes. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 14:50 ` Jens Axboe 2022-04-22 15:03 ` Jens Axboe @ 2022-04-22 20:03 ` Walker, Benjamin 2022-04-23 10:19 ` Pavel Begunkov 2022-04-23 13:20 ` Jens Axboe 2022-04-23 16:23 ` Avi Kivity 2 siblings, 2 replies; 22+ messages in thread From: Walker, Benjamin @ 2022-04-22 20:03 UTC (permalink / raw) To: Jens Axboe, Avi Kivity, io-uring On 4/22/2022 7:50 AM, Jens Axboe wrote: > On 4/13/22 4:33 AM, Avi Kivity wrote: >> Unfortunately, only ideas, no patches. But at least the first seems very easy. >> >> >> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op >> itself (1-8 bytes) to a user memory location specified by the op. >> >> >> Linked to another op, this can generate an in-memory notification >> useful for busy-waiters or the UMWAIT instruction >> >> This would be useful for Seastar, which looks at a timer-managed >> memory location to check when to break computation loops. > > This one would indeed be trivial to do. If we limit the max size > supported to eg 8 bytes like suggested, then it could be in the sqe > itself and just copied to the user address specified. > > Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination > address, and sqe->off the data to copy. > > If you'll commit to testing this, I can hack it up pretty quickly... > >> - IORING_OP_MEMCPY - asynchronously copy memory >> >> >> Some CPUs include a DMA engine, and io_uring is a perfect interface to >> exercise it. It may be difficult to find space for two iovecs though. > > I've considered this one in the past too, and it is indeed an ideal fit > in terms of API. Outside of the DMA engines, it can also be used to > offload memcpy to a GPU, for example. > > The io_uring side would not be hard to wire up, basically just have the > sqe specfy source, destination, length. Add some well defined flags > depending on what the copy engine offers, for example. > > But probably some work required here in exposing an API and testing > etc... > I'm about to send a set of patches to associate an io_uring with a dmaengine channel to this list. I'm not necessarily thinking of using it to directly drive the DMA engine itself (although we could, and there's some nice advantages to that), but rather as a way to offload data copies/transforms on existing io_uring operations. My primary focus has been the copy between kernel and user space when receiving from a socket. Upcoming DMA engines also support SVA, allowing them to copy from kernel to user without page pinning. We've got patches for full SVA enabling in dmaengine prepared, such that each io_uring can allocate a PASID describing the user+kernel address space for the current context, allocate a channel via dmaengine and assign it the PASID, and then do DMA between kernel/user with new dmaengine APIs without any page pinning. As preparation, I have submitted a series to dmaengine that allows for polling and out-of-order completions. See https://lore.kernel.org/dmaengine/[email protected]/T/#u. This is a necessary first step. I'll get the patches out ASAP as an RFC. I'm sure my approach was entirely wrong, but you'll get the idea. Thanks, Ben ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 20:03 ` Walker, Benjamin @ 2022-04-23 10:19 ` Pavel Begunkov 2022-04-23 13:20 ` Jens Axboe 1 sibling, 0 replies; 22+ messages in thread From: Pavel Begunkov @ 2022-04-23 10:19 UTC (permalink / raw) To: Walker, Benjamin, Jens Axboe, Avi Kivity, io-uring On 4/22/22 21:03, Walker, Benjamin wrote: > On 4/22/2022 7:50 AM, Jens Axboe wrote: >> On 4/13/22 4:33 AM, Avi Kivity wrote: >>> Unfortunately, only ideas, no patches. But at least the first seems very easy. >>> >>> >>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op >>> itself (1-8 bytes) to a user memory location specified by the op. >>> >>> >>> Linked to another op, this can generate an in-memory notification >>> useful for busy-waiters or the UMWAIT instruction >>> >>> This would be useful for Seastar, which looks at a timer-managed >>> memory location to check when to break computation loops. >> >> This one would indeed be trivial to do. If we limit the max size >> supported to eg 8 bytes like suggested, then it could be in the sqe >> itself and just copied to the user address specified. >> >> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination >> address, and sqe->off the data to copy. >> >> If you'll commit to testing this, I can hack it up pretty quickly... >> >>> - IORING_OP_MEMCPY - asynchronously copy memory >>> >>> >>> Some CPUs include a DMA engine, and io_uring is a perfect interface to >>> exercise it. It may be difficult to find space for two iovecs though. >> >> I've considered this one in the past too, and it is indeed an ideal fit >> in terms of API. Outside of the DMA engines, it can also be used to >> offload memcpy to a GPU, for example. >> >> The io_uring side would not be hard to wire up, basically just have the >> sqe specfy source, destination, length. Add some well defined flags >> depending on what the copy engine offers, for example. >> >> But probably some work required here in exposing an API and testing >> etc... >> > > I'm about to send a set of patches to associate an io_uring with a dmaengine channel to this list. I'm not necessarily thinking of using it to directly drive the DMA engine itself (although we could, and there's some nice advantages to that), but rather as a way to offload data copies/transforms on existing io_uring operations. My primary focus has been the copy between kernel and user space when receiving from a socket. > > Upcoming DMA engines also support SVA, allowing them to copy from kernel to user without page pinning. We've got patches for full SVA enabling in dmaengine prepared, such that each io_uring can allocate a PASID describing the user+kernel address space for the current context, allocate a channel via dmaengine and assign it the PASID, and then do DMA between kernel/user with new dmaengine APIs without any page pinning. > > As preparation, I have submitted a series to dmaengine that allows for polling and out-of-order completions. See https://lore.kernel.org/dmaengine/[email protected]/T/#u. This is a necessary first step. > > I'll get the patches out ASAP as an RFC. I'm sure my approach was entirely wrong, but you'll get the idea. Please do. Sounds interesting and it will likely be nicely aligned with our work/plans on p2p. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 20:03 ` Walker, Benjamin 2022-04-23 10:19 ` Pavel Begunkov @ 2022-04-23 13:20 ` Jens Axboe 1 sibling, 0 replies; 22+ messages in thread From: Jens Axboe @ 2022-04-23 13:20 UTC (permalink / raw) To: Walker, Benjamin, Avi Kivity, io-uring On 4/22/22 2:03 PM, Walker, Benjamin wrote: > On 4/22/2022 7:50 AM, Jens Axboe wrote: >> On 4/13/22 4:33 AM, Avi Kivity wrote: >>> Unfortunately, only ideas, no patches. But at least the first seems very easy. >>> >>> >>> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op >>> itself (1-8 bytes) to a user memory location specified by the op. >>> >>> >>> Linked to another op, this can generate an in-memory notification >>> useful for busy-waiters or the UMWAIT instruction >>> >>> This would be useful for Seastar, which looks at a timer-managed >>> memory location to check when to break computation loops. >> >> This one would indeed be trivial to do. If we limit the max size >> supported to eg 8 bytes like suggested, then it could be in the sqe >> itself and just copied to the user address specified. >> >> Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination >> address, and sqe->off the data to copy. >> >> If you'll commit to testing this, I can hack it up pretty quickly... >> >>> - IORING_OP_MEMCPY - asynchronously copy memory >>> >>> >>> Some CPUs include a DMA engine, and io_uring is a perfect interface to >>> exercise it. It may be difficult to find space for two iovecs though. >> >> I've considered this one in the past too, and it is indeed an ideal fit >> in terms of API. Outside of the DMA engines, it can also be used to >> offload memcpy to a GPU, for example. >> >> The io_uring side would not be hard to wire up, basically just have the >> sqe specfy source, destination, length. Add some well defined flags >> depending on what the copy engine offers, for example. >> >> But probably some work required here in exposing an API and testing >> etc... >> > > I'm about to send a set of patches to associate an io_uring with a > dmaengine channel to this list. I'm not necessarily thinking of using > it to directly drive the DMA engine itself (although we could, and > there's some nice advantages to that), but rather as a way to offload > data copies/transforms on existing io_uring operations. My primary > focus has been the copy between kernel and user space when receiving > from a socket. Interesting - I think both uses cases are actually valid, offloading a memcpy or using the engine to copy the data of an existing operation. > Upcoming DMA engines also support SVA, allowing them to copy from > kernel to user without page pinning. We've got patches for full SVA > enabling in dmaengine prepared, such that each io_uring can allocate a > PASID describing the user+kernel address space for the current > context, allocate a channel via dmaengine and assign it the PASID, and > then do DMA between kernel/user with new dmaengine APIs without any > page pinning. > > As preparation, I have submitted a series to dmaengine that allows for > polling and out-of-order completions. See > https://lore.kernel.org/dmaengine/[email protected]/T/#u. > This is a necessary first step. > > I'll get the patches out ASAP as an RFC. I'm sure my approach was > entirely wrong, but you'll get the idea. Please do, this sounds exciting! The whole point of an RFC is to get some feedback on initial design before it potentially goes too far down the wrong path. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-22 14:50 ` Jens Axboe 2022-04-22 15:03 ` Jens Axboe 2022-04-22 20:03 ` Walker, Benjamin @ 2022-04-23 16:23 ` Avi Kivity 2022-04-23 17:30 ` Jens Axboe 2 siblings, 1 reply; 22+ messages in thread From: Avi Kivity @ 2022-04-23 16:23 UTC (permalink / raw) To: Jens Axboe, io-uring On 22/04/2022 17.50, Jens Axboe wrote: > On 4/13/22 4:33 AM, Avi Kivity wrote: >> Unfortunately, only ideas, no patches. But at least the first seems very easy. >> >> >> - IORING_OP_MEMCPY_IMMEDIATE - copy some payload included in the op >> itself (1-8 bytes) to a user memory location specified by the op. >> >> >> Linked to another op, this can generate an in-memory notification >> useful for busy-waiters or the UMWAIT instruction >> >> This would be useful for Seastar, which looks at a timer-managed >> memory location to check when to break computation loops. > This one would indeed be trivial to do. If we limit the max size > supported to eg 8 bytes like suggested, then it could be in the sqe > itself and just copied to the user address specified. > > Eg have sqe->len be the length (1..8 bytes), sqe->addr the destination > address, and sqe->off the data to copy. > > If you'll commit to testing this, I can hack it up pretty quickly... I can certainly commit to test it in a VM (my workstation has a hate-hate relationship with custom kernels). >> - IORING_OP_MEMCPY - asynchronously copy memory >> >> >> Some CPUs include a DMA engine, and io_uring is a perfect interface to >> exercise it. It may be difficult to find space for two iovecs though. > I've considered this one in the past too, and it is indeed an ideal fit > in terms of API. Outside of the DMA engines, it can also be used to > offload memcpy to a GPU, for example. > > The io_uring side would not be hard to wire up, basically just have the > sqe specfy source, destination, length. Add some well defined flags > depending on what the copy engine offers, for example. > > But probably some work required here in exposing an API and testing > etc... Perhaps the interface should be kept separate from io_uring. e.g. use a pidfd to represent the address space, and then issue IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy across process boundaries. A different angle is to use expose the dma device as a separate fd. This can be useful as dma engine can often do other operations, like xor or crc or encryption or compression. In any case I'd argue for the interface to be useful outside io_uring, although that considerably increases the scope. I also don't have a direct use case for it, though I'm sure others will. The kernel itself should find the DMA engine useful for things like memory compaction. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-23 16:23 ` Avi Kivity @ 2022-04-23 17:30 ` Jens Axboe 2022-04-24 13:04 ` Avi Kivity 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2022-04-23 17:30 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/23/22 10:23 AM, Avi Kivity wrote: > Perhaps the interface should be kept separate from io_uring. e.g. use > a pidfd to represent the address space, and then issue > IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy > across process boundaries. Then you just made it a ton less efficient, particularly if you used the vectored read/write. For this to make sense, I think it has to be a separate op. At least that's the only implementation I'd be willing to entertain for the immediate copy. > A different angle is to use expose the dma device as a separate fd. > This can be useful as dma engine can often do other operations, like > xor or crc or encryption or compression. In any case I'd argue for the > interface to be useful outside io_uring, although that considerably > increases the scope. I also don't have a direct use case for it, > though I'm sure others will. I'd say that whoever does it get to at least dictate the initial implementation. For outside of io_uring, you're looking at a sync interface, which I think already exists for this (ioctls?). > The kernel itself should find the DMA engine useful for things like > memory compaction. That's a very different use case though and just deals with wiring it up internally. Let's try and keep the scope here reasonable, imho nothing good comes out of attempting to do all the things at once. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-23 17:30 ` Jens Axboe @ 2022-04-24 13:04 ` Avi Kivity 2022-04-24 13:30 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Avi Kivity @ 2022-04-24 13:04 UTC (permalink / raw) To: Jens Axboe, io-uring On 23/04/2022 20.30, Jens Axboe wrote: > On 4/23/22 10:23 AM, Avi Kivity wrote: >> Perhaps the interface should be kept separate from io_uring. e.g. use >> a pidfd to represent the address space, and then issue >> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy >> across process boundaries. > Then you just made it a ton less efficient, particularly if you used the > vectored read/write. For this to make sense, I think it has to be a > separate op. At least that's the only implementation I'd be willing to > entertain for the immediate copy. Sorry, I caused a lot of confusion by bundling immediate copy and a DMA engine interface. For sure the immediate copy should be a direct implementation like you posted! User-to-user copies are another matter. I feel like that should be a stand-alone driver, and that io_uring should be an io_uring-y way to access it. Just like io_uring isn't an NVMe driver. >> A different angle is to use expose the dma device as a separate fd. >> This can be useful as dma engine can often do other operations, like >> xor or crc or encryption or compression. In any case I'd argue for the >> interface to be useful outside io_uring, although that considerably >> increases the scope. I also don't have a direct use case for it, >> though I'm sure others will. > I'd say that whoever does it get to at least dictate the initial > implementation. Of course, but bikeshedding from the sidelines never hurt anyone. > For outside of io_uring, you're looking at a sync > interface, which I think already exists for this (ioctls?). Yes, it would be a asynchronous interface. I don't know if one exists, but I can't claim to have kept track. > >> The kernel itself should find the DMA engine useful for things like >> memory compaction. > That's a very different use case though and just deals with wiring it up > internally. > > Let's try and keep the scope here reasonable, imho nothing good comes > out of attempting to do all the things at once. > For sure, I'm just noting that the DMA engine has many different uses and so deserves an interface that is untied to io_uring. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-24 13:04 ` Avi Kivity @ 2022-04-24 13:30 ` Jens Axboe 2022-04-24 14:56 ` Avi Kivity 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2022-04-24 13:30 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/24/22 7:04 AM, Avi Kivity wrote: > > On 23/04/2022 20.30, Jens Axboe wrote: >> On 4/23/22 10:23 AM, Avi Kivity wrote: >>> Perhaps the interface should be kept separate from io_uring. e.g. use >>> a pidfd to represent the address space, and then issue >>> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy >>> across process boundaries. >> Then you just made it a ton less efficient, particularly if you used the >> vectored read/write. For this to make sense, I think it has to be a >> separate op. At least that's the only implementation I'd be willing to >> entertain for the immediate copy. > > > Sorry, I caused a lot of confusion by bundling immediate copy and a > DMA engine interface. For sure the immediate copy should be a direct > implementation like you posted! > > User-to-user copies are another matter. I feel like that should be a > stand-alone driver, and that io_uring should be an io_uring-y way to > access it. Just like io_uring isn't an NVMe driver. Not sure I understand your logic here or the io_uring vs nvme driver reference, to be honest. io_uring _is_ a standalone way to access it, you can use it sync or async through that. If you're talking about a standalone op vs being useful from a command itself, I do think both have merit and I can see good use cases for both. >> For outside of io_uring, you're looking at a sync >> interface, which I think already exists for this (ioctls?). > > > Yes, it would be a asynchronous interface. I don't know if one exists, > but I can't claim to have kept track. Again not following. So you're saying there should be a 2nd async interface for it? >>> The kernel itself should find the DMA engine useful for things like >>> memory compaction. >> That's a very different use case though and just deals with wiring it up >> internally. >> >> Let's try and keep the scope here reasonable, imho nothing good comes >> out of attempting to do all the things at once. >> > > For sure, I'm just noting that the DMA engine has many different uses > and so deserves an interface that is untied to io_uring. And again, not following, what's the point of having 2 interfaces for the same thing? I can sort of agree if one is just the basic ioctl kind of interface, a basic sync one. But outside of that I'm a bit puzzled as to why that would be useful at all. -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-24 13:30 ` Jens Axboe @ 2022-04-24 14:56 ` Avi Kivity 2022-04-25 0:45 ` Jens Axboe 0 siblings, 1 reply; 22+ messages in thread From: Avi Kivity @ 2022-04-24 14:56 UTC (permalink / raw) To: Jens Axboe, io-uring On 24/04/2022 16.30, Jens Axboe wrote: > On 4/24/22 7:04 AM, Avi Kivity wrote: >> On 23/04/2022 20.30, Jens Axboe wrote: >>> On 4/23/22 10:23 AM, Avi Kivity wrote: >>>> Perhaps the interface should be kept separate from io_uring. e.g. use >>>> a pidfd to represent the address space, and then issue >>>> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy >>>> across process boundaries. >>> Then you just made it a ton less efficient, particularly if you used the >>> vectored read/write. For this to make sense, I think it has to be a >>> separate op. At least that's the only implementation I'd be willing to >>> entertain for the immediate copy. >> >> Sorry, I caused a lot of confusion by bundling immediate copy and a >> DMA engine interface. For sure the immediate copy should be a direct >> implementation like you posted! >> >> User-to-user copies are another matter. I feel like that should be a >> stand-alone driver, and that io_uring should be an io_uring-y way to >> access it. Just like io_uring isn't an NVMe driver. > Not sure I understand your logic here or the io_uring vs nvme driver > reference, to be honest. io_uring _is_ a standalone way to access it, > you can use it sync or async through that. > > If you're talking about a standalone op vs being useful from a command > itself, I do think both have merit and I can see good use cases for > both. I'm saying that if dma is exposed to userspace, it should have a regular synchronous interface (maybe open("/dev/dma"), maybe something else). io_uring adds asynchrony to everything, but it's not everything's driver. Anyway maybe we drifted off somewhere and this should be decided by pragmatic concerns (like whatever the author of the driver prefers). > >>> For outside of io_uring, you're looking at a sync >>> interface, which I think already exists for this (ioctls?). >> >> Yes, it would be a asynchronous interface. I don't know if one exists, >> but I can't claim to have kept track. > Again not following. So you're saying there should be a 2nd async > interface for it? No. And I misspelled "synchronous" as "asynchronous" (I was agreeing with you that it would be a sync interface). > >>>> The kernel itself should find the DMA engine useful for things like >>>> memory compaction. >>> That's a very different use case though and just deals with wiring it up >>> internally. >>> >>> Let's try and keep the scope here reasonable, imho nothing good comes >>> out of attempting to do all the things at once. >>> >> For sure, I'm just noting that the DMA engine has many different uses >> and so deserves an interface that is untied to io_uring. > And again, not following, what's the point of having 2 interfaces for > the same thing? I can sort of agree if one is just the basic ioctl kind > of interface, a basic sync one. But outside of that I'm a bit puzzled as > to why that would be useful at all. > Yes I meant the basic sync one. Sorry I caused quite a lot of confusion here! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-24 14:56 ` Avi Kivity @ 2022-04-25 0:45 ` Jens Axboe 2022-04-25 18:05 ` Walker, Benjamin 0 siblings, 1 reply; 22+ messages in thread From: Jens Axboe @ 2022-04-25 0:45 UTC (permalink / raw) To: Avi Kivity, io-uring On 4/24/22 8:56 AM, Avi Kivity wrote: > > On 24/04/2022 16.30, Jens Axboe wrote: >> On 4/24/22 7:04 AM, Avi Kivity wrote: >>> On 23/04/2022 20.30, Jens Axboe wrote: >>>> On 4/23/22 10:23 AM, Avi Kivity wrote: >>>>> Perhaps the interface should be kept separate from io_uring. e.g. use >>>>> a pidfd to represent the address space, and then issue >>>>> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy >>>>> across process boundaries. >>>> Then you just made it a ton less efficient, particularly if you used the >>>> vectored read/write. For this to make sense, I think it has to be a >>>> separate op. At least that's the only implementation I'd be willing to >>>> entertain for the immediate copy. >>> >>> Sorry, I caused a lot of confusion by bundling immediate copy and a >>> DMA engine interface. For sure the immediate copy should be a direct >>> implementation like you posted! >>> >>> User-to-user copies are another matter. I feel like that should be a >>> stand-alone driver, and that io_uring should be an io_uring-y way to >>> access it. Just like io_uring isn't an NVMe driver. >> Not sure I understand your logic here or the io_uring vs nvme driver >> reference, to be honest. io_uring _is_ a standalone way to access it, >> you can use it sync or async through that. >> >> If you're talking about a standalone op vs being useful from a command >> itself, I do think both have merit and I can see good use cases for >> both. > > > I'm saying that if dma is exposed to userspace, it should have a > regular synchronous interface (maybe open("/dev/dma"), maybe something > else). io_uring adds asynchrony to everything, but it's not > everything's driver. Sure, my point is that if/when someone wants to add that, they should be free to do so. It's not a fair requirement to put on someone doing the initial work on wiring this up. It may not be something they would want to use to begin with, and it's perfectly easy to run io_uring in sync mode should you wish to do so. The hard part is making the issue+complete separate actions, rolling a sync API on top of that would be trivial. > Anyway maybe we drifted off somewhere and this should be decided by > pragmatic concerns (like whatever the author of the driver prefers). Indeed! -- Jens Axboe ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: memory access op ideas 2022-04-25 0:45 ` Jens Axboe @ 2022-04-25 18:05 ` Walker, Benjamin 0 siblings, 0 replies; 22+ messages in thread From: Walker, Benjamin @ 2022-04-25 18:05 UTC (permalink / raw) To: Jens Axboe, Avi Kivity, io-uring On 4/24/2022 5:45 PM, Jens Axboe wrote: > On 4/24/22 8:56 AM, Avi Kivity wrote: >> >> On 24/04/2022 16.30, Jens Axboe wrote: >>> On 4/24/22 7:04 AM, Avi Kivity wrote: >>>> On 23/04/2022 20.30, Jens Axboe wrote: >>>>> On 4/23/22 10:23 AM, Avi Kivity wrote: >>>>>> Perhaps the interface should be kept separate from io_uring. e.g. use >>>>>> a pidfd to represent the address space, and then issue >>>>>> IORING_OP_PREADV/IORING_OP_PWRITEV to initiate dma. Then one can copy >>>>>> across process boundaries. >>>>> Then you just made it a ton less efficient, particularly if you used the >>>>> vectored read/write. For this to make sense, I think it has to be a >>>>> separate op. At least that's the only implementation I'd be willing to >>>>> entertain for the immediate copy. >>>> >>>> Sorry, I caused a lot of confusion by bundling immediate copy and a >>>> DMA engine interface. For sure the immediate copy should be a direct >>>> implementation like you posted! >>>> >>>> User-to-user copies are another matter. I feel like that should be a >>>> stand-alone driver, and that io_uring should be an io_uring-y way to >>>> access it. Just like io_uring isn't an NVMe driver. >>> Not sure I understand your logic here or the io_uring vs nvme driver >>> reference, to be honest. io_uring _is_ a standalone way to access it, >>> you can use it sync or async through that. >>> >>> If you're talking about a standalone op vs being useful from a command >>> itself, I do think both have merit and I can see good use cases for >>> both. I'm actually not so certain the model where io_uring has special operations for driving DMA engines works out. I think in all cases you can accomplish what you want by reading or writing to existing file constructs, and just having those transparently offload to a DMA engine if one is available on your behalf. As a concrete example, let's take an inter-process copy. The main challenges with this one are the security model (who's allowed to copy where?) and synchronization between the two applications (when did the data change?). Rather, I'd consider implementing the inter-process copy using an existing mechanism like a Unix domain socket. The sender maybe does a MSG_ZEROCOPY send via io_uring, and the receiver does an async recv, and the kernel can use a DMA engine to move the data directly between the two buffers if it has one avaiable. Then you get the existing security model and coordination, and software works whether there's a DMA engine available or not. It's a similar story for copying to memory on a PCI device. You'd need some security model to decide you're allowed to copy there, which is probably best expressed by opening a file that represents that BAR and then doing reads/writes to it. This is at least the direction I've been pursuing. The DMA engine channel is associated with the io_uring and the kernel just intelligently offloads whatever it can. >> >> >> I'm saying that if dma is exposed to userspace, it should have a >> regular synchronous interface (maybe open("/dev/dma"), maybe something >> else). io_uring adds asynchrony to everything, but it's not >> everything's driver. > > Sure, my point is that if/when someone wants to add that, they should be > free to do so. It's not a fair requirement to put on someone doing the > initial work on wiring this up. It may not be something they would want > to use to begin with, and it's perfectly easy to run io_uring in sync > mode should you wish to do so. The hard part is making the > issue+complete separate actions, rolling a sync API on top of that would > be trivial. Just FYI but the Intel idxd driver already has a userspace interface that's async/poll-mode. Commands are submitted to a mmap'd portal using the movdir64/enqcmd instructions directly. It does not expose an fd you can read/write to in order to trigger copies, so it is not compatible with io_uring, but it doesn't really need to be since it is already async. What isn't currently exposed to userspace is access to the "dmaengine" framework. Prior to the patchset I have pending that I linked earlier in the thread, the "dmaengine" framework couldn't really operate in async/poll mode and handle out-of-order processing, etc. But after that series maybe. > >> Anyway maybe we drifted off somewhere and this should be decided by >> pragmatic concerns (like whatever the author of the driver prefers). > > Indeed! > ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2022-04-25 18:05 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-04-13 10:33 memory access op ideas Avi Kivity 2022-04-22 12:52 ` Hao Xu 2022-04-22 13:24 ` Hao Xu 2022-04-22 13:38 ` Jens Axboe 2022-04-23 7:19 ` Hao Xu 2022-04-23 16:14 ` Avi Kivity 2022-04-22 14:50 ` Jens Axboe 2022-04-22 15:03 ` Jens Axboe 2022-04-23 16:30 ` Avi Kivity 2022-04-23 17:32 ` Jens Axboe 2022-04-23 18:02 ` Jens Axboe 2022-04-23 18:11 ` Jens Axboe 2022-04-22 20:03 ` Walker, Benjamin 2022-04-23 10:19 ` Pavel Begunkov 2022-04-23 13:20 ` Jens Axboe 2022-04-23 16:23 ` Avi Kivity 2022-04-23 17:30 ` Jens Axboe 2022-04-24 13:04 ` Avi Kivity 2022-04-24 13:30 ` Jens Axboe 2022-04-24 14:56 ` Avi Kivity 2022-04-25 0:45 ` Jens Axboe 2022-04-25 18:05 ` Walker, Benjamin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox