public inbox for [email protected]
 help / color / mirror / Atom feed
* io_uring-only sendmsg + recvmsg zerocopy
@ 2020-11-10 21:31 Victor Stewart
  2020-11-10 23:23 ` Pavel Begunkov
  0 siblings, 1 reply; 6+ messages in thread
From: Victor Stewart @ 2020-11-10 21:31 UTC (permalink / raw)
  To: io-uring

here’s the design i’m flirting with for "recvmsg and sendmsg zerocopy"
with persistent buffers patch.

we'd be looking at approx +100% throughput each on the send and recv
paths (per TCP_ZEROCOPY_RECEIVE benchmarks).

these would be io_uring only operations given the sendmsg completion
logic described below. want to get some conscious that this design
could/would be acceptable for merging before I begin writing the code.

the problem with zerocopy send is the asynchronous ACK from the NIC
confirming transmission. and you can’t just block on a syscall til
then. MSG_ZEROCOPY tackled this by putting the ACK on the
MSG_ERRQUEUE. but that logic is very disjointed and requires a double
completion (once from sendmsg once the send is enqueued, and again
once the NIC ACKs the transmission), and requires costly userspace
bookkeeping.

so what i propose instead is to exploit the asynchrony of io_uring.

you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
sometime later receive the completion event on the ring’s completion
queue (either failure or success once ACK-ed by the NIC). 1 unified
completion flow.

we can somehow tag the socket as registered to io_uring, then when the
NIC ACKs, instead of finding the socket's error queue and putting the
completion there like MSG_ZEROCOPY, the kernel would find the io_uring
instance the socket is registered to and call into an io_uring
sendmsg_zerocopy_completion function. Then the cqe would get pushed
onto the completion queue.

the "recvmsg zerocopy" is straight forward enough. mimicking
TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.

the other big concern is the lifecycle of the persistent memory
buffers in the case of nefarious actors. but since we already have
buffer registration for O_DIRECT, I assume those mechanics already
address those issues and can just be repurposed?

and so with those persistent memory buffers, you'd only pay the cost
of pinning the memory into the kernel once upon registration, before
you even start your server listening... thus "free". versus pinning
per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: io_uring-only sendmsg + recvmsg zerocopy
  2020-11-10 21:31 io_uring-only sendmsg + recvmsg zerocopy Victor Stewart
@ 2020-11-10 23:23 ` Pavel Begunkov
       [not found]   ` <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>
  0 siblings, 1 reply; 6+ messages in thread
From: Pavel Begunkov @ 2020-11-10 23:23 UTC (permalink / raw)
  To: Victor Stewart, io-uring

On 10/11/2020 21:31, Victor Stewart wrote:
> here’s the design i’m flirting with for "recvmsg and sendmsg zerocopy"
> with persistent buffers patch.

Ok, first we need make it to work with registered buffers. I had patches
for that but need to rebase+refresh it, I'll send it out this week then.

Zerocopy would still go through some pinning,
e.g. skb_zerocopy_iter_*() -> iov_iter_get_pages()
		-> get_page() -> atomic_inc()
but it's lighter for bvec and can be optimised later if needed.

And that leaves hooking up into struct ubuf_info with callbacks
for zerocopy. 

> 
> we'd be looking at approx +100% throughput each on the send and recv
> paths (per TCP_ZEROCOPY_RECEIVE benchmarks).> 
> these would be io_uring only operations given the sendmsg completion
> logic described below. want to get some conscious that this design
> could/would be acceptable for merging before I begin writing the code.
> 
> the problem with zerocopy send is the asynchronous ACK from the NIC
> confirming transmission. and you can’t just block on a syscall til
> then. MSG_ZEROCOPY tackled this by putting the ACK on the
> MSG_ERRQUEUE. but that logic is very disjointed and requires a double
> completion (once from sendmsg once the send is enqueued, and again
> once the NIC ACKs the transmission), and requires costly userspace
> bookkeeping.
> 
> so what i propose instead is to exploit the asynchrony of io_uring.
> 
> you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
> sometime later receive the completion event on the ring’s completion
> queue (either failure or success once ACK-ed by the NIC). 1 unified
> completion flow.

I though about it after your other email. It makes sense for message
oriented protocols but may not for streams. That's because a user
may want to call

send();
send();

And expect right ordering, and that where waiting for ACK may add a lot
of latency, so returning from the call here is a notification that "it's
accounted, you may send more and order will be preserved".

And since ACKs may came a long after, you may put a lot of code and stuff
between send()s and still suffer latency (and so potentially throughput
drop).

As for me, for an optional feature sounds sensible, and should work well
for some use cases. But for others it may be good to have 2 of
notifications (1. ready to next send(), 2. ready to recycle buf).
E.g. 2 CQEs, that wouldn't work without a bit of io_uring patching.

> 
> we can somehow tag the socket as registered to io_uring, then when the

I'd rather tag a request

> NIC ACKs, instead of finding the socket's error queue and putting the
> completion there like MSG_ZEROCOPY, the kernel would find the io_uring
> instance the socket is registered to and call into an io_uring
> sendmsg_zerocopy_completion function. Then the cqe would get pushed
> onto the completion queue.> 
> the "recvmsg zerocopy" is straight forward enough. mimicking
> TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.

Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just
maps skbuffs into userspace, and in general unless there is a better
suited protocol (e.g. infiniband with richier src/dst tagging) or a very
very smart NIC, "true zerocopy" is not possible without breaking
multiplexing.

For registered buffers you still need to copy skbuff, at least because
of security implications.

> 
> the other big concern is the lifecycle of the persistent memory
> buffers in the case of nefarious actors. but since we already have
> buffer registration for O_DIRECT, I assume those mechanics already

just buffer registration, not specifically for O_DIRECT

> address those issues and can just be repurposed?

Depending on how long it could stuck in the net stack, we might need
to be able to cancel those requests. That may be a problem.

> 
> and so with those persistent memory buffers, you'd only pay the cost
> of pinning the memory into the kernel once upon registration, before
> you even start your server listening... thus "free". versus pinning
> per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Fwd: io_uring-only sendmsg + recvmsg zerocopy
       [not found]   ` <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>
@ 2020-11-11  0:25     ` Victor Stewart
  2020-11-11  0:57     ` Pavel Begunkov
  1 sibling, 0 replies; 6+ messages in thread
From: Victor Stewart @ 2020-11-11  0:25 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

forgot to reply-all on this lol.

also FYI

https://www.spinics.net/lists/netdev/msg698969.html

On Tue, Nov 10, 2020 at 11:26 PM Pavel Begunkov <[email protected]> wrote:
>
> On 10/11/2020 21:31, Victor Stewart wrote:
> > here’s the design i’m flirting with for "recvmsg and sendmsg zerocopy"
> > with persistent buffers patch.
>
> Ok, first we need make it to work with registered buffers. I had patches
> for that but need to rebase+refresh it, I'll send it out this week then.
>
> Zerocopy would still go through some pinning,
> e.g. skb_zerocopy_iter_*() -> iov_iter_get_pages()
>                 -> get_page() -> atomic_inc()
> but it's lighter for bvec and can be optimised later if needed.
>
> And that leaves hooking up into struct ubuf_info with callbacks
> for zerocopy.

okay!

>
> >
> > we'd be looking at approx +100% throughput each on the send and recv
> > paths (per TCP_ZEROCOPY_RECEIVE benchmarks).>
> > these would be io_uring only operations given the sendmsg completion
> > logic described below. want to get some conscious that this design
> > could/would be acceptable for merging before I begin writing the code.
> >
> > the problem with zerocopy send is the asynchronous ACK from the NIC
> > confirming transmission. and you can’t just block on a syscall til
> > then. MSG_ZEROCOPY tackled this by putting the ACK on the
> > MSG_ERRQUEUE. but that logic is very disjointed and requires a double
> > completion (once from sendmsg once the send is enqueued, and again
> > once the NIC ACKs the transmission), and requires costly userspace
> > bookkeeping.
> >
> > so what i propose instead is to exploit the asynchrony of io_uring.
> >
> > you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
> > sometime later receive the completion event on the ring’s completion
> > queue (either failure or success once ACK-ed by the NIC). 1 unified
> > completion flow.
>
> I though about it after your other email. It makes sense for message
> oriented protocols but may not for streams. That's because a user
> may want to call
>
> send();
> send();
>
> And expect right ordering, and that where waiting for ACK may add a lot
> of latency, so returning from the call here is a notification that "it's
> accounted, you may send more and order will be preserved".
>
> And since ACKs may came a long after, you may put a lot of code and stuff
> between send()s and still suffer latency (and so potentially throughput
> drop).
>
> As for me, for an optional feature sounds sensible, and should work well
> for some use cases. But for others it may be good to have 2 of
> notifications (1. ready to next send(), 2. ready to recycle buf).
> E.g. 2 CQEs, that wouldn't work without a bit of io_uring patching.
>

we could make it datagram only, like check the socket was created with
SOCK_DGRAM and fail otherwise... if it requires too much io_uring
changes / possible regression to accomodate a 2 cqe mode.

> >
> > we can somehow tag the socket as registered to io_uring, then when the
>
> I'd rather tag a request

as long as the NIC is able to find / callback the ring about
transmission ACK, whatever the path of least resistance is is best.

>
> > NIC ACKs, instead of finding the socket's error queue and putting the
> > completion there like MSG_ZEROCOPY, the kernel would find the io_uring
> > instance the socket is registered to and call into an io_uring
> > sendmsg_zerocopy_completion function. Then the cqe would get pushed
> > onto the completion queue.>
> > the "recvmsg zerocopy" is straight forward enough. mimicking
> > TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.
>
> Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just
> maps skbuffs into userspace, and in general unless there is a better
> suited protocol (e.g. infiniband with richier src/dst tagging) or a very
> very smart NIC, "true zerocopy" is not possible without breaking
> multiplexing.
>
> For registered buffers you still need to copy skbuff, at least because
> of security implications.

we can actually just force those buffers to be mmap-ed, and then when
packets arrive use vm_insert_pin or remap_pfn_range to change the
physical pages backing the virtual memory pages submmited for reading
via msg_iov. so it's transparent to userspace but still zerocopy.
(might require the user to notify io_uring when reading is
completed... but no matter).


>
> >
> > the other big concern is the lifecycle of the persistent memory
> > buffers in the case of nefarious actors. but since we already have
> > buffer registration for O_DIRECT, I assume those mechanics already
>
> just buffer registration, not specifically for O_DIRECT
>
> > address those issues and can just be repurposed?
>
> Depending on how long it could stuck in the net stack, we might need
> to be able to cancel those requests. That may be a problem.

I spoke about this idea with Willem the other day and he mentioned...

"As long as the mappings aren't unwound on process exit. But then you
open up to malicious applications that purposely register ranges and
then exit. The basics are straightforward to implement, but it's not
that easy to arrive at something robust."

>
> >
> > and so with those persistent memory buffers, you'd only pay the cost
> > of pinning the memory into the kernel once upon registration, before
> > you even start your server listening... thus "free". versus pinning
> > per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".
> >
>
> --
> Pavel Begunkov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: io_uring-only sendmsg + recvmsg zerocopy
       [not found]   ` <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>
  2020-11-11  0:25     ` Fwd: " Victor Stewart
@ 2020-11-11  0:57     ` Pavel Begunkov
  2020-11-11 16:49       ` Victor Stewart
  1 sibling, 1 reply; 6+ messages in thread
From: Pavel Begunkov @ 2020-11-11  0:57 UTC (permalink / raw)
  To: Victor Stewart, io-uring

On 11/11/2020 00:07, Victor Stewart wrote:
> On Tue, Nov 10, 2020 at 11:26 PM Pavel Begunkov <[email protected]> wrote:
>>> we'd be looking at approx +100% throughput each on the send and recv
>>> paths (per TCP_ZEROCOPY_RECEIVE benchmarks).>
>>> these would be io_uring only operations given the sendmsg completion
>>> logic described below. want to get some conscious that this design
>>> could/would be acceptable for merging before I begin writing the code.
>>>
>>> the problem with zerocopy send is the asynchronous ACK from the NIC
>>> confirming transmission. and you can’t just block on a syscall til
>>> then. MSG_ZEROCOPY tackled this by putting the ACK on the
>>> MSG_ERRQUEUE. but that logic is very disjointed and requires a double
>>> completion (once from sendmsg once the send is enqueued, and again
>>> once the NIC ACKs the transmission), and requires costly userspace
>>> bookkeeping.
>>>
>>> so what i propose instead is to exploit the asynchrony of io_uring.
>>>
>>> you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
>>> sometime later receive the completion event on the ring’s completion
>>> queue (either failure or success once ACK-ed by the NIC). 1 unified
>>> completion flow.
>>
>> I though about it after your other email. It makes sense for message
>> oriented protocols but may not for streams. That's because a user
>> may want to call
>>
>> send();
>> send();
>>
>> And expect right ordering, and that where waiting for ACK may add a lot
>> of latency, so returning from the call here is a notification that "it's
>> accounted, you may send more and order will be preserved".
>>
>> And since ACKs may came a long after, you may put a lot of code and stuff
>> between send()s and still suffer latency (and so potentially throughput
>> drop).
>>
>> As for me, for an optional feature sounds sensible, and should work well
>> for some use cases. But for others it may be good to have 2 of
>> notifications (1. ready to next send(), 2. ready to recycle buf).
>> E.g. 2 CQEs, that wouldn't work without a bit of io_uring patching. 
>>
> 
> we could make it datagram only, like check the socket was created with

no need, streams can also benefit from it.

> SOCK_DGRAM and fail otherwise... if it requires too much io_uring
> changes / possible regression to accomodate a 2 cqe mode.

May be easier to do via two requests with the second receiving
errors (yeah, msg_control again).

>>> we can somehow tag the socket as registered to io_uring, then when the
>>
>> I'd rather tag a request
> 
> as long as the NIC is able to find / callback the ring about
> transmission ACK, whatever the path of least resistance is is best.
> 
>>
>>> NIC ACKs, instead of finding the socket's error queue and putting the
>>> completion there like MSG_ZEROCOPY, the kernel would find the io_uring
>>> instance the socket is registered to and call into an io_uring
>>> sendmsg_zerocopy_completion function. Then the cqe would get pushed
>>> onto the completion queue.>
>>> the "recvmsg zerocopy" is straight forward enough. mimicking
>>> TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.
>>
>> Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just
>> maps skbuffs into userspace, and in general unless there is a better
>> suited protocol (e.g. infiniband with richier src/dst tagging) or a very
>> very smart NIC, "true zerocopy" is not possible without breaking
>> multiplexing.
>>
>> For registered buffers you still need to copy skbuff, at least because
>> of security implications.
> 
> we can actually just force those buffers to be mmap-ed, and then when
> packets arrive use vm_insert_pin or remap_pfn_range to change the
> physical pages backing the virtual memory pages submmited for reading
> via msg_iov. so it's transparent to userspace but still zerocopy.
> (might require the user to notify io_uring when reading is
> completed... but no matter).

Yes, with io_uring zerocopy-recv may be done better than
TCP_ZEROCOPY_RECEIVE but
1) it's still a remap. Yes, zerocopy, but not ideal
2) won't work with registered buffers, which is basically a set
of pinned pages that have a userspace mapping. After such remap
that mapping wouldn't be in sync and that gets messy.

>>> the other big concern is the lifecycle of the persistent memory
>>> buffers in the case of nefarious actors. but since we already have
>>> buffer registration for O_DIRECT, I assume those mechanics already
>>
>> just buffer registration, not specifically for O_DIRECT
>>
>>> address those issues and can just be repurposed?
>>
>> Depending on how long it could stuck in the net stack, we might need
>> to be able to cancel those requests. That may be a problem.
> 
> I spoke about this idea with Willem the other day and he mentioned...
> 
> "As long as the mappings aren't unwound on process exit. But then you

The pages won't be unpinned until all/related requests are gone, but
for that on exit io_uring waits for them to complete. That's one of the
reasons why either requests should be cancellable or short-lived and
somewhat predictably time-bound.

> open up to malicious applications that purposely register ranges and
> then exit. The basics are straightforward to implement, but it's not
> that easy to arrive at something robust."
> 
>>
>>>
>>> and so with those persistent memory buffers, you'd only pay the cost
>>> of pinning the memory into the kernel once upon registration, before
>>> you even start your server listening... thus "free". versus pinning
>>> per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: io_uring-only sendmsg + recvmsg zerocopy
  2020-11-11  0:57     ` Pavel Begunkov
@ 2020-11-11 16:49       ` Victor Stewart
  2020-11-11 18:50         ` Pavel Begunkov
  0 siblings, 1 reply; 6+ messages in thread
From: Victor Stewart @ 2020-11-11 16:49 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring

On Wed, Nov 11, 2020 at 1:00 AM Pavel Begunkov <[email protected]> wrote:
>
> On 11/11/2020 00:07, Victor Stewart wrote:
> > On Tue, Nov 10, 2020 at 11:26 PM Pavel Begunkov <[email protected]> wrote:
> >>> we'd be looking at approx +100% throughput each on the send and recv
> >>> paths (per TCP_ZEROCOPY_RECEIVE benchmarks).>
> >>> these would be io_uring only operations given the sendmsg completion
> >>> logic described below. want to get some conscious that this design
> >>> could/would be acceptable for merging before I begin writing the code.
> >>>
> >>> the problem with zerocopy send is the asynchronous ACK from the NIC
> >>> confirming transmission. and you can’t just block on a syscall til
> >>> then. MSG_ZEROCOPY tackled this by putting the ACK on the
> >>> MSG_ERRQUEUE. but that logic is very disjointed and requires a double
> >>> completion (once from sendmsg once the send is enqueued, and again
> >>> once the NIC ACKs the transmission), and requires costly userspace
> >>> bookkeeping.
> >>>
> >>> so what i propose instead is to exploit the asynchrony of io_uring.
> >>>
> >>> you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
> >>> sometime later receive the completion event on the ring’s completion
> >>> queue (either failure or success once ACK-ed by the NIC). 1 unified
> >>> completion flow.
> >>
> >> I though about it after your other email. It makes sense for message
> >> oriented protocols but may not for streams. That's because a user
> >> may want to call
> >>
> >> send();
> >> send();
> >>
> >> And expect right ordering, and that where waiting for ACK may add a lot
> >> of latency, so returning from the call here is a notification that "it's
> >> accounted, you may send more and order will be preserved".
> >>
> >> And since ACKs may came a long after, you may put a lot of code and stuff
> >> between send()s and still suffer latency (and so potentially throughput
> >> drop).
> >>
> >> As for me, for an optional feature sounds sensible, and should work well
> >> for some use cases. But for others it may be good to have 2 of
> >> notifications (1. ready to next send(), 2. ready to recycle buf).
> >> E.g. 2 CQEs, that wouldn't work without a bit of io_uring patching.
> >>
> >
> > we could make it datagram only, like check the socket was created with
>
> no need, streams can also benefit from it.
>
> > SOCK_DGRAM and fail otherwise... if it requires too much io_uring
> > changes / possible regression to accomodate a 2 cqe mode.
>
> May be easier to do via two requests with the second receiving
> errors (yeah, msg_control again).
>
> >>> we can somehow tag the socket as registered to io_uring, then when the
> >>
> >> I'd rather tag a request
> >
> > as long as the NIC is able to find / callback the ring about
> > transmission ACK, whatever the path of least resistance is is best.
> >
> >>
> >>> NIC ACKs, instead of finding the socket's error queue and putting the
> >>> completion there like MSG_ZEROCOPY, the kernel would find the io_uring
> >>> instance the socket is registered to and call into an io_uring
> >>> sendmsg_zerocopy_completion function. Then the cqe would get pushed
> >>> onto the completion queue.>
> >>> the "recvmsg zerocopy" is straight forward enough. mimicking
> >>> TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.
> >>
> >> Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just
> >> maps skbuffs into userspace, and in general unless there is a better
> >> suited protocol (e.g. infiniband with richier src/dst tagging) or a very
> >> very smart NIC, "true zerocopy" is not possible without breaking
> >> multiplexing.
> >>
> >> For registered buffers you still need to copy skbuff, at least because
> >> of security implications.
> >
> > we can actually just force those buffers to be mmap-ed, and then when
> > packets arrive use vm_insert_pin or remap_pfn_range to change the
> > physical pages backing the virtual memory pages submmited for reading
> > via msg_iov. so it's transparent to userspace but still zerocopy.
> > (might require the user to notify io_uring when reading is
> > completed... but no matter).
>
> Yes, with io_uring zerocopy-recv may be done better than
> TCP_ZEROCOPY_RECEIVE but
> 1) it's still a remap. Yes, zerocopy, but not ideal
> 2) won't work with registered buffers, which is basically a set
> of pinned pages that have a userspace mapping. After such remap
> that mapping wouldn't be in sync and that gets messy.

well unless we can alleviate all copies, then there isn’t any point
because it isn’t zerocopy.

so in my server, i have a ceiling on the number of clients,
preallocate them, and mmap anonymous noreserve read + write buffers
for each.

so say, 150,000 clients x (2MB * 2). which is 585GB. way more than the
physical memory of my machine. (and have 10 instance of it per
machine, so ~6TB lol). but at any one time probably 0.01% of that
memory is in usage. and i just MADV_COLD the pages after consumption.

this provides a persistent “vmem contiguous” stream buffer per client.
which has a litany of benefits. but if we persistently pin pages, this
ceases to work, because pin pages require persistent physical memory
backing pages.

But on the send side, if you don’t pin persistently, you’d have to pin
on demand, which costs more than it’s worth for sends less than ~10KB.
And I guess there’s no way to avoid pinning and maintain kernel
integrity. Maybe we could erase those userspace -> physical page
mappings, then recreate them once the operation completes, but 1) that
would require page aligned sends so that you could keep writing and
sending while you waited for completions and 2) beyond being
nonstandard and possibly unsafe, who says that would even cost less
than pinning, definitely costs something. Might cost more because
you’d have to get locks to the page table?

So essentially on the send side the only way to zerocopy for free is
to persistently pin (and give up my per client stream buffers).

On the receive side actually the only way to realistically do zerocopy
is to somehow pin a NIC RX queue to a process, and then persistently
map the queue into the process’s memory as read only. That’s a
security absurdly in the general case, but it could be root only
usage. Then you’d recvmsg with a NULL msg_iov[0].iov_base, and have
the packet buffer location and length written in. Might require driver
buy-in, so might be impractical, but unsure.

Otherwise the only option is an even worse nightmare how
TCP_ZEROCOPY_RECEIVE works, and ridiculously impractical for general
purpose…

“Mapping of memory into a process's address space is done on a
per-page granularity; there is no way to map a fraction of a page. So
inbound network data must be both page-aligned and page-sized when it
ends up in the receive buffer, or it will not be possible to map it
into user space. Alignment can be a bit tricky because the packets
coming out of the interface start with the protocol headers, not the
data the receiving process is interested in. It is the data that must
be aligned, not the headers. Achieving this alignment is possible, but
it requires cooperation from the network interface

It is also necessary to ensure that the data arrives in chunks that
are a multiple of the system's page size, or partial pages of data
will result. That can be done by setting the maximum transfer unit
(MTU) size properly on the interface. That, in turn, can require
knowledge of exactly what the incoming packets will look like; in a
test program posted with the patch set, Dumazet sets the MTU to
61,512. That turns out to be space for fifteen 4096-byte pages of
data, plus 40 bytes for the IPv6 header and 32 bytes for the TCP
header.”

https://lwn.net/Articles/752188/

Either receive case also makes my persistent per client stream buffer
zerocopy impossible lol.

in short, zerocopy sendmsg with persistently pinned buffers is
definitely possible and we should do that. (I'll just make it work on
my end).

recvmsg i'll have to do more research into the practicality of what I
proposed above.

>
> >>> the other big concern is the lifecycle of the persistent memory
> >>> buffers in the case of nefarious actors. but since we already have
> >>> buffer registration for O_DIRECT, I assume those mechanics already
> >>
> >> just buffer registration, not specifically for O_DIRECT
> >>
> >>> address those issues and can just be repurposed?
> >>
> >> Depending on how long it could stuck in the net stack, we might need
> >> to be able to cancel those requests. That may be a problem.
> >
> > I spoke about this idea with Willem the other day and he mentioned...
> >
> > "As long as the mappings aren't unwound on process exit. But then you
>
> The pages won't be unpinned until all/related requests are gone, but
> for that on exit io_uring waits for them to complete. That's one of the
> reasons why either requests should be cancellable or short-lived and
> somewhat predictably time-bound.
>
> > open up to malicious applications that purposely register ranges and
> > then exit. The basics are straightforward to implement, but it's not
> > that easy to arrive at something robust."
> >
> >>
> >>>
> >>> and so with those persistent memory buffers, you'd only pay the cost
> >>> of pinning the memory into the kernel once upon registration, before
> >>> you even start your server listening... thus "free". versus pinning
> >>> per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".
> --
> Pavel Begunkov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: io_uring-only sendmsg + recvmsg zerocopy
  2020-11-11 16:49       ` Victor Stewart
@ 2020-11-11 18:50         ` Pavel Begunkov
  0 siblings, 0 replies; 6+ messages in thread
From: Pavel Begunkov @ 2020-11-11 18:50 UTC (permalink / raw)
  To: Victor Stewart; +Cc: io-uring

On 11/11/2020 16:49, Victor Stewart wrote:
> On Wed, Nov 11, 2020 at 1:00 AM Pavel Begunkov <[email protected]> wrote:
>> On 11/11/2020 00:07, Victor Stewart wrote:
>>> On Tue, Nov 10, 2020 at 11:26 PM Pavel Begunkov <[email protected]> wrote:
>>>>> NIC ACKs, instead of finding the socket's error queue and putting the
>>>>> completion there like MSG_ZEROCOPY, the kernel would find the io_uring
>>>>> instance the socket is registered to and call into an io_uring
>>>>> sendmsg_zerocopy_completion function. Then the cqe would get pushed
>>>>> onto the completion queue.>
>>>>> the "recvmsg zerocopy" is straight forward enough. mimicking
>>>>> TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.
>>>>
>>>> Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just
>>>> maps skbuffs into userspace, and in general unless there is a better
>>>> suited protocol (e.g. infiniband with richier src/dst tagging) or a very
>>>> very smart NIC, "true zerocopy" is not possible without breaking
>>>> multiplexing.
>>>>
>>>> For registered buffers you still need to copy skbuff, at least because
>>>> of security implications.
>>>
>>> we can actually just force those buffers to be mmap-ed, and then when
>>> packets arrive use vm_insert_pin or remap_pfn_range to change the
>>> physical pages backing the virtual memory pages submmited for reading
>>> via msg_iov. so it's transparent to userspace but still zerocopy.
>>> (might require the user to notify io_uring when reading is
>>> completed... but no matter).
>>
>> Yes, with io_uring zerocopy-recv may be done better than
>> TCP_ZEROCOPY_RECEIVE but
>> 1) it's still a remap. Yes, zerocopy, but not ideal
>> 2) won't work with registered buffers, which is basically a set
>> of pinned pages that have a userspace mapping. After such remap
>> that mapping wouldn't be in sync and that gets messy.
> 
> well unless we can alleviate all copies, then there isn’t any point
> because it isn’t zerocopy.
> 
> so in my server, i have a ceiling on the number of clients,
> preallocate them, and mmap anonymous noreserve read + write buffers
> for each.
> 
> so say, 150,000 clients x (2MB * 2). which is 585GB. way more than the
> physical memory of my machine. (and have 10 instance of it per
> machine, so ~6TB lol). but at any one time probably 0.01% of that
> memory is in usage. and i just MADV_COLD the pages after consumption.
> 
> this provides a persistent “vmem contiguous” stream buffer per client.
> which has a litany of benefits. but if we persistently pin pages, this
> ceases to work, because pin pages require persistent physical memory
> backing pages.
> 
> But on the send side, if you don’t pin persistently, you’d have to pin
> on demand, which costs more than it’s worth for sends less than ~10KB.

having it non-contiguous and do round-robin IMHO would be a better shot

> And I guess there’s no way to avoid pinning and maintain kernel
> integrity. Maybe we could erase those userspace -> physical page
> mappings, then recreate them once the operation completes, but 1) that
> would require page aligned sends so that you could keep writing and
> sending while you waited for completions and 2) beyond being
> nonstandard and possibly unsafe, who says that would even cost less
> than pinning, definitely costs something. Might cost more because
> you’d have to get locks to the page table?
> 
> So essentially on the send side the only way to zerocopy for free is
> to persistently pin (and give up my per client stream buffers).
> 
> On the receive side actually the only way to realistically do zerocopy
> is to somehow pin a NIC RX queue to a process, and then persistently
> map the queue into the process’s memory as read only. That’s a
> security absurdly in the general case, but it could be root only
> usage. Then you’d recvmsg with a NULL msg_iov[0].iov_base, and have
> the packet buffer location and length written in. Might require driver
> buy-in, so might be impractical, but unsure.

https://blogs.oracle.com/linux/zero-copy-networking-in-uek6
scroll to AF_XDP

> 
> Otherwise the only option is an even worse nightmare how
> TCP_ZEROCOPY_RECEIVE works, and ridiculously impractical for general
> purpose…

Well, that's not so bad, API with io_uring might be much better, but
still would require unmap. However, depending on a use case overhead
for small packets and/or shared b/w many thread mm can potentially be
a deal breaker.

> “Mapping of memory into a process's address space is done on a
> per-page granularity; there is no way to map a fraction of a page. So
> inbound network data must be both page-aligned and page-sized when it
> ends up in the receive buffer, or it will not be possible to map it
> into user space. Alignment can be a bit tricky because the packets
> coming out of the interface start with the protocol headers, not the
> data the receiving process is interested in. It is the data that must
> be aligned, not the headers. Achieving this alignment is possible, but
> it requires cooperation from the network interface

should support scatter-gather in other words

> 
> It is also necessary to ensure that the data arrives in chunks that
> are a multiple of the system's page size, or partial pages of data
> will result. That can be done by setting the maximum transfer unit
> (MTU) size properly on the interface. That, in turn, can require
> knowledge of exactly what the incoming packets will look like; in a
> test program posted with the patch set, Dumazet sets the MTU to
> 61,512. That turns out to be space for fifteen 4096-byte pages of
> data, plus 40 bytes for the IPv6 header and 32 bytes for the TCP
> header.”
> 
> https://lwn.net/Articles/752188/
> 
> Either receive case also makes my persistent per client stream buffer
> zerocopy impossible lol.

it depends

> 
> in short, zerocopy sendmsg with persistently pinned buffers is
> definitely possible and we should do that. (I'll just make it work on
> my end).
> 
> recvmsg i'll have to do more research into the practicality of what I
> proposed above.

1. NIC is smart enough and can locate the end (userspace) buffer and
DMA there directly. That requires parsing TCP/UDP headers, etc., or
having a more versatile API like infiniband. + extra NIC features.

2. map skbufs into the userspace as TCP_ZEROCOPY_RECEIVE does.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-11-11 18:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-10 21:31 io_uring-only sendmsg + recvmsg zerocopy Victor Stewart
2020-11-10 23:23 ` Pavel Begunkov
     [not found]   ` <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>
2020-11-11  0:25     ` Fwd: " Victor Stewart
2020-11-11  0:57     ` Pavel Begunkov
2020-11-11 16:49       ` Victor Stewart
2020-11-11 18:50         ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox