Re: io_uring-only sendmsg + recvmsg zerocopy

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Pavel Begunkov <[email protected]>
To: Victor Stewart <[email protected]>, io-uring <[email protected]>
Subject: Re: io_uring-only sendmsg + recvmsg zerocopy
Date: Wed, 11 Nov 2020 00:57:43 +0000	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>

On 11/11/2020 00:07, Victor Stewart wrote:
> On Tue, Nov 10, 2020 at 11:26 PM Pavel Begunkov <[email protected]> wrote:
>>> we'd be looking at approx +100% throughput each on the send and recv
>>> paths (per TCP_ZEROCOPY_RECEIVE benchmarks).>
>>> these would be io_uring only operations given the sendmsg completion
>>> logic described below. want to get some conscious that this design
>>> could/would be acceptable for merging before I begin writing the code.
>>>
>>> the problem with zerocopy send is the asynchronous ACK from the NIC
>>> confirming transmission. and you can’t just block on a syscall til
>>> then. MSG_ZEROCOPY tackled this by putting the ACK on the
>>> MSG_ERRQUEUE. but that logic is very disjointed and requires a double
>>> completion (once from sendmsg once the send is enqueued, and again
>>> once the NIC ACKs the transmission), and requires costly userspace
>>> bookkeeping.
>>>
>>> so what i propose instead is to exploit the asynchrony of io_uring.
>>>
>>> you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
>>> sometime later receive the completion event on the ring’s completion
>>> queue (either failure or success once ACK-ed by the NIC). 1 unified
>>> completion flow.
>>
>> I though about it after your other email. It makes sense for message
>> oriented protocols but may not for streams. That's because a user
>> may want to call
>>
>> send();
>> send();
>>
>> And expect right ordering, and that where waiting for ACK may add a lot
>> of latency, so returning from the call here is a notification that "it's
>> accounted, you may send more and order will be preserved".
>>
>> And since ACKs may came a long after, you may put a lot of code and stuff
>> between send()s and still suffer latency (and so potentially throughput
>> drop).
>>
>> As for me, for an optional feature sounds sensible, and should work well
>> for some use cases. But for others it may be good to have 2 of
>> notifications (1. ready to next send(), 2. ready to recycle buf).
>> E.g. 2 CQEs, that wouldn't work without a bit of io_uring patching. 
>>
> 
> we could make it datagram only, like check the socket was created with

no need, streams can also benefit from it.

> SOCK_DGRAM and fail otherwise... if it requires too much io_uring
> changes / possible regression to accomodate a 2 cqe mode.

May be easier to do via two requests with the second receiving
errors (yeah, msg_control again).

>>> we can somehow tag the socket as registered to io_uring, then when the
>>
>> I'd rather tag a request
> 
> as long as the NIC is able to find / callback the ring about
> transmission ACK, whatever the path of least resistance is is best.
> 
>>
>>> NIC ACKs, instead of finding the socket's error queue and putting the
>>> completion there like MSG_ZEROCOPY, the kernel would find the io_uring
>>> instance the socket is registered to and call into an io_uring
>>> sendmsg_zerocopy_completion function. Then the cqe would get pushed
>>> onto the completion queue.>
>>> the "recvmsg zerocopy" is straight forward enough. mimicking
>>> TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.
>>
>> Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just
>> maps skbuffs into userspace, and in general unless there is a better
>> suited protocol (e.g. infiniband with richier src/dst tagging) or a very
>> very smart NIC, "true zerocopy" is not possible without breaking
>> multiplexing.
>>
>> For registered buffers you still need to copy skbuff, at least because
>> of security implications.
> 
> we can actually just force those buffers to be mmap-ed, and then when
> packets arrive use vm_insert_pin or remap_pfn_range to change the
> physical pages backing the virtual memory pages submmited for reading
> via msg_iov. so it's transparent to userspace but still zerocopy.
> (might require the user to notify io_uring when reading is
> completed... but no matter).

Yes, with io_uring zerocopy-recv may be done better than
TCP_ZEROCOPY_RECEIVE but
1) it's still a remap. Yes, zerocopy, but not ideal
2) won't work with registered buffers, which is basically a set
of pinned pages that have a userspace mapping. After such remap
that mapping wouldn't be in sync and that gets messy.

>>> the other big concern is the lifecycle of the persistent memory
>>> buffers in the case of nefarious actors. but since we already have
>>> buffer registration for O_DIRECT, I assume those mechanics already
>>
>> just buffer registration, not specifically for O_DIRECT
>>
>>> address those issues and can just be repurposed?
>>
>> Depending on how long it could stuck in the net stack, we might need
>> to be able to cancel those requests. That may be a problem.
> 
> I spoke about this idea with Willem the other day and he mentioned...
> 
> "As long as the mappings aren't unwound on process exit. But then you

The pages won't be unpinned until all/related requests are gone, but
for that on exit io_uring waits for them to complete. That's one of the
reasons why either requests should be cancellable or short-lived and
somewhat predictably time-bound.

> open up to malicious applications that purposely register ranges and
> then exit. The basics are straightforward to implement, but it's not
> that easy to arrive at something robust."
> 
>>
>>>
>>> and so with those persistent memory buffers, you'd only pay the cost
>>> of pinning the memory into the kernel once upon registration, before
>>> you even start your server listening... thus "free". versus pinning
>>> per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".
-- 
Pavel Begunkov

next prev parent reply	other threads:[~2020-11-11  1:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-10 21:31 io_uring-only sendmsg + recvmsg zerocopy Victor Stewart
2020-11-10 23:23 ` Pavel Begunkov
     [not found]   ` <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>
2020-11-11  0:25     ` Fwd: " Victor Stewart
2020-11-11  0:57     ` Pavel Begunkov [this message]
2020-11-11 16:49       ` Victor Stewart
2020-11-11 18:50         ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox