From: Pavel Begunkov <[email protected]>
To: Victor Stewart <[email protected]>, io-uring <[email protected]>
Subject: Re: io_uring-only sendmsg + recvmsg zerocopy
Date: Wed, 11 Nov 2020 00:57:43 +0000 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>
On 11/11/2020 00:07, Victor Stewart wrote:
> On Tue, Nov 10, 2020 at 11:26 PM Pavel Begunkov <[email protected]> wrote:
>>> we'd be looking at approx +100% throughput each on the send and recv
>>> paths (per TCP_ZEROCOPY_RECEIVE benchmarks).>
>>> these would be io_uring only operations given the sendmsg completion
>>> logic described below. want to get some conscious that this design
>>> could/would be acceptable for merging before I begin writing the code.
>>>
>>> the problem with zerocopy send is the asynchronous ACK from the NIC
>>> confirming transmission. and you can’t just block on a syscall til
>>> then. MSG_ZEROCOPY tackled this by putting the ACK on the
>>> MSG_ERRQUEUE. but that logic is very disjointed and requires a double
>>> completion (once from sendmsg once the send is enqueued, and again
>>> once the NIC ACKs the transmission), and requires costly userspace
>>> bookkeeping.
>>>
>>> so what i propose instead is to exploit the asynchrony of io_uring.
>>>
>>> you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
>>> sometime later receive the completion event on the ring’s completion
>>> queue (either failure or success once ACK-ed by the NIC). 1 unified
>>> completion flow.
>>
>> I though about it after your other email. It makes sense for message
>> oriented protocols but may not for streams. That's because a user
>> may want to call
>>
>> send();
>> send();
>>
>> And expect right ordering, and that where waiting for ACK may add a lot
>> of latency, so returning from the call here is a notification that "it's
>> accounted, you may send more and order will be preserved".
>>
>> And since ACKs may came a long after, you may put a lot of code and stuff
>> between send()s and still suffer latency (and so potentially throughput
>> drop).
>>
>> As for me, for an optional feature sounds sensible, and should work well
>> for some use cases. But for others it may be good to have 2 of
>> notifications (1. ready to next send(), 2. ready to recycle buf).
>> E.g. 2 CQEs, that wouldn't work without a bit of io_uring patching.
>>
>
> we could make it datagram only, like check the socket was created with
no need, streams can also benefit from it.
> SOCK_DGRAM and fail otherwise... if it requires too much io_uring
> changes / possible regression to accomodate a 2 cqe mode.
May be easier to do via two requests with the second receiving
errors (yeah, msg_control again).
>>> we can somehow tag the socket as registered to io_uring, then when the
>>
>> I'd rather tag a request
>
> as long as the NIC is able to find / callback the ring about
> transmission ACK, whatever the path of least resistance is is best.
>
>>
>>> NIC ACKs, instead of finding the socket's error queue and putting the
>>> completion there like MSG_ZEROCOPY, the kernel would find the io_uring
>>> instance the socket is registered to and call into an io_uring
>>> sendmsg_zerocopy_completion function. Then the cqe would get pushed
>>> onto the completion queue.>
>>> the "recvmsg zerocopy" is straight forward enough. mimicking
>>> TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.
>>
>> Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just
>> maps skbuffs into userspace, and in general unless there is a better
>> suited protocol (e.g. infiniband with richier src/dst tagging) or a very
>> very smart NIC, "true zerocopy" is not possible without breaking
>> multiplexing.
>>
>> For registered buffers you still need to copy skbuff, at least because
>> of security implications.
>
> we can actually just force those buffers to be mmap-ed, and then when
> packets arrive use vm_insert_pin or remap_pfn_range to change the
> physical pages backing the virtual memory pages submmited for reading
> via msg_iov. so it's transparent to userspace but still zerocopy.
> (might require the user to notify io_uring when reading is
> completed... but no matter).
Yes, with io_uring zerocopy-recv may be done better than
TCP_ZEROCOPY_RECEIVE but
1) it's still a remap. Yes, zerocopy, but not ideal
2) won't work with registered buffers, which is basically a set
of pinned pages that have a userspace mapping. After such remap
that mapping wouldn't be in sync and that gets messy.
>>> the other big concern is the lifecycle of the persistent memory
>>> buffers in the case of nefarious actors. but since we already have
>>> buffer registration for O_DIRECT, I assume those mechanics already
>>
>> just buffer registration, not specifically for O_DIRECT
>>
>>> address those issues and can just be repurposed?
>>
>> Depending on how long it could stuck in the net stack, we might need
>> to be able to cancel those requests. That may be a problem.
>
> I spoke about this idea with Willem the other day and he mentioned...
>
> "As long as the mappings aren't unwound on process exit. But then you
The pages won't be unpinned until all/related requests are gone, but
for that on exit io_uring waits for them to complete. That's one of the
reasons why either requests should be cancellable or short-lived and
somewhat predictably time-bound.
> open up to malicious applications that purposely register ranges and
> then exit. The basics are straightforward to implement, but it's not
> that easy to arrive at something robust."
>
>>
>>>
>>> and so with those persistent memory buffers, you'd only pay the cost
>>> of pinning the memory into the kernel once upon registration, before
>>> you even start your server listening... thus "free". versus pinning
>>> per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".
--
Pavel Begunkov
next prev parent reply other threads:[~2020-11-11 1:00 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-10 21:31 io_uring-only sendmsg + recvmsg zerocopy Victor Stewart
2020-11-10 23:23 ` Pavel Begunkov
[not found] ` <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>
2020-11-11 0:25 ` Fwd: " Victor Stewart
2020-11-11 0:57 ` Pavel Begunkov [this message]
2020-11-11 16:49 ` Victor Stewart
2020-11-11 18:50 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox