From: Pavel Begunkov <[email protected]>
To: Victor Stewart <[email protected]>, [email protected]
Subject: Re: io_uring-only sendmsg + recvmsg zerocopy
Date: Tue, 10 Nov 2020 23:23:22 +0000 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAM1kxwgKGMz9UcvpFr1239kmdvmKPuzAyBEwKi_rxDog1MshRQ@mail.gmail.com>
On 10/11/2020 21:31, Victor Stewart wrote:
> here’s the design i’m flirting with for "recvmsg and sendmsg zerocopy"
> with persistent buffers patch.
Ok, first we need make it to work with registered buffers. I had patches
for that but need to rebase+refresh it, I'll send it out this week then.
Zerocopy would still go through some pinning,
e.g. skb_zerocopy_iter_*() -> iov_iter_get_pages()
-> get_page() -> atomic_inc()
but it's lighter for bvec and can be optimised later if needed.
And that leaves hooking up into struct ubuf_info with callbacks
for zerocopy.
>
> we'd be looking at approx +100% throughput each on the send and recv
> paths (per TCP_ZEROCOPY_RECEIVE benchmarks).>
> these would be io_uring only operations given the sendmsg completion
> logic described below. want to get some conscious that this design
> could/would be acceptable for merging before I begin writing the code.
>
> the problem with zerocopy send is the asynchronous ACK from the NIC
> confirming transmission. and you can’t just block on a syscall til
> then. MSG_ZEROCOPY tackled this by putting the ACK on the
> MSG_ERRQUEUE. but that logic is very disjointed and requires a double
> completion (once from sendmsg once the send is enqueued, and again
> once the NIC ACKs the transmission), and requires costly userspace
> bookkeeping.
>
> so what i propose instead is to exploit the asynchrony of io_uring.
>
> you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
> sometime later receive the completion event on the ring’s completion
> queue (either failure or success once ACK-ed by the NIC). 1 unified
> completion flow.
I though about it after your other email. It makes sense for message
oriented protocols but may not for streams. That's because a user
may want to call
send();
send();
And expect right ordering, and that where waiting for ACK may add a lot
of latency, so returning from the call here is a notification that "it's
accounted, you may send more and order will be preserved".
And since ACKs may came a long after, you may put a lot of code and stuff
between send()s and still suffer latency (and so potentially throughput
drop).
As for me, for an optional feature sounds sensible, and should work well
for some use cases. But for others it may be good to have 2 of
notifications (1. ready to next send(), 2. ready to recycle buf).
E.g. 2 CQEs, that wouldn't work without a bit of io_uring patching.
>
> we can somehow tag the socket as registered to io_uring, then when the
I'd rather tag a request
> NIC ACKs, instead of finding the socket's error queue and putting the
> completion there like MSG_ZEROCOPY, the kernel would find the io_uring
> instance the socket is registered to and call into an io_uring
> sendmsg_zerocopy_completion function. Then the cqe would get pushed
> onto the completion queue.>
> the "recvmsg zerocopy" is straight forward enough. mimicking
> TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.
Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just
maps skbuffs into userspace, and in general unless there is a better
suited protocol (e.g. infiniband with richier src/dst tagging) or a very
very smart NIC, "true zerocopy" is not possible without breaking
multiplexing.
For registered buffers you still need to copy skbuff, at least because
of security implications.
>
> the other big concern is the lifecycle of the persistent memory
> buffers in the case of nefarious actors. but since we already have
> buffer registration for O_DIRECT, I assume those mechanics already
just buffer registration, not specifically for O_DIRECT
> address those issues and can just be repurposed?
Depending on how long it could stuck in the net stack, we might need
to be able to cancel those requests. That may be a problem.
>
> and so with those persistent memory buffers, you'd only pay the cost
> of pinning the memory into the kernel once upon registration, before
> you even start your server listening... thus "free". versus pinning
> per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".
>
--
Pavel Begunkov
next prev parent reply other threads:[~2020-11-10 23:26 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-10 21:31 io_uring-only sendmsg + recvmsg zerocopy Victor Stewart
2020-11-10 23:23 ` Pavel Begunkov [this message]
[not found] ` <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>
2020-11-11 0:25 ` Fwd: " Victor Stewart
2020-11-11 0:57 ` Pavel Begunkov
2020-11-11 16:49 ` Victor Stewart
2020-11-11 18:50 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox