public inbox for [email protected]
 help / color / mirror / Atom feed
From: Victor Stewart <[email protected]>
To: [email protected]
Subject: io_uring-only sendmsg + recvmsg zerocopy
Date: Tue, 10 Nov 2020 21:31:06 +0000	[thread overview]
Message-ID: <CAM1kxwgKGMz9UcvpFr1239kmdvmKPuzAyBEwKi_rxDog1MshRQ@mail.gmail.com> (raw)

here’s the design i’m flirting with for "recvmsg and sendmsg zerocopy"
with persistent buffers patch.

we'd be looking at approx +100% throughput each on the send and recv
paths (per TCP_ZEROCOPY_RECEIVE benchmarks).

these would be io_uring only operations given the sendmsg completion
logic described below. want to get some conscious that this design
could/would be acceptable for merging before I begin writing the code.

the problem with zerocopy send is the asynchronous ACK from the NIC
confirming transmission. and you can’t just block on a syscall til
then. MSG_ZEROCOPY tackled this by putting the ACK on the
MSG_ERRQUEUE. but that logic is very disjointed and requires a double
completion (once from sendmsg once the send is enqueued, and again
once the NIC ACKs the transmission), and requires costly userspace
bookkeeping.

so what i propose instead is to exploit the asynchrony of io_uring.

you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
sometime later receive the completion event on the ring’s completion
queue (either failure or success once ACK-ed by the NIC). 1 unified
completion flow.

we can somehow tag the socket as registered to io_uring, then when the
NIC ACKs, instead of finding the socket's error queue and putting the
completion there like MSG_ZEROCOPY, the kernel would find the io_uring
instance the socket is registered to and call into an io_uring
sendmsg_zerocopy_completion function. Then the cqe would get pushed
onto the completion queue.

the "recvmsg zerocopy" is straight forward enough. mimicking
TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.

the other big concern is the lifecycle of the persistent memory
buffers in the case of nefarious actors. but since we already have
buffer registration for O_DIRECT, I assume those mechanics already
address those issues and can just be repurposed?

and so with those persistent memory buffers, you'd only pay the cost
of pinning the memory into the kernel once upon registration, before
you even start your server listening... thus "free". versus pinning
per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".

             reply	other threads:[~2020-11-10 21:31 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-10 21:31 Victor Stewart [this message]
2020-11-10 23:23 ` io_uring-only sendmsg + recvmsg zerocopy Pavel Begunkov
     [not found]   ` <CAM1kxwjSyLb9ijs0=RZUA06E20qjwBnAZygwM3ckh10WozExag@mail.gmail.com>
2020-11-11  0:25     ` Fwd: " Victor Stewart
2020-11-11  0:57     ` Pavel Begunkov
2020-11-11 16:49       ` Victor Stewart
2020-11-11 18:50         ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAM1kxwgKGMz9UcvpFr1239kmdvmKPuzAyBEwKi_rxDog1MshRQ@mail.gmail.com \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox