Re: tcp short writes / write ordering / etc

public inbox for [email protected]
 help / color / mirror / Atom feed

From: Pavel Begunkov <[email protected]>
To: dormando <[email protected]>, io-uring <[email protected]>
Subject: Re: tcp short writes / write ordering / etc
Date: Mon, 1 Feb 2021 10:56:38 +0000	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 31/01/2021 09:10, dormando wrote:
> Hey,
> 
> I'm trying to puzzle out an architecture on top of io_uring for a tcp
> proxy I'm working on. I have a high level question, then I'll explain what
> I'm doing for context:
> 
> - How (is?) order maintained for write()'s to the same FD from different
> SQE's to a network socket? ie; I get request A and queue a write(), later

Without IOSQE_LINK or anything -- no ordering guarantees. Even if CQEs came
in some order actual I/O may have been executed in reverse.

> request B comes in and gets queued, A finishes short. There was no chance
> to IOSQE_LINK A to B. Does B cancel? This makes sense for disk IO but I
> can't wrap my head around it for network sockets.
> 
> The setup:
> 
> - N per-core worker threads. Each thread handles X client sockets.
> - Y backend sockets in a global shared pool. These point to storage
> servers (or other proxyes/anything).
> 
> - client sockets wake up with requests for an arbitrary number of keys (1
> to 100 or so).
>   - each key is mapped to a backend (like keyhash % Y).
>   - new requests are dispatched for each key to each backend socket.
>   - the results are put back into order and returned to the client.
> 
> The workers are designed such that they should not have to wait for a
> large request set before processing the next ready client socket. ie;
> thread N1 gets a request for 100 keys; it queues that work off, and then
> starts on a request for a single key. it picks up the results of the
> original request later and returns it. Else we get poor long tail latency.
> 
> I've been working out a test program to mock this new backend. I have mock
> worker threads that submit batches of work from fake connections, and then
> have libevent or io_uring handle things.
> 
> In libevent/epoll mode:
>  - workers can directly call write() to backend sockets while holding a
> lock around a descriptive structure. this ensures order.
>  - OR workers submit stacks to one or more threads which the backends
> sockets are striped across. These threads lock and write(). this mode
> helps with latency pileup.
>  - a dedicated thread sits in epoll_wait() on EPOLLIN for each backend
> socket. This avoids repeated calls to epoll_add()/mod/etc. As responses
> are parsed, completed sets of requests are shipped back to the worker
> threads.
> 
> In uring mode:
>  - workers should submit to a single (or few) threads which have a private
> ring. sqe's are stacked and submit()'ed in a batch. Ideally saving all of
> the overhead of write()'ing to a bunch of sockets. (not working yet)
>  - a dedicated thread with its own ring is sitting on recv() for each
> backend socket. It handles the same as epoll mode, except after each read
> I have to re-submit a new SQE for the next read.
> 
> (I have everything sharing the same WQ, for what it's worth)
> 
> I'm trying to figure out uring mode's single submission thread, but
> figuring out the IO ordering issues is blanking my mind. Requests can come
> in interleaved as the backends are shared, and waiting for a batch to
> complete before submitting the next one defeats the purpose (I think).
> 
> What would be super nice but I'm pretty sure is impossible:
> 
> - M (possibly 1) thread(s) sitting on recv() in its own ring
> - N client handling worker threads with independent rings on the same WQ
>  - SQE's with writes to the same backend FD are serialized by a magical
> unicorn.
> 
> Then:
> - worker with a request for 100 keys makes and submits the SQE's itself,
>   then moves on to the next client connection.
> - recv() thread gathers responses and signals worker when the batch is
> complete.
> 
> If I can avoid issues with short/colliding writes I can still make this
> work as my protocol can allow for out of order responses, but it's not the
> default mode so I need both to work anyway.
> 
> Apologies if this isn't clear or was answered recently; I did try to read
> archives/code/etc.

-- 
Pavel Begunkov

     prev parent reply	other threads:[~2021-02-01 11:01 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-31  9:10 tcp short writes / write ordering / etc dormando
2021-02-01 10:56 ` Pavel Begunkov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox