From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B570C433E0 for ; Sun, 31 Jan 2021 09:20:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D8BD164E1F for ; Sun, 31 Jan 2021 09:20:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230133AbhAaJTu (ORCPT ); Sun, 31 Jan 2021 04:19:50 -0500 Received: from rydia.net ([173.255.253.96]:39839 "EHLO mail.rydia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230029AbhAaJTZ (ORCPT ); Sun, 31 Jan 2021 04:19:25 -0500 X-Greylist: delayed 469 seconds by postgrey-1.27 at vger.kernel.org; Sun, 31 Jan 2021 04:19:24 EST Received: from dry.lan (104-9-122-4.lightspeed.sntcca.sbcglobal.net [104.9.122.4]) by mail.rydia.net (Postfix) with ESMTPA id D01B44A69 for ; Sun, 31 Jan 2021 01:10:31 -0800 (PST) Date: Sun, 31 Jan 2021 01:10:27 -0800 (PST) From: dormando To: io-uring Subject: tcp short writes / write ordering / etc Message-ID: <855d3bc1-f694-e42e-283e-f8ee8f9c8e6e@rydia.net> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org Hey, I'm trying to puzzle out an architecture on top of io_uring for a tcp proxy I'm working on. I have a high level question, then I'll explain what I'm doing for context: - How (is?) order maintained for write()'s to the same FD from different SQE's to a network socket? ie; I get request A and queue a write(), later request B comes in and gets queued, A finishes short. There was no chance to IOSQE_LINK A to B. Does B cancel? This makes sense for disk IO but I can't wrap my head around it for network sockets. The setup: - N per-core worker threads. Each thread handles X client sockets. - Y backend sockets in a global shared pool. These point to storage servers (or other proxyes/anything). - client sockets wake up with requests for an arbitrary number of keys (1 to 100 or so). - each key is mapped to a backend (like keyhash % Y). - new requests are dispatched for each key to each backend socket. - the results are put back into order and returned to the client. The workers are designed such that they should not have to wait for a large request set before processing the next ready client socket. ie; thread N1 gets a request for 100 keys; it queues that work off, and then starts on a request for a single key. it picks up the results of the original request later and returns it. Else we get poor long tail latency. I've been working out a test program to mock this new backend. I have mock worker threads that submit batches of work from fake connections, and then have libevent or io_uring handle things. In libevent/epoll mode: - workers can directly call write() to backend sockets while holding a lock around a descriptive structure. this ensures order. - OR workers submit stacks to one or more threads which the backends sockets are striped across. These threads lock and write(). this mode helps with latency pileup. - a dedicated thread sits in epoll_wait() on EPOLLIN for each backend socket. This avoids repeated calls to epoll_add()/mod/etc. As responses are parsed, completed sets of requests are shipped back to the worker threads. In uring mode: - workers should submit to a single (or few) threads which have a private ring. sqe's are stacked and submit()'ed in a batch. Ideally saving all of the overhead of write()'ing to a bunch of sockets. (not working yet) - a dedicated thread with its own ring is sitting on recv() for each backend socket. It handles the same as epoll mode, except after each read I have to re-submit a new SQE for the next read. (I have everything sharing the same WQ, for what it's worth) I'm trying to figure out uring mode's single submission thread, but figuring out the IO ordering issues is blanking my mind. Requests can come in interleaved as the backends are shared, and waiting for a batch to complete before submitting the next one defeats the purpose (I think). What would be super nice but I'm pretty sure is impossible: - M (possibly 1) thread(s) sitting on recv() in its own ring - N client handling worker threads with independent rings on the same WQ - SQE's with writes to the same backend FD are serialized by a magical unicorn. Then: - worker with a request for 100 keys makes and submits the SQE's itself, then moves on to the next client connection. - recv() thread gathers responses and signals worker when the batch is complete. If I can avoid issues with short/colliding writes I can still make this work as my protocol can allow for out of order responses, but it's not the default mode so I need both to work anyway. Apologies if this isn't clear or was answered recently; I did try to read archives/code/etc. Thanks, -Dormando