From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F34471DFCD for ; Mon, 26 Feb 2024 19:42:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.175 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708976524; cv=none; b=ISwue86XXfPTwWmt746QH12JqMSXF54BVZNUXGAYZtwtzBt4ekX05ikuLjELQkJnXnGdQmUS42gNXLNqTu/XP9hPvGw8FpPeb7I9nEK+DaDSJwu0+inWR+wpbaY1xr8YCDcBymKobqhzokkJq0jo0TfNw9af8en9tfrdAZSCe1E= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708976524; c=relaxed/simple; bh=ECFktcGCcd4zh+g/qFOholFCqyocYYutgjEZWdIAyIg=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=Oi1eKp4dySqMfAOWR68PhOs4KwYMdnN08/DG77uHY4oLKU1fMfhXoCdOGUq8FGTI9Of1INi6kHPJdTTsvkP2A2XUYZoX+Jb924JPwv5+88j/a+wchzpzYDQN9R9sM9i2qELriDpEQfJ2wabijz8W8+HLzQP1ZdzkTLfzvAqyO9o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=HWJM+Y9m; arc=none smtp.client-ip=209.85.208.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HWJM+Y9m" Received: by mail-lj1-f175.google.com with SMTP id 38308e7fff4ca-2d2533089f6so41862531fa.1 for ; Mon, 26 Feb 2024 11:42:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708976520; x=1709581320; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=VPGmnPol1YFisBYmwUge6jKoFsz+yFOXJaBr+LwKKGs=; b=HWJM+Y9mIlVDzflw1pBjl0RhOainKFetBf+wgFzLQbHeoADMXBnbN4jPwDZJPzmopq mR4oZagCwEmTFQtqaWqmAu3UG9oDbGAtOCYq9YnJZtcjG7zTOfuqv8GcQ11OXeOBGoTq ssTiTBCQX69G9P+N5/8Xp8/zA2HWwY3RQQF7vogMqZr4yhoBbghZNGP+HouqTt4gldh8 Cv4lljDGoKbpGiXUMf5bJmrjlhUx/NXiiv0wqF1+EBWmaxCYQ318+BAYULXQ8hf0TdIy sV0FCQWdsF5MHJbpH+tP0jmeAqx0Sr9FAo+YeKV5U2uQCs1nrBhIQ65No9jzCjIv1cqB 9/PA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708976520; x=1709581320; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=VPGmnPol1YFisBYmwUge6jKoFsz+yFOXJaBr+LwKKGs=; b=WPVmehSJgBVt0yBcmf8Y+aLEaTu/7VNIZatrbtNN7gX6maT1J92E91kqjFX7IbIEBe 91z+xhsjUgYbPwq8iCbAqn8efiXbD+eL5eBXgAD6Fx5RuuI4BJ3pwMLry9PEoEViUdYH H5InD3JijQc6avsJGAob0WdBrPeM8HGXybRuG2Q4bG9Ho+FgTWusgSxm7PTbdnC9XFXo se3hVTS3Q+N+1K6i3Pp24LToVsZY3YB564khfItZNas9/ha7IdWYK3SyHczzS/HqH4GK O1U3dwuyo/wcn90upIu8GnzCiFFl47beAJNix61NhXxL02gtDZ921JwgU+HflXqf/NoU bK0Q== X-Gm-Message-State: AOJu0YzYs8w1UrQwwQCqXGYFtgqD1EgxzS0i6JRxHr90I4wuAgoumQTz WJaDScKOxD0aZjNHYfrojZwNtU5eNVQ1Xm7euum+pHJ5r3ugCFE/qqx2LIx9 X-Google-Smtp-Source: AGHT+IHNY4eHJjpgkVNdwzXxx4hQ7XD0GQJFAcdEC6UV+yuimYma3am+izc3JiOU2uRY5Y8etZ8mhQ== X-Received: by 2002:a2e:801a:0:b0:2d2:61e8:44eb with SMTP id j26-20020a2e801a000000b002d261e844ebmr4972208ljg.33.1708976519693; Mon, 26 Feb 2024 11:41:59 -0800 (PST) Received: from [192.168.8.100] ([185.69.144.12]) by smtp.gmail.com with ESMTPSA id m4-20020a05600c4f4400b004128d6ddad3sm13349214wmq.0.2024.02.26.11.41.59 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 26 Feb 2024 11:41:59 -0800 (PST) Message-ID: <68adc174-802a-455d-b6ca-a6908e592689@gmail.com> Date: Mon, 26 Feb 2024 19:21:38 +0000 Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 6/8] io_uring/net: support multishot for send Content-Language: en-US To: Jens Axboe , Dylan Yudaken Cc: io-uring@vger.kernel.org References: <20240225003941.129030-1-axboe@kernel.dk> <20240225003941.129030-7-axboe@kernel.dk> <63859888-5602-41fb-9a42-4edc6132766f@kernel.dk> <454ef0d2-066f-4bdf-af42-52fd0c57bd56@kernel.dk> <4823c201-8c5d-4a4f-a77e-bd3e6c239cbe@kernel.dk> From: Pavel Begunkov In-Reply-To: <4823c201-8c5d-4a4f-a77e-bd3e6c239cbe@kernel.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 2/26/24 19:11, Jens Axboe wrote: > On 2/26/24 8:41 AM, Pavel Begunkov wrote: >> On 2/26/24 15:16, Jens Axboe wrote: >>> On 2/26/24 7:36 AM, Pavel Begunkov wrote: >>>> On 2/26/24 14:27, Jens Axboe wrote: >>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote: >>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe wrote: >>>>>>> >>>>>>> On 2/26/24 3:47 AM, Dylan Yudaken wrote: >>>>>>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe wrote: >>>>>>>>> >>>>>>>>> This works very much like the receive side, except for sends. The idea >>>>>>>>> is that an application can fill outgoing buffers in a provided buffer >>>>>>>>> group, and then arm a single send that will service them all. For now >>>>>>>>> this variant just terminates when we are out of buffers to send, and >>>>>>>>> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't >>>>>>>>> set, as per usual for multishot requests. >>>>>>>>> >>>>>>>> >>>>>>>> This feels to me a lot like just using OP_SEND with MSG_WAITALL as >>>>>>>> described, unless I'm missing something? >>>>>>> >>>>>>> How so? MSG_WAITALL is "send X amount of data, and if it's a short send, >>>>>>> try again" where multishot is "send data from this buffer group, and >>>>>>> keep sending data until it's empty". Hence it's the mirror of multishot >>>>>>> on the receive side. Unless I'm misunderstanding you somehow, not sure >>>>>>> it'd be smart to add special meaning to MSG_WAITALL with provided >>>>>>> buffers. >>>>>>> >>>>>> >>>>>> _If_ you have the data upfront these are very similar, and only differ >>>>>> in that the multishot approach will give you more granular progress >>>>>> updates. My point was that this might not be a valuable API to people >>>>>> for only this use case. >>>>> >>>>> Not sure I agree, it feels like attributing a different meaning to >>>>> MSG_WAITALL if you use a provided buffer vs if you don't. And that to me >>>>> would seem to be confusing. Particularly when we have multishot on the >>>>> receive side, and this is identical, just for sends. Receives will keep >>>>> receiving as long as there are buffers in the provided group to receive >>>>> into, and sends will keep sending for the same condition. Either one >>>>> will terminate if we run out of buffers. >>>>> >>>>> If you make MSG_WAITALL be that for provided buffers + send, then that >>>>> behaves differently than MSG_WAITALL with receive, and MSG_WAITALL with >>>>> send _without_ provided buffers. I don't think overloading an existing >>>>> flag for this purposes is a good idea, particularly when we already have >>>>> the existing semantics for multishot on the receive side. >>>> >>>> I'm actually with Dylan on that and wonder where the perf win >>>> could come from. Let's assume TCP, sends are usually completed >>>> in the same syscall, otherwise your pacing is just bad. Thrift, >>>> for example, collects sends and packs into one multi iov request >>>> during a loop iteration. If the req completes immediately then >>>> the userspace just wouldn't have time to push more buffers by >>>> definition (assuming single threading). >>> >>> The problem only occurs when they don't complete inline, and now you get >>> reordering. The application could of course attempt to do proper pacing >>> and see if it can avoid that condition. If not, it now needs to >> >> Ok, I admit that there are more than valid cases when artificial pacing >> is not an option, which is why I also laid out the polling case. >> Let's also say that limits potential perf wins to streaming and very >> large transfers (like files), not "lots of relatively small >> request-response" kinds of apps. > > I don't think that's true - if you're doing large streaming, you're more > likely to keep the socket buffer full, whereas for smallish sends, it's > less likely to be full. Testing with the silly proxy confirms that. And I don't see any contradiction to what I said. With streaming/large sends it's more likely to be polled. For small sends and send-receive-send-... patterns the sock queue is unlikely to be full, in which case the send is processed inline, and so the feature doesn't add performance, as you agreed a couple email before. > outside of cases where pacing just isn't feasible, it's extra overhead > for cases where you potentially could or what. I lost it, what overhead? > To me, the main appeal of this is the simplicity. I'd argue it doesn't seem any simpler than the alternative. >>> serialize sends. Using provided buffers makes this very easy, as you >>> don't need to care about it at all, and it eliminates complexity in the >>> application dealing with this. >> >> If I'm correct the example also serialises sends(?). I don't >> think it's that simpler. You batch, you send. Same with this, >> but batch into a provided buffer and the send is conditional. > > Do you mean the proxy example? Just want to be sure we're talking about Yes, proxy, the one you referenced in the CV. And FWIW, I don't think it's a fair comparison without batching followed by multi-iov. > the same thing. Yes it has to serialize sends, because otherwise we can > run into the condition described in the patch that adds provided buffer > support for send. But I did bench multishot separately from there, > here's some of it: > > 10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte packets. > Send ring and send multishot not used: > > Pkt sz | Send ring | mshot | usec | QPS | Bw > ===================================================== > 1000 | No | No | 437 | 1.22M | 9598M > 32 | No | No | 5856 | 2.87M | 734M > > Same test, now turn on send ring: > > Pkt sz | Send ring | mshot | usec | QPS | Bw | Diff > =========================================================== > 1000 | Yes | No | 436 | 1.23M | 9620M | + 0.2% > 32 | Yes | No | 3462 | 4.85M | 1237M | +68.5% > > Same test, now turn on send mshot as well: > > Pkt sz | Send ring | mshot | usec | QPS | Bw | Diff > =========================================================== > 1000 | Yes | Yes | 436 | 1.23M | 9620M | + 0.2% > 32 | Yes | Yes | 3125 | 5.37M | 1374M | +87.2% > > which does show that there's another win on top for just queueing these > sends and doing a single send to handle them, rather than needing to > prepare a send for each buffer. Part of that may be that you simply run > out of SQEs and then have to submit regardless of where you are at. How many sockets did you test with? It's 1 SQE per sock max +87% sounds like a huge difference, and I don't understand where it comes from, hence the question >> Another downside is that you need a provided queue per socket, >> which sounds pretty expensive for 100s if not 1000s socket >> apps. > > That's certainly true. But either you need backlog per socket anyway in > the app, or you only send single buffers anyway (in a typical > request/response kind of fashion) between receives and you don't need it > at all. That's pinning pages and maping them, which surely is not bad but with everything else equal malloc()/stack alloc is much nicer in terms of resources. (Not talking about CPU setup overhead). >>>> If you actually need to poll tx, you send a request and collect >>>> data into iov in userspace in background. When the request >>>> completes you send all that in batch. You can probably find >>>> a niche example when batch=1 in this case, but I don't think >>>> anyone would care. >>>> >>>> The example doesn't use multi-iov, and also still has to >>>> serialise requests, which naturally serialises buffer consumption >>>> w/o provided bufs. >>> >>> IMHO there's no reason NOT to have both a send with provided buffers and >>> a multishot send. The alternative would be to have send-N, where you >>> pass in N. But I don't see much point to that over "just drain the whole >>> pending list". The obvious use case is definitely send multishot, but >> >> Not sure I follow, but in all cases I was contemplating about >> you sends everything you have at the moment. >> >>> what would the reasoning be to prohibit pacing by explicitly disallowing >>> only doing a single buffer (or a partial queue)? As mentioned earlier, I >>> like keeping the symmetry with the receive side for multishot, and not >>> make it any different unless there's a reason to. >> >> There are different, buffer content kernel (rx) vs userspace (tx) >> provided, provided queue / group per socket vs shared. Wake ups >> for multishots as per below. It's not like it's a one line change, >> so IMHO requires to be giving some benefits. > > Are you talking about provided buffers, or multishot specifically? I I assumed that any of them would retry until the queue is exhausted, at least that sounds more efficient and used in all comments. > think both are standalone pretty much as simple as they can be. And if > the argument is "just have send with provided buffers be multishot by > default", It's not, rx and tx are different, e.g. true tx multishot doesn't seem to be possible because of that. > then that single patch is basically the two patches combined. > There's no simplification there. Outside of a strong argument for why it > would never make sense to do single shot send with provided buffers, I > really don't want to combine them into one single action. In the current form it does make more sense to have multishot optionally. >>>>>> You do make a good point about MSG_WAITALL though - multishot send >>>>>> doesn't really make sense to me without MSG_WAITALL semantics. I >>>>>> cannot imagine a useful use case where the first buffer being >>>>>> partially sent will still want the second buffer sent. >>>>> >>>>> Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we >>>>> make it implied for multishot send. Currently the code doesn't deal with >>>>> that. >>>>> >>>>> Maybe if MSG_WAITALL isn't set and we get a short send we don't set >>>>> CQE_F_MORE and we just stop. If it is set, then we go through the usual >>>>> retry logic. That would make it identical to MSG_WAITALL send without >>>>> multishot, which again is something I like in that we don't have >>>>> different behaviors depending on which mode we are using. >>>>> >>>>>>>> I actually could imagine it being useful for the previous patches' use >>>>>>>> case of queuing up sends and keeping ordering, >>>>>>>> and I think the API is more obvious (rather than the second CQE >>>>>>>> sending the first CQE's data). So maybe it's worth only >>>>>>>> keeping one approach? >>>>>>> >>>>>>> And here you totally lost me :-) >>>>>> >>>>>> I am suggesting here that you don't really need to support buffer >>>>>> lists on send without multishot. >>>>> >>>>> That is certainly true, but I also don't see a reason _not_ to support >>>>> it. Again mostly because this is how receive and everything else works. >>>>> The app is free to issue a single SQE for send without multishot, and >>>>> pick the first buffer and send it. >>>> >>>> Multishot sound interesting, but I don't see it much useful if >>>> you terminate when there are no buffers. Otherwise, if it continues >>>> to sit in, someone would have to wake it up >>> >>> I did think about the termination case, and the problem is that if there >>> are no buffers, you need it to wake when there are buffers. And at that >>> point you may as well just do another send, as you need the application >>> to trigger it. The alternative would be to invent a way to trigger that >>> wakeup, which would be send only and weird just because of that. >> >> Yeah, that's the point, wake ups would be userspace driven, and how >> to do it without heavy stuff like syscalls is not so clear. > > It's just not possible without eg polling, either directly or using some > monitor/mwait arch specific thing which would be awful. Or by doing some > manual wakeup, which would need to lookup and kick the request, which I > bet would be worse than just re-arming the send multishot. Right > If you could poll trigger it somehow, it also further complicates things > as now it could potentially happen at any time. As it stands, the app > knows when a poll multishot is armed (and submitted, or not), and can > serialize with the outgoing buffer queue trivially. -- Pavel Begunkov