public inbox for [email protected]
 help / color / mirror / Atom feed
From: Pavel Begunkov <[email protected]>
To: Jens Axboe <[email protected]>, Dylan Yudaken <[email protected]>
Cc: [email protected]
Subject: Re: [PATCH 6/8] io_uring/net: support multishot for send
Date: Mon, 26 Feb 2024 19:21:38 +0000	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 2/26/24 19:11, Jens Axboe wrote:
> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>> On 2/26/24 15:16, Jens Axboe wrote:
>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe <[email protected]> wrote:
>>>>>>>
>>>>>>> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
>>>>>>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> This works very much like the receive side, except for sends. The idea
>>>>>>>>> is that an application can fill outgoing buffers in a provided buffer
>>>>>>>>> group, and then arm a single send that will service them all. For now
>>>>>>>>> this variant just terminates when we are out of buffers to send, and
>>>>>>>>> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
>>>>>>>>> set, as per usual for multishot requests.
>>>>>>>>>
>>>>>>>>
>>>>>>>> This feels to me a lot like just using OP_SEND with MSG_WAITALL as
>>>>>>>> described, unless I'm missing something?
>>>>>>>
>>>>>>> How so? MSG_WAITALL is "send X amount of data, and if it's a short send,
>>>>>>> try again" where multishot is "send data from this buffer group, and
>>>>>>> keep sending data until it's empty". Hence it's the mirror of multishot
>>>>>>> on the receive side. Unless I'm misunderstanding you somehow, not sure
>>>>>>> it'd be smart to add special meaning to MSG_WAITALL with provided
>>>>>>> buffers.
>>>>>>>
>>>>>>
>>>>>> _If_ you have the data upfront these are very similar, and only differ
>>>>>> in that the multishot approach will give you more granular progress
>>>>>> updates. My point was that this might not be a valuable API to people
>>>>>> for only this use case.
>>>>>
>>>>> Not sure I agree, it feels like attributing a different meaning to
>>>>> MSG_WAITALL if you use a provided buffer vs if you don't. And that to me
>>>>> would seem to be confusing. Particularly when we have multishot on the
>>>>> receive side, and this is identical, just for sends. Receives will keep
>>>>> receiving as long as there are buffers in the provided group to receive
>>>>> into, and sends will keep sending for the same condition. Either one
>>>>> will terminate if we run out of buffers.
>>>>>
>>>>> If you make MSG_WAITALL be that for provided buffers + send, then that
>>>>> behaves differently than MSG_WAITALL with receive, and MSG_WAITALL with
>>>>> send _without_ provided buffers. I don't think overloading an existing
>>>>> flag for this purposes is a good idea, particularly when we already have
>>>>> the existing semantics for multishot on the receive side.
>>>>
>>>> I'm actually with Dylan on that and wonder where the perf win
>>>> could come from. Let's assume TCP, sends are usually completed
>>>> in the same syscall, otherwise your pacing is just bad. Thrift,
>>>> for example, collects sends and packs into one multi iov request
>>>> during a loop iteration. If the req completes immediately then
>>>> the userspace just wouldn't have time to push more buffers by
>>>> definition (assuming single threading).
>>>
>>> The problem only occurs when they don't complete inline, and now you get
>>> reordering. The application could of course attempt to do proper pacing
>>> and see if it can avoid that condition. If not, it now needs to
>>
>> Ok, I admit that there are more than valid cases when artificial pacing
>> is not an option, which is why I also laid out the polling case.
>> Let's also say that limits potential perf wins to streaming and very
>> large transfers (like files), not "lots of relatively small
>> request-response" kinds of apps.
> 
> I don't think that's true - if you're doing large streaming, you're more
> likely to keep the socket buffer full, whereas for smallish sends, it's
> less likely to be full. Testing with the silly proxy confirms that. And

I don't see any contradiction to what I said. With streaming/large
sends it's more likely to be polled. For small sends and
send-receive-send-... patterns the sock queue is unlikely to be
full, in which case the send is processed inline, and so the
feature doesn't add performance, as you agreed a couple email
before.

> outside of cases where pacing just isn't feasible, it's extra overhead
> for cases where you potentially could or what.

I lost it, what overhead?

> To me, the main appeal of this is the simplicity.

I'd argue it doesn't seem any simpler than the alternative.

>>> serialize sends. Using provided buffers makes this very easy, as you
>>> don't need to care about it at all, and it eliminates complexity in the
>>> application dealing with this.
>>
>> If I'm correct the example also serialises sends(?). I don't
>> think it's that simpler. You batch, you send. Same with this,
>> but batch into a provided buffer and the send is conditional.
> 
> Do you mean the proxy example? Just want to be sure we're talking about

Yes, proxy, the one you referenced in the CV. And FWIW, I don't
think it's a fair comparison without batching followed by multi-iov.

> the same thing. Yes it has to serialize sends, because otherwise we can
> run into the condition described in the patch that adds provided buffer
> support for send. But I did bench multishot separately from there,
> here's some of it:
> 
> 10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte packets.
> Send ring and send multishot not used:
> 
> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw
> =====================================================
> 1000   |    No	   |  No   |   437  | 1.22M | 9598M
> 32     |    No	   |  No   |  5856  | 2.87M |  734M
> 
> Same test, now turn on send ring:
> 
> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
> ===========================================================
> 1000   |    Yes	   |  No   |   436  | 1.23M | 9620M | + 0.2%
> 32     |    Yes	   |  No   |  3462  | 4.85M | 1237M | +68.5%
> 
> Same test, now turn on send mshot as well:
> 
> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
> ===========================================================
> 1000   |    Yes	   |  Yes  |   436  | 1.23M | 9620M | + 0.2%
> 32     |    Yes	   |  Yes  |  3125  | 5.37M | 1374M | +87.2%
> 
> which does show that there's another win on top for just queueing these
> sends and doing a single send to handle them, rather than needing to
> prepare a send for each buffer. Part of that may be that you simply run
> out of SQEs and then have to submit regardless of where you are at.

How many sockets did you test with? It's 1 SQE per sock max

+87% sounds like a huge difference, and I don't understand where
it comes from, hence the question

>> Another downside is that you need a provided queue per socket,
>> which sounds pretty expensive for 100s if not 1000s socket
>> apps.
> 
> That's certainly true. But either you need backlog per socket anyway in
> the app, or you only send single buffers anyway (in a typical
> request/response kind of fashion) between receives and you don't need it
> at all.

That's pinning pages and maping them, which surely is not bad
but with everything else equal malloc()/stack alloc is much
nicer in terms of resources. (Not talking about CPU setup
overhead).

>>>> If you actually need to poll tx, you send a request and collect
>>>> data into iov in userspace in background. When the request
>>>> completes you send all that in batch. You can probably find
>>>> a niche example when batch=1 in this case, but I don't think
>>>> anyone would care.
>>>>
>>>> The example doesn't use multi-iov, and also still has to
>>>> serialise requests, which naturally serialises buffer consumption
>>>> w/o provided bufs.
>>>
>>> IMHO there's no reason NOT to have both a send with provided buffers and
>>> a multishot send. The alternative would be to have send-N, where you
>>> pass in N. But I don't see much point to that over "just drain the whole
>>> pending list". The obvious use case is definitely send multishot, but
>>
>> Not sure I follow, but in all cases I was contemplating about
>> you sends everything you have at the moment.
>>
>>> what would the reasoning be to prohibit pacing by explicitly disallowing
>>> only doing a single buffer (or a partial queue)? As mentioned earlier, I
>>> like keeping the symmetry with the receive side for multishot, and not
>>> make it any different unless there's a reason to.
>>
>> There are different, buffer content kernel (rx) vs userspace (tx)
>> provided, provided queue / group per socket vs shared. Wake ups
>> for multishots as per below. It's not like it's a one line change,
>> so IMHO requires to be giving some benefits.
> 
> Are you talking about provided buffers, or multishot specifically? I

I assumed that any of them would retry until the queue is exhausted,
at least that sounds more efficient and used in all comments.

> think both are standalone pretty much as simple as they can be. And if
> the argument is "just have send with provided buffers be multishot by
> default",

It's not, rx and tx are different, e.g. true tx multishot doesn't
seem to be possible because of that.

> then that single patch is basically the two patches combined.
> There's no simplification there. Outside of a strong argument for why it
> would never make sense to do single shot send with provided buffers, I
> really don't want to combine them into one single action.

In the current form it does make more sense to have
multishot optionally.

>>>>>> You do make a good point about MSG_WAITALL though - multishot send
>>>>>> doesn't really make sense to me without MSG_WAITALL semantics. I
>>>>>> cannot imagine a useful use case where the first buffer being
>>>>>> partially sent will still want the second buffer sent.
>>>>>
>>>>> Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we
>>>>> make it implied for multishot send. Currently the code doesn't deal with
>>>>> that.
>>>>>
>>>>> Maybe if MSG_WAITALL isn't set and we get a short send we don't set
>>>>> CQE_F_MORE and we just stop. If it is set, then we go through the usual
>>>>> retry logic. That would make it identical to MSG_WAITALL send without
>>>>> multishot, which again is something I like in that we don't have
>>>>> different behaviors depending on which mode we are using.
>>>>>
>>>>>>>> I actually could imagine it being useful for the previous patches' use
>>>>>>>> case of queuing up sends and keeping ordering,
>>>>>>>> and I think the API is more obvious (rather than the second CQE
>>>>>>>> sending the first CQE's data). So maybe it's worth only
>>>>>>>> keeping one approach?
>>>>>>>
>>>>>>> And here you totally lost me :-)
>>>>>>
>>>>>> I am suggesting here that you don't really need to support buffer
>>>>>> lists on send without multishot.
>>>>>
>>>>> That is certainly true, but I also don't see a reason _not_ to support
>>>>> it. Again mostly because this is how receive and everything else works.
>>>>> The app is free to issue a single SQE for send without multishot, and
>>>>> pick the first buffer and send it.
>>>>
>>>> Multishot sound interesting, but I don't see it much useful if
>>>> you terminate when there are no buffers. Otherwise, if it continues
>>>> to sit in, someone would have to wake it up
>>>
>>> I did think about the termination case, and the problem is that if there
>>> are no buffers, you need it to wake when there are buffers. And at that
>>> point you may as well just do another send, as you need the application
>>> to trigger it. The alternative would be to invent a way to trigger that
>>> wakeup, which would be send only and weird just because of that.
>>
>> Yeah, that's the point, wake ups would be userspace driven, and how
>> to do it without heavy stuff like syscalls is not so clear.
> 
> It's just not possible without eg polling, either directly or using some
> monitor/mwait arch specific thing which would be awful. Or by doing some
> manual wakeup, which would need to lookup and kick the request, which I
> bet would be worse than just re-arming the send multishot.

Right

> If you could poll trigger it somehow, it also further complicates things
> as now it could potentially happen at any time. As it stands, the app
> knows when a poll multishot is armed (and submitted, or not), and can
> serialize with the outgoing buffer queue trivially.

-- 
Pavel Begunkov

  reply	other threads:[~2024-02-26 19:42 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-25  0:35 [PATCHSET v3 0/8] Support for provided buffers for send Jens Axboe
2024-02-25  0:35 ` [PATCH 1/8] io_uring/net: unify how recvmsg and sendmsg copy in the msghdr Jens Axboe
2024-02-26 14:41   ` Pavel Begunkov
2024-02-26 15:03     ` Jens Axboe
2024-02-25  0:35 ` [PATCH 2/8] net: remove {revc,send}msg_copy_msghdr() from exports Jens Axboe
2024-02-25  0:35 ` [PATCH 3/8] io_uring/net: add provided buffer support for IORING_OP_SEND Jens Axboe
2024-02-25  0:35 ` [PATCH 4/8] io_uring/net: add provided buffer support for IORING_OP_SENDMSG Jens Axboe
2024-02-25  0:35 ` [PATCH 5/8] io_uring/kbuf: flag request if buffer pool is empty after buffer pick Jens Axboe
2024-02-25  0:35 ` [PATCH 6/8] io_uring/net: support multishot for send Jens Axboe
2024-02-26 10:47   ` Dylan Yudaken
2024-02-26 13:38     ` Jens Axboe
2024-02-26 14:02       ` Dylan Yudaken
2024-02-26 14:27         ` Jens Axboe
2024-02-26 14:36           ` Pavel Begunkov
2024-02-26 15:16             ` Jens Axboe
2024-02-26 15:41               ` Pavel Begunkov
2024-02-26 19:11                 ` Jens Axboe
2024-02-26 19:21                   ` Pavel Begunkov [this message]
2024-02-26 20:12                     ` Jens Axboe
2024-02-26 20:51                       ` Pavel Begunkov
2024-02-26 21:27                         ` Jens Axboe
2024-02-28 12:39                           ` Pavel Begunkov
2024-02-28 17:28                             ` Jens Axboe
2024-02-28 23:49                               ` Jens Axboe
2024-02-29  1:46                                 ` Jens Axboe
2024-02-29 15:42                                   ` Jens Axboe
2024-02-26 19:31           ` Dylan Yudaken
2024-02-26 19:49             ` Jens Axboe
2024-02-25  0:35 ` [PATCH 7/8] io_uring/net: support multishot for sendmsg Jens Axboe
2024-02-25  0:35 ` [PATCH 8/8] io_uring/net: set MSG_MORE if we're doing multishot send and have more Jens Axboe
2024-02-26 10:59   ` Dylan Yudaken
2024-02-26 13:42     ` Jens Axboe
2024-02-26 14:24       ` Pavel Begunkov
2024-02-26 14:52         ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox