public inbox for [email protected]
 help / color / mirror / Atom feed
From: Stefan Metzmacher <[email protected]>
To: Pavel Begunkov <[email protected]>, [email protected]
Cc: Jens Axboe <[email protected]>, Dylan Yudaken <[email protected]>
Subject: Re: [RFC 2/2] io_uring/net: allow to override notification tag
Date: Fri, 19 Aug 2022 14:36:58 +0200	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>


Am 19.08.22 um 13:42 schrieb Pavel Begunkov:
> On 8/18/22 19:13, Stefan Metzmacher wrote:
>> Am 17.08.22 um 14:42 schrieb Pavel Begunkov:
>>> On 8/16/22 09:23, Stefan Metzmacher wrote:
>>>> Am 16.08.22 um 09:42 schrieb Pavel Begunkov:
>>>>> Considering limited amount of slots some users struggle with
>>>>> registration time notification tag assignment as it's hard to manage
>>>>> notifications using sequence numbers. Add a simple feature that copies
>>>>> sqe->user_data of a send(+flush) request into the notification CQE it
>>>>> flushes (and only when it's flushes).
>>>>>
>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>> ---
>>>>>   include/uapi/linux/io_uring.h | 4 ++++
>>>>>   io_uring/net.c                | 6 +++++-
>>>>>   2 files changed, 9 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>> index 20368394870e..91e7944c9c78 100644
>>>>> --- a/include/uapi/linux/io_uring.h
>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>> @@ -280,11 +280,15 @@ enum io_uring_op {
>>>>>    *
>>>>>    * IORING_RECVSEND_NOTIF_FLUSH    Flush a notification after a successful
>>>>>    *                successful. Only for zerocopy sends.
>>>>> + *
>>>>> + * IORING_RECVSEND_NOTIF_COPY_TAG Copy request's user_data into the notification
>>>>> + *                  completion even if it's flushed.
>>>>>    */
>>>>>   #define IORING_RECVSEND_POLL_FIRST    (1U << 0)
>>>>>   #define IORING_RECV_MULTISHOT        (1U << 1)
>>>>>   #define IORING_RECVSEND_FIXED_BUF    (1U << 2)
>>>>>   #define IORING_RECVSEND_NOTIF_FLUSH    (1U << 3)
>>>>> +#define IORING_RECVSEND_NOTIF_COPY_TAG    (1U << 4)
>>>>>   /* cqe->res mask for extracting the notification sequence number */
>>>>>   #define IORING_NOTIF_SEQ_MASK        0xFFFFU
>>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>>> index bd3fad9536ef..4d271a269979 100644
>>>>> --- a/io_uring/net.c
>>>>> +++ b/io_uring/net.c
>>>>> @@ -858,7 +858,9 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>       zc->flags = READ_ONCE(sqe->ioprio);
>>>>>       if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
>>>>> -              IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
>>>>> +              IORING_RECVSEND_FIXED_BUF |
>>>>> +              IORING_RECVSEND_NOTIF_FLUSH |
>>>>> +              IORING_RECVSEND_NOTIF_COPY_TAG))
>>>>>           return -EINVAL;
>>>>>       if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
>>>>>           unsigned idx = READ_ONCE(sqe->buf_index);
>>>>> @@ -1024,6 +1026,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
>>>>>           if (ret == -ERESTARTSYS)
>>>>>               ret = -EINTR;
>>>>>       } else if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH) {
>>>>> +        if (zc->flags & IORING_RECVSEND_NOTIF_COPY_TAG)
>>>>> +            notif->cqe.user_data = req->cqe.user_data;
>>>>>           io_notif_slot_flush_submit(notif_slot, 0);
>>>>>       }
>>>>
>>>> This would work but it seems to be confusing.
>>>>
>>>> Can't we have a slot-less mode, with slot_idx==U16_MAX,
>>>> where we always allocate a new notif for each request,
>>>> this would then get the same user_data and would be referenced on the
>>>> request in order to reuse the same notif on an async retry after a short send.
>>>
>>> Ok, retries may make slots managing much harder, let me think
>>
>> With retries it would be much saner to use the same
>> notif for the whole request. So keeping it referenced
>> as zc->notif might be a way, maybe doing that in the _prep
>> function in order to do it just once, then we don't need
>> zc->slot_idx anymore.
> 
> Even though it's possible atm with some userspace consideration,
> it's definitely should be patched up.
> 
>>>> And this notif will always be flushed at the end of the request.
>>>>
>>>> This:
>>>>
>>>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>>>                                  struct io_notif_slot *slot)
>>>>
>>>> would change to:
>>>>
>>>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>>>                                  __u64 cqe_user_data,
>>>>                  __s32 cqe_res)
>>>>
>>>>
>>>> and:
>>>>
>>>> void io_notif_slot_flush(struct io_notif_slot *slot) __must_hold(&ctx->uring_lock)
>>>>
>>>> (__must_hold looks wrong there...)
>>>
>>> Nope, it should be there
>>
>> Shouldn't it be something like
>> __must_hold(&slot->notif->ctx->uring_lock)
>>
>> There is not 'ctx' argument.
> 
> Ah, in this sense, agree
> 
>>>> What do you think? It would remove the whole notif slot complexity
>>>> from caller using IORING_RECVSEND_NOTIF_FLUSH for every request anyway.
>>>
>>> The downside is that requests then should be pretty large or it'll
>>> lose in performance. Surely not a problem for 8MB per request but
>>> even 4KB won't suffice. And users may want to put in smaller chunks
>>> on the wire instead of waiting for mode data to let tcp handle
>>> pacing and potentially improve latencies by sending earlier.
>>
>> If this is optional applications can decide what fits better.
>>
>>> On the other hand that one notification per request idea mentioned
>>> before can extended to 1-2 CQEs per request, which is interestingly
>>> the approach zc send discussions started with.
>>
>> In order to make use of any of this I need any way
>> to get 2 CQEs with user_data being the same or related.
> 
> The idea described above will post 2 CQEs (mostly) per request
> as you want with an optional way to have only 1 CQE. My current
> sentiment is to kill all the slot business, leave this 1-2 CQE
> per request and see if there are users for whom it won't be
> enough. It's anyway just a slight deviation from what I wanted
> to push as a complimentary interface.

Ah, ok, removing the slot stuff again would be fine for me...

>> The only benefit for with slots is being able to avoid or
>> batch additional CQEs, correct? Or is there more to it?
> 
> CQE batching is a lesser problem, I'm more concerned of how
> it sticks with the network. In short, it'll hugely underperform
> with TCP if requests are not large enough.
> 
> A simple bench with some hacks, localhost, TCP, run by
> 
> ./msg_zerocopy -6 -r tcp -s <size> &
> ./io_uring_zerocopy_tx -6 -D "::1" -s <size> -m <0,2> tcp
> 
> 
> non-zerocopy:
> 4000B:  tx=8711880 (MB=33233), tx/s=1742376 (MB/s=6646)
> 16000B: tx=3196528 (MB=48775), tx/s=639305 (MB/s=9755)
> 60000B: tx=1036536 (MB=59311), tx/s=207307 (MB/s=11862)
> 
> zerocopy:
> 4000B:  tx=3003488 (MB=11457), tx/s=600697 (MB/s=2291)
> 16000B: tx=2940296 (MB=44865), tx/s=588059 (MB/s=8973)
> 60000B: tx=2621792 (MB=150020), tx/s=524358 (MB/s=30004)

So with something between 16k and 60k we reach the point where
ZC starts to be faster, correct?

Did you remove the loopback restriction as described in
Documentation/networking/msg_zerocopy.rst ?

Are the results similar when using ./msg_zerocopy -6 tcp -s <size>
as client?

And the reason is some page pinning overhead from iov_iter_get_pages2()
in __zerocopy_sg_from_iter()?

> Reusing notifications with slots will change the picture.
> And it this has nothing to do with io_uring overhead like
> CQE posting and so on.

Hmm I don't understand how the number of notif structures
would have any impact? Is it related to io_sg_from_iter()?

metze

  reply	other threads:[~2022-08-19 12:37 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-16  7:41 [RFC 0/2] io_uring zc notification tag override Pavel Begunkov
2022-08-16  7:42 ` [RFC 1/2] io_uring/notif: change notif CQE uapi format Pavel Begunkov
2022-08-16  8:14   ` Stefan Metzmacher
2022-08-16  7:42 ` [RFC 2/2] io_uring/net: allow to override notification tag Pavel Begunkov
2022-08-16  8:23   ` Stefan Metzmacher
2022-08-17 12:42     ` Pavel Begunkov
2022-08-18 18:13       ` Stefan Metzmacher
2022-08-19 11:42         ` Pavel Begunkov
2022-08-19 12:36           ` Stefan Metzmacher [this message]
2022-08-22 11:49             ` Pavel Begunkov
2022-08-16  8:37   ` Dylan Yudaken
2022-08-17 10:48     ` Pavel Begunkov
2022-08-17 12:04       ` Dylan Yudaken
2022-08-17 12:44         ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox