From: Pavel Begunkov <[email protected]>
To: Stefan Metzmacher <[email protected]>, [email protected]
Cc: Jens Axboe <[email protected]>, Dylan Yudaken <[email protected]>
Subject: Re: [RFC 2/2] io_uring/net: allow to override notification tag
Date: Fri, 19 Aug 2022 12:42:28 +0100 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
On 8/18/22 19:13, Stefan Metzmacher wrote:
> Am 17.08.22 um 14:42 schrieb Pavel Begunkov:
>> On 8/16/22 09:23, Stefan Metzmacher wrote:
>>> Am 16.08.22 um 09:42 schrieb Pavel Begunkov:
>>>> Considering limited amount of slots some users struggle with
>>>> registration time notification tag assignment as it's hard to manage
>>>> notifications using sequence numbers. Add a simple feature that copies
>>>> sqe->user_data of a send(+flush) request into the notification CQE it
>>>> flushes (and only when it's flushes).
>>>>
>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>> ---
>>>> include/uapi/linux/io_uring.h | 4 ++++
>>>> io_uring/net.c | 6 +++++-
>>>> 2 files changed, 9 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index 20368394870e..91e7944c9c78 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -280,11 +280,15 @@ enum io_uring_op {
>>>> *
>>>> * IORING_RECVSEND_NOTIF_FLUSH Flush a notification after a successful
>>>> * successful. Only for zerocopy sends.
>>>> + *
>>>> + * IORING_RECVSEND_NOTIF_COPY_TAG Copy request's user_data into the notification
>>>> + * completion even if it's flushed.
>>>> */
>>>> #define IORING_RECVSEND_POLL_FIRST (1U << 0)
>>>> #define IORING_RECV_MULTISHOT (1U << 1)
>>>> #define IORING_RECVSEND_FIXED_BUF (1U << 2)
>>>> #define IORING_RECVSEND_NOTIF_FLUSH (1U << 3)
>>>> +#define IORING_RECVSEND_NOTIF_COPY_TAG (1U << 4)
>>>> /* cqe->res mask for extracting the notification sequence number */
>>>> #define IORING_NOTIF_SEQ_MASK 0xFFFFU
>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>> index bd3fad9536ef..4d271a269979 100644
>>>> --- a/io_uring/net.c
>>>> +++ b/io_uring/net.c
>>>> @@ -858,7 +858,9 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>> zc->flags = READ_ONCE(sqe->ioprio);
>>>> if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
>>>> - IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
>>>> + IORING_RECVSEND_FIXED_BUF |
>>>> + IORING_RECVSEND_NOTIF_FLUSH |
>>>> + IORING_RECVSEND_NOTIF_COPY_TAG))
>>>> return -EINVAL;
>>>> if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
>>>> unsigned idx = READ_ONCE(sqe->buf_index);
>>>> @@ -1024,6 +1026,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
>>>> if (ret == -ERESTARTSYS)
>>>> ret = -EINTR;
>>>> } else if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH) {
>>>> + if (zc->flags & IORING_RECVSEND_NOTIF_COPY_TAG)
>>>> + notif->cqe.user_data = req->cqe.user_data;
>>>> io_notif_slot_flush_submit(notif_slot, 0);
>>>> }
>>>
>>> This would work but it seems to be confusing.
>>>
>>> Can't we have a slot-less mode, with slot_idx==U16_MAX,
>>> where we always allocate a new notif for each request,
>>> this would then get the same user_data and would be referenced on the
>>> request in order to reuse the same notif on an async retry after a short send.
>>
>> Ok, retries may make slots managing much harder, let me think
>
> With retries it would be much saner to use the same
> notif for the whole request. So keeping it referenced
> as zc->notif might be a way, maybe doing that in the _prep
> function in order to do it just once, then we don't need
> zc->slot_idx anymore.
Even though it's possible atm with some userspace consideration,
it's definitely should be patched up.
>>> And this notif will always be flushed at the end of the request.
>>>
>>> This:
>>>
>>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>> struct io_notif_slot *slot)
>>>
>>> would change to:
>>>
>>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>> __u64 cqe_user_data,
>>> __s32 cqe_res)
>>>
>>>
>>> and:
>>>
>>> void io_notif_slot_flush(struct io_notif_slot *slot) __must_hold(&ctx->uring_lock)
>>>
>>> (__must_hold looks wrong there...)
>>
>> Nope, it should be there
>
> Shouldn't it be something like
> __must_hold(&slot->notif->ctx->uring_lock)
>
> There is not 'ctx' argument.
Ah, in this sense, agree
>>> What do you think? It would remove the whole notif slot complexity
>>> from caller using IORING_RECVSEND_NOTIF_FLUSH for every request anyway.
>>
>> The downside is that requests then should be pretty large or it'll
>> lose in performance. Surely not a problem for 8MB per request but
>> even 4KB won't suffice. And users may want to put in smaller chunks
>> on the wire instead of waiting for mode data to let tcp handle
>> pacing and potentially improve latencies by sending earlier.
>
> If this is optional applications can decide what fits better.
>
>> On the other hand that one notification per request idea mentioned
>> before can extended to 1-2 CQEs per request, which is interestingly
>> the approach zc send discussions started with.
>
> In order to make use of any of this I need any way
> to get 2 CQEs with user_data being the same or related.
The idea described above will post 2 CQEs (mostly) per request
as you want with an optional way to have only 1 CQE. My current
sentiment is to kill all the slot business, leave this 1-2 CQE
per request and see if there are users for whom it won't be
enough. It's anyway just a slight deviation from what I wanted
to push as a complimentary interface.
> The only benefit for with slots is being able to avoid or
> batch additional CQEs, correct? Or is there more to it?
CQE batching is a lesser problem, I'm more concerned of how
it sticks with the network. In short, it'll hugely underperform
with TCP if requests are not large enough.
A simple bench with some hacks, localhost, TCP, run by
./msg_zerocopy -6 -r tcp -s <size> &
./io_uring_zerocopy_tx -6 -D "::1" -s <size> -m <0,2> tcp
non-zerocopy:
4000B: tx=8711880 (MB=33233), tx/s=1742376 (MB/s=6646)
16000B: tx=3196528 (MB=48775), tx/s=639305 (MB/s=9755)
60000B: tx=1036536 (MB=59311), tx/s=207307 (MB/s=11862)
zerocopy:
4000B: tx=3003488 (MB=11457), tx/s=600697 (MB/s=2291)
16000B: tx=2940296 (MB=44865), tx/s=588059 (MB/s=8973)
60000B: tx=2621792 (MB=150020), tx/s=524358 (MB/s=30004)
Reusing notifications with slots will change the picture.
And it this has nothing to do with io_uring overhead like
CQE posting and so on.
--
Pavel Begunkov
next prev parent reply other threads:[~2022-08-19 11:43 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-16 7:41 [RFC 0/2] io_uring zc notification tag override Pavel Begunkov
2022-08-16 7:42 ` [RFC 1/2] io_uring/notif: change notif CQE uapi format Pavel Begunkov
2022-08-16 8:14 ` Stefan Metzmacher
2022-08-16 7:42 ` [RFC 2/2] io_uring/net: allow to override notification tag Pavel Begunkov
2022-08-16 8:23 ` Stefan Metzmacher
2022-08-17 12:42 ` Pavel Begunkov
2022-08-18 18:13 ` Stefan Metzmacher
2022-08-19 11:42 ` Pavel Begunkov [this message]
2022-08-19 12:36 ` Stefan Metzmacher
2022-08-22 11:49 ` Pavel Begunkov
2022-08-16 8:37 ` Dylan Yudaken
2022-08-17 10:48 ` Pavel Begunkov
2022-08-17 12:04 ` Dylan Yudaken
2022-08-17 12:44 ` Pavel Begunkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox