public inbox for [email protected]
 help / color / mirror / Atom feed
From: Jens Axboe <[email protected]>
To: Victor Stewart <[email protected]>
Cc: [email protected]
Subject: Re: [PATCH 10/10] io_uring: add support for ring mapped supplied buffers
Date: Fri, 29 Apr 2022 08:57:51 -0600	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAM1kxwhyPpZFQ2ZEhWGdENz6Bw6a0QN-NWMkmAuYjVxDDHP_Aw@mail.gmail.com>

On 4/29/22 7:21 AM, Victor Stewart wrote:
> top posting because this is a tangential but related comment.
> 
> the way i manage memory in my network server is by initializing with a
> fixed maximum number of supported clients, and then mmap an enormous
> contiguous buffer of something like (100MB + 100MB) * nMaxClients, and
> then for each client assign a fixed 100MB range for receive and
> another for send.
> 
> then with transparent huge pages disabled, only the pages with bytes
> in them are ever resident, memset-ing bytes to 0 as they?re consumed
> by the send or receive paths.
> 
> so this provides a perfectly optimal deterministic memory
> architecture, which makes client memory management effortless, while
> costing nothing? without the hassle of recycling buffers or worrying
> about what range to recv into or write into.
> 
> but i know that registered buffers as is have some restriction on
> maximum number of bytes one can register (i forget exactly).

You can have 64K groups, and 64K buffers in each. Each buffer can be
INT_MAX.

> so maybe there?s some way in the future to accommodate this scheme as
> well, which i believe is optimal out of all options.

As you noted, this patch doesn't change how provided buffers work, it
merely changes the mechanism with which they can be provided and
consumed to be more efficient.

One idea that we have entertained internally is to allow incremental
consumption of a buffer. Let's assume your setup. I'm going to exclude
send as those aren't relevant for this discussion. This means you have
100MB of buffer space for receive per client. Each client would have a
buffer group ID associated with it, for their receive buffers. If you
know what size your receives will be, then you'd provide your 100MB in
chunks of that. Each receive would pick a chunk, recv data, then post
the completion that holds information on what buffer was picked for it.
When the client is done with the data, it puts it back into the provided
pool.

If you have wildly different receive sizes, and no idea how much you'd
get at any point in time, then this scheme doesn't work so well as you
have to then either do multiple receive requests to get all the data, or
size your chunks such that any receive will fit. Obviously that can be
wasteful, as you end up with fewer available chunks, and maybe you need
to throw more than 100MB at it at that point.

If we allowed incremental consumption, you could provide your 100MB as
just one chunk. When a recv request is posted for eg 1500 bytes, you'd
simply chop 1500 off the front of that buffer and use it. You're now
left with a single chunk that's 100MB-1500B in size.

One complication here is that we don't have enough room in the CQE to
tell the app where we consumed from. Hence we'd need to ensure that the
kernel and application agree on where data is consumed from for any
given receive. Given full ordering of completions wrt data receive, this
isn't impossible, but it does seem a bit fragile to me.

We do have pending patches that allow for bigger CQEs, with the initial
use case being the passthrough support for eg NVMe. With that, you have
two u64 extra fields for any CQE, if you configure your ring to use big
CQEs. With that, we could do incremental consumption and just have the
recv completion be:

cqe = {
	.user_data	/* whatever app set user_data to for recv */
	.res		/* bytes received */
	.flags		/* IORING_CQE_F_INC_BUFFER */
	.extra1		/* start address of where data landed */
	.extra2		/* still unused */
}

and the client now knows that data was received into the address at
.extra1 and of .res bytes in length. This would not supported vectored
recv, but that seems like a minor thing as you can just do big buffers.

This does suffer from the fragmentation issue again. Your case probably
does not as you have a group per client, but other use cases might have
shared groups.

That was a long winded way of saying that "yes this patch doesn't
fundamentally change how provided buffers work, it just makes it more
efficient to use and allows easy re-provide options that previously
made provided buffers too slow to use for some use cases".

I welcome feedback! It's not entirely clear to me what your suggestion
is, it looks more like you're describing your use cases and soliciting
ideas on how provided buffers could work better for that?

-- 
Jens Axboe



      parent reply	other threads:[~2022-04-29 14:57 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-29 12:27 [PATCHSET RFC 0/10] Add support for ring mapped provided buffers Jens Axboe
2022-04-29 12:27 ` [PATCH 01/10] io_uring: kill io_recv_buffer_select() wrapper Jens Axboe
2022-04-29 12:27 ` [PATCH 02/10] io_uring: make io_buffer_select() return the user address directly Jens Axboe
2022-04-29 12:27 ` [PATCH 03/10] io_uring: kill io_rw_buffer_select() wrapper Jens Axboe
2022-04-29 12:27 ` [PATCH 04/10] io_uring: always use req->buf_index for the provided buffer group Jens Axboe
2022-04-29 12:27 ` [PATCH 05/10] io_uring: cache last io_buffer_list lookup Jens Axboe
2022-04-29 12:27 ` [PATCH 06/10] io_uring: add buffer selection support to IORING_OP_NOP Jens Axboe
2022-04-29 12:28 ` [PATCH 07/10] io_uring: add io_pin_pages() helper Jens Axboe
2022-04-29 12:28 ` [PATCH 08/10] io_uring: abstract out provided buffer list selection Jens Axboe
2022-04-29 12:28 ` [PATCH 09/10] io_uring: relocate io_buffer_get_list() Jens Axboe
2022-04-29 12:28 ` [PATCH 10/10] io_uring: add support for ring mapped supplied buffers Jens Axboe
     [not found]   ` <CAM1kxwhyPpZFQ2ZEhWGdENz6Bw6a0QN-NWMkmAuYjVxDDHP_Aw@mail.gmail.com>
2022-04-29 14:57     ` Jens Axboe [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox