public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Mina Almasry <almasrymina@google.com>
To: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
	netdev@vger.kernel.org, io-uring@vger.kernel.org,
	 Eric Dumazet <edumazet@google.com>,
	Willem de Bruijn <willemb@google.com>,
	 Paolo Abeni <pabeni@redhat.com>,
	andrew+netdev@lunn.ch, horms@kernel.org,  davem@davemloft.net,
	sdf@fomichev.me, dw@davidwei.uk,  michael.chan@broadcom.com,
	dtatulea@nvidia.com, ap420073@gmail.com
Subject: Re: [RFC v1 00/22] Large rx buffer support for zcrx
Date: Mon, 28 Jul 2025 13:23:48 -0700	[thread overview]
Message-ID: <CAHS8izMR+PsD12BA+Rq2yixKn=656V1jQhryiVZrC6z05Kq1SQ@mail.gmail.com> (raw)
In-Reply-To: <9922111a-63e6-468c-b2de-f9899e5b95cc@gmail.com>

On Mon, Jul 28, 2025 at 12:40 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 7/28/25 19:54, Mina Almasry wrote:
> > On Mon, Jul 28, 2025 at 4:03 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> This series implements large rx buffer support for io_uring/zcrx on
> >> top of Jakub's queue configuration changes, but it can also be used
> >> by other memory providers. Large rx buffers can be drastically
> >> beneficial with high-end hw-gro enabled cards that can coalesce traffic
> >> into larger pages, reducing the number of frags traversing the network
> >> stack and resuling in larger contiguous chunks of data for the
> >> userspace. Benchamrks showed up to ~30% improvement in CPU util.
> >>
> >
> > Very exciting.
> >
> > I have not yet had a chance to thoroughly look, but even still I have
> > a few high level questions/concerns. Maybe you already have answers to
> > them that can make my life a bit easier as I try to take a thorough
> > look.
> >
> > - I'm a bit confused that you're not making changes to the core net
> > stack to support non-PAGE_SIZE netmems. From a quick glance, it seems
> > that there are potentially a ton of places in the net stack that
> > assume PAGE_SIZE:
>
> The stack already supports large frags and it's not new. Page pools
> has higher order allocations, see __page_pool_alloc_page_order. The
> tx path can allocate large pages / coalesce user pages.

Right, large order allocations are not new, but I'm not sure they
actually work reliably. AFAICT most drivers set pp_params.order = 0;
I'm not sure how well tested multi-order pages are.

It may be reasonable to assume multi order pages just work and see
what blows up, though.

> Any specific
> place that concerns you? There are many places legitimately using
> PAGE_SIZE: kmap'ing folios, shifting it by order to get the size,
> linear allocations, etc.
>

From a 5-min look:

- skb_splice_from_iter, this line: size_t part = min_t(size_t,
PAGE_SIZE - off, len);
- skb_pp_cow_data, this line: max_head_size =
SKB_WITH_OVERHEAD(PAGE_SIZE - headroom);
- skb_seq_read, this line: pg_sz = min_t(unsigned int, pg_sz -
st->frag_off, PAGE_SIZE - pg_off
- zerocopy_fill_skb_from_iter, this line: int size = min_t(int,
copied, PAGE_SIZE - start);

I think the `PAGE_SIZE -` logic in general assumes the memory is
PAGE_SIZEd. Although in these cases it seems page specifics, i.e.
net_iovs wouldn't be exposed to these particular call sites.

I spent a few weeks acking the net stack for all page-access to prune
all of them to add unreadable netmem... are you somewhat confident
there are no PAGE_SIZE assumptions in the net stack that affect
net_iovs that require a deep look? Or is the approach here to merge
this and see what/if breaks?

> > cd net
> > ackc "PAGE_SIZE|PAGE_SHIFT" | wc -l
> > 468
> >
> > Are we sure none of these places assuming PAGE_SIZE or PAGE_SHIFT are
> > concerning?
> >
> > - You're not adding a field in the net_iov that tells us how big the
> > net_iov is. It seems to me you're configuring the driver to set the rx
> > buffer size, then assuming all the pp allocations are of that size,
> > then assuming in the rxzc code that all the net_iov are of that size.
> > I think a few problems may happen?
> >
> > (a) what happens if the rx buffer size is re-configured? Does the
> > io_uring rxrc instance get recreated as well?
>
> Any reason you even want it to work? You can't and frankly
> shouldn't be allowed to, at least in case of io_uring. Unless it's
> rejected somewhere earlier, in this case it'll fail on the order
> check while trying to create a page pool with a zcrx provider.
>

I think it's reasonable to disallow rx-buffer-size reconfiguration
when the queue is memory-config bound. I can check to see what this
code is doing.

> > (b) what happens with skb coalescing? skb coalescing is already a bit
> > of a mess. We don't allow coalescing unreadable and readable skbs, but
> > we do allow coalescing devmem and iozcrx skbs which could lead to some
> > bugs I'm guessing already. AFAICT as of this patch series we may allow
> > coalescing of skbs with netmems inside of them of different sizes, but
> > AFAICT so far, the iozcrx assume the size is constant across all the
> > netmems it gets, which I'm not sure is always true?
>
> It rejects niovs from other providers incl. from any other io_uring
> instances, so it only assume a uniform size for its own niovs.

Thanks. What is 'it' and where is the code that does the rejection?

> The
> backing memory is verified that it can be chunked.
>   > For all these reasons I had assumed that we'd need space in the
> > net_iov that tells us its size: net_iov->size.
>
> Nope, not in this case.
>
> > And then netmem_size(netmem) would replace all the PAGE_SIZE
> > assumptions in the net stack, and then we'd disallow coalescing of
> > skbs with different-sized netmems (else we need to handle them
> > correctly per the netmem_size).
> I'm not even sure what's the concern. What's the difference b/w
> tcp_recvmsg_dmabuf() getting one skb with differently sized frags
> or same frags in separate skbs? You still need to handle it
> somehow, even if by failing.
>

Right, I just wanted to understand what the design is. I guess the
design is allowing the netmems in the same skb to have different max
frag lens, yes?

I am guessing that it works, even in tcp_recvmsg_dmabuf. I guess the
frag len is actually in frag->len, so already it may vary from frag to
frag. Even if coalescing happens, some frags would have a frag->len =
PAGE_SIZE and some > PAGE_SIZE. Seems fine to me off the bat.

> Also, we should never coalesce different niovs together regardless
> of sizes. And for coalescing two chunks of the same niov, it should
> work just fine even without knowing the length.
>

Yeah, we should probably not coalesce 2 netmems together, although I
vaguely remember reading code in a net stack hepler that does that
somewhere already. Whatever.

-- 
Thanks,
Mina

  reply	other threads:[~2025-07-28 20:24 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
2025-07-28 18:11   ` Mina Almasry
2025-07-28 21:36   ` Mina Almasry
2025-08-01 23:13     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 02/22] net: ethtool: report max value for rx-buf-len Pavel Begunkov
2025-07-29  5:00   ` Subbaraya Sundeep
2025-07-28 11:04 ` [RFC v1 03/22] net: use zero value to restore rx_buf_len to default Pavel Begunkov
2025-07-29  5:03   ` Subbaraya Sundeep
2025-07-28 11:04 ` [RFC v1 04/22] net: clarify the meaning of netdev_config members Pavel Begunkov
2025-07-28 21:44   ` Mina Almasry
2025-08-01 23:14     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 05/22] net: add rx_buf_len to netdev config Pavel Begunkov
2025-07-28 21:50   ` Mina Almasry
2025-08-01 23:18     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 06/22] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 07/22] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 08/22] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 09/22] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 11/22] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 12/22] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 13/22] net: add queue config validation callback Pavel Begunkov
2025-07-28 22:26   ` Mina Almasry
2025-07-28 11:04 ` [RFC v1 14/22] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 15/22] eth: bnxt: store the rx buf size per queue Pavel Begunkov
2025-07-28 22:33   ` Mina Almasry
2025-08-01 23:20     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
2025-07-28 23:10   ` Mina Almasry
2025-08-01 23:37     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 18/22] net: wipe the setting of deactived queues Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 19/22] eth: bnxt: use queue op config validate Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 20/22] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 21/22] net: parametrise mp open with a queue config Pavel Begunkov
2025-08-02  0:10   ` Jakub Kicinski
2025-08-04 12:50     ` Pavel Begunkov
2025-08-05 22:43       ` Jakub Kicinski
2025-08-06  0:05       ` Jakub Kicinski
2025-08-06 16:48       ` Mina Almasry
2025-08-06 18:11         ` Jakub Kicinski
2025-08-06 18:30           ` Mina Almasry
2025-08-06 22:05             ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 22/22] io_uring/zcrx: implement large rx buffer support Pavel Begunkov
2025-07-28 17:13 ` [RFC v1 00/22] Large rx buffer support for zcrx Stanislav Fomichev
2025-07-28 18:18   ` Pavel Begunkov
2025-07-28 20:21     ` Stanislav Fomichev
2025-07-28 21:28       ` Pavel Begunkov
2025-07-28 22:06         ` Stanislav Fomichev
2025-07-28 22:44           ` Pavel Begunkov
2025-07-29 16:33             ` Stanislav Fomichev
2025-07-30 14:16               ` Pavel Begunkov
2025-07-30 15:50                 ` Stanislav Fomichev
2025-07-31 19:34                   ` Mina Almasry
2025-07-31 19:57                     ` Pavel Begunkov
2025-07-31 20:05                       ` Mina Almasry
2025-08-01  9:48                         ` Pavel Begunkov
2025-08-01  9:58                     ` Pavel Begunkov
2025-07-28 23:22           ` Mina Almasry
2025-07-29 16:41             ` Stanislav Fomichev
2025-07-29 17:01               ` Mina Almasry
2025-07-28 18:54 ` Mina Almasry
2025-07-28 19:42   ` Pavel Begunkov
2025-07-28 20:23     ` Mina Almasry [this message]
2025-07-28 20:57       ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHS8izMR+PsD12BA+Rq2yixKn=656V1jQhryiVZrC6z05Kq1SQ@mail.gmail.com' \
    --to=almasrymina@google.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=ap420073@gmail.com \
    --cc=asml.silence@gmail.com \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=dw@davidwei.uk \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=io-uring@vger.kernel.org \
    --cc=kuba@kernel.org \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox