From: Pavel Begunkov <asml.silence@gmail.com>
To: Mina Almasry <almasrymina@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
netdev@vger.kernel.org, io-uring@vger.kernel.org,
Eric Dumazet <edumazet@google.com>,
Willem de Bruijn <willemb@google.com>,
Paolo Abeni <pabeni@redhat.com>,
andrew+netdev@lunn.ch, horms@kernel.org, davem@davemloft.net,
sdf@fomichev.me, dw@davidwei.uk, michael.chan@broadcom.com,
dtatulea@nvidia.com, ap420073@gmail.com
Subject: Re: [RFC v1 00/22] Large rx buffer support for zcrx
Date: Mon, 28 Jul 2025 21:57:49 +0100 [thread overview]
Message-ID: <8c90f485-192f-4f7a-ac94-8171d78f3c4a@gmail.com> (raw)
In-Reply-To: <CAHS8izMR+PsD12BA+Rq2yixKn=656V1jQhryiVZrC6z05Kq1SQ@mail.gmail.com>
On 7/28/25 21:23, Mina Almasry wrote:
...>>>
>>> - I'm a bit confused that you're not making changes to the core net
>>> stack to support non-PAGE_SIZE netmems. From a quick glance, it seems
>>> that there are potentially a ton of places in the net stack that
>>> assume PAGE_SIZE:
>>
>> The stack already supports large frags and it's not new. Page pools
>> has higher order allocations, see __page_pool_alloc_page_order. The
>> tx path can allocate large pages / coalesce user pages.
>
> Right, large order allocations are not new, but I'm not sure they
> actually work reliably. AFAICT most drivers set pp_params.order = 0;
> I'm not sure how well tested multi-order pages are.
>
> It may be reasonable to assume multi order pages just work and see
> what blows up, though.
>
>> Any specific
>> place that concerns you? There are many places legitimately using
>> PAGE_SIZE: kmap'ing folios, shifting it by order to get the size,
>> linear allocations, etc.
>>
>
> From a 5-min look:
>
> - skb_splice_from_iter, this line: size_t part = min_t(size_t,
> PAGE_SIZE - off, len);
It does it for pages that it got from
iov_iter_extract_pages() a few lines above, those are PAGE_SIZE'd.
> - skb_pp_cow_data, this line: max_head_size =
> SKB_WITH_OVERHEAD(PAGE_SIZE - headroom);
This one should be about the linear part, not frags
> - skb_seq_read, this line: pg_sz = min_t(unsigned int, pg_sz -
> st->frag_off, PAGE_SIZE - pg_off
That's kmap handling, it can iterate a frag multiple times
in PAGE_SIZE chunks for high mem archs.
> - zerocopy_fill_skb_from_iter, this line: int size = min_t(int,
> copied, PAGE_SIZE - start);
Pages from iov_iter_get_pages2(), same as with
skb_splice_from_iter()
> I think the `PAGE_SIZE -` logic in general assumes the memory is
> PAGE_SIZEd. Although in these cases it seems page specifics, i.e.
> net_iovs wouldn't be exposed to these particular call sites.
>
> I spent a few weeks acking the net stack for all page-access to prune
> all of them to add unreadable netmem... are you somewhat confident
> there are no PAGE_SIZE assumptions in the net stack that affect
> net_iovs that require a deep look? Or is the approach here to merge
The difference is that this one is already supported and the
stack is large page aware, while unreadable frags was a new
concept.
> this and see what/if breaks?
No reason for it not to work. Even if breaks somewhere on that,
it should be a pre-existent problem, which needs to be fixed
either way.
>>> cd net
>>> ackc "PAGE_SIZE|PAGE_SHIFT" | wc -l
>>> 468
>>>
>>> Are we sure none of these places assuming PAGE_SIZE or PAGE_SHIFT are
>>> concerning?
>>>
>>> - You're not adding a field in the net_iov that tells us how big the
>>> net_iov is. It seems to me you're configuring the driver to set the rx
>>> buffer size, then assuming all the pp allocations are of that size,
>>> then assuming in the rxzc code that all the net_iov are of that size.
>>> I think a few problems may happen?
>>>
>>> (a) what happens if the rx buffer size is re-configured? Does the
>>> io_uring rxrc instance get recreated as well?
>>
>> Any reason you even want it to work? You can't and frankly
>> shouldn't be allowed to, at least in case of io_uring. Unless it's
>> rejected somewhere earlier, in this case it'll fail on the order
>> check while trying to create a page pool with a zcrx provider.
>>
>
> I think it's reasonable to disallow rx-buffer-size reconfiguration
> when the queue is memory-config bound. I can check to see what this
> code is doing.
Right, it doesn't make sense to reconfigure zcrx, and we can
only fail the operation one way or another.
>>> (b) what happens with skb coalescing? skb coalescing is already a bit
>>> of a mess. We don't allow coalescing unreadable and readable skbs, but
>>> we do allow coalescing devmem and iozcrx skbs which could lead to some
>>> bugs I'm guessing already. AFAICT as of this patch series we may allow
>>> coalescing of skbs with netmems inside of them of different sizes, but
>>> AFAICT so far, the iozcrx assume the size is constant across all the
>>> netmems it gets, which I'm not sure is always true?
>>
>> It rejects niovs from other providers incl. from any other io_uring
>> instances, so it only assume a uniform size for its own niovs.
>
> Thanks. What is 'it' and where is the code that does the rejection?
zcrx does, you're familiar with this chunk:
io_uring/zcrx.c:
io_zcrx_recv_frag() {
if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
io_pp_to_ifq(niov->pp) != ifq)
return -EFAULT;
}
>> The
>> backing memory is verified that it can be chunked.
>> > For all these reasons I had assumed that we'd need space in the
>>> net_iov that tells us its size: net_iov->size.
>>
>> Nope, not in this case.
>>
>>> And then netmem_size(netmem) would replace all the PAGE_SIZE
>>> assumptions in the net stack, and then we'd disallow coalescing of
>>> skbs with different-sized netmems (else we need to handle them
>>> correctly per the netmem_size).
>> I'm not even sure what's the concern. What's the difference b/w
>> tcp_recvmsg_dmabuf() getting one skb with differently sized frags
>> or same frags in separate skbs? You still need to handle it
>> somehow, even if by failing.
>>
>
> Right, I just wanted to understand what the design is. I guess the
> design is allowing the netmems in the same skb to have different max
> frag lens, yes?
Yeah, and it's already allowed for higher order pages.
> I am guessing that it works, even in tcp_recvmsg_dmabuf. I guess the
And you won't see it unless adds support for that, that's why
I added this:
if (!net_is_devmem_iov(niov)) {
err = -ENODEV;
goto out;
}
> frag len is actually in frag->len, so already it may vary from frag to
> frag. Even if coalescing happens, some frags would have a frag->len =
> PAGE_SIZE and some > PAGE_SIZE. Seems fine to me off the bat.
>
>> Also, we should never coalesce different niovs together regardless
>> of sizes. And for coalescing two chunks of the same niov, it should
>> work just fine even without knowing the length.
>>
>
> Yeah, we should probably not coalesce 2 netmems together, although I
> vaguely remember reading code in a net stack hepler that does that
> somewhere already. Whatever.
Let know if that turns out to be true, because it should already
be broken. You shouldn't coalesce pages from different folios,
and to check that you need to get the head page / etc., which
niovs obviously don't have.
--
Pavel Begunkov
prev parent reply other threads:[~2025-07-28 20:56 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-28 11:04 [RFC v1 00/22] Large rx buffer support for zcrx Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
2025-07-28 18:11 ` Mina Almasry
2025-07-28 21:36 ` Mina Almasry
2025-08-01 23:13 ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 02/22] net: ethtool: report max value for rx-buf-len Pavel Begunkov
2025-07-29 5:00 ` Subbaraya Sundeep
2025-07-28 11:04 ` [RFC v1 03/22] net: use zero value to restore rx_buf_len to default Pavel Begunkov
2025-07-29 5:03 ` Subbaraya Sundeep
2025-07-28 11:04 ` [RFC v1 04/22] net: clarify the meaning of netdev_config members Pavel Begunkov
2025-07-28 21:44 ` Mina Almasry
2025-08-01 23:14 ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 05/22] net: add rx_buf_len to netdev config Pavel Begunkov
2025-07-28 21:50 ` Mina Almasry
2025-08-01 23:18 ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 06/22] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 07/22] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 08/22] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 09/22] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 11/22] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 12/22] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 13/22] net: add queue config validation callback Pavel Begunkov
2025-07-28 22:26 ` Mina Almasry
2025-07-28 11:04 ` [RFC v1 14/22] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 15/22] eth: bnxt: store the rx buf size per queue Pavel Begunkov
2025-07-28 22:33 ` Mina Almasry
2025-08-01 23:20 ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
2025-07-28 23:10 ` Mina Almasry
2025-08-01 23:37 ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 18/22] net: wipe the setting of deactived queues Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 19/22] eth: bnxt: use queue op config validate Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 20/22] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 21/22] net: parametrise mp open with a queue config Pavel Begunkov
2025-08-02 0:10 ` Jakub Kicinski
2025-08-04 12:50 ` Pavel Begunkov
2025-08-05 22:43 ` Jakub Kicinski
2025-08-06 0:05 ` Jakub Kicinski
2025-08-06 16:48 ` Mina Almasry
2025-08-06 18:11 ` Jakub Kicinski
2025-08-06 18:30 ` Mina Almasry
2025-08-06 22:05 ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 22/22] io_uring/zcrx: implement large rx buffer support Pavel Begunkov
2025-07-28 17:13 ` [RFC v1 00/22] Large rx buffer support for zcrx Stanislav Fomichev
2025-07-28 18:18 ` Pavel Begunkov
2025-07-28 20:21 ` Stanislav Fomichev
2025-07-28 21:28 ` Pavel Begunkov
2025-07-28 22:06 ` Stanislav Fomichev
2025-07-28 22:44 ` Pavel Begunkov
2025-07-29 16:33 ` Stanislav Fomichev
2025-07-30 14:16 ` Pavel Begunkov
2025-07-30 15:50 ` Stanislav Fomichev
2025-07-31 19:34 ` Mina Almasry
2025-07-31 19:57 ` Pavel Begunkov
2025-07-31 20:05 ` Mina Almasry
2025-08-01 9:48 ` Pavel Begunkov
2025-08-01 9:58 ` Pavel Begunkov
2025-07-28 23:22 ` Mina Almasry
2025-07-29 16:41 ` Stanislav Fomichev
2025-07-29 17:01 ` Mina Almasry
2025-07-28 18:54 ` Mina Almasry
2025-07-28 19:42 ` Pavel Begunkov
2025-07-28 20:23 ` Mina Almasry
2025-07-28 20:57 ` Pavel Begunkov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8c90f485-192f-4f7a-ac94-8171d78f3c4a@gmail.com \
--to=asml.silence@gmail.com \
--cc=almasrymina@google.com \
--cc=andrew+netdev@lunn.ch \
--cc=ap420073@gmail.com \
--cc=davem@davemloft.net \
--cc=dtatulea@nvidia.com \
--cc=dw@davidwei.uk \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=io-uring@vger.kernel.org \
--cc=kuba@kernel.org \
--cc=michael.chan@broadcom.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox