public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Jakub Kicinski <kuba@kernel.org>, netdev@vger.kernel.org
Cc: asml.silence@gmail.com, io-uring@vger.kernel.org,
	Eric Dumazet <edumazet@google.com>,
	Willem de Bruijn <willemb@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	andrew+netdev@lunn.ch, horms@kernel.org, davem@davemloft.net,
	sdf@fomichev.me, almasrymina@google.com, dw@davidwei.uk,
	michael.chan@broadcom.com, dtatulea@nvidia.com,
	ap420073@gmail.com
Subject: [RFC v1 00/22] Large rx buffer support for zcrx
Date: Mon, 28 Jul 2025 12:04:04 +0100	[thread overview]
Message-ID: <cover.1753694913.git.asml.silence@gmail.com> (raw)

This series implements large rx buffer support for io_uring/zcrx on
top of Jakub's queue configuration changes, but it can also be used
by other memory providers. Large rx buffers can be drastically
beneficial with high-end hw-gro enabled cards that can coalesce traffic
into larger pages, reducing the number of frags traversing the network
stack and resuling in larger contiguous chunks of data for the
userspace. Benchamrks showed up to ~30% improvement in CPU util.

For example, for 200Gbit broadcom NIC, 4K vs 32K buffers, and napi and
userspace pinned to the same CPU:

packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.69    0.00    8.26   31.65    1.83   57.00    0.57

And for napi and userspace on different CPUs:

packets=10725082 (MB=1227388), rps=198285 (MB/s=22692)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.10    0.00    0.50    0.00    0.50   74.50    24.40
  1    4.51    0.00   44.33   47.22    2.08    1.85    0.00
packets=14026235 (MB=1605175), rps=198388 (MB/s=22703)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.10    0.00    0.70    0.00    1.00   43.78   54.42
  1    1.09    0.00   31.95   62.91    1.42    2.63    0.00

Patch 19 allows to pass queue config from a memory provider. The
zcrx changes are contained in a single patch as I already queued
most of work making it size agnostic into my zcrx branch. The
uAPI is simple and imperative, it'll use the exact value (if)
specified by the user. In the future we might extend it to
"choose the best size in a given range".

The rest (first 20) patches are from Jakub's series implementing
per queue configuration. Quoting Jakub:

"... The direct motivation for the series is that zero-copy Rx queues would
like to use larger Rx buffers. Most modern high-speed NICs support HW-GRO,
and can coalesce payloads into pages much larger than than the MTU.
Enabling larger buffers globally is a bit precarious as it exposes us
to potentially very inefficient memory use. Also allocating large
buffers may not be easy or cheap under load. Zero-copy queues service
only select traffic and have pre-allocated memory so the concerns don't
apply as much.

The per-queue config has to address 3 problems:
- user API
- driver API
- memory provider API

For user API the main question is whether we expose the config via
ethtool or netdev nl. I picked the latter - via queue GET/SET, rather
than extending the ethtool RINGS_GET API. I worry slightly that queue
GET/SET will turn in a monster like SETLINK. OTOH the only per-queue
settings we have in ethtool which are not going via RINGS_SET is
IRQ coalescing.

My goal for the driver API was to avoid complexity in the drivers.
The queue management API has gained two ops, responsible for preparing
configuration for a given queue, and validating whether the config
is supported. The validating is used both for NIC-wide and per-queue
changes. Queue alloc/start ops have a new "config" argument which
contains the current config for a given queue (we use queue restart
to apply per-queue settings). Outside of queue reset paths drivers
can call netdev_queue_config() which returns the config for an arbitrary
queue. Long story short I anticipate it to be used during ndo_open.

In the core I extended struct netdev_config with per queue settings.
All in all this isn't too far from what was there in my "queue API
prototype" a few years ago ..."


Kernel branch with all dependencies: 
git: https://github.com/isilence/linux.git zcrx/large-buffers
url: https://github.com/isilence/linux/tree/zcrx/large-buffers

Jakub Kicinski (20):
  docs: ethtool: document that rx_buf_len must control payload lengths
  net: ethtool: report max value for rx-buf-len
  net: use zero value to restore rx_buf_len to default
  net: clarify the meaning of netdev_config members
  net: add rx_buf_len to netdev config
  eth: bnxt: read the page size from the adapter struct
  eth: bnxt: set page pool page order based on rx_page_size
  eth: bnxt: support setting size of agg buffers via ethtool
  net: move netdev_config manipulation to dedicated helpers
  net: reduce indent of struct netdev_queue_mgmt_ops members
  net: allocate per-queue config structs and pass them thru the queue
    API
  net: pass extack to netdev_rx_queue_restart()
  net: add queue config validation callback
  eth: bnxt: always set the queue mgmt ops
  eth: bnxt: store the rx buf size per queue
  eth: bnxt: adjust the fill level of agg queues with larger buffers
  netdev: add support for setting rx-buf-len per queue
  net: wipe the setting of deactived queues
  eth: bnxt: use queue op config validate
  eth: bnxt: support per queue configuration of rx-buf-len

Pavel Begunkov (2):
  net: parametrise mp open with a queue config
  io_uring/zcrx: implement large rx buffer support

 Documentation/netlink/specs/ethtool.yaml      |   4 +
 Documentation/netlink/specs/netdev.yaml       |  15 ++
 Documentation/networking/ethtool-netlink.rst  |   7 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 135 ++++++++++++----
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |   9 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |   2 +-
 drivers/net/ethernet/google/gve/gve_main.c    |   9 +-
 .../marvell/octeontx2/nic/otx2_ethtool.c      |   6 +-
 drivers/net/netdevsim/netdev.c                |   8 +-
 include/linux/ethtool.h                       |   3 +
 include/net/netdev_queues.h                   |  83 ++++++++--
 include/net/netdev_rx_queue.h                 |   3 +-
 include/net/netlink.h                         |  19 +++
 include/net/page_pool/memory_provider.h       |   4 +-
 .../uapi/linux/ethtool_netlink_generated.h    |   1 +
 include/uapi/linux/io_uring.h                 |   2 +-
 include/uapi/linux/netdev.h                   |   2 +
 io_uring/zcrx.c                               |  39 ++++-
 net/core/Makefile                             |   2 +-
 net/core/dev.c                                |  12 +-
 net/core/dev.h                                |  12 ++
 net/core/netdev-genl-gen.c                    |  15 ++
 net/core/netdev-genl-gen.h                    |   1 +
 net/core/netdev-genl.c                        |  92 +++++++++++
 net/core/netdev_config.c                      | 150 ++++++++++++++++++
 net/core/netdev_rx_queue.c                    |  54 +++++--
 net/ethtool/common.c                          |   4 +-
 net/ethtool/netlink.c                         |  14 +-
 net/ethtool/rings.c                           |  14 +-
 tools/include/uapi/linux/netdev.h             |   2 +
 32 files changed, 642 insertions(+), 92 deletions(-)
 create mode 100644 net/core/netdev_config.c

-- 
2.49.0


             reply	other threads:[~2025-07-28 11:03 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-28 11:04 Pavel Begunkov [this message]
2025-07-28 11:04 ` [RFC v1 01/22] docs: ethtool: document that rx_buf_len must control payload lengths Pavel Begunkov
2025-07-28 18:11   ` Mina Almasry
2025-07-28 21:36   ` Mina Almasry
2025-08-01 23:13     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 02/22] net: ethtool: report max value for rx-buf-len Pavel Begunkov
2025-07-29  5:00   ` Subbaraya Sundeep
2025-07-28 11:04 ` [RFC v1 03/22] net: use zero value to restore rx_buf_len to default Pavel Begunkov
2025-07-29  5:03   ` Subbaraya Sundeep
2025-07-28 11:04 ` [RFC v1 04/22] net: clarify the meaning of netdev_config members Pavel Begunkov
2025-07-28 21:44   ` Mina Almasry
2025-08-01 23:14     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 05/22] net: add rx_buf_len to netdev config Pavel Begunkov
2025-07-28 21:50   ` Mina Almasry
2025-08-01 23:18     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 06/22] eth: bnxt: read the page size from the adapter struct Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 07/22] eth: bnxt: set page pool page order based on rx_page_size Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 08/22] eth: bnxt: support setting size of agg buffers via ethtool Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 09/22] net: move netdev_config manipulation to dedicated helpers Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 10/22] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 11/22] net: allocate per-queue config structs and pass them thru the queue API Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 12/22] net: pass extack to netdev_rx_queue_restart() Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 13/22] net: add queue config validation callback Pavel Begunkov
2025-07-28 22:26   ` Mina Almasry
2025-07-28 11:04 ` [RFC v1 14/22] eth: bnxt: always set the queue mgmt ops Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 15/22] eth: bnxt: store the rx buf size per queue Pavel Begunkov
2025-07-28 22:33   ` Mina Almasry
2025-08-01 23:20     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 16/22] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 17/22] netdev: add support for setting rx-buf-len per queue Pavel Begunkov
2025-07-28 23:10   ` Mina Almasry
2025-08-01 23:37     ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 18/22] net: wipe the setting of deactived queues Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 19/22] eth: bnxt: use queue op config validate Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 20/22] eth: bnxt: support per queue configuration of rx-buf-len Pavel Begunkov
2025-07-28 11:04 ` [RFC v1 21/22] net: parametrise mp open with a queue config Pavel Begunkov
2025-08-02  0:10   ` Jakub Kicinski
2025-08-04 12:50     ` Pavel Begunkov
2025-08-05 22:43       ` Jakub Kicinski
2025-08-06  0:05       ` Jakub Kicinski
2025-08-06 16:48       ` Mina Almasry
2025-08-06 18:11         ` Jakub Kicinski
2025-08-06 18:30           ` Mina Almasry
2025-08-06 22:05             ` Jakub Kicinski
2025-07-28 11:04 ` [RFC v1 22/22] io_uring/zcrx: implement large rx buffer support Pavel Begunkov
2025-07-28 17:13 ` [RFC v1 00/22] Large rx buffer support for zcrx Stanislav Fomichev
2025-07-28 18:18   ` Pavel Begunkov
2025-07-28 20:21     ` Stanislav Fomichev
2025-07-28 21:28       ` Pavel Begunkov
2025-07-28 22:06         ` Stanislav Fomichev
2025-07-28 22:44           ` Pavel Begunkov
2025-07-29 16:33             ` Stanislav Fomichev
2025-07-30 14:16               ` Pavel Begunkov
2025-07-30 15:50                 ` Stanislav Fomichev
2025-07-31 19:34                   ` Mina Almasry
2025-07-31 19:57                     ` Pavel Begunkov
2025-07-31 20:05                       ` Mina Almasry
2025-08-01  9:48                         ` Pavel Begunkov
2025-08-01  9:58                     ` Pavel Begunkov
2025-07-28 23:22           ` Mina Almasry
2025-07-29 16:41             ` Stanislav Fomichev
2025-07-29 17:01               ` Mina Almasry
2025-07-28 18:54 ` Mina Almasry
2025-07-28 19:42   ` Pavel Begunkov
2025-07-28 20:23     ` Mina Almasry
2025-07-28 20:57       ` Pavel Begunkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1753694913.git.asml.silence@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=almasrymina@google.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=ap420073@gmail.com \
    --cc=davem@davemloft.net \
    --cc=dtatulea@nvidia.com \
    --cc=dw@davidwei.uk \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=io-uring@vger.kernel.org \
    --cc=kuba@kernel.org \
    --cc=michael.chan@broadcom.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=sdf@fomichev.me \
    --cc=willemb@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox