[PATCH net-next v8 0/9] Add support for providers with large rx buffer

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v8 0/9] Add support for providers with large rx buffer
@ 2026-01-09 11:28 Pavel Begunkov
  2026-01-09 11:28 ` [PATCH net-next v8 1/9] net: memzero mp params when closing a queue Pavel Begunkov
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

Note: it's net/ only bits and doesn't include changes, which shoulf be
merged separately and are posted separately. The full branch for
convenience is at [1], and the patch is here:

https://lore.kernel.org/io-uring/7486ab32e99be1f614b3ef8d0e9bc77015b173f7.1764265323.git.asml.silence@gmail.com

Many modern NICs support configurable receive buffer lengths, and zcrx and
memory providers can use buffers larger than 4K to improve performance. When
paired with hw-gro larger rx buffer sizes can drastically reduce the number
of buffers traversing the stack and save a lot of processing time. It also
allows to give to users larger contiguous chunks of data. The idea was first
floated around by Saeed during netdev conf 2024 and was asked about by a few
folks.

Single stream benchmarks showed up to ~30% CPU util improvement.
E.g. comparison for 4K vs 32K buffers using a 200Gbit NIC:

packets=23987040 (MB=2745098), rps=199559 (MB/s=22837)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    1.53    0.00   27.78    2.72    1.31   66.45    0.22
packets=24078368 (MB=2755550), rps=200319 (MB/s=22924)
CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
  0    0.69    0.00    8.26   31.65    1.83   57.00    0.57

This series adds net infrastructure for memory providers configuring
the size and implements it for bnxt. It's an opt-in feature for drivers,
they should advertise support for the parameter in the qops and must check
if the hardware supports the given size. It's limited to memory providers
as it drastically simplifies implementation. It doesn't affect the fast
path zcrx uAPI, and the user exposed parameter is defined in zcrx terms,
which allows it to be flexible and adjusted in the future.

A liburing example can be found at [2]

full branch:
[1] https://github.com/isilence/linux.git zcrx/large-buffers-v8
Liburing example:
[2] https://github.com/isilence/liburing.git zcrx/rx-buf-len

---

The following changes since commit 9ace4753a5202b02191d54e9fdf7f9e3d02b85eb:

  Linux 6.19-rc4 (2026-01-04 14:41:55 -0800)

are available in the Git repository at:

  https://github.com/isilence/linux.git tags/net-queue-rx-buf-len-v8

for you to fetch changes up to 37f5abe6929963fc6086777056b59ecb034d0e19:

  io_uring/zcrx: document area chunking parameter (2026-01-08 11:35:20 +0000)


v8: - Add stripped down qcfg
    - Retain the page size across resets for bnxt

v7: - Add xa_destroy
    - Rebase

v6: - Update docs and add a selftest

v5: https://lore.kernel.org/netdev/cover.1760440268.git.asml.silence@gmail.com/
    - Remove all unnecessary bits like configuration via netlink, and
      multi-stage queue configuration.

v4: https://lore.kernel.org/all/cover.1760364551.git.asml.silence@gmail.com/
    - Update fbnic qops
    - Propagate max buf len for hns3
    - Use configured buf size in __bnxt_alloc_rx_netmem
    - Minor stylistic changes
v3: https://lore.kernel.org/all/cover.1755499375.git.asml.silence@gmail.com/
    - Rebased, excluded zcrx specific patches
    - Set agg_size_fac to 1 on warning
v2: https://lore.kernel.org/all/cover.1754657711.git.asml.silence@gmail.com/
    - Add MAX_PAGE_ORDER check on pp init
    - Applied comments rewording
    - Adjust pp.max_len based on order
    - Patch up mlx5 queue callbacks after rebase
    - Minor ->queue_mgmt_ops refactoring
    - Rebased to account for both fill level and agg_size_fac
    - Pass providers buf length in struct pp_memory_provider_params and
      apply it in __netdev_queue_confi().
    - Use ->supported_ring_params to validate drivers support of set
      qcfg parameters.

Jakub Kicinski (2):
  net: reduce indent of struct netdev_queue_mgmt_ops members
  eth: bnxt: adjust the fill level of agg queues with larger buffers

Pavel Begunkov (7):
  net: memzero mp params when closing a queue
  net: add bare bone queue configs
  net: pass queue rx page size from memory provider
  eth: bnxt: store rx buffer size per queue
  eth: bnxt: support qcfg provided rx page size
  selftests: iou-zcrx: test large chunk sizes
  io_uring/zcrx: document area chunking parameter

 Documentation/networking/iou-zcrx.rst         |  20 +++
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 126 ++++++++++++++----
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   2 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |   2 +-
 drivers/net/ethernet/google/gve/gve_main.c    |   9 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  10 +-
 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c  |   8 +-
 drivers/net/netdevsim/netdev.c                |   7 +-
 include/net/netdev_queues.h                   |  47 +++++--
 include/net/netdev_rx_queue.h                 |   2 +
 include/net/page_pool/types.h                 |   1 +
 net/core/dev.c                                |  17 +++
 net/core/netdev_rx_queue.c                    |  31 +++--
 .../selftests/drivers/net/hw/iou-zcrx.c       |  72 ++++++++--
 .../selftests/drivers/net/hw/iou-zcrx.py      |  37 +++++
 16 files changed, 318 insertions(+), 79 deletions(-)

-- 
2.52.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next v8 1/9] net: memzero mp params when closing a queue
  2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
@ 2026-01-09 11:28 ` Pavel Begunkov
  2026-01-09 11:28 ` [PATCH net-next v8 2/9] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

Instead of resetting memory provider parameters one by one in
__net_mp_{open,close}_rxq, memzero the entire structure. It'll be used
to extend the structure.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/netdev_rx_queue.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index c7d9341b7630..a0083f176a9c 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -139,10 +139,9 @@ int __net_mp_open_rxq(struct net_device *dev, unsigned int rxq_idx,
 
 	rxq->mp_params = *p;
 	ret = netdev_rx_queue_restart(dev, rxq_idx);
-	if (ret) {
-		rxq->mp_params.mp_ops = NULL;
-		rxq->mp_params.mp_priv = NULL;
-	}
+	if (ret)
+		memset(&rxq->mp_params, 0, sizeof(rxq->mp_params));
+
 	return ret;
 }
 
@@ -179,8 +178,7 @@ void __net_mp_close_rxq(struct net_device *dev, unsigned int ifq_idx,
 			 rxq->mp_params.mp_priv != old_p->mp_priv))
 		return;
 
-	rxq->mp_params.mp_ops = NULL;
-	rxq->mp_params.mp_priv = NULL;
+	memset(&rxq->mp_params, 0, sizeof(rxq->mp_params));
 	err = netdev_rx_queue_restart(dev, ifq_idx);
 	WARN_ON(err && err != -ENETDOWN);
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v8 2/9] net: reduce indent of struct netdev_queue_mgmt_ops members
  2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
  2026-01-09 11:28 ` [PATCH net-next v8 1/9] net: memzero mp params when closing a queue Pavel Begunkov
@ 2026-01-09 11:28 ` Pavel Begunkov
  2026-01-09 11:28 ` [PATCH net-next v8 3/9] net: add bare bone queue configs Pavel Begunkov
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

From: Jakub Kicinski <kuba@kernel.org>

Trivial change, reduce the indent. I think the original is copied
from real NDOs. It's unnecessarily deep, makes passing struct args
problematic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index cd00e0406cf4..541e7d9853b1 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -135,20 +135,20 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  * be called for an interface which is open.
  */
 struct netdev_queue_mgmt_ops {
-	size_t			ndo_queue_mem_size;
-	int			(*ndo_queue_mem_alloc)(struct net_device *dev,
-						       void *per_queue_mem,
-						       int idx);
-	void			(*ndo_queue_mem_free)(struct net_device *dev,
-						      void *per_queue_mem);
-	int			(*ndo_queue_start)(struct net_device *dev,
-						   void *per_queue_mem,
-						   int idx);
-	int			(*ndo_queue_stop)(struct net_device *dev,
-						  void *per_queue_mem,
-						  int idx);
-	struct device *		(*ndo_queue_get_dma_dev)(struct net_device *dev,
-							 int idx);
+	size_t	ndo_queue_mem_size;
+	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
+				       void *per_queue_mem,
+				       int idx);
+	void	(*ndo_queue_mem_free)(struct net_device *dev,
+				      void *per_queue_mem);
+	int	(*ndo_queue_start)(struct net_device *dev,
+				   void *per_queue_mem,
+				   int idx);
+	int	(*ndo_queue_stop)(struct net_device *dev,
+				  void *per_queue_mem,
+				  int idx);
+	struct device *	(*ndo_queue_get_dma_dev)(struct net_device *dev,
+						 int idx);
 };
 
 bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v8 3/9] net: add bare bone queue configs
  2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
  2026-01-09 11:28 ` [PATCH net-next v8 1/9] net: memzero mp params when closing a queue Pavel Begunkov
  2026-01-09 11:28 ` [PATCH net-next v8 2/9] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
@ 2026-01-09 11:28 ` Pavel Begunkov
  2026-01-09 11:28 ` [PATCH net-next v8 4/9] net: pass queue rx page size from memory provider Pavel Begunkov
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

We'll need to pass extra parameters when allocating a queue for memory
providers. Define a new structure for queue configurations, and pass it
to qapi callbacks. It's empty for now, actual parameters will be added
in following patches.

Configurations should persist across resets, and for that they're
default-initialised on device registration and stored in struct
netdev_rx_queue. We also add a new qapi callback for defaulting a given
config. It must be implemented if a driver wants to use queue configs
and is optional otherwise.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c       |  8 ++++++--
 drivers/net/ethernet/google/gve/gve_main.c      |  9 ++++++---
 .../net/ethernet/mellanox/mlx5/core/en_main.c   | 10 ++++++----
 drivers/net/ethernet/meta/fbnic/fbnic_txrx.c    |  8 ++++++--
 drivers/net/netdevsim/netdev.c                  |  7 +++++--
 include/net/netdev_queues.h                     |  9 +++++++++
 include/net/netdev_rx_queue.h                   |  2 ++
 net/core/dev.c                                  | 17 +++++++++++++++++
 net/core/netdev_rx_queue.c                      | 12 +++++++++---
 9 files changed, 66 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index d17d0ea89c36..73f954da39b9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15902,7 +15902,9 @@ static const struct netdev_stat_ops bnxt_stat_ops = {
 	.get_base_stats		= bnxt_get_base_stats,
 };
 
-static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
+static int bnxt_queue_mem_alloc(struct net_device *dev,
+				struct netdev_queue_config *qcfg,
+				void *qmem, int idx)
 {
 	struct bnxt_rx_ring_info *rxr, *clone;
 	struct bnxt *bp = netdev_priv(dev);
@@ -16068,7 +16070,9 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
 	dst->rx_agg_bmap = src->rx_agg_bmap;
 }
 
-static int bnxt_queue_start(struct net_device *dev, void *qmem, int idx)
+static int bnxt_queue_start(struct net_device *dev,
+			    struct netdev_queue_config *qcfg,
+			    void *qmem, int idx)
 {
 	struct bnxt *bp = netdev_priv(dev);
 	struct bnxt_rx_ring_info *rxr, *clone;
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 7eb64e1e4d85..c42640da15a5 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -2616,8 +2616,9 @@ static void gve_rx_queue_mem_free(struct net_device *dev, void *per_q_mem)
 		gve_rx_free_ring_dqo(priv, gve_per_q_mem, &cfg);
 }
 
-static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
-				  int idx)
+static int gve_rx_queue_mem_alloc(struct net_device *dev,
+				  struct netdev_queue_config *qcfg,
+				  void *per_q_mem, int idx)
 {
 	struct gve_priv *priv = netdev_priv(dev);
 	struct gve_rx_alloc_rings_cfg cfg = {0};
@@ -2638,7 +2639,9 @@ static int gve_rx_queue_mem_alloc(struct net_device *dev, void *per_q_mem,
 	return err;
 }
 
-static int gve_rx_queue_start(struct net_device *dev, void *per_q_mem, int idx)
+static int gve_rx_queue_start(struct net_device *dev,
+			      struct netdev_queue_config *qcfg,
+			      void *per_q_mem, int idx)
 {
 	struct gve_priv *priv = netdev_priv(dev);
 	struct gve_rx_ring *gve_per_q_mem;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 07fc4d2c8fad..0e2132b58257 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5596,8 +5596,9 @@ struct mlx5_qmgmt_data {
 	struct mlx5e_channel_param cparam;
 };
 
-static int mlx5e_queue_mem_alloc(struct net_device *dev, void *newq,
-				 int queue_index)
+static int mlx5e_queue_mem_alloc(struct net_device *dev,
+				 struct netdev_queue_config *qcfg,
+				 void *newq, int queue_index)
 {
 	struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
 	struct mlx5e_priv *priv = netdev_priv(dev);
@@ -5658,8 +5659,9 @@ static int mlx5e_queue_stop(struct net_device *dev, void *oldq, int queue_index)
 	return 0;
 }
 
-static int mlx5e_queue_start(struct net_device *dev, void *newq,
-			     int queue_index)
+static int mlx5e_queue_start(struct net_device *dev,
+			     struct netdev_queue_config *qcfg,
+			     void *newq, int queue_index)
 {
 	struct mlx5_qmgmt_data *new = (struct mlx5_qmgmt_data *)newq;
 	struct mlx5e_priv *priv = netdev_priv(dev);
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
index 13d508ce637f..e36ed25462b4 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_txrx.c
@@ -2809,7 +2809,9 @@ void fbnic_napi_depletion_check(struct net_device *netdev)
 	fbnic_wrfl(fbd);
 }
 
-static int fbnic_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
+static int fbnic_queue_mem_alloc(struct net_device *dev,
+				 struct netdev_queue_config *qcfg,
+				 void *qmem, int idx)
 {
 	struct fbnic_net *fbn = netdev_priv(dev);
 	const struct fbnic_q_triad *real;
@@ -2861,7 +2863,9 @@ static void __fbnic_nv_restart(struct fbnic_net *fbn,
 		netif_wake_subqueue(fbn->netdev, nv->qt[i].sub0.q_idx);
 }
 
-static int fbnic_queue_start(struct net_device *dev, void *qmem, int idx)
+static int fbnic_queue_start(struct net_device *dev,
+			     struct netdev_queue_config *qcfg,
+			     void *qmem, int idx)
 {
 	struct fbnic_net *fbn = netdev_priv(dev);
 	struct fbnic_napi_vector *nv;
diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
index 6927c1962277..6285fbefe38a 100644
--- a/drivers/net/netdevsim/netdev.c
+++ b/drivers/net/netdevsim/netdev.c
@@ -758,7 +758,9 @@ struct nsim_queue_mem {
 };
 
 static int
-nsim_queue_mem_alloc(struct net_device *dev, void *per_queue_mem, int idx)
+nsim_queue_mem_alloc(struct net_device *dev,
+		     struct netdev_queue_config *qcfg,
+		     void *per_queue_mem, int idx)
 {
 	struct nsim_queue_mem *qmem = per_queue_mem;
 	struct netdevsim *ns = netdev_priv(dev);
@@ -807,7 +809,8 @@ static void nsim_queue_mem_free(struct net_device *dev, void *per_queue_mem)
 }
 
 static int
-nsim_queue_start(struct net_device *dev, void *per_queue_mem, int idx)
+nsim_queue_start(struct net_device *dev, struct netdev_queue_config *qcfg,
+		 void *per_queue_mem, int idx)
 {
 	struct nsim_queue_mem *qmem = per_queue_mem;
 	struct netdevsim *ns = netdev_priv(dev);
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index 541e7d9853b1..f6f1f71a24e1 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -14,6 +14,9 @@ struct netdev_config {
 	u8	hds_config;
 };
 
+struct netdev_queue_config {
+};
+
 /* See the netdev.yaml spec for definition of each statistic */
 struct netdev_queue_stats_rx {
 	u64 bytes;
@@ -130,6 +133,8 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  * @ndo_queue_get_dma_dev: Get dma device for zero-copy operations to be used
  *			   for this queue. Return NULL on error.
  *
+ * @ndo_default_qcfg:	Populate queue config struct with defaults. Optional.
+ *
  * Note that @ndo_queue_mem_alloc and @ndo_queue_mem_free may be called while
  * the interface is closed. @ndo_queue_start and @ndo_queue_stop will only
  * be called for an interface which is open.
@@ -137,16 +142,20 @@ void netdev_stat_queue_sum(struct net_device *netdev,
 struct netdev_queue_mgmt_ops {
 	size_t	ndo_queue_mem_size;
 	int	(*ndo_queue_mem_alloc)(struct net_device *dev,
+				       struct netdev_queue_config *qcfg,
 				       void *per_queue_mem,
 				       int idx);
 	void	(*ndo_queue_mem_free)(struct net_device *dev,
 				      void *per_queue_mem);
 	int	(*ndo_queue_start)(struct net_device *dev,
+				   struct netdev_queue_config *qcfg,
 				   void *per_queue_mem,
 				   int idx);
 	int	(*ndo_queue_stop)(struct net_device *dev,
 				  void *per_queue_mem,
 				  int idx);
+	void	(*ndo_default_qcfg)(struct net_device *dev,
+				    struct netdev_queue_config *qcfg);
 	struct device *	(*ndo_queue_get_dma_dev)(struct net_device *dev,
 						 int idx);
 };
diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h
index 8cdcd138b33f..cfa72c485387 100644
--- a/include/net/netdev_rx_queue.h
+++ b/include/net/netdev_rx_queue.h
@@ -7,6 +7,7 @@
 #include <linux/sysfs.h>
 #include <net/xdp.h>
 #include <net/page_pool/types.h>
+#include <net/netdev_queues.h>
 
 /* This structure contains an instance of an RX queue. */
 struct netdev_rx_queue {
@@ -27,6 +28,7 @@ struct netdev_rx_queue {
 	struct xsk_buff_pool            *pool;
 #endif
 	struct napi_struct		*napi;
+	struct netdev_queue_config	qcfg;
 	struct pp_memory_provider_params mp_params;
 } ____cacheline_aligned_in_smp;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 36dc5199037e..a1d394addaef 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -11270,6 +11270,21 @@ static void netdev_free_phy_link_topology(struct net_device *dev)
 	}
 }
 
+static void init_rx_queue_cfgs(struct net_device *dev)
+{
+	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
+	struct netdev_rx_queue *rxq;
+	int i;
+
+	if (!qops || !qops->ndo_default_qcfg)
+		return;
+
+	for (i = 0; i < dev->num_rx_queues; i++) {
+		rxq = __netif_get_rx_queue(dev, i);
+		qops->ndo_default_qcfg(dev, &rxq->qcfg);
+	}
+}
+
 /**
  * register_netdevice() - register a network device
  * @dev: device to register
@@ -11315,6 +11330,8 @@ int register_netdevice(struct net_device *dev)
 	if (!dev->name_node)
 		goto out;
 
+	init_rx_queue_cfgs(dev);
+
 	/* Init, if this function is available */
 	if (dev->netdev_ops->ndo_init) {
 		ret = dev->netdev_ops->ndo_init(dev);
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index a0083f176a9c..86d1c0a925e3 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -22,6 +22,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 {
 	struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx);
 	const struct netdev_queue_mgmt_ops *qops = dev->queue_mgmt_ops;
+	struct netdev_queue_config qcfg;
 	void *new_mem, *old_mem;
 	int err;
 
@@ -31,6 +32,10 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 
 	netdev_assert_locked(dev);
 
+	memset(&qcfg, 0, sizeof(qcfg));
+	if (qops->ndo_default_qcfg)
+		qops->ndo_default_qcfg(dev, &qcfg);
+
 	new_mem = kvzalloc(qops->ndo_queue_mem_size, GFP_KERNEL);
 	if (!new_mem)
 		return -ENOMEM;
@@ -41,7 +46,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 		goto err_free_new_mem;
 	}
 
-	err = qops->ndo_queue_mem_alloc(dev, new_mem, rxq_idx);
+	err = qops->ndo_queue_mem_alloc(dev, &qcfg, new_mem, rxq_idx);
 	if (err)
 		goto err_free_old_mem;
 
@@ -54,7 +59,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 		if (err)
 			goto err_free_new_queue_mem;
 
-		err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
+		err = qops->ndo_queue_start(dev, &qcfg, new_mem, rxq_idx);
 		if (err)
 			goto err_start_queue;
 	} else {
@@ -66,6 +71,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 	kvfree(old_mem);
 	kvfree(new_mem);
 
+	rxq->qcfg = qcfg;
 	return 0;
 
 err_start_queue:
@@ -76,7 +82,7 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 	 * WARN if we fail to recover the old rx queue, and at least free
 	 * old_mem so we don't also leak that.
 	 */
-	if (qops->ndo_queue_start(dev, old_mem, rxq_idx)) {
+	if (qops->ndo_queue_start(dev, &rxq->qcfg, old_mem, rxq_idx)) {
 		WARN(1,
 		     "Failed to restart old queue in error path. RX queue %d may be unhealthy.",
 		     rxq_idx);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v8 4/9] net: pass queue rx page size from memory provider
  2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
                   ` (2 preceding siblings ...)
  2026-01-09 11:28 ` [PATCH net-next v8 3/9] net: add bare bone queue configs Pavel Begunkov
@ 2026-01-09 11:28 ` Pavel Begunkov
  2026-01-09 11:28 ` [PATCH net-next v8 5/9] eth: bnxt: store rx buffer size per queue Pavel Begunkov
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

Allow memory providers to configure rx queues with a custom receive
page size. It's passed in struct pp_memory_provider_params, which is
copied into the queue, so it's preserved across queue restarts. Then,
it's propagated to the driver in a new queue config parameter.

Drivers should explicitly opt into using it by setting
QCFG_RX_PAGE_SIZE, in which case they should implement ndo_default_qcfg,
validate the size on queue restart and honour the current config in case
of a reset.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/net/netdev_queues.h   | 10 ++++++++++
 include/net/page_pool/types.h |  1 +
 net/core/netdev_rx_queue.c    |  9 +++++++++
 3 files changed, 20 insertions(+)

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index f6f1f71a24e1..feca25131930 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -15,6 +15,7 @@ struct netdev_config {
 };
 
 struct netdev_queue_config {
+	u32	rx_page_size;
 };
 
 /* See the netdev.yaml spec for definition of each statistic */
@@ -114,6 +115,11 @@ void netdev_stat_queue_sum(struct net_device *netdev,
 			   int tx_start, int tx_end,
 			   struct netdev_queue_stats_tx *tx_sum);
 
+enum {
+	/* The queue checks and honours the page size qcfg parameter */
+	QCFG_RX_PAGE_SIZE	= 0x1,
+};
+
 /**
  * struct netdev_queue_mgmt_ops - netdev ops for queue management
  *
@@ -135,6 +141,8 @@ void netdev_stat_queue_sum(struct net_device *netdev,
  *
  * @ndo_default_qcfg:	Populate queue config struct with defaults. Optional.
  *
+ * @supported_params:	Bitmask of supported parameters, see QCFG_*.
+ *
  * Note that @ndo_queue_mem_alloc and @ndo_queue_mem_free may be called while
  * the interface is closed. @ndo_queue_start and @ndo_queue_stop will only
  * be called for an interface which is open.
@@ -158,6 +166,8 @@ struct netdev_queue_mgmt_ops {
 				    struct netdev_queue_config *qcfg);
 	struct device *	(*ndo_queue_get_dma_dev)(struct net_device *dev,
 						 int idx);
+
+	unsigned int supported_params;
 };
 
 bool netif_rxq_has_unreadable_mp(struct net_device *dev, int idx);
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 1509a536cb85..0d453484a585 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -161,6 +161,7 @@ struct memory_provider_ops;
 struct pp_memory_provider_params {
 	void *mp_priv;
 	const struct memory_provider_ops *mp_ops;
+	u32 rx_page_size;
 };
 
 struct page_pool {
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index 86d1c0a925e3..b81cad90ba2f 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -30,12 +30,21 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
 	    !qops->ndo_queue_mem_alloc || !qops->ndo_queue_start)
 		return -EOPNOTSUPP;
 
+	if (WARN_ON_ONCE(qops->supported_params && !qops->ndo_default_qcfg))
+		return -EINVAL;
+
 	netdev_assert_locked(dev);
 
 	memset(&qcfg, 0, sizeof(qcfg));
 	if (qops->ndo_default_qcfg)
 		qops->ndo_default_qcfg(dev, &qcfg);
 
+	if (rxq->mp_params.rx_page_size) {
+		if (!(qops->supported_params & QCFG_RX_PAGE_SIZE))
+			return -EOPNOTSUPP;
+		qcfg.rx_page_size = rxq->mp_params.rx_page_size;
+	}
+
 	new_mem = kvzalloc(qops->ndo_queue_mem_size, GFP_KERNEL);
 	if (!new_mem)
 		return -ENOMEM;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v8 5/9] eth: bnxt: store rx buffer size per queue
  2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
                   ` (3 preceding siblings ...)
  2026-01-09 11:28 ` [PATCH net-next v8 4/9] net: pass queue rx page size from memory provider Pavel Begunkov
@ 2026-01-09 11:28 ` Pavel Begunkov
  2026-01-13 10:19   ` Paolo Abeni
  2026-01-09 11:28 ` [PATCH net-next v8 6/9] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

Instead of using a constant buffer length, allow configuring the size
for each queue separately. There is no way to change the length yet, and
it'll be passed from memory providers in a later patch.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 56 +++++++++++--------
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |  6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h |  2 +-
 4 files changed, 38 insertions(+), 27 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 73f954da39b9..8f42885a7c86 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -905,7 +905,7 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
 
 static bool bnxt_separate_head_pool(struct bnxt_rx_ring_info *rxr)
 {
-	return rxr->need_head_pool || PAGE_SIZE > BNXT_RX_PAGE_SIZE;
+	return rxr->need_head_pool || rxr->rx_page_size < PAGE_SIZE;
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
@@ -915,9 +915,9 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 {
 	struct page *page;
 
-	if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) {
+	if (rxr->rx_page_size < PAGE_SIZE) {
 		page = page_pool_dev_alloc_frag(rxr->page_pool, offset,
-						BNXT_RX_PAGE_SIZE);
+						rxr->rx_page_size);
 	} else {
 		page = page_pool_dev_alloc_pages(rxr->page_pool);
 		*offset = 0;
@@ -936,8 +936,9 @@ static netmem_ref __bnxt_alloc_rx_netmem(struct bnxt *bp, dma_addr_t *mapping,
 {
 	netmem_ref netmem;
 
-	if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) {
-		netmem = page_pool_alloc_frag_netmem(rxr->page_pool, offset, BNXT_RX_PAGE_SIZE, gfp);
+	if (rxr->rx_page_size < PAGE_SIZE) {
+		netmem = page_pool_alloc_frag_netmem(rxr->page_pool, offset,
+						     rxr->rx_page_size, gfp);
 	} else {
 		netmem = page_pool_alloc_netmems(rxr->page_pool, gfp);
 		*offset = 0;
@@ -1155,9 +1156,9 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
 				bp->rx_dir);
-	skb = napi_build_skb(data_ptr - bp->rx_offset, BNXT_RX_PAGE_SIZE);
+	skb = napi_build_skb(data_ptr - bp->rx_offset, rxr->rx_page_size);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
 		return NULL;
@@ -1189,7 +1190,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
 				bp->rx_dir);
 
 	if (unlikely(!payload))
@@ -1203,7 +1204,7 @@ static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 
 	skb_mark_for_recycle(skb);
 	off = (void *)data_ptr - page_address(page);
-	skb_add_rx_frag(skb, 0, page, off, len, BNXT_RX_PAGE_SIZE);
+	skb_add_rx_frag(skb, 0, page, off, len, rxr->rx_page_size);
 	memcpy(skb->data - NET_IP_ALIGN, data_ptr - NET_IP_ALIGN,
 	       payload + NET_IP_ALIGN);
 
@@ -1288,7 +1289,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		if (skb) {
 			skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
 					       cons_rx_buf->offset,
-					       frag_len, BNXT_RX_PAGE_SIZE);
+					       frag_len, rxr->rx_page_size);
 		} else {
 			skb_frag_t *frag = &shinfo->frags[i];
 
@@ -1313,7 +1314,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 			if (skb) {
 				skb->len -= frag_len;
 				skb->data_len -= frag_len;
-				skb->truesize -= BNXT_RX_PAGE_SIZE;
+				skb->truesize -= rxr->rx_page_size;
 			}
 
 			--shinfo->nr_frags;
@@ -1328,7 +1329,7 @@ static u32 __bnxt_rx_agg_netmems(struct bnxt *bp,
 		}
 
 		page_pool_dma_sync_netmem_for_cpu(rxr->page_pool, netmem, 0,
-						  BNXT_RX_PAGE_SIZE);
+						  rxr->rx_page_size);
 
 		total_frag_len += frag_len;
 		prod = NEXT_RX_AGG(prod);
@@ -2281,8 +2282,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 			if (!skb)
 				goto oom_next_rx;
 		} else {
-			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs,
-						 rxr->page_pool, &xdp);
+			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr, &xdp);
 			if (!skb) {
 				/* we should be able to free the old skb here */
 				bnxt_xdp_buff_frags_free(rxr, &xdp);
@@ -3828,11 +3828,13 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.pool_size = bp->rx_agg_ring_size / agg_size_fac;
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size / rx_size_fac;
+
+	pp.order = get_order(rxr->rx_page_size);
 	pp.nid = numa_node;
 	pp.netdev = bp->dev;
 	pp.dev = &bp->pdev->dev;
 	pp.dma_dir = bp->rx_dir;
-	pp.max_len = PAGE_SIZE;
+	pp.max_len = PAGE_SIZE << pp.order;
 	pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV |
 		   PP_FLAG_ALLOW_UNREADABLE_NETMEM;
 	pp.queue_idx = rxr->bnapi->index;
@@ -3843,7 +3845,10 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	rxr->page_pool = pool;
 
 	rxr->need_head_pool = page_pool_is_unreadable(pool);
+	rxr->need_head_pool |= !!pp.order;
 	if (bnxt_separate_head_pool(rxr)) {
+		pp.order = 0;
+		pp.max_len = PAGE_SIZE;
 		pp.pool_size = min(bp->rx_ring_size / rx_size_fac, 1024);
 		pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 		pool = page_pool_create(&pp);
@@ -4319,6 +4324,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 		if (!rxr)
 			goto skip_rx;
 
+		rxr->rx_page_size = BNXT_RX_PAGE_SIZE;
+
 		ring = &rxr->rx_ring_struct;
 		rmem = &ring->ring_mem;
 		rmem->nr_pages = bp->rx_nr_pages;
@@ -4478,7 +4485,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
 	ring = &rxr->rx_agg_ring_struct;
 	ring->fw_ring_id = INVALID_HW_RING_ID;
 	if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
-		type = ((u32)BNXT_RX_PAGE_SIZE << RX_BD_LEN_SHIFT) |
+		type = ((u32)(u32)rxr->rx_page_size << RX_BD_LEN_SHIFT) |
 			RX_BD_TYPE_RX_AGG_BD;
 
 		/* On P7, setting EOP will cause the chip to disable
@@ -7056,6 +7063,7 @@ static void bnxt_hwrm_ring_grp_free(struct bnxt *bp)
 
 static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 				       struct hwrm_ring_alloc_input *req,
+				       struct bnxt_rx_ring_info *rxr,
 				       struct bnxt_ring_struct *ring)
 {
 	struct bnxt_ring_grp_info *grp_info = &bp->grp_info[ring->grp_idx];
@@ -7065,7 +7073,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 	if (ring_type == HWRM_RING_ALLOC_AGG) {
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_RX_AGG;
 		req->rx_ring_id = cpu_to_le16(grp_info->rx_fw_ring_id);
-		req->rx_buf_size = cpu_to_le16(BNXT_RX_PAGE_SIZE);
+		req->rx_buf_size = cpu_to_le16(rxr->rx_page_size);
 		enables |= RING_ALLOC_REQ_ENABLES_RX_RING_ID_VALID;
 	} else {
 		req->rx_buf_size = cpu_to_le16(bp->rx_buf_use_size);
@@ -7079,6 +7087,7 @@ static void bnxt_set_rx_ring_params_p5(struct bnxt *bp, u32 ring_type,
 }
 
 static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
+				    struct bnxt_rx_ring_info *rxr,
 				    struct bnxt_ring_struct *ring,
 				    u32 ring_type, u32 map_index)
 {
@@ -7135,7 +7144,8 @@ static int hwrm_ring_alloc_send_msg(struct bnxt *bp,
 			      cpu_to_le32(bp->rx_ring_mask + 1) :
 			      cpu_to_le32(bp->rx_agg_ring_mask + 1);
 		if (bp->flags & BNXT_FLAG_CHIP_P5_PLUS)
-			bnxt_set_rx_ring_params_p5(bp, ring_type, req, ring);
+			bnxt_set_rx_ring_params_p5(bp, ring_type, req,
+						   rxr, ring);
 		break;
 	case HWRM_RING_ALLOC_CMPL:
 		req->ring_type = RING_ALLOC_REQ_RING_TYPE_L2_CMPL;
@@ -7283,7 +7293,7 @@ static int bnxt_hwrm_rx_ring_alloc(struct bnxt *bp,
 	u32 map_idx = bnapi->index;
 	int rc;
 
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
 	if (rc)
 		return rc;
 
@@ -7303,7 +7313,7 @@ static int bnxt_hwrm_rx_agg_ring_alloc(struct bnxt *bp,
 	int rc;
 
 	map_idx = grp_idx + bp->rx_nr_rings;
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, rxr, ring, type, map_idx);
 	if (rc)
 		return rc;
 
@@ -7327,7 +7337,7 @@ static int bnxt_hwrm_cp_ring_alloc_p5(struct bnxt *bp,
 
 	ring = &cpr->cp_ring_struct;
 	ring->handle = BNXT_SET_NQ_HDL(cpr);
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
 	if (rc)
 		return rc;
 	bnxt_set_db(bp, &cpr->cp_db, type, map_idx, ring->fw_ring_id);
@@ -7342,7 +7352,7 @@ static int bnxt_hwrm_tx_ring_alloc(struct bnxt *bp,
 	const u32 type = HWRM_RING_ALLOC_TX;
 	int rc;
 
-	rc = hwrm_ring_alloc_send_msg(bp, ring, type, tx_idx);
+	rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, tx_idx);
 	if (rc)
 		return rc;
 	bnxt_set_db(bp, &txr->tx_db, type, tx_idx, ring->fw_ring_id);
@@ -7368,7 +7378,7 @@ static int bnxt_hwrm_ring_alloc(struct bnxt *bp)
 
 		vector = bp->irq_tbl[map_idx].vector;
 		disable_irq_nosync(vector);
-		rc = hwrm_ring_alloc_send_msg(bp, ring, type, map_idx);
+		rc = hwrm_ring_alloc_send_msg(bp, NULL, ring, type, map_idx);
 		if (rc) {
 			enable_irq(vector);
 			goto err_out;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index f5f07a7e6b29..4c880a9fba92 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1107,6 +1107,7 @@ struct bnxt_rx_ring_info {
 
 	unsigned long		*rx_agg_bmap;
 	u16			rx_agg_bmap_size;
+	u16			rx_page_size;
 	bool                    need_head_pool;
 
 	dma_addr_t		rx_desc_mapping[MAX_RX_PAGES];
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index c94a391b1ba5..85cbeb35681c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -183,7 +183,7 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 			u16 cons, u8 *data_ptr, unsigned int len,
 			struct xdp_buff *xdp)
 {
-	u32 buflen = BNXT_RX_PAGE_SIZE;
+	u32 buflen = rxr->rx_page_size;
 	struct bnxt_sw_rx_bd *rx_buf;
 	struct pci_dev *pdev;
 	dma_addr_t mapping;
@@ -460,7 +460,7 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 
 struct sk_buff *
 bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
-		   struct page_pool *pool, struct xdp_buff *xdp)
+		   struct bnxt_rx_ring_info *rxr, struct xdp_buff *xdp)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 
@@ -468,7 +468,7 @@ bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb, u8 num_frags,
 		return NULL;
 
 	xdp_update_skb_frags_info(skb, num_frags, sinfo->xdp_frags_size,
-				  BNXT_RX_PAGE_SIZE * num_frags,
+				  rxr->rx_page_size * num_frags,
 				  xdp_buff_get_skb_flags(xdp));
 	return skb;
 }
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
index 220285e190fc..8933a0dec09a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.h
@@ -32,6 +32,6 @@ void bnxt_xdp_buff_init(struct bnxt *bp, struct bnxt_rx_ring_info *rxr,
 void bnxt_xdp_buff_frags_free(struct bnxt_rx_ring_info *rxr,
 			      struct xdp_buff *xdp);
 struct sk_buff *bnxt_xdp_build_skb(struct bnxt *bp, struct sk_buff *skb,
-				   u8 num_frags, struct page_pool *pool,
+				   u8 num_frags, struct bnxt_rx_ring_info *rxr,
 				   struct xdp_buff *xdp);
 #endif
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v8 5/9] eth: bnxt: store rx buffer size per queue
  2026-01-09 11:28 ` [PATCH net-next v8 5/9] eth: bnxt: store rx buffer size per queue Pavel Begunkov
@ 2026-01-13 10:19   ` Paolo Abeni
  2026-01-13 10:46     ` Pavel Begunkov
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Abeni @ 2026-01-13 10:19 UTC (permalink / raw)
  To: Pavel Begunkov, netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Jonathan Corbet,
	Michael Chan, Pavan Chebbi, Andrew Lunn, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	Ilias Apalodimas, Shuah Khan, Willem de Bruijn, Ankit Garg,
	Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, David Wei,
	Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

On 1/9/26 12:28 PM, Pavel Begunkov wrote:
> @@ -4478,7 +4485,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
>  	ring = &rxr->rx_agg_ring_struct;
>  	ring->fw_ring_id = INVALID_HW_RING_ID;
>  	if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
> -		type = ((u32)BNXT_RX_PAGE_SIZE << RX_BD_LEN_SHIFT) |
> +		type = ((u32)(u32)rxr->rx_page_size << RX_BD_LEN_SHIFT) |

Minor nit: duplicate cast above.

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> index f5f07a7e6b29..4c880a9fba92 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> @@ -1107,6 +1107,7 @@ struct bnxt_rx_ring_info {
>  
>  	unsigned long		*rx_agg_bmap;
>  	u16			rx_agg_bmap_size;
> +	u16			rx_page_size;

Any special reason for using u16 above? AFAICS using u32 will not change
the struct size on 64 bit arches, and using u32 will likely yield better
code.

/P


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v8 5/9] eth: bnxt: store rx buffer size per queue
  2026-01-13 10:19   ` Paolo Abeni
@ 2026-01-13 10:46     ` Pavel Begunkov
  0 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-13 10:46 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Jonathan Corbet,
	Michael Chan, Pavan Chebbi, Andrew Lunn, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	Ilias Apalodimas, Shuah Khan, Willem de Bruijn, Ankit Garg,
	Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, David Wei,
	Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

On 1/13/26 10:19, Paolo Abeni wrote:
> On 1/9/26 12:28 PM, Pavel Begunkov wrote:
>> @@ -4478,7 +4485,7 @@ static void bnxt_init_one_rx_agg_ring_rxbd(struct bnxt *bp,
>>   	ring = &rxr->rx_agg_ring_struct;
>>   	ring->fw_ring_id = INVALID_HW_RING_ID;
>>   	if ((bp->flags & BNXT_FLAG_AGG_RINGS)) {
>> -		type = ((u32)BNXT_RX_PAGE_SIZE << RX_BD_LEN_SHIFT) |
>> +		type = ((u32)(u32)rxr->rx_page_size << RX_BD_LEN_SHIFT) |
> 
> Minor nit: duplicate cast above.

oops, missed that, thanks

> 
>> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
>> index f5f07a7e6b29..4c880a9fba92 100644
>> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
>> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
>> @@ -1107,6 +1107,7 @@ struct bnxt_rx_ring_info {
>>   
>>   	unsigned long		*rx_agg_bmap;
>>   	u16			rx_agg_bmap_size;
>> +	u16			rx_page_size;
> 
> Any special reason for using u16 above? AFAICS using u32 will not change
> the struct size on 64 bit arches, and using u32 will likely yield better
> code.

IIRC bnxt doesn't support more than 2^16-1, but it doesn't really
matter, I can convert it to u32.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next v8 6/9] eth: bnxt: adjust the fill level of agg queues with larger buffers
  2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
                   ` (4 preceding siblings ...)
  2026-01-09 11:28 ` [PATCH net-next v8 5/9] eth: bnxt: store rx buffer size per queue Pavel Begunkov
@ 2026-01-09 11:28 ` Pavel Begunkov
  2026-01-13 10:27   ` Paolo Abeni
  2026-01-09 11:28 ` [PATCH net-next v8 7/9] eth: bnxt: support qcfg provided rx page size Pavel Begunkov
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

From: Jakub Kicinski <kuba@kernel.org>

The driver tries to provision more agg buffers than header buffers
since multiple agg segments can reuse the same header. The calculation
/ heuristic tries to provide enough pages for 65k of data for each header
(or 4 frags per header if the result is too big). This calculation is
currently global to the adapter. If we increase the buffer sizes 8x
we don't want 8x the amount of memory sitting on the rings.
Luckily we don't have to fill the rings completely, adjust
the fill level dynamically in case particular queue has buffers
larger than the global size.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
[pavel: rebase on top of agg_size_fac, assert agg_size_fac]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 28 +++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 8f42885a7c86..137e348d2b9c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -3816,16 +3816,34 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
 	}
 }
 
+static int bnxt_rx_agg_ring_fill_level(struct bnxt *bp,
+				       struct bnxt_rx_ring_info *rxr)
+{
+	/* User may have chosen larger than default rx_page_size,
+	 * we keep the ring sizes uniform and also want uniform amount
+	 * of bytes consumed per ring, so cap how much of the rings we fill.
+	 */
+	int fill_level = bp->rx_agg_ring_size;
+
+	if (rxr->rx_page_size > BNXT_RX_PAGE_SIZE)
+		fill_level /= rxr->rx_page_size / BNXT_RX_PAGE_SIZE;
+
+	return fill_level;
+}
+
 static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 				   struct bnxt_rx_ring_info *rxr,
 				   int numa_node)
 {
-	const unsigned int agg_size_fac = PAGE_SIZE / BNXT_RX_PAGE_SIZE;
+	unsigned int agg_size_fac = rxr->rx_page_size / BNXT_RX_PAGE_SIZE;
 	const unsigned int rx_size_fac = PAGE_SIZE / SZ_4K;
 	struct page_pool_params pp = { 0 };
 	struct page_pool *pool;
 
-	pp.pool_size = bp->rx_agg_ring_size / agg_size_fac;
+	if (WARN_ON_ONCE(agg_size_fac == 0))
+		agg_size_fac = 1;
+
+	pp.pool_size = bnxt_rx_agg_ring_fill_level(bp, rxr) / agg_size_fac;
 	if (BNXT_RX_PAGE_MODE(bp))
 		pp.pool_size += bp->rx_ring_size / rx_size_fac;
 
@@ -4403,11 +4421,13 @@ static void bnxt_alloc_one_rx_ring_netmem(struct bnxt *bp,
 					  struct bnxt_rx_ring_info *rxr,
 					  int ring_nr)
 {
+	int fill_level, i;
 	u32 prod;
-	int i;
+
+	fill_level = bnxt_rx_agg_ring_fill_level(bp, rxr);
 
 	prod = rxr->rx_agg_prod;
-	for (i = 0; i < bp->rx_agg_ring_size; i++) {
+	for (i = 0; i < fill_level; i++) {
 		if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_KERNEL)) {
 			netdev_warn(bp->dev, "init'ed rx ring %d with %d/%d pages only\n",
 				    ring_nr, i, bp->rx_agg_ring_size);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v8 6/9] eth: bnxt: adjust the fill level of agg queues with larger buffers
  2026-01-09 11:28 ` [PATCH net-next v8 6/9] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
@ 2026-01-13 10:27   ` Paolo Abeni
  2026-01-13 10:41     ` Paolo Abeni
  2026-01-13 10:42     ` Pavel Begunkov
  0 siblings, 2 replies; 19+ messages in thread
From: Paolo Abeni @ 2026-01-13 10:27 UTC (permalink / raw)
  To: Pavel Begunkov, netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Jonathan Corbet,
	Michael Chan, Pavan Chebbi, Andrew Lunn, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	Ilias Apalodimas, Shuah Khan, Willem de Bruijn, Ankit Garg,
	Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, David Wei,
	Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

On 1/9/26 12:28 PM, Pavel Begunkov wrote:
> From: Jakub Kicinski <kuba@kernel.org>
> 
> The driver tries to provision more agg buffers than header buffers
> since multiple agg segments can reuse the same header. The calculation
> / heuristic tries to provide enough pages for 65k of data for each header
> (or 4 frags per header if the result is too big). This calculation is
> currently global to the adapter. If we increase the buffer sizes 8x
> we don't want 8x the amount of memory sitting on the rings.
> Luckily we don't have to fill the rings completely, adjust
> the fill level dynamically in case particular queue has buffers
> larger than the global size.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> [pavel: rebase on top of agg_size_fac, assert agg_size_fac]
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 28 +++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 8f42885a7c86..137e348d2b9c 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -3816,16 +3816,34 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>  	}
>  }
>  
> +static int bnxt_rx_agg_ring_fill_level(struct bnxt *bp,
> +				       struct bnxt_rx_ring_info *rxr)
> +{
> +	/* User may have chosen larger than default rx_page_size,
> +	 * we keep the ring sizes uniform and also want uniform amount
> +	 * of bytes consumed per ring, so cap how much of the rings we fill.
> +	 */
> +	int fill_level = bp->rx_agg_ring_size;
> +
> +	if (rxr->rx_page_size > BNXT_RX_PAGE_SIZE)
> +		fill_level /= rxr->rx_page_size / BNXT_RX_PAGE_SIZE;

According to the check in bnxt_alloc_rx_page_pool() it's theoretically
possible for `rxr->rx_page_size / BNXT_RX_PAGE_SIZE` being zero. If so
the above would crash.

Side note: this looks like something AI review could/should catch. The
fact it didn't makes me think I'm missing something...

/P


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v8 6/9] eth: bnxt: adjust the fill level of agg queues with larger buffers
  2026-01-13 10:27   ` Paolo Abeni
@ 2026-01-13 10:41     ` Paolo Abeni
  2026-01-13 10:42     ` Pavel Begunkov
  1 sibling, 0 replies; 19+ messages in thread
From: Paolo Abeni @ 2026-01-13 10:41 UTC (permalink / raw)
  To: Pavel Begunkov, netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Jonathan Corbet,
	Michael Chan, Pavan Chebbi, Andrew Lunn, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	Ilias Apalodimas, Shuah Khan, Willem de Bruijn, Ankit Garg,
	Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, David Wei,
	Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

On 1/13/26 11:27 AM, Paolo Abeni wrote:
> On 1/9/26 12:28 PM, Pavel Begunkov wrote:
>> From: Jakub Kicinski <kuba@kernel.org>
>>
>> The driver tries to provision more agg buffers than header buffers
>> since multiple agg segments can reuse the same header. The calculation
>> / heuristic tries to provide enough pages for 65k of data for each header
>> (or 4 frags per header if the result is too big). This calculation is
>> currently global to the adapter. If we increase the buffer sizes 8x
>> we don't want 8x the amount of memory sitting on the rings.
>> Luckily we don't have to fill the rings completely, adjust
>> the fill level dynamically in case particular queue has buffers
>> larger than the global size.
>>
>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>> [pavel: rebase on top of agg_size_fac, assert agg_size_fac]
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 28 +++++++++++++++++++----
>>  1 file changed, 24 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> index 8f42885a7c86..137e348d2b9c 100644
>> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> @@ -3816,16 +3816,34 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>>  	}
>>  }
>>  
>> +static int bnxt_rx_agg_ring_fill_level(struct bnxt *bp,
>> +				       struct bnxt_rx_ring_info *rxr)
>> +{
>> +	/* User may have chosen larger than default rx_page_size,
>> +	 * we keep the ring sizes uniform and also want uniform amount
>> +	 * of bytes consumed per ring, so cap how much of the rings we fill.
>> +	 */
>> +	int fill_level = bp->rx_agg_ring_size;
>> +
>> +	if (rxr->rx_page_size > BNXT_RX_PAGE_SIZE)
>> +		fill_level /= rxr->rx_page_size / BNXT_RX_PAGE_SIZE;
> 
> According to the check in bnxt_alloc_rx_page_pool() it's theoretically
> possible for `rxr->rx_page_size / BNXT_RX_PAGE_SIZE` being zero. If so
> the above would crash.
> 
> Side note: this looks like something AI review could/should catch. The
> fact it didn't makes me think I'm missing something...

I see the next patch rejects too small `rx_page_size` values; so
possibly the better option is to drop the confusing check in
bnxt_alloc_rx_page_pool().

/P


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v8 6/9] eth: bnxt: adjust the fill level of agg queues with larger buffers
  2026-01-13 10:27   ` Paolo Abeni
  2026-01-13 10:41     ` Paolo Abeni
@ 2026-01-13 10:42     ` Pavel Begunkov
  1 sibling, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-13 10:42 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Jonathan Corbet,
	Michael Chan, Pavan Chebbi, Andrew Lunn, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	Ilias Apalodimas, Shuah Khan, Willem de Bruijn, Ankit Garg,
	Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, David Wei,
	Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

On 1/13/26 10:27, Paolo Abeni wrote:
> On 1/9/26 12:28 PM, Pavel Begunkov wrote:
>> From: Jakub Kicinski <kuba@kernel.org>
>>
>> The driver tries to provision more agg buffers than header buffers
>> since multiple agg segments can reuse the same header. The calculation
>> / heuristic tries to provide enough pages for 65k of data for each header
>> (or 4 frags per header if the result is too big). This calculation is
>> currently global to the adapter. If we increase the buffer sizes 8x
>> we don't want 8x the amount of memory sitting on the rings.
>> Luckily we don't have to fill the rings completely, adjust
>> the fill level dynamically in case particular queue has buffers
>> larger than the global size.
>>
>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
>> [pavel: rebase on top of agg_size_fac, assert agg_size_fac]
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>   drivers/net/ethernet/broadcom/bnxt/bnxt.c | 28 +++++++++++++++++++----
>>   1 file changed, 24 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> index 8f42885a7c86..137e348d2b9c 100644
>> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
>> @@ -3816,16 +3816,34 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>>   	}
>>   }
>>   
>> +static int bnxt_rx_agg_ring_fill_level(struct bnxt *bp,
>> +				       struct bnxt_rx_ring_info *rxr)
>> +{
>> +	/* User may have chosen larger than default rx_page_size,
>> +	 * we keep the ring sizes uniform and also want uniform amount
>> +	 * of bytes consumed per ring, so cap how much of the rings we fill.
>> +	 */
>> +	int fill_level = bp->rx_agg_ring_size;
>> +
>> +	if (rxr->rx_page_size > BNXT_RX_PAGE_SIZE)
>> +		fill_level /= rxr->rx_page_size / BNXT_RX_PAGE_SIZE;
> 
> According to the check in bnxt_alloc_rx_page_pool() it's theoretically
> possible for `rxr->rx_page_size / BNXT_RX_PAGE_SIZE` being zero. If so
> the above would crash.
> 
> Side note: this looks like something AI review could/should catch. The
> fact it didn't makes me think I'm missing something...

I doubt LLMs will be able to see it, but rx_page_size is no less
than BNXT_RX_PAGE_SIZE. It's either set from defaults, which is
exactly BNXT_RX_PAGE_SIZE, or given by the provider and then
checked in bnxt_validate_qcfg().

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next v8 7/9] eth: bnxt: support qcfg provided rx page size
  2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
                   ` (5 preceding siblings ...)
  2026-01-09 11:28 ` [PATCH net-next v8 6/9] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
@ 2026-01-09 11:28 ` Pavel Begunkov
  2026-01-14  3:36   ` Jakub Kicinski
  2026-01-09 11:28 ` [PATCH net-next v8 8/9] selftests: iou-zcrx: test large chunk sizes Pavel Begunkov
  2026-01-09 11:28 ` [PATCH net-next v8 9/9] io_uring/zcrx: document area chunking parameter Pavel Begunkov
  8 siblings, 1 reply; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

Implement support for qcfg provided rx page sizes. For that, implement
the ndo_default_qcfg callback and validate the config on restart. Also,
use the current config's value in bnxt_init_ring_struct to retain the
correct size across resets.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 36 ++++++++++++++++++++++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  1 +
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 137e348d2b9c..3ffe4fe159d3 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4325,6 +4325,7 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 		struct bnxt_rx_ring_info *rxr;
 		struct bnxt_tx_ring_info *txr;
 		struct bnxt_ring_struct *ring;
+		struct netdev_rx_queue *rxq;
 
 		if (!bnapi)
 			continue;
@@ -4342,7 +4343,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
 		if (!rxr)
 			goto skip_rx;
 
-		rxr->rx_page_size = BNXT_RX_PAGE_SIZE;
+		rxq = __netif_get_rx_queue(bp->dev, i);
+		rxr->rx_page_size = rxq->qcfg.rx_page_size;
 
 		ring = &rxr->rx_ring_struct;
 		rmem = &ring->ring_mem;
@@ -15932,6 +15934,29 @@ static const struct netdev_stat_ops bnxt_stat_ops = {
 	.get_base_stats		= bnxt_get_base_stats,
 };
 
+static void bnxt_queue_default_qcfg(struct net_device *dev,
+				    struct netdev_queue_config *qcfg)
+{
+	qcfg->rx_page_size = BNXT_RX_PAGE_SIZE;
+}
+
+static int bnxt_validate_qcfg(struct bnxt *bp, struct netdev_queue_config *qcfg)
+{
+	/* Older chips need MSS calc so rx_page_size is not supported */
+	if (!(bp->flags & BNXT_FLAG_CHIP_P5_PLUS) &&
+	     qcfg->rx_page_size != BNXT_RX_PAGE_SIZE)
+		return -EINVAL;
+
+	if (!is_power_of_2(qcfg->rx_page_size))
+		return -ERANGE;
+
+	if (qcfg->rx_page_size < BNXT_RX_PAGE_SIZE ||
+	    qcfg->rx_page_size > BNXT_MAX_RX_PAGE_SIZE)
+		return -ERANGE;
+
+	return 0;
+}
+
 static int bnxt_queue_mem_alloc(struct net_device *dev,
 				struct netdev_queue_config *qcfg,
 				void *qmem, int idx)
@@ -15944,6 +15969,10 @@ static int bnxt_queue_mem_alloc(struct net_device *dev,
 	if (!bp->rx_ring)
 		return -ENETDOWN;
 
+	rc = bnxt_validate_qcfg(bp, qcfg);
+	if (rc < 0)
+		return rc;
+
 	rxr = &bp->rx_ring[idx];
 	clone = qmem;
 	memcpy(clone, rxr, sizeof(*rxr));
@@ -15955,6 +15984,7 @@ static int bnxt_queue_mem_alloc(struct net_device *dev,
 	clone->rx_sw_agg_prod = 0;
 	clone->rx_next_cons = 0;
 	clone->need_head_pool = false;
+	clone->rx_page_size = qcfg->rx_page_size;
 
 	rc = bnxt_alloc_rx_page_pool(bp, clone, rxr->page_pool->p.nid);
 	if (rc)
@@ -16081,6 +16111,8 @@ static void bnxt_copy_rx_ring(struct bnxt *bp,
 	src_ring = &src->rx_agg_ring_struct;
 	src_rmem = &src_ring->ring_mem;
 
+	dst->rx_page_size = src->rx_page_size;
+
 	WARN_ON(dst_rmem->nr_pages != src_rmem->nr_pages);
 	WARN_ON(dst_rmem->page_size != src_rmem->page_size);
 	WARN_ON(dst_rmem->flags != src_rmem->flags);
@@ -16235,6 +16267,8 @@ static const struct netdev_queue_mgmt_ops bnxt_queue_mgmt_ops = {
 	.ndo_queue_mem_free	= bnxt_queue_mem_free,
 	.ndo_queue_start	= bnxt_queue_start,
 	.ndo_queue_stop		= bnxt_queue_stop,
+	.ndo_default_qcfg	= bnxt_queue_default_qcfg,
+	.supported_params	= QCFG_RX_PAGE_SIZE,
 };
 
 static void bnxt_remove_one(struct pci_dev *pdev)
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 4c880a9fba92..d245eefbbdda 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -760,6 +760,7 @@ struct nqe_cn {
 #endif
 
 #define BNXT_RX_PAGE_SIZE (1 << BNXT_RX_PAGE_SHIFT)
+#define BNXT_MAX_RX_PAGE_SIZE BIT(15)
 
 #define BNXT_MAX_MTU		9500
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v8 7/9] eth: bnxt: support qcfg provided rx page size
  2026-01-09 11:28 ` [PATCH net-next v8 7/9] eth: bnxt: support qcfg provided rx page size Pavel Begunkov
@ 2026-01-14  3:36   ` Jakub Kicinski
  2026-01-15 17:10     ` Pavel Begunkov
  0 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2026-01-14  3:36 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: netdev, David S . Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, David Wei,
	Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

On Fri,  9 Jan 2026 11:28:46 +0000 Pavel Begunkov wrote:
> @@ -4342,7 +4343,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
>  		if (!rxr)
>  			goto skip_rx;
>  
> -		rxr->rx_page_size = BNXT_RX_PAGE_SIZE;
> +		rxq = __netif_get_rx_queue(bp->dev, i);
> +		rxr->rx_page_size = rxq->qcfg.rx_page_size;

Pretty sure I asked for the netdev_queue_config() helper to make 
a return, instead of drivers poking directly into core state.
Having the config live in rxq directly is also ugh.

But at this stage we're probably better off if you just respin
to fix the nits from Paolo and I try to de-lobotimize the driver
facing API. This is close enough.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v8 7/9] eth: bnxt: support qcfg provided rx page size
  2026-01-14  3:36   ` Jakub Kicinski
@ 2026-01-15 17:10     ` Pavel Begunkov
  0 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-15 17:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, David S . Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, David Wei,
	Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

On 1/14/26 03:36, Jakub Kicinski wrote:
> On Fri,  9 Jan 2026 11:28:46 +0000 Pavel Begunkov wrote:
>> @@ -4342,7 +4343,8 @@ static void bnxt_init_ring_struct(struct bnxt *bp)
>>   		if (!rxr)
>>   			goto skip_rx;
>>   
>> -		rxr->rx_page_size = BNXT_RX_PAGE_SIZE;
>> +		rxq = __netif_get_rx_queue(bp->dev, i);
>> +		rxr->rx_page_size = rxq->qcfg.rx_page_size;
> 
> Pretty sure I asked for the netdev_queue_config() helper to make
> a return, instead of drivers poking directly into core state.
> Having the config live in rxq directly is also ugh.

Having a helper would be a good idea, but I went for stashing
configs in the queue as it's simpler, while dynamic allocations
were of no benefit for this series. Maybe there are some further
plans for it, but as you mentioned, it'd be better to do on top.

> But at this stage we're probably better off if you just respin
> to fix the nits from Paolo and I try to de-lobotimize the driver
> facing API. This is close enough.

Ok

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next v8 8/9] selftests: iou-zcrx: test large chunk sizes
  2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
                   ` (6 preceding siblings ...)
  2026-01-09 11:28 ` [PATCH net-next v8 7/9] eth: bnxt: support qcfg provided rx page size Pavel Begunkov
@ 2026-01-09 11:28 ` Pavel Begunkov
  2026-01-13 10:34   ` Paolo Abeni
  2026-01-09 11:28 ` [PATCH net-next v8 9/9] io_uring/zcrx: document area chunking parameter Pavel Begunkov
  8 siblings, 1 reply; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

Add a test using large chunks for zcrx memory area.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 .../selftests/drivers/net/hw/iou-zcrx.c       | 72 +++++++++++++++----
 .../selftests/drivers/net/hw/iou-zcrx.py      | 37 ++++++++++
 2 files changed, 97 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 62456df947bc..0a19b573f4f5 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -12,6 +12,7 @@
 #include <unistd.h>
 
 #include <arpa/inet.h>
+#include <linux/mman.h>
 #include <linux/errqueue.h>
 #include <linux/if_packet.h>
 #include <linux/ipv6.h>
@@ -37,6 +38,23 @@
 
 #include <liburing.h>
 
+#define SKIP_CODE	42
+
+struct t_io_uring_zcrx_ifq_reg {
+	__u32	if_idx;
+	__u32	if_rxq;
+	__u32	rq_entries;
+	__u32	flags;
+
+	__u64	area_ptr; /* pointer to struct io_uring_zcrx_area_reg */
+	__u64	region_ptr; /* struct io_uring_region_desc * */
+
+	struct io_uring_zcrx_offsets offsets;
+	__u32	zcrx_id;
+	__u32	rx_buf_len;
+	__u64	__resv[3];
+};
+
 static long page_size;
 #define AREA_SIZE (8192 * page_size)
 #define SEND_SIZE (512 * 4096)
@@ -65,6 +83,8 @@ static bool cfg_oneshot;
 static int cfg_oneshot_recvs;
 static int cfg_send_size = SEND_SIZE;
 static struct sockaddr_in6 cfg_addr;
+static unsigned cfg_rx_buf_len;
+static bool cfg_dry_run;
 
 static char *payload;
 static void *area_ptr;
@@ -128,14 +148,28 @@ static void setup_zcrx(struct io_uring *ring)
 	if (!ifindex)
 		error(1, 0, "bad interface name: %s", cfg_ifname);
 
-	area_ptr = mmap(NULL,
-			AREA_SIZE,
-			PROT_READ | PROT_WRITE,
-			MAP_ANONYMOUS | MAP_PRIVATE,
-			0,
-			0);
-	if (area_ptr == MAP_FAILED)
-		error(1, 0, "mmap(): zero copy area");
+	if (cfg_rx_buf_len && cfg_rx_buf_len != page_size) {
+		area_ptr = mmap(NULL,
+				AREA_SIZE,
+				PROT_READ | PROT_WRITE,
+				MAP_ANONYMOUS | MAP_PRIVATE |
+				MAP_HUGETLB | MAP_HUGE_2MB,
+				-1,
+				0);
+		if (area_ptr == MAP_FAILED) {
+			printf("Can't allocate huge pages\n");
+			exit(SKIP_CODE);
+		}
+	} else {
+		area_ptr = mmap(NULL,
+				AREA_SIZE,
+				PROT_READ | PROT_WRITE,
+				MAP_ANONYMOUS | MAP_PRIVATE,
+				0,
+				0);
+		if (area_ptr == MAP_FAILED)
+			error(1, 0, "mmap(): zero copy area");
+	}
 
 	ring_size = get_refill_ring_size(rq_entries);
 	ring_ptr = mmap(NULL,
@@ -157,17 +191,23 @@ static void setup_zcrx(struct io_uring *ring)
 		.flags = 0,
 	};
 
-	struct io_uring_zcrx_ifq_reg reg = {
+	struct t_io_uring_zcrx_ifq_reg reg = {
 		.if_idx = ifindex,
 		.if_rxq = cfg_queue_id,
 		.rq_entries = rq_entries,
 		.area_ptr = (__u64)(unsigned long)&area_reg,
 		.region_ptr = (__u64)(unsigned long)&region_reg,
+		.rx_buf_len = cfg_rx_buf_len,
 	};
 
-	ret = io_uring_register_ifq(ring, &reg);
-	if (ret)
+	ret = io_uring_register_ifq(ring, (void *)&reg);
+	if (cfg_rx_buf_len && (ret == -EINVAL || ret == -EOPNOTSUPP ||
+			       ret == -ERANGE)) {
+		printf("Large chunks are not supported %i\n", ret);
+		exit(SKIP_CODE);
+	} else if (ret) {
 		error(1, 0, "io_uring_register_ifq(): %d", ret);
+	}
 
 	rq_ring.khead = (unsigned int *)((char *)ring_ptr + reg.offsets.head);
 	rq_ring.ktail = (unsigned int *)((char *)ring_ptr + reg.offsets.tail);
@@ -323,6 +363,8 @@ static void run_server(void)
 	io_uring_queue_init(512, &ring, flags);
 
 	setup_zcrx(&ring);
+	if (cfg_dry_run)
+		return;
 
 	add_accept(&ring, fd);
 
@@ -383,7 +425,7 @@ static void parse_opts(int argc, char **argv)
 		usage(argv[0]);
 	cfg_payload_len = max_payload_len;
 
-	while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:")) != -1) {
+	while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:d")) != -1) {
 		switch (c) {
 		case 's':
 			if (cfg_client)
@@ -418,6 +460,12 @@ static void parse_opts(int argc, char **argv)
 		case 'z':
 			cfg_send_size = strtoul(optarg, NULL, 0);
 			break;
+		case 'x':
+			cfg_rx_buf_len = page_size * strtoul(optarg, NULL, 0);
+			break;
+		case 'd':
+			cfg_dry_run = true;
+			break;
 		}
 	}
 
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
index 712c806508b5..83061b27f2f2 100755
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
@@ -7,6 +7,7 @@ from lib.py import ksft_run, ksft_exit, KsftSkipEx
 from lib.py import NetDrvEpEnv
 from lib.py import bkg, cmd, defer, ethtool, rand_port, wait_port_listen
 
+SKIP_CODE = 42
 
 def _get_current_settings(cfg):
     output = ethtool(f"-g {cfg.ifname}", json=True)[0]
@@ -132,6 +133,42 @@ def test_zcrx_rss(cfg) -> None:
         cmd(tx_cmd, host=cfg.remote)
 
 
+def test_zcrx_large_chunks(cfg) -> None:
+    cfg.require_ipver('6')
+
+    combined_chans = _get_combined_channels(cfg)
+    if combined_chans < 2:
+        raise KsftSkipEx('at least 2 combined channels required')
+    (rx_ring, hds_thresh) = _get_current_settings(cfg)
+    port = rand_port()
+
+    ethtool(f"-G {cfg.ifname} tcp-data-split on")
+    defer(ethtool, f"-G {cfg.ifname} tcp-data-split auto")
+
+    ethtool(f"-G {cfg.ifname} hds-thresh 0")
+    defer(ethtool, f"-G {cfg.ifname} hds-thresh {hds_thresh}")
+
+    ethtool(f"-G {cfg.ifname} rx 64")
+    defer(ethtool, f"-G {cfg.ifname} rx {rx_ring}")
+
+    ethtool(f"-X {cfg.ifname} equal {combined_chans - 1}")
+    defer(ethtool, f"-X {cfg.ifname} default")
+
+    flow_rule_id = _set_flow_rule(cfg, port, combined_chans - 1)
+    defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
+
+    rx_cmd = f"{cfg.bin_local} -s -p {port} -i {cfg.ifname} -q {combined_chans - 1} -x 2"
+    tx_cmd = f"{cfg.bin_remote} -c -h {cfg.addr_v['6']} -p {port} -l 12840"
+
+    probe = cmd(rx_cmd + " -d", fail=False)
+    if probe.ret == SKIP_CODE:
+        raise KsftSkipEx(probe.stdout)
+
+    with bkg(rx_cmd, exit_wait=True):
+        wait_port_listen(port, proto="tcp")
+        cmd(tx_cmd, host=cfg.remote)
+
+
 def main() -> None:
     with NetDrvEpEnv(__file__) as cfg:
         cfg.bin_local = path.abspath(path.dirname(__file__) + "/../../../drivers/net/hw/iou-zcrx")
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v8 8/9] selftests: iou-zcrx: test large chunk sizes
  2026-01-09 11:28 ` [PATCH net-next v8 8/9] selftests: iou-zcrx: test large chunk sizes Pavel Begunkov
@ 2026-01-13 10:34   ` Paolo Abeni
  2026-01-13 10:48     ` Pavel Begunkov
  0 siblings, 1 reply; 19+ messages in thread
From: Paolo Abeni @ 2026-01-13 10:34 UTC (permalink / raw)
  To: Pavel Begunkov, netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Jonathan Corbet,
	Michael Chan, Pavan Chebbi, Andrew Lunn, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	Ilias Apalodimas, Shuah Khan, Willem de Bruijn, Ankit Garg,
	Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, David Wei,
	Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

On 1/9/26 12:28 PM, Pavel Begunkov wrote:
> @@ -65,6 +83,8 @@ static bool cfg_oneshot;
>  static int cfg_oneshot_recvs;
>  static int cfg_send_size = SEND_SIZE;
>  static struct sockaddr_in6 cfg_addr;
> +static unsigned cfg_rx_buf_len;

Checkpatch prefers 'unsigned int' above

> @@ -132,6 +133,42 @@ def test_zcrx_rss(cfg) -> None:
>          cmd(tx_cmd, host=cfg.remote)
>  
>  
> +def test_zcrx_large_chunks(cfg) -> None:

pylint laments the lack of docstring. Perhaps explicitly silencing the
warning?

/P


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v8 8/9] selftests: iou-zcrx: test large chunk sizes
  2026-01-13 10:34   ` Paolo Abeni
@ 2026-01-13 10:48     ` Pavel Begunkov
  0 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-13 10:48 UTC (permalink / raw)
  To: Paolo Abeni, netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Jonathan Corbet,
	Michael Chan, Pavan Chebbi, Andrew Lunn, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Joshua Washington, Harshitha Ramamurthy, Saeed Mahameed,
	Tariq Toukan, Mark Bloch, Leon Romanovsky, Alexander Duyck,
	Ilias Apalodimas, Shuah Khan, Willem de Bruijn, Ankit Garg,
	Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, David Wei,
	Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

On 1/13/26 10:34, Paolo Abeni wrote:
> On 1/9/26 12:28 PM, Pavel Begunkov wrote:
>> @@ -65,6 +83,8 @@ static bool cfg_oneshot;
>>   static int cfg_oneshot_recvs;
>>   static int cfg_send_size = SEND_SIZE;
>>   static struct sockaddr_in6 cfg_addr;
>> +static unsigned cfg_rx_buf_len;
> 
> Checkpatch prefers 'unsigned int' above
> 
>> @@ -132,6 +133,42 @@ def test_zcrx_rss(cfg) -> None:
>>           cmd(tx_cmd, host=cfg.remote)
>>   
>>   
>> +def test_zcrx_large_chunks(cfg) -> None:
> 
> pylint laments the lack of docstring. Perhaps explicitly silencing the
> warning?

fwiw, I left it be because all other functions in the file
have exactly the same problem.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next v8 9/9] io_uring/zcrx: document area chunking parameter
  2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
                   ` (7 preceding siblings ...)
  2026-01-09 11:28 ` [PATCH net-next v8 8/9] selftests: iou-zcrx: test large chunk sizes Pavel Begunkov
@ 2026-01-09 11:28 ` Pavel Begunkov
  8 siblings, 0 replies; 19+ messages in thread
From: Pavel Begunkov @ 2026-01-09 11:28 UTC (permalink / raw)
  To: netdev
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Michael Chan, Pavan Chebbi, Andrew Lunn,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Joshua Washington, Harshitha Ramamurthy,
	Saeed Mahameed, Tariq Toukan, Mark Bloch, Leon Romanovsky,
	Alexander Duyck, Ilias Apalodimas, Shuah Khan, Willem de Bruijn,
	Ankit Garg, Tim Hostetler, Alok Tiwari, Ziwei Xiao, John Fraker,
	Praveen Kaligineedi, Mohsin Bashir, Joe Damato, Mina Almasry,
	Dimitri Daskalakis, Stanislav Fomichev, Kuniyuki Iwashima,
	Samiullah Khawaja, Ahmed Zaki, Alexander Lobakin, Pavel Begunkov,
	David Wei, Yue Haibing, Haiyue Wang, Jens Axboe, Simon Horman,
	Vishwanath Seshagiri, linux-doc, linux-kernel, bpf, linux-rdma,
	linux-kselftest, dtatulea, io-uring

struct io_uring_zcrx_ifq_reg::rx_buf_len is used as a hint specifying
the kernel what buffer size it should use. Document the API and
limitations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 Documentation/networking/iou-zcrx.rst | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst
index 54a72e172bdc..7f3f4b2e6cf2 100644
--- a/Documentation/networking/iou-zcrx.rst
+++ b/Documentation/networking/iou-zcrx.rst
@@ -196,6 +196,26 @@ Return buffers back to the kernel to be used again::
   rqe->len = cqe->res;
   IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
 
+Area chunking
+-------------
+
+zcrx splits the memory area into fixed-length physically contiguous chunks.
+This limits the maximum buffer size returned in a single io_uring CQE. Users
+can provide a hint to the kernel to use larger chunks by setting the
+``rx_buf_len`` field of ``struct io_uring_zcrx_ifq_reg`` to the desired length
+during registration. If this field is set to zero, the kernel defaults to
+the system page size.
+
+To use larger sizes, the memory area must be backed by physically contiguous
+ranges whose sizes are multiples of ``rx_buf_len``. It also requires kernel
+and hardware support. If registration fails, users are generally expected to
+fall back to defaults by setting ``rx_buf_len`` to zero.
+
+Larger chunks don't give any additional guarantees about buffer sizes returned
+in CQEs, and they can vary depending on many factors like traffic pattern,
+hardware offload, etc. It doesn't require any application changes beyond zcrx
+registration.
+
 Testing
 =======
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-01-15 17:11 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-09 11:28 [PATCH net-next v8 0/9] Add support for providers with large rx buffer Pavel Begunkov
2026-01-09 11:28 ` [PATCH net-next v8 1/9] net: memzero mp params when closing a queue Pavel Begunkov
2026-01-09 11:28 ` [PATCH net-next v8 2/9] net: reduce indent of struct netdev_queue_mgmt_ops members Pavel Begunkov
2026-01-09 11:28 ` [PATCH net-next v8 3/9] net: add bare bone queue configs Pavel Begunkov
2026-01-09 11:28 ` [PATCH net-next v8 4/9] net: pass queue rx page size from memory provider Pavel Begunkov
2026-01-09 11:28 ` [PATCH net-next v8 5/9] eth: bnxt: store rx buffer size per queue Pavel Begunkov
2026-01-13 10:19   ` Paolo Abeni
2026-01-13 10:46     ` Pavel Begunkov
2026-01-09 11:28 ` [PATCH net-next v8 6/9] eth: bnxt: adjust the fill level of agg queues with larger buffers Pavel Begunkov
2026-01-13 10:27   ` Paolo Abeni
2026-01-13 10:41     ` Paolo Abeni
2026-01-13 10:42     ` Pavel Begunkov
2026-01-09 11:28 ` [PATCH net-next v8 7/9] eth: bnxt: support qcfg provided rx page size Pavel Begunkov
2026-01-14  3:36   ` Jakub Kicinski
2026-01-15 17:10     ` Pavel Begunkov
2026-01-09 11:28 ` [PATCH net-next v8 8/9] selftests: iou-zcrx: test large chunk sizes Pavel Begunkov
2026-01-13 10:34   ` Paolo Abeni
2026-01-13 10:48     ` Pavel Begunkov
2026-01-09 11:28 ` [PATCH net-next v8 9/9] io_uring/zcrx: document area chunking parameter Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox