* [PATCH net-next v11 00/21] io_uring zero copy rx
@ 2025-01-16 23:16 David Wei
2025-01-16 23:16 ` [PATCH net-next v11 01/21] net: page_pool: don't cast mp param to devmem David Wei
` (21 more replies)
0 siblings, 22 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
This patchset adds support for zero copy rx into userspace pages using
io_uring, eliminating a kernel to user copy.
We configure a page pool that a driver uses to fill a hw rx queue to
hand out user pages instead of kernel pages. Any data that ends up
hitting this hw rx queue will thus be dma'd into userspace memory
directly, without needing to be bounced through kernel memory. 'Reading'
data out of a socket instead becomes a _notification_ mechanism, where
the kernel tells userspace where the data is. The overall approach is
similar to the devmem TCP proposal.
This relies on hw header/data split, flow steering and RSS to ensure
packet headers remain in kernel memory and only desired flows hit a hw
rx queue configured for zero copy. Configuring this is outside of the
scope of this patchset.
We share netdev core infra with devmem TCP. The main difference is that
io_uring is used for the uAPI and the lifetime of all objects are bound
to an io_uring instance. Data is 'read' using a new io_uring request
type. When done, data is returned via a new shared refill queue. A zero
copy page pool refills a hw rx queue from this refill queue directly. Of
course, the lifetime of these data buffers are managed by io_uring
rather than the networking stack, with different refcounting rules.
This patchset is the first step adding basic zero copy support. We will
extend this iteratively with new features e.g. dynamically allocated
zero copy areas, THP support, dmabuf support, improved copy fallback,
general optimisations and more.
In terms of netdev support, we're first targeting Broadcom bnxt. Patches
aren't included since Taehee Yoo has already sent a more comprehensive
patchset adding support in [1]. Google gve should already support this,
and Mellanox mlx5 support is WIP pending driver changes.
===========
Performance
===========
Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.
Test setup:
* AMD EPYC 9454
* Broadcom BCM957508 200G
* Kernel v6.11 base [2]
* liburing fork [3]
* kperf fork [4]
* 4K MTU
* Single TCP flow
With application thread + net rx softirq pinned to _different_ cores:
+-------------------------------+
| epoll | io_uring |
|-----------|-------------------|
| 82.2 Gbps | 116.2 Gbps (+41%) |
+-------------------------------+
Pinned to _same_ core:
+-------------------------------+
| epoll | io_uring |
|-----------|-------------------|
| 62.6 Gbps | 80.9 Gbps (+29%) |
+-------------------------------+
=====
Links
=====
Broadcom bnxt support:
[1]: https://lore.kernel.org/netdev/[email protected]/
Linux kernel branch:
[2]: https://github.com/spikeh/linux.git zcrx/v9
liburing for testing:
[3]: https://github.com/isilence/liburing.git zcrx/next
kperf for testing:
[4]: https://git.kernel.dk/kperf.git
Changes in v11:
---------------
* Add a shim provider helper for page_pool_set_dma_addr_netmem()
* Drop netdev in ->uninstall, pin struct device instead
* Add net_mp_open_rxq() and net_mp_close_rxq()
* Remove unneeded CFLAGS += -I/usr/include/ in selftest Makefile
Changes in v10:
---------------
* Fix !CONFIG_PAGE_POOL build
* Use acquire/release for RQ in examples
* Fix page_pool_ref_netmem for net_iov
* Move provider helpers / definitions into a new file
* Don’t export page_pool_{set,clear}_pp_info, introduce
net_mp_niov_{set,clear}_page_pool() instead
* Remove devmem.h from net/core/page_pool_user.c
* Add Netdev yaml for io-uring attribute
* Add memory provider ops for filling in Netlink info
Changes in v9:
--------------
* Fail proof against multiple page pools running the same memory
provider
* Lock the consumer side of the refill queue.
* Move scrub into io_uring exit.
* Kill napi_execute.
* Kill area init api and export finer grained net helpers as partial
init now need to happen in ->alloc_netmems()
* Separate user refcounting.
* Fix copy fallback path math.
* Add rodata check to page_pool_init()
* Fix incorrect path in documentation
Changes in v8:
--------------
* add documentation and selftest
* use io_uring regions for the refill ring
Changes in v7:
--------------
net:
* Use NAPI_F_PREFER_BUSY_POLL for napi_execute + stylistics changes.
Changes in v6:
--------------
Please note: Comparison with TCP_ZEROCOPY_RECEIVE isn't done yet.
net:
* Drop a devmem.h clean up patch.
* Migrate to netdev_get_by_index from deprecated API.
* Fix !CONFIG_NET_DEVMEM build.
* Don’t return into the page pool cache directly, use a new helper
* Refactor napi_execute
io_uring:
* Require IORING_RECV_MULTISHOT flag set.
* Add unselectable CONFIG_IO_URING_ZCRX.
* Pulled latest io_uring changes.
* Unexport io_uring_pp_zc_ops.
Changes in v5:
--------------
* Rebase on top of merged net_iov + netmem infra.
* Decouple net_iov from devmem TCP.
* Use netdev queue API to allocate an rx queue.
* Minor uAPI enhancements for future extensibility.
* QoS improvements with request throttling.
Changes in RFC v4:
------------------
* Rebased on top of Mina Almasry's TCP devmem patchset and latest
net-next, now sharing common infra e.g.:
* netmem_t and net_iovs
* Page pool memory provider
* The registered buffer (rbuf) completion queue where completions from
io_recvzc requests are posted is removed. Now these post into the main
completion queue, using big (32-byte) CQEs. The first 16 bytes is an
ordinary CQE, while the latter 16 bytes contain the io_uring_rbuf_cqe
as before. This vastly simplifies the uAPI and removes a level of
indirection in userspace when looking for payloads.
* The rbuf refill queue is still needed for userspace to return
buffers to kernel.
* Simplified code and uAPI on the io_uring side, particularly
io_recvzc() and io_zc_rx_recv(). Many unnecessary lines were removed
e.g. extra msg flags, readlen, etc.
Changes in RFC v3:
------------------
* Rebased on top of Jakub Kicinski's memory provider API RFC. The ZC
pool added is now a backend for memory provider.
* We're also reusing ppiov infrastructure. The refcounting rules stay
the same but it's shifted into ppiov->refcount. That lets us to
flexibly manage buffer lifetimes without adding any extra code to the
common networking paths. It'd also make it easier to support dmabufs
and device memory in the future.
* io_uring also knows about pages, and so ppiovs might unnecessarily
break tools inspecting data, that can easily be solved later.
Many patches are not for upstream as they depend on work in progress,
namely from Mina:
* struct netmem_t
* Driver ndo commands for Rx queue configs
* struct page_pool_iov and shared pp infra
Changes in RFC v2:
------------------
* Added copy fallback support if userspace memory allocated for ZC Rx
runs out, or if header splitting or flow steering fails.
* Added veth support for ZC Rx, for testing and demonstration. We will
need to figure out what driver would be best for such testing
functionality in the future. Perhaps netdevsim?
* Added socket registration API to io_uring to associate specific
sockets with ifqs/Rx queues for ZC.
* Added multi-socket support, such that multiple connections can be
steered into the same hardware Rx queue.
* Added Netbench server/client support.
David Wei (9):
netdev: add io_uring memory provider info
net: page_pool: add memory provider helpers
net: add helpers for setting a memory provider on an rx queue
io_uring/zcrx: add interface queue and refill queue
io_uring/zcrx: add io_zcrx_area
io_uring/zcrx: add io_recvzc request
io_uring/zcrx: set pp memory provider for an rx queue
net: add documentation for io_uring zcrx
io_uring/zcrx: add selftest
Pavel Begunkov (12):
net: page_pool: don't cast mp param to devmem
net: prefix devmem specific helpers
net: generalise net_iov chunk owners
net: page_pool: create hooks for custom memory providers
net: page_pool: add callback for mp info printing
net: page_pool: add a mp hook to unregister_netdevice*
net: prepare for non devmem TCP memory providers
io_uring/zcrx: grab a net device
io_uring/zcrx: implement zerocopy receive pp memory provider
io_uring/zcrx: dma-map area for the device
io_uring/zcrx: throttle receive requests
io_uring/zcrx: add copy fallback
Documentation/netlink/specs/netdev.yaml | 15 +
Documentation/networking/index.rst | 1 +
Documentation/networking/iou-zcrx.rst | 201 ++++
Kconfig | 2 +
include/linux/io_uring_types.h | 6 +
include/net/netmem.h | 21 +-
include/net/page_pool/memory_provider.h | 45 +
include/net/page_pool/types.h | 4 +
include/uapi/linux/io_uring.h | 54 +-
include/uapi/linux/netdev.h | 8 +
io_uring/KConfig | 10 +
io_uring/Makefile | 1 +
io_uring/io_uring.c | 7 +
io_uring/io_uring.h | 10 +
io_uring/memmap.h | 1 +
io_uring/net.c | 74 ++
io_uring/opdef.c | 16 +
io_uring/register.c | 7 +
io_uring/rsrc.c | 2 +-
io_uring/rsrc.h | 1 +
io_uring/zcrx.c | 947 ++++++++++++++++++
io_uring/zcrx.h | 73 ++
net/core/dev.c | 16 +-
net/core/devmem.c | 93 +-
net/core/devmem.h | 49 +-
net/core/netdev-genl.c | 11 +-
net/core/netdev_rx_queue.c | 66 ++
net/core/page_pool.c | 51 +-
net/core/page_pool_user.c | 7 +-
net/ipv4/tcp.c | 7 +-
tools/include/uapi/linux/netdev.h | 8 +
.../selftests/drivers/net/hw/.gitignore | 2 +
.../testing/selftests/drivers/net/hw/Makefile | 5 +
.../selftests/drivers/net/hw/iou-zcrx.c | 432 ++++++++
.../selftests/drivers/net/hw/iou-zcrx.py | 64 ++
35 files changed, 2234 insertions(+), 83 deletions(-)
create mode 100644 Documentation/networking/iou-zcrx.rst
create mode 100644 include/net/page_pool/memory_provider.h
create mode 100644 io_uring/KConfig
create mode 100644 io_uring/zcrx.c
create mode 100644 io_uring/zcrx.h
create mode 100644 tools/testing/selftests/drivers/net/hw/iou-zcrx.c
create mode 100755 tools/testing/selftests/drivers/net/hw/iou-zcrx.py
--
2.43.5
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH net-next v11 01/21] net: page_pool: don't cast mp param to devmem
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 02/21] net: prefix devmem specific helpers David Wei
` (20 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
page_pool_check_memory_provider() is a generic path and shouldn't assume
anything about the actual type of the memory provider argument. It's
fine while devmem is the only provider, but cast away the devmem
specific binding types to avoid confusion.
Reviewed-by: Jakub Kicinski <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
net/core/page_pool_user.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
index 48335766c1bf..8d31c71bea1a 100644
--- a/net/core/page_pool_user.c
+++ b/net/core/page_pool_user.c
@@ -353,7 +353,7 @@ void page_pool_unlist(struct page_pool *pool)
int page_pool_check_memory_provider(struct net_device *dev,
struct netdev_rx_queue *rxq)
{
- struct net_devmem_dmabuf_binding *binding = rxq->mp_params.mp_priv;
+ void *binding = rxq->mp_params.mp_priv;
struct page_pool *pool;
struct hlist_node *n;
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 02/21] net: prefix devmem specific helpers
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
2025-01-16 23:16 ` [PATCH net-next v11 01/21] net: page_pool: don't cast mp param to devmem David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 03/21] net: generalise net_iov chunk owners David Wei
` (19 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
Add prefixes to all helpers that are specific to devmem TCP, i.e.
net_iov_binding[_id].
Reviewed-by: Jakub Kicinski <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
net/core/devmem.c | 2 +-
net/core/devmem.h | 14 +++++++-------
net/ipv4/tcp.c | 2 +-
3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 0b6ed7525b22..5e1a05082ab8 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -93,7 +93,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
void net_devmem_free_dmabuf(struct net_iov *niov)
{
- struct net_devmem_dmabuf_binding *binding = net_iov_binding(niov);
+ struct net_devmem_dmabuf_binding *binding = net_devmem_iov_binding(niov);
unsigned long dma_addr = net_devmem_get_dma_addr(niov);
if (WARN_ON(!gen_pool_has_addr(binding->chunk_pool, dma_addr,
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 76099ef9c482..99782ddeca40 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -86,11 +86,16 @@ static inline unsigned int net_iov_idx(const struct net_iov *niov)
}
static inline struct net_devmem_dmabuf_binding *
-net_iov_binding(const struct net_iov *niov)
+net_devmem_iov_binding(const struct net_iov *niov)
{
return net_iov_owner(niov)->binding;
}
+static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
+{
+ return net_devmem_iov_binding(niov)->id;
+}
+
static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
{
struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
@@ -99,11 +104,6 @@ static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
((unsigned long)net_iov_idx(niov) << PAGE_SHIFT);
}
-static inline u32 net_iov_binding_id(const struct net_iov *niov)
-{
- return net_iov_owner(niov)->binding->id;
-}
-
static inline void
net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
{
@@ -171,7 +171,7 @@ static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
return 0;
}
-static inline u32 net_iov_binding_id(const struct net_iov *niov)
+static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
{
return 0;
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..b872de9a8271 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2494,7 +2494,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
/* Will perform the exchange later */
dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
- dmabuf_cmsg.dmabuf_id = net_iov_binding_id(niov);
+ dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
offset += copy;
remaining_len -= copy;
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 03/21] net: generalise net_iov chunk owners
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
2025-01-16 23:16 ` [PATCH net-next v11 01/21] net: page_pool: don't cast mp param to devmem David Wei
2025-01-16 23:16 ` [PATCH net-next v11 02/21] net: prefix devmem specific helpers David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 04/21] net: page_pool: create hooks for custom memory providers David Wei
` (18 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
which serves as a useful abstraction to share data and provide a
context. However, it's too devmem specific, and we want to reuse it for
other memory providers, and for that we need to decouple net_iov from
devmem. Make net_iov to point to a new base structure called
net_iov_area, which dmabuf_genpool_chunk_owner extends.
Reviewed-by: Mina Almasry <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
include/net/netmem.h | 21 ++++++++++++++++++++-
net/core/devmem.c | 25 +++++++++++++------------
net/core/devmem.h | 25 +++++++++----------------
3 files changed, 42 insertions(+), 29 deletions(-)
diff --git a/include/net/netmem.h b/include/net/netmem.h
index 1b58faa4f20f..c61d5b21e7b4 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -24,11 +24,20 @@ struct net_iov {
unsigned long __unused_padding;
unsigned long pp_magic;
struct page_pool *pp;
- struct dmabuf_genpool_chunk_owner *owner;
+ struct net_iov_area *owner;
unsigned long dma_addr;
atomic_long_t pp_ref_count;
};
+struct net_iov_area {
+ /* Array of net_iovs for this area. */
+ struct net_iov *niovs;
+ size_t num_niovs;
+
+ /* Offset into the dma-buf where this chunk starts. */
+ unsigned long base_virtual;
+};
+
/* These fields in struct page are used by the page_pool and net stack:
*
* struct {
@@ -54,6 +63,16 @@ NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr);
NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
#undef NET_IOV_ASSERT_OFFSET
+static inline struct net_iov_area *net_iov_owner(const struct net_iov *niov)
+{
+ return niov->owner;
+}
+
+static inline unsigned int net_iov_idx(const struct net_iov *niov)
+{
+ return niov - net_iov_owner(niov)->niovs;
+}
+
/* netmem */
/**
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 5e1a05082ab8..c250db6993d3 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -32,14 +32,15 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
{
struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
- kvfree(owner->niovs);
+ kvfree(owner->area.niovs);
kfree(owner);
}
static dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)
{
- struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
+ struct dmabuf_genpool_chunk_owner *owner;
+ owner = net_devmem_iov_to_chunk_owner(niov);
return owner->base_dma_addr +
((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT);
}
@@ -82,7 +83,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
offset = dma_addr - owner->base_dma_addr;
index = offset / PAGE_SIZE;
- niov = &owner->niovs[index];
+ niov = &owner->area.niovs[index];
niov->pp_magic = 0;
niov->pp = NULL;
@@ -250,9 +251,9 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
goto err_free_chunks;
}
- owner->base_virtual = virtual;
+ owner->area.base_virtual = virtual;
owner->base_dma_addr = dma_addr;
- owner->num_niovs = len / PAGE_SIZE;
+ owner->area.num_niovs = len / PAGE_SIZE;
owner->binding = binding;
err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
@@ -264,17 +265,17 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
goto err_free_chunks;
}
- owner->niovs = kvmalloc_array(owner->num_niovs,
- sizeof(*owner->niovs),
- GFP_KERNEL);
- if (!owner->niovs) {
+ owner->area.niovs = kvmalloc_array(owner->area.num_niovs,
+ sizeof(*owner->area.niovs),
+ GFP_KERNEL);
+ if (!owner->area.niovs) {
err = -ENOMEM;
goto err_free_chunks;
}
- for (i = 0; i < owner->num_niovs; i++) {
- niov = &owner->niovs[i];
- niov->owner = owner;
+ for (i = 0; i < owner->area.num_niovs; i++) {
+ niov = &owner->area.niovs[i];
+ niov->owner = &owner->area;
page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
net_devmem_get_dma_addr(niov));
}
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 99782ddeca40..a2b9913e9a17 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -10,6 +10,8 @@
#ifndef _NET_DEVMEM_H
#define _NET_DEVMEM_H
+#include <net/netmem.h>
+
struct netlink_ext_ack;
struct net_devmem_dmabuf_binding {
@@ -51,17 +53,11 @@ struct net_devmem_dmabuf_binding {
* allocations from this chunk.
*/
struct dmabuf_genpool_chunk_owner {
- /* Offset into the dma-buf where this chunk starts. */
- unsigned long base_virtual;
+ struct net_iov_area area;
+ struct net_devmem_dmabuf_binding *binding;
/* dma_addr of the start of the chunk. */
dma_addr_t base_dma_addr;
-
- /* Array of net_iovs for this chunk. */
- struct net_iov *niovs;
- size_t num_niovs;
-
- struct net_devmem_dmabuf_binding *binding;
};
void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding);
@@ -75,20 +71,17 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
void dev_dmabuf_uninstall(struct net_device *dev);
static inline struct dmabuf_genpool_chunk_owner *
-net_iov_owner(const struct net_iov *niov)
+net_devmem_iov_to_chunk_owner(const struct net_iov *niov)
{
- return niov->owner;
-}
+ struct net_iov_area *owner = net_iov_owner(niov);
-static inline unsigned int net_iov_idx(const struct net_iov *niov)
-{
- return niov - net_iov_owner(niov)->niovs;
+ return container_of(owner, struct dmabuf_genpool_chunk_owner, area);
}
static inline struct net_devmem_dmabuf_binding *
net_devmem_iov_binding(const struct net_iov *niov)
{
- return net_iov_owner(niov)->binding;
+ return net_devmem_iov_to_chunk_owner(niov)->binding;
}
static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
@@ -98,7 +91,7 @@ static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
{
- struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
+ struct net_iov_area *owner = net_iov_owner(niov);
return owner->base_virtual +
((unsigned long)net_iov_idx(niov) << PAGE_SHIFT);
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 04/21] net: page_pool: create hooks for custom memory providers
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (2 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 03/21] net: generalise net_iov chunk owners David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-17 1:46 ` Jakub Kicinski
2025-01-16 23:16 ` [PATCH net-next v11 05/21] netdev: add io_uring memory provider info David Wei
` (17 subsequent siblings)
21 siblings, 1 reply; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
A spin off from the original page pool memory providers patch by Jakub,
which allows extending page pools with custom allocators. One of such
providers is devmem TCP, and the other is io_uring zerocopy added in
following patches.
Co-developed-by: Jakub Kicinski <[email protected]> # initial mp proposal
Link: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
include/net/page_pool/memory_provider.h | 15 +++++++++++++++
include/net/page_pool/types.h | 4 ++++
net/core/devmem.c | 15 ++++++++++++++-
net/core/page_pool.c | 23 +++++++++++++++--------
4 files changed, 48 insertions(+), 9 deletions(-)
create mode 100644 include/net/page_pool/memory_provider.h
diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
new file mode 100644
index 000000000000..e49d0a52629d
--- /dev/null
+++ b/include/net/page_pool/memory_provider.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_PAGE_POOL_MEMORY_PROVIDER_H
+#define _NET_PAGE_POOL_MEMORY_PROVIDER_H
+
+#include <net/netmem.h>
+#include <net/page_pool/types.h>
+
+struct memory_provider_ops {
+ netmem_ref (*alloc_netmems)(struct page_pool *pool, gfp_t gfp);
+ bool (*release_netmem)(struct page_pool *pool, netmem_ref netmem);
+ int (*init)(struct page_pool *pool);
+ void (*destroy)(struct page_pool *pool);
+};
+
+#endif
diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index ed4cd114180a..88f65c3e2ad9 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -152,8 +152,11 @@ struct page_pool_stats {
*/
#define PAGE_POOL_FRAG_GROUP_ALIGN (4 * sizeof(long))
+struct memory_provider_ops;
+
struct pp_memory_provider_params {
void *mp_priv;
+ const struct memory_provider_ops *mp_ops;
};
struct page_pool {
@@ -216,6 +219,7 @@ struct page_pool {
struct ptr_ring ring;
void *mp_priv;
+ const struct memory_provider_ops *mp_ops;
#ifdef CONFIG_PAGE_POOL_STATS
/* recycle stats are per-cpu to avoid locking */
diff --git a/net/core/devmem.c b/net/core/devmem.c
index c250db6993d3..48833c1dcbd4 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -15,6 +15,7 @@
#include <net/netdev_queues.h>
#include <net/netdev_rx_queue.h>
#include <net/page_pool/helpers.h>
+#include <net/page_pool/memory_provider.h>
#include <trace/events/page_pool.h>
#include "devmem.h"
@@ -26,6 +27,8 @@
/* Protected by rtnl_lock() */
static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
+static const struct memory_provider_ops dmabuf_devmem_ops;
+
static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
struct gen_pool_chunk *chunk,
void *not_used)
@@ -117,6 +120,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
WARN_ON(rxq->mp_params.mp_priv != binding);
rxq->mp_params.mp_priv = NULL;
+ rxq->mp_params.mp_ops = NULL;
rxq_idx = get_netdev_rx_queue_index(rxq);
@@ -142,7 +146,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
}
rxq = __netif_get_rx_queue(dev, rxq_idx);
- if (rxq->mp_params.mp_priv) {
+ if (rxq->mp_params.mp_ops) {
NL_SET_ERR_MSG(extack, "designated queue already memory provider bound");
return -EEXIST;
}
@@ -160,6 +164,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
return err;
rxq->mp_params.mp_priv = binding;
+ rxq->mp_params.mp_ops = &dmabuf_devmem_ops;
err = netdev_rx_queue_restart(dev, rxq_idx);
if (err)
@@ -169,6 +174,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
err_xa_erase:
rxq->mp_params.mp_priv = NULL;
+ rxq->mp_params.mp_ops = NULL;
xa_erase(&binding->bound_rxqs, xa_idx);
return err;
@@ -388,3 +394,10 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
/* We don't want the page pool put_page()ing our net_iovs. */
return false;
}
+
+static const struct memory_provider_ops dmabuf_devmem_ops = {
+ .init = mp_dmabuf_devmem_init,
+ .destroy = mp_dmabuf_devmem_destroy,
+ .alloc_netmems = mp_dmabuf_devmem_alloc_netmems,
+ .release_netmem = mp_dmabuf_devmem_release_page,
+};
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 9733206d6406..24e2e2efb1eb 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -13,6 +13,7 @@
#include <net/netdev_rx_queue.h>
#include <net/page_pool/helpers.h>
+#include <net/page_pool/memory_provider.h>
#include <net/xdp.h>
#include <linux/dma-direction.h>
@@ -285,13 +286,19 @@ static int page_pool_init(struct page_pool *pool,
rxq = __netif_get_rx_queue(pool->slow.netdev,
pool->slow.queue_idx);
pool->mp_priv = rxq->mp_params.mp_priv;
+ pool->mp_ops = rxq->mp_params.mp_ops;
}
- if (pool->mp_priv) {
+ if (pool->mp_ops) {
if (!pool->dma_map || !pool->dma_sync)
return -EOPNOTSUPP;
- err = mp_dmabuf_devmem_init(pool);
+ if (WARN_ON(!is_kernel_rodata((unsigned long)pool->mp_ops))) {
+ err = -EFAULT;
+ goto free_ptr_ring;
+ }
+
+ err = pool->mp_ops->init(pool);
if (err) {
pr_warn("%s() mem-provider init failed %d\n", __func__,
err);
@@ -588,8 +595,8 @@ netmem_ref page_pool_alloc_netmems(struct page_pool *pool, gfp_t gfp)
return netmem;
/* Slow-path: cache empty, do real allocation */
- if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
- netmem = mp_dmabuf_devmem_alloc_netmems(pool, gfp);
+ if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+ netmem = pool->mp_ops->alloc_netmems(pool, gfp);
else
netmem = __page_pool_alloc_pages_slow(pool, gfp);
return netmem;
@@ -680,8 +687,8 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
bool put;
put = true;
- if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
- put = mp_dmabuf_devmem_release_page(pool, netmem);
+ if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+ put = pool->mp_ops->release_netmem(pool, netmem);
else
__page_pool_release_page_dma(pool, netmem);
@@ -1049,8 +1056,8 @@ static void __page_pool_destroy(struct page_pool *pool)
page_pool_unlist(pool);
page_pool_uninit(pool);
- if (pool->mp_priv) {
- mp_dmabuf_devmem_destroy(pool);
+ if (pool->mp_ops) {
+ pool->mp_ops->destroy(pool);
static_branch_dec(&page_pool_mem_providers);
}
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 05/21] netdev: add io_uring memory provider info
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (3 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 04/21] net: page_pool: create hooks for custom memory providers David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 06/21] net: page_pool: add callback for mp info printing David Wei
` (16 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
Add a nested attribute for io_uring memory provider info. For now it is
empty and its presence indicates that a particular page pool or queue
has an io_uring memory provider attached.
$ ./cli.py --spec netlink/specs/netdev.yaml --dump page-pool-get
[{'id': 80,
'ifindex': 2,
'inflight': 64,
'inflight-mem': 262144,
'napi-id': 525},
{'id': 79,
'ifindex': 2,
'inflight': 320,
'inflight-mem': 1310720,
'io_uring': {},
'napi-id': 525},
...
$ ./cli.py --spec netlink/specs/netdev.yaml --dump queue-get
[{'id': 0, 'ifindex': 1, 'type': 'rx'},
{'id': 0, 'ifindex': 1, 'type': 'tx'},
{'id': 0, 'ifindex': 2, 'napi-id': 513, 'type': 'rx'},
{'id': 1, 'ifindex': 2, 'napi-id': 514, 'type': 'rx'},
...
{'id': 12, 'ifindex': 2, 'io_uring': {}, 'napi-id': 525, 'type': 'rx'},
...
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
Documentation/netlink/specs/netdev.yaml | 15 +++++++++++++++
include/uapi/linux/netdev.h | 8 ++++++++
tools/include/uapi/linux/netdev.h | 8 ++++++++
3 files changed, 31 insertions(+)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index cbb544bd6c84..288923e965ae 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -114,6 +114,9 @@ attribute-sets:
doc: Bitmask of enabled AF_XDP features.
type: u64
enum: xsk-flags
+ -
+ name: io-uring-provider-info
+ attributes: []
-
name: page-pool
attributes:
@@ -171,6 +174,11 @@ attribute-sets:
name: dmabuf
doc: ID of the dmabuf this page-pool is attached to.
type: u32
+ -
+ name: io-uring
+ doc: io-uring memory provider information.
+ type: nest
+ nested-attributes: io-uring-provider-info
-
name: page-pool-info
subset-of: page-pool
@@ -296,6 +304,11 @@ attribute-sets:
name: dmabuf
doc: ID of the dmabuf attached to this queue, if any.
type: u32
+ -
+ name: io-uring
+ doc: io_uring memory provider information.
+ type: nest
+ nested-attributes: io-uring-provider-info
-
name: qstats
@@ -572,6 +585,7 @@ operations:
- inflight-mem
- detach-time
- dmabuf
+ - io-uring
dump:
reply: *pp-reply
config-cond: page-pool
@@ -637,6 +651,7 @@ operations:
- napi-id
- ifindex
- dmabuf
+ - io-uring
dump:
request:
attributes:
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index e4be227d3ad6..684090732068 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -86,6 +86,12 @@ enum {
NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
};
+enum {
+
+ __NETDEV_A_IO_URING_PROVIDER_INFO_MAX,
+ NETDEV_A_IO_URING_PROVIDER_INFO_MAX = (__NETDEV_A_IO_URING_PROVIDER_INFO_MAX - 1)
+};
+
enum {
NETDEV_A_PAGE_POOL_ID = 1,
NETDEV_A_PAGE_POOL_IFINDEX,
@@ -94,6 +100,7 @@ enum {
NETDEV_A_PAGE_POOL_INFLIGHT_MEM,
NETDEV_A_PAGE_POOL_DETACH_TIME,
NETDEV_A_PAGE_POOL_DMABUF,
+ NETDEV_A_PAGE_POOL_IO_URING,
__NETDEV_A_PAGE_POOL_MAX,
NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1)
@@ -136,6 +143,7 @@ enum {
NETDEV_A_QUEUE_TYPE,
NETDEV_A_QUEUE_NAPI_ID,
NETDEV_A_QUEUE_DMABUF,
+ NETDEV_A_QUEUE_IO_URING,
__NETDEV_A_QUEUE_MAX,
NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index e4be227d3ad6..684090732068 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -86,6 +86,12 @@ enum {
NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
};
+enum {
+
+ __NETDEV_A_IO_URING_PROVIDER_INFO_MAX,
+ NETDEV_A_IO_URING_PROVIDER_INFO_MAX = (__NETDEV_A_IO_URING_PROVIDER_INFO_MAX - 1)
+};
+
enum {
NETDEV_A_PAGE_POOL_ID = 1,
NETDEV_A_PAGE_POOL_IFINDEX,
@@ -94,6 +100,7 @@ enum {
NETDEV_A_PAGE_POOL_INFLIGHT_MEM,
NETDEV_A_PAGE_POOL_DETACH_TIME,
NETDEV_A_PAGE_POOL_DMABUF,
+ NETDEV_A_PAGE_POOL_IO_URING,
__NETDEV_A_PAGE_POOL_MAX,
NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1)
@@ -136,6 +143,7 @@ enum {
NETDEV_A_QUEUE_TYPE,
NETDEV_A_QUEUE_NAPI_ID,
NETDEV_A_QUEUE_DMABUF,
+ NETDEV_A_QUEUE_IO_URING,
__NETDEV_A_QUEUE_MAX,
NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 06/21] net: page_pool: add callback for mp info printing
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (4 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 05/21] netdev: add io_uring memory provider info David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 07/21] net: page_pool: add a mp hook to unregister_netdevice* David Wei
` (15 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
Add a mandatory callback that prints information about the memory
provider to netlink.
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
include/net/page_pool/memory_provider.h | 5 +++++
net/core/devmem.c | 10 ++++++++++
net/core/netdev-genl.c | 11 ++++++-----
net/core/page_pool_user.c | 5 ++---
4 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
index e49d0a52629d..6d10a0959d00 100644
--- a/include/net/page_pool/memory_provider.h
+++ b/include/net/page_pool/memory_provider.h
@@ -5,11 +5,16 @@
#include <net/netmem.h>
#include <net/page_pool/types.h>
+struct netdev_rx_queue;
+struct sk_buff;
+
struct memory_provider_ops {
netmem_ref (*alloc_netmems)(struct page_pool *pool, gfp_t gfp);
bool (*release_netmem)(struct page_pool *pool, netmem_ref netmem);
int (*init)(struct page_pool *pool);
void (*destroy)(struct page_pool *pool);
+ int (*nl_fill)(void *mp_priv, struct sk_buff *rsp,
+ struct netdev_rx_queue *rxq);
};
#endif
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 48833c1dcbd4..c0bde0869f72 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -395,9 +395,19 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
return false;
}
+static int mp_dmabuf_devmem_nl_fill(void *mp_priv, struct sk_buff *rsp,
+ struct netdev_rx_queue *rxq)
+{
+ const struct net_devmem_dmabuf_binding *binding = mp_priv;
+ int type = rxq ? NETDEV_A_QUEUE_DMABUF : NETDEV_A_PAGE_POOL_DMABUF;
+
+ return nla_put_u32(rsp, type, binding->id);
+}
+
static const struct memory_provider_ops dmabuf_devmem_ops = {
.init = mp_dmabuf_devmem_init,
.destroy = mp_dmabuf_devmem_destroy,
.alloc_netmems = mp_dmabuf_devmem_alloc_netmems,
.release_netmem = mp_dmabuf_devmem_release_page,
+ .nl_fill = mp_dmabuf_devmem_nl_fill,
};
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 2d3ae0cd3ad2..4bc05fb27890 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -10,6 +10,7 @@
#include <net/sock.h>
#include <net/xdp.h>
#include <net/xdp_sock.h>
+#include <net/page_pool/memory_provider.h>
#include "dev.h"
#include "devmem.h"
@@ -368,7 +369,6 @@ static int
netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
u32 q_idx, u32 q_type, const struct genl_info *info)
{
- struct net_devmem_dmabuf_binding *binding;
struct netdev_rx_queue *rxq;
struct netdev_queue *txq;
void *hdr;
@@ -385,15 +385,16 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
switch (q_type) {
case NETDEV_QUEUE_TYPE_RX:
rxq = __netif_get_rx_queue(netdev, q_idx);
+ struct pp_memory_provider_params *params;
+
if (rxq->napi && nla_put_u32(rsp, NETDEV_A_QUEUE_NAPI_ID,
rxq->napi->napi_id))
goto nla_put_failure;
- binding = rxq->mp_params.mp_priv;
- if (binding &&
- nla_put_u32(rsp, NETDEV_A_QUEUE_DMABUF, binding->id))
+ params = &rxq->mp_params;
+ if (params->mp_ops &&
+ params->mp_ops->nl_fill(params->mp_priv, rsp, rxq))
goto nla_put_failure;
-
break;
case NETDEV_QUEUE_TYPE_TX:
txq = netdev_get_tx_queue(netdev, q_idx);
diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
index 8d31c71bea1a..bd017537fa80 100644
--- a/net/core/page_pool_user.c
+++ b/net/core/page_pool_user.c
@@ -7,9 +7,9 @@
#include <net/netdev_rx_queue.h>
#include <net/page_pool/helpers.h>
#include <net/page_pool/types.h>
+#include <net/page_pool/memory_provider.h>
#include <net/sock.h>
-#include "devmem.h"
#include "page_pool_priv.h"
#include "netdev-genl-gen.h"
@@ -214,7 +214,6 @@ static int
page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
const struct genl_info *info)
{
- struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
size_t inflight, refsz;
void *hdr;
@@ -244,7 +243,7 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
pool->user.detach_time))
goto err_cancel;
- if (binding && nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id))
+ if (pool->mp_ops && pool->mp_ops->nl_fill(pool->mp_priv, rsp, NULL))
goto err_cancel;
genlmsg_end(rsp, hdr);
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 07/21] net: page_pool: add a mp hook to unregister_netdevice*
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (5 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 06/21] net: page_pool: add callback for mp info printing David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 08/21] net: prepare for non devmem TCP memory providers David Wei
` (14 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
Devmem TCP needs a hook in unregister_netdevice_many_notify() to upkeep
the set tracking queues it's bound to, i.e. ->bound_rxqs. Instead of
devmem sticking directly out of the genetic path, add a mp function.
Reviewed-by: Jakub Kicinski <[email protected]>
Reviewed-by: Mina Almasry <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
include/net/page_pool/memory_provider.h | 1 +
net/core/dev.c | 16 ++++++++++-
net/core/devmem.c | 36 +++++++++++--------------
net/core/devmem.h | 5 ----
4 files changed, 32 insertions(+), 26 deletions(-)
diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
index 6d10a0959d00..36469a7e649f 100644
--- a/include/net/page_pool/memory_provider.h
+++ b/include/net/page_pool/memory_provider.h
@@ -15,6 +15,7 @@ struct memory_provider_ops {
void (*destroy)(struct page_pool *pool);
int (*nl_fill)(void *mp_priv, struct sk_buff *rsp,
struct netdev_rx_queue *rxq);
+ void (*uninstall)(void *mp_priv, struct netdev_rx_queue *rxq);
};
#endif
diff --git a/net/core/dev.c b/net/core/dev.c
index c7f3dea3e0eb..1d99974e8fba 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -157,6 +157,7 @@
#include <net/netdev_rx_queue.h>
#include <net/page_pool/types.h>
#include <net/page_pool/helpers.h>
+#include <net/page_pool/memory_provider.h>
#include <net/rps.h>
#include <linux/phy_link_topology.h>
@@ -11464,6 +11465,19 @@ void unregister_netdevice_queue(struct net_device *dev, struct list_head *head)
}
EXPORT_SYMBOL(unregister_netdevice_queue);
+static void dev_memory_provider_uninstall(struct net_device *dev)
+{
+ unsigned int i;
+
+ for (i = 0; i < dev->real_num_rx_queues; i++) {
+ struct netdev_rx_queue *rxq = &dev->_rx[i];
+ struct pp_memory_provider_params *p = &rxq->mp_params;
+
+ if (p->mp_ops && p->mp_ops->uninstall)
+ p->mp_ops->uninstall(rxq->mp_params.mp_priv, rxq);
+ }
+}
+
void unregister_netdevice_many_notify(struct list_head *head,
u32 portid, const struct nlmsghdr *nlh)
{
@@ -11516,7 +11530,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
dev_tcx_uninstall(dev);
dev_xdp_uninstall(dev);
bpf_dev_bound_netdev_unregister(dev);
- dev_dmabuf_uninstall(dev);
+ dev_memory_provider_uninstall(dev);
netdev_offload_xstats_disable_all(dev);
diff --git a/net/core/devmem.c b/net/core/devmem.c
index c0bde0869f72..6f46286d45a9 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -309,26 +309,6 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
return ERR_PTR(err);
}
-void dev_dmabuf_uninstall(struct net_device *dev)
-{
- struct net_devmem_dmabuf_binding *binding;
- struct netdev_rx_queue *rxq;
- unsigned long xa_idx;
- unsigned int i;
-
- for (i = 0; i < dev->real_num_rx_queues; i++) {
- binding = dev->_rx[i].mp_params.mp_priv;
- if (!binding)
- continue;
-
- xa_for_each(&binding->bound_rxqs, xa_idx, rxq)
- if (rxq == &dev->_rx[i]) {
- xa_erase(&binding->bound_rxqs, xa_idx);
- break;
- }
- }
-}
-
/*** "Dmabuf devmem memory provider" ***/
int mp_dmabuf_devmem_init(struct page_pool *pool)
@@ -404,10 +384,26 @@ static int mp_dmabuf_devmem_nl_fill(void *mp_priv, struct sk_buff *rsp,
return nla_put_u32(rsp, type, binding->id);
}
+static void mp_dmabuf_devmem_uninstall(void *mp_priv,
+ struct netdev_rx_queue *rxq)
+{
+ struct net_devmem_dmabuf_binding *binding = mp_priv;
+ struct netdev_rx_queue *bound_rxq;
+ unsigned long xa_idx;
+
+ xa_for_each(&binding->bound_rxqs, xa_idx, bound_rxq) {
+ if (bound_rxq == rxq) {
+ xa_erase(&binding->bound_rxqs, xa_idx);
+ break;
+ }
+ }
+}
+
static const struct memory_provider_ops dmabuf_devmem_ops = {
.init = mp_dmabuf_devmem_init,
.destroy = mp_dmabuf_devmem_destroy,
.alloc_netmems = mp_dmabuf_devmem_alloc_netmems,
.release_netmem = mp_dmabuf_devmem_release_page,
.nl_fill = mp_dmabuf_devmem_nl_fill,
+ .uninstall = mp_dmabuf_devmem_uninstall,
};
diff --git a/net/core/devmem.h b/net/core/devmem.h
index a2b9913e9a17..8e999fe2ae67 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -68,7 +68,6 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
struct net_devmem_dmabuf_binding *binding,
struct netlink_ext_ack *extack);
-void dev_dmabuf_uninstall(struct net_device *dev);
static inline struct dmabuf_genpool_chunk_owner *
net_devmem_iov_to_chunk_owner(const struct net_iov *niov)
@@ -145,10 +144,6 @@ net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
return -EOPNOTSUPP;
}
-static inline void dev_dmabuf_uninstall(struct net_device *dev)
-{
-}
-
static inline struct net_iov *
net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
{
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 08/21] net: prepare for non devmem TCP memory providers
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (6 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 07/21] net: page_pool: add a mp hook to unregister_netdevice* David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 09/21] net: page_pool: add memory provider helpers David Wei
` (13 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
There is a good bunch of places in generic paths assuming that the only
page pool memory provider is devmem TCP. As we want to reuse the net_iov
and provider infrastructure, we need to patch it up and explicitly check
the provider type when we branch into devmem TCP code.
Reviewed-by: Mina Almasry <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
net/core/devmem.c | 5 +++++
net/core/devmem.h | 7 +++++++
net/ipv4/tcp.c | 5 +++++
3 files changed, 17 insertions(+)
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 6f46286d45a9..8fcb0c7b63be 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -29,6 +29,11 @@ static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
static const struct memory_provider_ops dmabuf_devmem_ops;
+bool net_is_devmem_iov(struct net_iov *niov)
+{
+ return niov->pp->mp_ops == &dmabuf_devmem_ops;
+}
+
static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
struct gen_pool_chunk *chunk,
void *not_used)
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 8e999fe2ae67..7fc158d52729 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -115,6 +115,8 @@ struct net_iov *
net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding);
void net_devmem_free_dmabuf(struct net_iov *ppiov);
+bool net_is_devmem_iov(struct net_iov *niov);
+
#else
struct net_devmem_dmabuf_binding;
@@ -163,6 +165,11 @@ static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
{
return 0;
}
+
+static inline bool net_is_devmem_iov(struct net_iov *niov)
+{
+ return false;
+}
#endif
#endif /* _NET_DEVMEM_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b872de9a8271..7f43d31c9400 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2476,6 +2476,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
}
niov = skb_frag_net_iov(frag);
+ if (!net_is_devmem_iov(niov)) {
+ err = -ENODEV;
+ goto out;
+ }
+
end = start + skb_frag_size(frag);
copy = end - offset;
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 09/21] net: page_pool: add memory provider helpers
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (7 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 08/21] net: prepare for non devmem TCP memory providers David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-17 1:47 ` Jakub Kicinski
2025-01-16 23:16 ` [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue David Wei
` (12 subsequent siblings)
21 siblings, 1 reply; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
Add helpers for memory providers to interact with page pools.
net_mp_niov_{set,clear}_page_pool() serve to [dis]associate a net_iov
with a page pool. If used, the memory provider is responsible to match
"set" calls with "clear" once a net_iov is not going to be used by a page
pool anymore, changing a page pool, etc.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
include/net/page_pool/memory_provider.h | 19 +++++++++++++++++
net/core/page_pool.c | 28 +++++++++++++++++++++++++
2 files changed, 47 insertions(+)
diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
index 36469a7e649f..4f0ffb8f6a0a 100644
--- a/include/net/page_pool/memory_provider.h
+++ b/include/net/page_pool/memory_provider.h
@@ -18,4 +18,23 @@ struct memory_provider_ops {
void (*uninstall)(void *mp_priv, struct netdev_rx_queue *rxq);
};
+bool net_mp_niov_set_dma_addr(struct net_iov *niov, dma_addr_t addr);
+void net_mp_niov_set_page_pool(struct page_pool *pool, struct net_iov *niov);
+void net_mp_niov_clear_page_pool(struct net_iov *niov);
+
+/**
+ * net_mp_netmem_place_in_cache() - give a netmem to a page pool
+ * @pool: the page pool to place the netmem into
+ * @netmem: netmem to give
+ *
+ * Push an accounted netmem into the page pool's allocation cache. The caller
+ * must ensure that there is space in the cache. It should only be called off
+ * the mp_ops->alloc_netmems() path.
+ */
+static inline void net_mp_netmem_place_in_cache(struct page_pool *pool,
+ netmem_ref netmem)
+{
+ pool->alloc.cache[pool->alloc.count++] = netmem;
+}
+
#endif
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 24e2e2efb1eb..6a9eb2da5f6d 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1196,3 +1196,31 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
}
}
EXPORT_SYMBOL(page_pool_update_nid);
+
+bool net_mp_niov_set_dma_addr(struct net_iov *niov, dma_addr_t addr)
+{
+ return page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), addr);
+}
+
+/* Associate a niov with a page pool. Should follow with a matching
+ * net_mp_niov_clear_page_pool()
+ */
+void net_mp_niov_set_page_pool(struct page_pool *pool, struct net_iov *niov)
+{
+ netmem_ref netmem = net_iov_to_netmem(niov);
+
+ page_pool_set_pp_info(pool, netmem);
+
+ pool->pages_state_hold_cnt++;
+ trace_page_pool_state_hold(pool, netmem, pool->pages_state_hold_cnt);
+}
+
+/* Disassociate a niov from a page pool. Should only be used in the
+ * ->release_netmem() path.
+ */
+void net_mp_niov_clear_page_pool(struct net_iov *niov)
+{
+ netmem_ref netmem = net_iov_to_netmem(niov);
+
+ page_pool_clear_pp_info(netmem);
+}
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (8 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 09/21] net: page_pool: add memory provider helpers David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-17 1:50 ` Jakub Kicinski
` (2 more replies)
2025-01-16 23:16 ` [PATCH net-next v11 11/21] io_uring/zcrx: add interface queue and refill queue David Wei
` (11 subsequent siblings)
21 siblings, 3 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
Add helpers that properly prep or remove a memory provider for an rx
queue then restart the queue.
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
include/net/page_pool/memory_provider.h | 5 ++
net/core/netdev_rx_queue.c | 66 +++++++++++++++++++++++++
2 files changed, 71 insertions(+)
diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
index 4f0ffb8f6a0a..b3e665897767 100644
--- a/include/net/page_pool/memory_provider.h
+++ b/include/net/page_pool/memory_provider.h
@@ -22,6 +22,11 @@ bool net_mp_niov_set_dma_addr(struct net_iov *niov, dma_addr_t addr);
void net_mp_niov_set_page_pool(struct page_pool *pool, struct net_iov *niov);
void net_mp_niov_clear_page_pool(struct net_iov *niov);
+int net_mp_open_rxq(struct net_device *dev, unsigned ifq_idx,
+ struct pp_memory_provider_params *p);
+void net_mp_close_rxq(struct net_device *dev, unsigned ifq_idx,
+ struct pp_memory_provider_params *old_p);
+
/**
* net_mp_netmem_place_in_cache() - give a netmem to a page pool
* @pool: the page pool to place the netmem into
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index e217a5838c87..e934568d4cd9 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -3,6 +3,7 @@
#include <linux/netdevice.h>
#include <net/netdev_queues.h>
#include <net/netdev_rx_queue.h>
+#include <net/page_pool/memory_provider.h>
#include "page_pool_priv.h"
@@ -79,3 +80,68 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
return err;
}
+
+static int __net_mp_open_rxq(struct net_device *dev, unsigned ifq_idx,
+ struct pp_memory_provider_params *p)
+{
+ struct netdev_rx_queue *rxq;
+ int ret;
+
+ if (ifq_idx >= dev->real_num_rx_queues)
+ return -EINVAL;
+ ifq_idx = array_index_nospec(ifq_idx, dev->real_num_rx_queues);
+
+ rxq = __netif_get_rx_queue(dev, ifq_idx);
+ if (rxq->mp_params.mp_ops)
+ return -EEXIST;
+
+ rxq->mp_params = *p;
+ ret = netdev_rx_queue_restart(dev, ifq_idx);
+ if (ret) {
+ rxq->mp_params.mp_ops = NULL;
+ rxq->mp_params.mp_priv = NULL;
+ }
+ return ret;
+}
+
+int net_mp_open_rxq(struct net_device *dev, unsigned ifq_idx,
+ struct pp_memory_provider_params *p)
+{
+ int ret;
+
+ rtnl_lock();
+ ret = __net_mp_open_rxq(dev, ifq_idx, p);
+ rtnl_unlock();
+ return ret;
+}
+
+static void __net_mp_close_rxq(struct net_device *dev, unsigned ifq_idx,
+ struct pp_memory_provider_params *old_p)
+{
+ struct netdev_rx_queue *rxq;
+ int ret;
+
+ if (WARN_ON_ONCE(ifq_idx >= dev->real_num_rx_queues))
+ return;
+
+ rxq = __netif_get_rx_queue(dev, ifq_idx);
+
+ if (WARN_ON_ONCE(rxq->mp_params.mp_ops != old_p->mp_ops ||
+ rxq->mp_params.mp_priv != old_p->mp_priv))
+ return;
+
+ rxq->mp_params.mp_ops = NULL;
+ rxq->mp_params.mp_priv = NULL;
+ ret = netdev_rx_queue_restart(dev, ifq_idx);
+ if (ret)
+ pr_devel("Could not restart queue %u after removing memory provider.\n",
+ ifq_idx);
+}
+
+void net_mp_close_rxq(struct net_device *dev, unsigned ifq_idx,
+ struct pp_memory_provider_params *old_p)
+{
+ rtnl_lock();
+ __net_mp_close_rxq(dev, ifq_idx, old_p);
+ rtnl_unlock();
+}
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 11/21] io_uring/zcrx: add interface queue and refill queue
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (9 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 12/21] io_uring/zcrx: add io_zcrx_area David Wei
` (10 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
Add a new object called an interface queue (ifq) that represents a net
rx queue that has been configured for zero copy. Each ifq is registered
using a new registration opcode IORING_REGISTER_ZCRX_IFQ.
The refill queue is allocated by the kernel and mapped by userspace
using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main
SQ/CQ. It is used by userspace to return buffers that it is done with,
which will then be re-used by the netdev again.
The main CQ ring is used to notify userspace of received data by using
the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each
entry contains the offset + len to the data.
For now, each io_uring instance only has a single ifq.
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
Kconfig | 2 +
include/linux/io_uring_types.h | 6 ++
include/uapi/linux/io_uring.h | 43 +++++++++-
io_uring/KConfig | 10 +++
io_uring/Makefile | 1 +
io_uring/io_uring.c | 7 ++
io_uring/memmap.h | 1 +
io_uring/register.c | 7 ++
io_uring/zcrx.c | 149 +++++++++++++++++++++++++++++++++
io_uring/zcrx.h | 35 ++++++++
10 files changed, 260 insertions(+), 1 deletion(-)
create mode 100644 io_uring/KConfig
create mode 100644 io_uring/zcrx.c
create mode 100644 io_uring/zcrx.h
diff --git a/Kconfig b/Kconfig
index 745bc773f567..529ea7694ba9 100644
--- a/Kconfig
+++ b/Kconfig
@@ -30,3 +30,5 @@ source "lib/Kconfig"
source "lib/Kconfig.debug"
source "Documentation/Kconfig"
+
+source "io_uring/KConfig"
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 493a8f7fa8e4..f0c6e18d176a 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -40,6 +40,8 @@ enum io_uring_cmd_flags {
IO_URING_F_TASK_DEAD = (1 << 13),
};
+struct io_zcrx_ifq;
+
struct io_wq_work_node {
struct io_wq_work_node *next;
};
@@ -384,6 +386,8 @@ struct io_ring_ctx {
struct wait_queue_head poll_wq;
struct io_restriction restrictions;
+ struct io_zcrx_ifq *ifq;
+
u32 pers_next;
struct xarray personalities;
@@ -436,6 +440,8 @@ struct io_ring_ctx {
struct io_mapped_region ring_region;
/* used for optimised request parameter and wait argument passing */
struct io_mapped_region param_region;
+ /* just one zcrx per ring for now, will move to io_zcrx_ifq eventually */
+ struct io_mapped_region zcrx_region;
};
struct io_tw_state {
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 38f0d6b10eaf..3af8b7a19824 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -638,7 +638,8 @@ enum io_uring_register_op {
/* send MSG_RING without having a ring */
IORING_REGISTER_SEND_MSG_RING = 31,
- /* 32 reserved for zc rx */
+ /* register a netdev hw rx queue for zerocopy */
+ IORING_REGISTER_ZCRX_IFQ = 32,
/* resize CQ ring */
IORING_REGISTER_RESIZE_RINGS = 33,
@@ -955,6 +956,46 @@ enum io_uring_socket_op {
SOCKET_URING_OP_SETSOCKOPT,
};
+/* Zero copy receive refill queue entry */
+struct io_uring_zcrx_rqe {
+ __u64 off;
+ __u32 len;
+ __u32 __pad;
+};
+
+struct io_uring_zcrx_cqe {
+ __u64 off;
+ __u64 __pad;
+};
+
+/* The bit from which area id is encoded into offsets */
+#define IORING_ZCRX_AREA_SHIFT 48
+#define IORING_ZCRX_AREA_MASK (~(((__u64)1 << IORING_ZCRX_AREA_SHIFT) - 1))
+
+struct io_uring_zcrx_offsets {
+ __u32 head;
+ __u32 tail;
+ __u32 rqes;
+ __u32 __resv2;
+ __u64 __resv[2];
+};
+
+/*
+ * Argument for IORING_REGISTER_ZCRX_IFQ
+ */
+struct io_uring_zcrx_ifq_reg {
+ __u32 if_idx;
+ __u32 if_rxq;
+ __u32 rq_entries;
+ __u32 flags;
+
+ __u64 area_ptr; /* pointer to struct io_uring_zcrx_area_reg */
+ __u64 region_ptr; /* struct io_uring_region_desc * */
+
+ struct io_uring_zcrx_offsets offsets;
+ __u64 __resv[4];
+};
+
#ifdef __cplusplus
}
#endif
diff --git a/io_uring/KConfig b/io_uring/KConfig
new file mode 100644
index 000000000000..9e2a4beba1ef
--- /dev/null
+++ b/io_uring/KConfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# io_uring configuration
+#
+
+config IO_URING_ZCRX
+ def_bool y
+ depends on PAGE_POOL
+ depends on INET
+ depends on NET_RX_BUSY_POLL
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 53167bef37d7..a95b0b8229c9 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_IO_URING) += io_uring.o opdef.o kbuf.o rsrc.o notif.o \
epoll.o statx.o timeout.o fdinfo.o \
cancel.o waitid.o register.o \
truncate.o memmap.o
+obj-$(CONFIG_IO_URING_ZCRX) += zcrx.o
obj-$(CONFIG_IO_WQ) += io-wq.o
obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 5535a72b0ce1..0c02a2e97f01 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -97,6 +97,7 @@
#include "uring_cmd.h"
#include "msg_ring.h"
#include "memmap.h"
+#include "zcrx.h"
#include "timeout.h"
#include "poll.h"
@@ -2686,6 +2687,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
mutex_lock(&ctx->uring_lock);
io_sqe_buffers_unregister(ctx);
io_sqe_files_unregister(ctx);
+ io_unregister_zcrx_ifqs(ctx);
io_cqring_overflow_kill(ctx);
io_eventfd_unregister(ctx);
io_alloc_cache_free(&ctx->apoll_cache, kfree);
@@ -2851,6 +2853,11 @@ static __cold void io_ring_exit_work(struct work_struct *work)
io_cqring_overflow_kill(ctx);
mutex_unlock(&ctx->uring_lock);
}
+ if (ctx->ifq) {
+ mutex_lock(&ctx->uring_lock);
+ io_shutdown_zcrx_ifqs(ctx);
+ mutex_unlock(&ctx->uring_lock);
+ }
if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
io_move_task_work_from_local(ctx);
diff --git a/io_uring/memmap.h b/io_uring/memmap.h
index c898dcba2b4e..dad0aa5b1b45 100644
--- a/io_uring/memmap.h
+++ b/io_uring/memmap.h
@@ -2,6 +2,7 @@
#define IO_URING_MEMMAP_H
#define IORING_MAP_OFF_PARAM_REGION 0x20000000ULL
+#define IORING_MAP_OFF_ZCRX_REGION 0x30000000ULL
struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
diff --git a/io_uring/register.c b/io_uring/register.c
index f1698c18c7cb..f9dfdca79a80 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -30,6 +30,7 @@
#include "eventfd.h"
#include "msg_ring.h"
#include "memmap.h"
+#include "zcrx.h"
#define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -798,6 +799,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_register_clone_buffers(ctx, arg);
break;
+ case IORING_REGISTER_ZCRX_IFQ:
+ ret = -EINVAL;
+ if (!arg || nr_args != 1)
+ break;
+ ret = io_register_zcrx_ifq(ctx, arg);
+ break;
case IORING_REGISTER_RESIZE_RINGS:
ret = -EINVAL;
if (!arg || nr_args != 1)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
new file mode 100644
index 000000000000..f3ace7e8264d
--- /dev/null
+++ b/io_uring/zcrx.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "kbuf.h"
+#include "memmap.h"
+#include "zcrx.h"
+
+#define IO_RQ_MAX_ENTRIES 32768
+
+static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
+ struct io_uring_zcrx_ifq_reg *reg,
+ struct io_uring_region_desc *rd)
+{
+ size_t off, size;
+ void *ptr;
+ int ret;
+
+ off = sizeof(struct io_uring);
+ size = off + sizeof(struct io_uring_zcrx_rqe) * reg->rq_entries;
+ if (size > rd->size)
+ return -EINVAL;
+
+ ret = io_create_region_mmap_safe(ifq->ctx, &ifq->ctx->zcrx_region, rd,
+ IORING_MAP_OFF_ZCRX_REGION);
+ if (ret < 0)
+ return ret;
+
+ ptr = io_region_get_ptr(&ifq->ctx->zcrx_region);
+ ifq->rq_ring = (struct io_uring *)ptr;
+ ifq->rqes = (struct io_uring_zcrx_rqe *)(ptr + off);
+ return 0;
+}
+
+static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq)
+{
+ io_free_region(ifq->ctx, &ifq->ctx->zcrx_region);
+ ifq->rq_ring = NULL;
+ ifq->rqes = NULL;
+}
+
+static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
+{
+ struct io_zcrx_ifq *ifq;
+
+ ifq = kzalloc(sizeof(*ifq), GFP_KERNEL);
+ if (!ifq)
+ return NULL;
+
+ ifq->if_rxq = -1;
+ ifq->ctx = ctx;
+ return ifq;
+}
+
+static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
+{
+ io_free_rbuf_ring(ifq);
+ kfree(ifq);
+}
+
+int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
+ struct io_uring_zcrx_ifq_reg __user *arg)
+{
+ struct io_uring_zcrx_ifq_reg reg;
+ struct io_uring_region_desc rd;
+ struct io_zcrx_ifq *ifq;
+ int ret;
+
+ /*
+ * 1. Interface queue allocation.
+ * 2. It can observe data destined for sockets of other tasks.
+ */
+ if (!capable(CAP_NET_ADMIN))
+ return -EPERM;
+
+ /* mandatory io_uring features for zc rx */
+ if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN &&
+ ctx->flags & IORING_SETUP_CQE32))
+ return -EINVAL;
+ if (ctx->ifq)
+ return -EBUSY;
+ if (copy_from_user(®, arg, sizeof(reg)))
+ return -EFAULT;
+ if (copy_from_user(&rd, u64_to_user_ptr(reg.region_ptr), sizeof(rd)))
+ return -EFAULT;
+ if (memchr_inv(®.__resv, 0, sizeof(reg.__resv)))
+ return -EINVAL;
+ if (reg.if_rxq == -1 || !reg.rq_entries || reg.flags)
+ return -EINVAL;
+ if (reg.rq_entries > IO_RQ_MAX_ENTRIES) {
+ if (!(ctx->flags & IORING_SETUP_CLAMP))
+ return -EINVAL;
+ reg.rq_entries = IO_RQ_MAX_ENTRIES;
+ }
+ reg.rq_entries = roundup_pow_of_two(reg.rq_entries);
+
+ if (!reg.area_ptr)
+ return -EFAULT;
+
+ ifq = io_zcrx_ifq_alloc(ctx);
+ if (!ifq)
+ return -ENOMEM;
+
+ ret = io_allocate_rbuf_ring(ifq, ®, &rd);
+ if (ret)
+ goto err;
+
+ ifq->rq_entries = reg.rq_entries;
+ ifq->if_rxq = reg.if_rxq;
+
+ reg.offsets.rqes = sizeof(struct io_uring);
+ reg.offsets.head = offsetof(struct io_uring, head);
+ reg.offsets.tail = offsetof(struct io_uring, tail);
+
+ if (copy_to_user(arg, ®, sizeof(reg)) ||
+ copy_to_user(u64_to_user_ptr(reg.region_ptr), &rd, sizeof(rd))) {
+ ret = -EFAULT;
+ goto err;
+ }
+
+ ctx->ifq = ifq;
+ return 0;
+err:
+ io_zcrx_ifq_free(ifq);
+ return ret;
+}
+
+void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+ struct io_zcrx_ifq *ifq = ctx->ifq;
+
+ lockdep_assert_held(&ctx->uring_lock);
+
+ if (!ifq)
+ return;
+
+ ctx->ifq = NULL;
+ io_zcrx_ifq_free(ifq);
+}
+
+void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+ lockdep_assert_held(&ctx->uring_lock);
+}
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
new file mode 100644
index 000000000000..58e4ab6c6083
--- /dev/null
+++ b/io_uring/zcrx.h
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_ZC_RX_H
+#define IOU_ZC_RX_H
+
+#include <linux/io_uring_types.h>
+
+struct io_zcrx_ifq {
+ struct io_ring_ctx *ctx;
+ struct io_uring *rq_ring;
+ struct io_uring_zcrx_rqe *rqes;
+ u32 rq_entries;
+
+ u32 if_rxq;
+};
+
+#if defined(CONFIG_IO_URING_ZCRX)
+int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
+ struct io_uring_zcrx_ifq_reg __user *arg);
+void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);
+void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);
+#else
+static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
+ struct io_uring_zcrx_ifq_reg __user *arg)
+{
+ return -EOPNOTSUPP;
+}
+static inline void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+#endif
+
+#endif
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 12/21] io_uring/zcrx: add io_zcrx_area
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (10 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 11/21] io_uring/zcrx: add interface queue and refill queue David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 13/21] io_uring/zcrx: grab a net device David Wei
` (9 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
Add io_zcrx_area that represents a region of userspace memory that is
used for zero copy. During ifq registration, userspace passes in the
uaddr and len of userspace memory, which is then pinned by the kernel.
Each net_iov is mapped to one of these pages.
The freelist is a spinlock protected list that keeps track of all the
net_iovs/pages that aren't used.
For now, there is only one area per ifq and area registration happens
implicitly as part of ifq registration. There is no API for
adding/removing areas yet. The struct for area registration is there for
future extensibility once we support multiple areas and TCP devmem.
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
include/uapi/linux/io_uring.h | 9 ++++
io_uring/rsrc.c | 2 +-
io_uring/rsrc.h | 1 +
io_uring/zcrx.c | 89 ++++++++++++++++++++++++++++++++++-
io_uring/zcrx.h | 16 +++++++
5 files changed, 114 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 3af8b7a19824..e251f28507ce 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -980,6 +980,15 @@ struct io_uring_zcrx_offsets {
__u64 __resv[2];
};
+struct io_uring_zcrx_area_reg {
+ __u64 addr;
+ __u64 len;
+ __u64 rq_area_token;
+ __u32 flags;
+ __u32 __resv1;
+ __u64 __resv2[2];
+};
+
/*
* Argument for IORING_REGISTER_ZCRX_IFQ
*/
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index f2ff108485c8..d0f11b5aec0d 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -77,7 +77,7 @@ static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
return 0;
}
-static int io_buffer_validate(struct iovec *iov)
+int io_buffer_validate(struct iovec *iov)
{
unsigned long tmp, acct_len = iov->iov_len + (PAGE_SIZE - 1);
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index c8b093584461..0ae54ddeb1fd 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -66,6 +66,7 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
unsigned size, unsigned type);
int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
unsigned int size, unsigned int type);
+int io_buffer_validate(struct iovec *iov);
bool io_check_coalesce_buffer(struct page **page_array, int nr_pages,
struct io_imu_folio_data *data);
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index f3ace7e8264d..04883a3ae80c 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -10,6 +10,7 @@
#include "kbuf.h"
#include "memmap.h"
#include "zcrx.h"
+#include "rsrc.h"
#define IO_RQ_MAX_ENTRIES 32768
@@ -44,6 +45,79 @@ static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq)
ifq->rqes = NULL;
}
+static void io_zcrx_free_area(struct io_zcrx_area *area)
+{
+ kvfree(area->freelist);
+ kvfree(area->nia.niovs);
+ if (area->pages) {
+ unpin_user_pages(area->pages, area->nia.num_niovs);
+ kvfree(area->pages);
+ }
+ kfree(area);
+}
+
+static int io_zcrx_create_area(struct io_zcrx_ifq *ifq,
+ struct io_zcrx_area **res,
+ struct io_uring_zcrx_area_reg *area_reg)
+{
+ struct io_zcrx_area *area;
+ int i, ret, nr_pages;
+ struct iovec iov;
+
+ if (area_reg->flags || area_reg->rq_area_token)
+ return -EINVAL;
+ if (area_reg->__resv1 || area_reg->__resv2[0] || area_reg->__resv2[1])
+ return -EINVAL;
+ if (area_reg->addr & ~PAGE_MASK || area_reg->len & ~PAGE_MASK)
+ return -EINVAL;
+
+ iov.iov_base = u64_to_user_ptr(area_reg->addr);
+ iov.iov_len = area_reg->len;
+ ret = io_buffer_validate(&iov);
+ if (ret)
+ return ret;
+
+ ret = -ENOMEM;
+ area = kzalloc(sizeof(*area), GFP_KERNEL);
+ if (!area)
+ goto err;
+
+ area->pages = io_pin_pages((unsigned long)area_reg->addr, area_reg->len,
+ &nr_pages);
+ if (IS_ERR(area->pages)) {
+ ret = PTR_ERR(area->pages);
+ area->pages = NULL;
+ goto err;
+ }
+ area->nia.num_niovs = nr_pages;
+
+ area->nia.niovs = kvmalloc_array(nr_pages, sizeof(area->nia.niovs[0]),
+ GFP_KERNEL | __GFP_ZERO);
+ if (!area->nia.niovs)
+ goto err;
+
+ area->freelist = kvmalloc_array(nr_pages, sizeof(area->freelist[0]),
+ GFP_KERNEL | __GFP_ZERO);
+ if (!area->freelist)
+ goto err;
+
+ for (i = 0; i < nr_pages; i++)
+ area->freelist[i] = i;
+
+ area->free_count = nr_pages;
+ area->ifq = ifq;
+ /* we're only supporting one area per ifq for now */
+ area->area_id = 0;
+ area_reg->rq_area_token = (u64)area->area_id << IORING_ZCRX_AREA_SHIFT;
+ spin_lock_init(&area->freelist_lock);
+ *res = area;
+ return 0;
+err:
+ if (area)
+ io_zcrx_free_area(area);
+ return ret;
+}
+
static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
{
struct io_zcrx_ifq *ifq;
@@ -59,6 +133,9 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
{
+ if (ifq->area)
+ io_zcrx_free_area(ifq->area);
+
io_free_rbuf_ring(ifq);
kfree(ifq);
}
@@ -66,6 +143,7 @@ static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
struct io_uring_zcrx_ifq_reg __user *arg)
{
+ struct io_uring_zcrx_area_reg area;
struct io_uring_zcrx_ifq_reg reg;
struct io_uring_region_desc rd;
struct io_zcrx_ifq *ifq;
@@ -99,7 +177,7 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
}
reg.rq_entries = roundup_pow_of_two(reg.rq_entries);
- if (!reg.area_ptr)
+ if (copy_from_user(&area, u64_to_user_ptr(reg.area_ptr), sizeof(area)))
return -EFAULT;
ifq = io_zcrx_ifq_alloc(ctx);
@@ -110,6 +188,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
if (ret)
goto err;
+ ret = io_zcrx_create_area(ifq, &ifq->area, &area);
+ if (ret)
+ goto err;
+
ifq->rq_entries = reg.rq_entries;
ifq->if_rxq = reg.if_rxq;
@@ -122,7 +204,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
ret = -EFAULT;
goto err;
}
-
+ if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
+ ret = -EFAULT;
+ goto err;
+ }
ctx->ifq = ifq;
return 0;
err:
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 58e4ab6c6083..53fd94b65b38 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -3,9 +3,25 @@
#define IOU_ZC_RX_H
#include <linux/io_uring_types.h>
+#include <net/page_pool/types.h>
+
+struct io_zcrx_area {
+ struct net_iov_area nia;
+ struct io_zcrx_ifq *ifq;
+
+ u16 area_id;
+ struct page **pages;
+
+ /* freelist */
+ spinlock_t freelist_lock ____cacheline_aligned_in_smp;
+ u32 free_count;
+ u32 *freelist;
+};
struct io_zcrx_ifq {
struct io_ring_ctx *ctx;
+ struct io_zcrx_area *area;
+
struct io_uring *rq_ring;
struct io_uring_zcrx_rqe *rqes;
u32 rq_entries;
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 13/21] io_uring/zcrx: grab a net device
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (11 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 12/21] io_uring/zcrx: add io_zcrx_area David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 14/21] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
` (8 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
Zerocopy receive needs a net device to bind to its rx queue and dma map
buffers. As a preparation to following patches, resolve a net device
from the if_idx parameter with no functional changes otherwise.
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
io_uring/zcrx.c | 28 ++++++++++++++++++++++++++++
io_uring/zcrx.h | 5 +++++
2 files changed, 33 insertions(+)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 04883a3ae80c..435cd634f91c 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -3,6 +3,8 @@
#include <linux/errno.h>
#include <linux/mm.h>
#include <linux/io_uring.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
#include <uapi/linux/io_uring.h>
@@ -128,13 +130,28 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
ifq->if_rxq = -1;
ifq->ctx = ctx;
+ spin_lock_init(&ifq->lock);
return ifq;
}
+static void io_zcrx_drop_netdev(struct io_zcrx_ifq *ifq)
+{
+ spin_lock(&ifq->lock);
+ if (ifq->netdev) {
+ netdev_put(ifq->netdev, &ifq->netdev_tracker);
+ ifq->netdev = NULL;
+ }
+ spin_unlock(&ifq->lock);
+}
+
static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
{
+ io_zcrx_drop_netdev(ifq);
+
if (ifq->area)
io_zcrx_free_area(ifq->area);
+ if (ifq->dev)
+ put_device(ifq->dev);
io_free_rbuf_ring(ifq);
kfree(ifq);
@@ -195,6 +212,17 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
ifq->rq_entries = reg.rq_entries;
ifq->if_rxq = reg.if_rxq;
+ ret = -ENODEV;
+ ifq->netdev = netdev_get_by_index(current->nsproxy->net_ns, reg.if_idx,
+ &ifq->netdev_tracker, GFP_KERNEL);
+ if (!ifq->netdev)
+ goto err;
+
+ ifq->dev = ifq->netdev->dev.parent;
+ if (!ifq->dev)
+ return -EOPNOTSUPP;
+ get_device(ifq->dev);
+
reg.offsets.rqes = sizeof(struct io_uring);
reg.offsets.head = offsetof(struct io_uring, head);
reg.offsets.tail = offsetof(struct io_uring, tail);
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 53fd94b65b38..595bca0001d2 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -4,6 +4,7 @@
#include <linux/io_uring_types.h>
#include <net/page_pool/types.h>
+#include <net/net_trackers.h>
struct io_zcrx_area {
struct net_iov_area nia;
@@ -27,6 +28,10 @@ struct io_zcrx_ifq {
u32 rq_entries;
u32 if_rxq;
+ struct device *dev;
+ struct net_device *netdev;
+ netdevice_tracker netdev_tracker;
+ spinlock_t lock;
};
#if defined(CONFIG_IO_URING_ZCRX)
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 14/21] io_uring/zcrx: implement zerocopy receive pp memory provider
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (12 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 13/21] io_uring/zcrx: grab a net device David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-17 2:07 ` Jakub Kicinski
2025-01-16 23:16 ` [PATCH net-next v11 15/21] io_uring/zcrx: dma-map area for the device David Wei
` (7 subsequent siblings)
21 siblings, 1 reply; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
Implement a page pool memory provider for io_uring to receieve in a
zero copy fashion. For that, the provider allocates user pages wrapped
around into struct net_iovs, that are stored in a previously registered
struct net_iov_area.
Unlike the traditional receive, that frees pages and returns them back
to the page pool right after data was copied to the user, e.g. inside
recv(2), we extend the lifetime until the user space confirms that it's
done processing the data. That's done by taking a net_iov reference.
When the user is done with the buffer, it must return it back to the
kernel by posting an entry into the refill ring, which is usually polled
off the io_uring memory provider callback in the page pool's netmem
allocation path.
There is also a separate set of per net_iov "user" references accounting
whether a buffer is currently given to the user (including possible
fragmentation).
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
io_uring/zcrx.c | 269 ++++++++++++++++++++++++++++++++++++++++++++++++
io_uring/zcrx.h | 3 +
2 files changed, 272 insertions(+)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 435cd634f91c..c007ff80b325 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -2,10 +2,17 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/mm.h>
+#include <linux/nospec.h>
#include <linux/io_uring.h>
#include <linux/netdevice.h>
#include <linux/rtnetlink.h>
+#include <net/page_pool/helpers.h>
+#include <net/page_pool/memory_provider.h>
+#include <net/netlink.h>
+
+#include <trace/events/page_pool.h>
+
#include <uapi/linux/io_uring.h>
#include "io_uring.h"
@@ -16,6 +23,33 @@
#define IO_RQ_MAX_ENTRIES 32768
+__maybe_unused
+static const struct memory_provider_ops io_uring_pp_zc_ops;
+
+static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
+{
+ struct net_iov_area *owner = net_iov_owner(niov);
+
+ return container_of(owner, struct io_zcrx_area, nia);
+}
+
+static inline atomic_t *io_get_user_counter(struct net_iov *niov)
+{
+ struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+
+ return &area->user_refs[net_iov_idx(niov)];
+}
+
+static bool io_zcrx_put_niov_uref(struct net_iov *niov)
+{
+ atomic_t *uref = io_get_user_counter(niov);
+
+ if (unlikely(!atomic_read(uref)))
+ return false;
+ atomic_dec(uref);
+ return true;
+}
+
static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
struct io_uring_zcrx_ifq_reg *reg,
struct io_uring_region_desc *rd)
@@ -51,6 +85,7 @@ static void io_zcrx_free_area(struct io_zcrx_area *area)
{
kvfree(area->freelist);
kvfree(area->nia.niovs);
+ kvfree(area->user_refs);
if (area->pages) {
unpin_user_pages(area->pages, area->nia.num_niovs);
kvfree(area->pages);
@@ -106,6 +141,19 @@ static int io_zcrx_create_area(struct io_zcrx_ifq *ifq,
for (i = 0; i < nr_pages; i++)
area->freelist[i] = i;
+ area->user_refs = kvmalloc_array(nr_pages, sizeof(area->user_refs[0]),
+ GFP_KERNEL | __GFP_ZERO);
+ if (!area->user_refs)
+ goto err;
+
+ for (i = 0; i < nr_pages; i++) {
+ struct net_iov *niov = &area->nia.niovs[i];
+
+ niov->owner = &area->nia;
+ area->freelist[i] = i;
+ atomic_set(&area->user_refs[i], 0);
+ }
+
area->free_count = nr_pages;
area->ifq = ifq;
/* we're only supporting one area per ifq for now */
@@ -131,6 +179,7 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
ifq->if_rxq = -1;
ifq->ctx = ctx;
spin_lock_init(&ifq->lock);
+ spin_lock_init(&ifq->rq_lock);
return ifq;
}
@@ -256,7 +305,227 @@ void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
io_zcrx_ifq_free(ifq);
}
+static struct net_iov *__io_zcrx_get_free_niov(struct io_zcrx_area *area)
+{
+ unsigned niov_idx;
+
+ lockdep_assert_held(&area->freelist_lock);
+
+ niov_idx = area->freelist[--area->free_count];
+ return &area->nia.niovs[niov_idx];
+}
+
+static void io_zcrx_return_niov_freelist(struct net_iov *niov)
+{
+ struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+
+ spin_lock_bh(&area->freelist_lock);
+ area->freelist[area->free_count++] = net_iov_idx(niov);
+ spin_unlock_bh(&area->freelist_lock);
+}
+
+static void io_zcrx_return_niov(struct net_iov *niov)
+{
+ netmem_ref netmem = net_iov_to_netmem(niov);
+
+ page_pool_put_unrefed_netmem(niov->pp, netmem, -1, false);
+}
+
+static void io_zcrx_scrub(struct io_zcrx_ifq *ifq)
+{
+ struct io_zcrx_area *area = ifq->area;
+ int i;
+
+ if (!area)
+ return;
+
+ /* Reclaim back all buffers given to the user space. */
+ for (i = 0; i < area->nia.num_niovs; i++) {
+ struct net_iov *niov = &area->nia.niovs[i];
+ int nr;
+
+ if (!atomic_read(io_get_user_counter(niov)))
+ continue;
+ nr = atomic_xchg(io_get_user_counter(niov), 0);
+ if (nr && !page_pool_unref_netmem(net_iov_to_netmem(niov), nr))
+ io_zcrx_return_niov(niov);
+ }
+}
+
void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
{
lockdep_assert_held(&ctx->uring_lock);
+
+ if (ctx->ifq)
+ io_zcrx_scrub(ctx->ifq);
+}
+
+static inline u32 io_zcrx_rqring_entries(struct io_zcrx_ifq *ifq)
+{
+ u32 entries;
+
+ entries = smp_load_acquire(&ifq->rq_ring->tail) - ifq->cached_rq_head;
+ return min(entries, ifq->rq_entries);
}
+
+static struct io_uring_zcrx_rqe *io_zcrx_get_rqe(struct io_zcrx_ifq *ifq,
+ unsigned mask)
+{
+ unsigned int idx = ifq->cached_rq_head++ & mask;
+
+ return &ifq->rqes[idx];
+}
+
+static void io_zcrx_ring_refill(struct page_pool *pp,
+ struct io_zcrx_ifq *ifq)
+{
+ unsigned int mask = ifq->rq_entries - 1;
+ unsigned int entries;
+ netmem_ref netmem;
+
+ spin_lock_bh(&ifq->rq_lock);
+
+ entries = io_zcrx_rqring_entries(ifq);
+ entries = min_t(unsigned, entries, PP_ALLOC_CACHE_REFILL - pp->alloc.count);
+ if (unlikely(!entries)) {
+ spin_unlock_bh(&ifq->rq_lock);
+ return;
+ }
+
+ do {
+ struct io_uring_zcrx_rqe *rqe = io_zcrx_get_rqe(ifq, mask);
+ struct io_zcrx_area *area;
+ struct net_iov *niov;
+ unsigned niov_idx, area_idx;
+
+ area_idx = rqe->off >> IORING_ZCRX_AREA_SHIFT;
+ niov_idx = (rqe->off & ~IORING_ZCRX_AREA_MASK) >> PAGE_SHIFT;
+
+ if (unlikely(rqe->__pad || area_idx))
+ continue;
+ area = ifq->area;
+
+ if (unlikely(niov_idx >= area->nia.num_niovs))
+ continue;
+ niov_idx = array_index_nospec(niov_idx, area->nia.num_niovs);
+
+ niov = &area->nia.niovs[niov_idx];
+ if (!io_zcrx_put_niov_uref(niov))
+ continue;
+
+ netmem = net_iov_to_netmem(niov);
+ if (page_pool_unref_netmem(netmem, 1) != 0)
+ continue;
+
+ if (unlikely(niov->pp != pp)) {
+ io_zcrx_return_niov(niov);
+ continue;
+ }
+
+ net_mp_netmem_place_in_cache(pp, netmem);
+ } while (--entries);
+
+ smp_store_release(&ifq->rq_ring->head, ifq->cached_rq_head);
+ spin_unlock_bh(&ifq->rq_lock);
+}
+
+static void io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *ifq)
+{
+ struct io_zcrx_area *area = ifq->area;
+
+ spin_lock_bh(&area->freelist_lock);
+ while (area->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
+ struct net_iov *niov = __io_zcrx_get_free_niov(area);
+ netmem_ref netmem = net_iov_to_netmem(niov);
+
+ net_mp_niov_set_page_pool(pp, niov);
+ net_mp_netmem_place_in_cache(pp, netmem);
+ }
+ spin_unlock_bh(&area->freelist_lock);
+}
+
+static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
+{
+ struct io_zcrx_ifq *ifq = pp->mp_priv;
+
+ /* pp should already be ensuring that */
+ if (unlikely(pp->alloc.count))
+ goto out_return;
+
+ io_zcrx_ring_refill(pp, ifq);
+ if (likely(pp->alloc.count))
+ goto out_return;
+
+ io_zcrx_refill_slow(pp, ifq);
+ if (!pp->alloc.count)
+ return 0;
+out_return:
+ return pp->alloc.cache[--pp->alloc.count];
+}
+
+static bool io_pp_zc_release_netmem(struct page_pool *pp, netmem_ref netmem)
+{
+ struct net_iov *niov;
+
+ if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+ return false;
+
+ niov = netmem_to_net_iov(netmem);
+ net_mp_niov_clear_page_pool(niov);
+ io_zcrx_return_niov_freelist(niov);
+ return false;
+}
+
+static int io_pp_zc_init(struct page_pool *pp)
+{
+ struct io_zcrx_ifq *ifq = pp->mp_priv;
+
+ if (WARN_ON_ONCE(!ifq))
+ return -EINVAL;
+ if (pp->dma_map)
+ return -EOPNOTSUPP;
+ if (pp->p.order != 0)
+ return -EOPNOTSUPP;
+
+ percpu_ref_get(&ifq->ctx->refs);
+ return 0;
+}
+
+static void io_pp_zc_destroy(struct page_pool *pp)
+{
+ struct io_zcrx_ifq *ifq = pp->mp_priv;
+ struct io_zcrx_area *area = ifq->area;
+
+ if (WARN_ON_ONCE(area->free_count != area->nia.num_niovs))
+ return;
+ percpu_ref_put(&ifq->ctx->refs);
+}
+
+static int io_pp_nl_fill(void *mp_priv, struct sk_buff *rsp,
+ struct netdev_rx_queue *rxq)
+{
+ struct nlattr *nest;
+ int type;
+
+ type = rxq ? NETDEV_A_QUEUE_IO_URING : NETDEV_A_PAGE_POOL_IO_URING;
+ nest = nla_nest_start(rsp, type);
+ nla_nest_end(rsp, nest);
+
+ return 0;
+}
+
+static void io_pp_uninstall(void *mp_priv, struct netdev_rx_queue *rxq)
+{
+ struct io_zcrx_ifq *ifq = mp_priv;
+
+ io_zcrx_drop_netdev(ifq);
+}
+
+static const struct memory_provider_ops io_uring_pp_zc_ops = {
+ .alloc_netmems = io_pp_zc_alloc_netmems,
+ .release_netmem = io_pp_zc_release_netmem,
+ .init = io_pp_zc_init,
+ .destroy = io_pp_zc_destroy,
+ .nl_fill = io_pp_nl_fill,
+ .uninstall = io_pp_uninstall,
+};
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 595bca0001d2..6c808240ac91 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -9,6 +9,7 @@
struct io_zcrx_area {
struct net_iov_area nia;
struct io_zcrx_ifq *ifq;
+ atomic_t *user_refs;
u16 area_id;
struct page **pages;
@@ -26,6 +27,8 @@ struct io_zcrx_ifq {
struct io_uring *rq_ring;
struct io_uring_zcrx_rqe *rqes;
u32 rq_entries;
+ u32 cached_rq_head;
+ spinlock_t rq_lock;
u32 if_rxq;
struct device *dev;
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 15/21] io_uring/zcrx: dma-map area for the device
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (13 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 14/21] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 16/21] io_uring/zcrx: add io_recvzc request David Wei
` (6 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
Setup DMA mappings for the area into which we intend to receive data
later on. We know the device we want to attach to even before we get a
page pool and can pre-map in advance. All net_iov are synchronised for
device when allocated, see page_pool_mp_return_in_cache().
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
io_uring/zcrx.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++-
io_uring/zcrx.h | 1 +
2 files changed, 82 insertions(+), 1 deletion(-)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index c007ff80b325..2b668f09b55f 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/kernel.h>
#include <linux/errno.h>
+#include <linux/dma-map-ops.h>
#include <linux/mm.h>
#include <linux/nospec.h>
#include <linux/io_uring.h>
@@ -21,6 +22,73 @@
#include "zcrx.h"
#include "rsrc.h"
+#define IO_DMA_ATTR (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING)
+
+static void __io_zcrx_unmap_area(struct io_zcrx_ifq *ifq,
+ struct io_zcrx_area *area, int nr_mapped)
+{
+ int i;
+
+ for (i = 0; i < nr_mapped; i++) {
+ struct net_iov *niov = &area->nia.niovs[i];
+ dma_addr_t dma;
+
+ dma = page_pool_get_dma_addr_netmem(net_iov_to_netmem(niov));
+ dma_unmap_page_attrs(ifq->dev, dma, PAGE_SIZE,
+ DMA_FROM_DEVICE, IO_DMA_ATTR);
+ net_mp_niov_set_dma_addr(niov, 0);
+ }
+}
+
+static void io_zcrx_unmap_area(struct io_zcrx_ifq *ifq, struct io_zcrx_area *area)
+{
+ if (area->is_mapped)
+ __io_zcrx_unmap_area(ifq, area, area->nia.num_niovs);
+}
+
+static int io_zcrx_map_area(struct io_zcrx_ifq *ifq, struct io_zcrx_area *area)
+{
+ int i;
+
+ for (i = 0; i < area->nia.num_niovs; i++) {
+ struct net_iov *niov = &area->nia.niovs[i];
+ dma_addr_t dma;
+
+ dma = dma_map_page_attrs(ifq->dev, area->pages[i], 0, PAGE_SIZE,
+ DMA_FROM_DEVICE, IO_DMA_ATTR);
+ if (dma_mapping_error(ifq->dev, dma))
+ break;
+ if (net_mp_niov_set_dma_addr(niov, dma)) {
+ dma_unmap_page_attrs(ifq->dev, dma, PAGE_SIZE,
+ DMA_FROM_DEVICE, IO_DMA_ATTR);
+ break;
+ }
+ }
+
+ if (i != area->nia.num_niovs) {
+ __io_zcrx_unmap_area(ifq, area, i);
+ return -EINVAL;
+ }
+
+ area->is_mapped = true;
+ return 0;
+}
+
+static void io_zcrx_sync_for_device(const struct page_pool *pool,
+ struct net_iov *niov)
+{
+#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
+ dma_addr_t dma_addr;
+
+ if (!dma_dev_need_sync(pool->p.dev))
+ return;
+
+ dma_addr = page_pool_get_dma_addr_netmem(net_iov_to_netmem(niov));
+ __dma_sync_single_for_device(pool->p.dev, dma_addr + pool->p.offset,
+ PAGE_SIZE, pool->p.dma_dir);
+#endif
+}
+
#define IO_RQ_MAX_ENTRIES 32768
__maybe_unused
@@ -83,6 +151,8 @@ static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq)
static void io_zcrx_free_area(struct io_zcrx_area *area)
{
+ io_zcrx_unmap_area(area->ifq, area);
+
kvfree(area->freelist);
kvfree(area->nia.niovs);
kvfree(area->user_refs);
@@ -272,6 +342,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
return -EOPNOTSUPP;
get_device(ifq->dev);
+ ret = io_zcrx_map_area(ifq, ifq->area);
+ if (ret)
+ goto err;
+
reg.offsets.rqes = sizeof(struct io_uring);
reg.offsets.head = offsetof(struct io_uring, head);
reg.offsets.tail = offsetof(struct io_uring, tail);
@@ -422,6 +496,7 @@ static void io_zcrx_ring_refill(struct page_pool *pp,
continue;
}
+ io_zcrx_sync_for_device(pp, niov);
net_mp_netmem_place_in_cache(pp, netmem);
} while (--entries);
@@ -439,6 +514,7 @@ static void io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *ifq)
netmem_ref netmem = net_iov_to_netmem(niov);
net_mp_niov_set_page_pool(pp, niov);
+ io_zcrx_sync_for_device(pp, niov);
net_mp_netmem_place_in_cache(pp, netmem);
}
spin_unlock_bh(&area->freelist_lock);
@@ -482,10 +558,14 @@ static int io_pp_zc_init(struct page_pool *pp)
if (WARN_ON_ONCE(!ifq))
return -EINVAL;
- if (pp->dma_map)
+ if (WARN_ON_ONCE(ifq->dev != pp->p.dev))
+ return -EINVAL;
+ if (WARN_ON_ONCE(!pp->dma_map))
return -EOPNOTSUPP;
if (pp->p.order != 0)
return -EOPNOTSUPP;
+ if (pp->p.dma_dir != DMA_FROM_DEVICE)
+ return -EOPNOTSUPP;
percpu_ref_get(&ifq->ctx->refs);
return 0;
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 6c808240ac91..1b6363591f72 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -11,6 +11,7 @@ struct io_zcrx_area {
struct io_zcrx_ifq *ifq;
atomic_t *user_refs;
+ bool is_mapped;
u16 area_id;
struct page **pages;
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 16/21] io_uring/zcrx: add io_recvzc request
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (14 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 15/21] io_uring/zcrx: dma-map area for the device David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 17/21] io_uring/zcrx: set pp memory provider for an rx queue David Wei
` (5 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a
socket. Only the connection should be land on the specific rx queue set
up for zero copy, and the socket must be handled by the io_uring
instance that the rx queue was registered for zero copy with. That's
because neither net_iovs / buffers from our queue can be read by outside
applications, nor zero copy is possible if traffic for the zero copy
connection goes to another queue. This coordination is outside of the
scope of this patch series. Also, any traffic directed to the zero copy
enabled queue is immediately visible to the application, which is why
CAP_NET_ADMIN is required at the registration step.
Of course, no data is actually read out of the socket, it has already
been copied by the netdev into userspace memory via DMA. OP_RECV_ZC
reads skbs out of the socket and checks that its frags are indeed
net_iovs that belong to io_uring. A cqe is queued for each one of these
frags.
Recall that each cqe is a big cqe, with the top half being an
io_uring_zcrx_cqe. The cqe res field contains the len or error. The
lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off
field contain the offset relative to the start of the zero copy area.
The upper part of the off field is trivially zero, and will be used
to carry the area id.
For now, there is no limit as to how much work each OP_RECV_ZC request
does. It will attempt to drain a socket of all available data. This
request always operates in multishot mode.
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
include/uapi/linux/io_uring.h | 2 +
io_uring/io_uring.h | 10 ++
io_uring/net.c | 72 +++++++++++++
io_uring/opdef.c | 16 +++
io_uring/zcrx.c | 190 +++++++++++++++++++++++++++++++++-
io_uring/zcrx.h | 13 +++
6 files changed, 302 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e251f28507ce..b919a541d44f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -87,6 +87,7 @@ struct io_uring_sqe {
union {
__s32 splice_fd_in;
__u32 file_index;
+ __u32 zcrx_ifq_idx;
__u32 optlen;
struct {
__u16 addr_len;
@@ -278,6 +279,7 @@ enum io_uring_op {
IORING_OP_FTRUNCATE,
IORING_OP_BIND,
IORING_OP_LISTEN,
+ IORING_OP_RECV_ZC,
/* this goes last, obviously */
IORING_OP_LAST,
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 032758b28d78..2b1ce5539bfe 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -184,6 +184,16 @@ static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret
return io_get_cqe_overflow(ctx, ret, false);
}
+static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
+ struct io_uring_cqe **cqe_ret)
+{
+ io_lockdep_assert_cq_locked(ctx);
+
+ ctx->cq_extra++;
+ ctx->submit_state.cq_flush = true;
+ return io_get_cqe(ctx, cqe_ret);
+}
+
static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
struct io_kiocb *req)
{
diff --git a/io_uring/net.c b/io_uring/net.c
index 8457408194e7..5d8b9a016766 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -16,6 +16,7 @@
#include "net.h"
#include "notif.h"
#include "rsrc.h"
+#include "zcrx.h"
#if defined(CONFIG_NET)
struct io_shutdown {
@@ -88,6 +89,13 @@ struct io_sr_msg {
*/
#define MULTISHOT_MAX_RETRY 32
+struct io_recvzc {
+ struct file *file;
+ unsigned msg_flags;
+ u16 flags;
+ struct io_zcrx_ifq *ifq;
+};
+
int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
@@ -1209,6 +1217,70 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
return ret;
}
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+ unsigned ifq_idx;
+
+ if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
+ sqe->len || sqe->addr3))
+ return -EINVAL;
+
+ ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
+ if (ifq_idx != 0)
+ return -EINVAL;
+ zc->ifq = req->ctx->ifq;
+ if (!zc->ifq)
+ return -EINVAL;
+
+ zc->flags = READ_ONCE(sqe->ioprio);
+ zc->msg_flags = READ_ONCE(sqe->msg_flags);
+ if (zc->msg_flags)
+ return -EINVAL;
+ if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECV_MULTISHOT))
+ return -EINVAL;
+ /* multishot required */
+ if (!(zc->flags & IORING_RECV_MULTISHOT))
+ return -EINVAL;
+ /* All data completions are posted as aux CQEs. */
+ req->flags |= REQ_F_APOLL_MULTISHOT;
+
+ return 0;
+}
+
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+ struct socket *sock;
+ int ret;
+
+ if (!(req->flags & REQ_F_POLLED) &&
+ (zc->flags & IORING_RECVSEND_POLL_FIRST))
+ return -EAGAIN;
+
+ sock = sock_from_file(req->file);
+ if (unlikely(!sock))
+ return -ENOTSOCK;
+
+ ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT,
+ issue_flags);
+ if (unlikely(ret <= 0) && ret != -EAGAIN) {
+ if (ret == -ERESTARTSYS)
+ ret = -EINTR;
+
+ req_set_fail(req);
+ io_req_set_res(req, ret, 0);
+
+ if (issue_flags & IO_URING_F_MULTISHOT)
+ return IOU_STOP_MULTISHOT;
+ return IOU_OK;
+ }
+
+ if (issue_flags & IO_URING_F_MULTISHOT)
+ return IOU_ISSUE_SKIP_COMPLETE;
+ return -EAGAIN;
+}
+
void io_send_zc_cleanup(struct io_kiocb *req)
{
struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3de75eca1c92..6ae00c0af9a8 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -36,6 +36,7 @@
#include "waitid.h"
#include "futex.h"
#include "truncate.h"
+#include "zcrx.h"
static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
{
@@ -513,6 +514,18 @@ const struct io_issue_def io_issue_defs[] = {
.async_size = sizeof(struct io_async_msghdr),
#else
.prep = io_eopnotsupp_prep,
+#endif
+ },
+ [IORING_OP_RECV_ZC] = {
+ .needs_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollin = 1,
+ .ioprio = 1,
+#if defined(CONFIG_NET)
+ .prep = io_recvzc_prep,
+ .issue = io_recvzc,
+#else
+ .prep = io_eopnotsupp_prep,
#endif
},
};
@@ -744,6 +757,9 @@ const struct io_cold_def io_cold_defs[] = {
[IORING_OP_LISTEN] = {
.name = "LISTEN",
},
+ [IORING_OP_RECV_ZC] = {
+ .name = "RECV_ZC",
+ },
};
const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 2b668f09b55f..cecfb04a117c 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -13,6 +13,8 @@
#include <net/netlink.h>
#include <trace/events/page_pool.h>
+#include <net/tcp.h>
+#include <net/rps.h>
#include <uapi/linux/io_uring.h>
@@ -91,7 +93,12 @@ static void io_zcrx_sync_for_device(const struct page_pool *pool,
#define IO_RQ_MAX_ENTRIES 32768
-__maybe_unused
+struct io_zcrx_args {
+ struct io_kiocb *req;
+ struct io_zcrx_ifq *ifq;
+ struct socket *sock;
+};
+
static const struct memory_provider_ops io_uring_pp_zc_ops;
static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
@@ -118,6 +125,11 @@ static bool io_zcrx_put_niov_uref(struct net_iov *niov)
return true;
}
+static void io_zcrx_get_niov_uref(struct net_iov *niov)
+{
+ atomic_inc(io_get_user_counter(niov));
+}
+
static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
struct io_uring_zcrx_ifq_reg *reg,
struct io_uring_region_desc *rd)
@@ -609,3 +621,179 @@ static const struct memory_provider_ops io_uring_pp_zc_ops = {
.nl_fill = io_pp_nl_fill,
.uninstall = io_pp_uninstall,
};
+
+static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
+ struct io_zcrx_ifq *ifq, int off, int len)
+{
+ struct io_uring_zcrx_cqe *rcqe;
+ struct io_zcrx_area *area;
+ struct io_uring_cqe *cqe;
+ u64 offset;
+
+ if (!io_defer_get_uncommited_cqe(req->ctx, &cqe))
+ return false;
+
+ cqe->user_data = req->cqe.user_data;
+ cqe->res = len;
+ cqe->flags = IORING_CQE_F_MORE;
+
+ area = io_zcrx_iov_to_area(niov);
+ offset = off + (net_iov_idx(niov) << PAGE_SHIFT);
+ rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
+ rcqe->off = offset + ((u64)area->area_id << IORING_ZCRX_AREA_SHIFT);
+ rcqe->__pad = 0;
+ return true;
+}
+
+static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+ const skb_frag_t *frag, int off, int len)
+{
+ struct net_iov *niov;
+
+ if (unlikely(!skb_frag_is_net_iov(frag)))
+ return -EOPNOTSUPP;
+
+ niov = netmem_to_net_iov(frag->netmem);
+ if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
+ niov->pp->mp_priv != ifq)
+ return -EFAULT;
+
+ if (!io_zcrx_queue_cqe(req, niov, ifq, off + skb_frag_off(frag), len))
+ return -ENOSPC;
+
+ /*
+ * Prevent it from being recycled while user is accessing it.
+ * It has to be done before grabbing a user reference.
+ */
+ page_pool_ref_netmem(net_iov_to_netmem(niov));
+ io_zcrx_get_niov_uref(niov);
+ return len;
+}
+
+static int
+io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+ unsigned int offset, size_t len)
+{
+ struct io_zcrx_args *args = desc->arg.data;
+ struct io_zcrx_ifq *ifq = args->ifq;
+ struct io_kiocb *req = args->req;
+ struct sk_buff *frag_iter;
+ unsigned start, start_off;
+ int i, copy, end, off;
+ int ret = 0;
+
+ start = skb_headlen(skb);
+ start_off = offset;
+
+ if (offset < start)
+ return -EOPNOTSUPP;
+
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ const skb_frag_t *frag;
+
+ if (WARN_ON(start > offset + len))
+ return -EFAULT;
+
+ frag = &skb_shinfo(skb)->frags[i];
+ end = start + skb_frag_size(frag);
+
+ if (offset < end) {
+ copy = end - offset;
+ if (copy > len)
+ copy = len;
+
+ off = offset - start;
+ ret = io_zcrx_recv_frag(req, ifq, frag, off, copy);
+ if (ret < 0)
+ goto out;
+
+ offset += ret;
+ len -= ret;
+ if (len == 0 || ret != copy)
+ goto out;
+ }
+ start = end;
+ }
+
+ skb_walk_frags(skb, frag_iter) {
+ if (WARN_ON(start > offset + len))
+ return -EFAULT;
+
+ end = start + frag_iter->len;
+ if (offset < end) {
+ copy = end - offset;
+ if (copy > len)
+ copy = len;
+
+ off = offset - start;
+ ret = io_zcrx_recv_skb(desc, frag_iter, off, copy);
+ if (ret < 0)
+ goto out;
+
+ offset += ret;
+ len -= ret;
+ if (len == 0 || ret != copy)
+ goto out;
+ }
+ start = end;
+ }
+
+out:
+ if (offset == start_off)
+ return ret;
+ return offset - start_off;
+}
+
+static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+ struct sock *sk, int flags,
+ unsigned issue_flags)
+{
+ struct io_zcrx_args args = {
+ .req = req,
+ .ifq = ifq,
+ .sock = sk->sk_socket,
+ };
+ read_descriptor_t rd_desc = {
+ .count = 1,
+ .arg.data = &args,
+ };
+ int ret;
+
+ lock_sock(sk);
+ ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
+ if (ret <= 0) {
+ if (ret < 0 || sock_flag(sk, SOCK_DONE))
+ goto out;
+ if (sk->sk_err)
+ ret = sock_error(sk);
+ else if (sk->sk_shutdown & RCV_SHUTDOWN)
+ goto out;
+ else if (sk->sk_state == TCP_CLOSE)
+ ret = -ENOTCONN;
+ else
+ ret = -EAGAIN;
+ } else if (sock_flag(sk, SOCK_DONE)) {
+ /* Make it to retry until it finally gets 0. */
+ if (issue_flags & IO_URING_F_MULTISHOT)
+ ret = IOU_REQUEUE;
+ else
+ ret = -EAGAIN;
+ }
+out:
+ release_sock(sk);
+ return ret;
+}
+
+int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+ struct socket *sock, unsigned int flags,
+ unsigned issue_flags)
+{
+ struct sock *sk = sock->sk;
+ const struct proto *prot = READ_ONCE(sk->sk_prot);
+
+ if (prot->recvmsg != tcp_recvmsg)
+ return -EPROTONOSUPPORT;
+
+ sock_rps_record_flow(sk);
+ return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags);
+}
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 1b6363591f72..a16bdd921f03 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -3,6 +3,7 @@
#define IOU_ZC_RX_H
#include <linux/io_uring_types.h>
+#include <linux/socket.h>
#include <net/page_pool/types.h>
#include <net/net_trackers.h>
@@ -43,6 +44,9 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
struct io_uring_zcrx_ifq_reg __user *arg);
void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);
void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);
+int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+ struct socket *sock, unsigned int flags,
+ unsigned issue_flags);
#else
static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
struct io_uring_zcrx_ifq_reg __user *arg)
@@ -55,6 +59,15 @@ static inline void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
{
}
+static inline int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+ struct socket *sock, unsigned int flags,
+ unsigned issue_flags)
+{
+ return -EOPNOTSUPP;
+}
#endif
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags);
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+
#endif
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 17/21] io_uring/zcrx: set pp memory provider for an rx queue
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (15 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 16/21] io_uring/zcrx: add io_recvzc request David Wei
@ 2025-01-16 23:16 ` David Wei
2025-01-17 2:13 ` Jakub Kicinski
2025-01-16 23:17 ` [PATCH net-next v11 18/21] io_uring/zcrx: throttle receive requests David Wei
` (4 subsequent siblings)
21 siblings, 1 reply; 43+ messages in thread
From: David Wei @ 2025-01-16 23:16 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
Set the page pool memory provider for the rx queue configured for zero
copy to io_uring. Then the rx queue is reset using
netdev_rx_queue_restart() and netdev core + page pool will take care of
filling the rx queue from the io_uring zero copy memory provider.
For now, there is only one ifq so its destruction happens implicitly
during io_uring cleanup.
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
io_uring/zcrx.c | 47 +++++++++++++++++++++++++++++++++++++++--------
1 file changed, 39 insertions(+), 8 deletions(-)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index cecfb04a117c..0cfa8c0ecff8 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -11,11 +11,12 @@
#include <net/page_pool/helpers.h>
#include <net/page_pool/memory_provider.h>
#include <net/netlink.h>
-
-#include <trace/events/page_pool.h>
+#include <net/netdev_rx_queue.h>
#include <net/tcp.h>
#include <net/rps.h>
+#include <trace/events/page_pool.h>
+
#include <uapi/linux/io_uring.h>
#include "io_uring.h"
@@ -275,8 +276,32 @@ static void io_zcrx_drop_netdev(struct io_zcrx_ifq *ifq)
spin_unlock(&ifq->lock);
}
+static void io_close_queue(struct io_zcrx_ifq *ifq)
+{
+ struct net_device *netdev;
+ netdevice_tracker netdev_tracker;
+ struct pp_memory_provider_params p = {
+ .mp_ops = &io_uring_pp_zc_ops,
+ .mp_priv = ifq,
+ };
+
+ if (ifq->if_rxq == -1)
+ return;
+
+ spin_lock(&ifq->lock);
+ netdev = ifq->netdev;
+ netdev_tracker = ifq->netdev_tracker;
+ ifq->netdev = NULL;
+ spin_unlock(&ifq->lock);
+
+ if (netdev)
+ net_mp_close_rxq(netdev, ifq->if_rxq, &p);
+ ifq->if_rxq = -1;
+}
+
static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
{
+ io_close_queue(ifq);
io_zcrx_drop_netdev(ifq);
if (ifq->area)
@@ -291,6 +316,7 @@ static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
struct io_uring_zcrx_ifq_reg __user *arg)
{
+ struct pp_memory_provider_params mp_param = {};
struct io_uring_zcrx_area_reg area;
struct io_uring_zcrx_ifq_reg reg;
struct io_uring_region_desc rd;
@@ -341,7 +367,6 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
goto err;
ifq->rq_entries = reg.rq_entries;
- ifq->if_rxq = reg.if_rxq;
ret = -ENODEV;
ifq->netdev = netdev_get_by_index(current->nsproxy->net_ns, reg.if_idx,
@@ -358,16 +383,20 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
if (ret)
goto err;
+ mp_param.mp_ops = &io_uring_pp_zc_ops;
+ mp_param.mp_priv = ifq;
+ ret = net_mp_open_rxq(ifq->netdev, reg.if_rxq, &mp_param);
+ if (ret)
+ goto err;
+ ifq->if_rxq = reg.if_rxq;
+
reg.offsets.rqes = sizeof(struct io_uring);
reg.offsets.head = offsetof(struct io_uring, head);
reg.offsets.tail = offsetof(struct io_uring, tail);
if (copy_to_user(arg, ®, sizeof(reg)) ||
- copy_to_user(u64_to_user_ptr(reg.region_ptr), &rd, sizeof(rd))) {
- ret = -EFAULT;
- goto err;
- }
- if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
+ copy_to_user(u64_to_user_ptr(reg.region_ptr), &rd, sizeof(rd)) ||
+ copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
ret = -EFAULT;
goto err;
}
@@ -444,6 +473,8 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
if (ctx->ifq)
io_zcrx_scrub(ctx->ifq);
+
+ io_close_queue(ctx->ifq);
}
static inline u32 io_zcrx_rqring_entries(struct io_zcrx_ifq *ifq)
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 18/21] io_uring/zcrx: throttle receive requests
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (16 preceding siblings ...)
2025-01-16 23:16 ` [PATCH net-next v11 17/21] io_uring/zcrx: set pp memory provider for an rx queue David Wei
@ 2025-01-16 23:17 ` David Wei
2025-01-16 23:17 ` [PATCH net-next v11 19/21] io_uring/zcrx: add copy fallback David Wei
` (3 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:17 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
io_zc_rx_tcp_recvmsg() continues until it fails or there is nothing to
receive. If the other side sends fast enough, we might get stuck in
io_zc_rx_tcp_recvmsg() producing more and more CQEs but not letting the
user to handle them leading to unbound latencies.
Break out of it based on an arbitrarily chosen limit, the upper layer
will either return to userspace or requeue the request.
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
io_uring/net.c | 2 ++
io_uring/zcrx.c | 9 +++++++++
2 files changed, 11 insertions(+)
diff --git a/io_uring/net.c b/io_uring/net.c
index 5d8b9a016766..86eaba37e739 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1267,6 +1267,8 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(ret <= 0) && ret != -EAGAIN) {
if (ret == -ERESTARTSYS)
ret = -EINTR;
+ if (ret == IOU_REQUEUE)
+ return IOU_REQUEUE;
req_set_fail(req);
io_req_set_res(req, ret, 0);
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 0cfa8c0ecff8..6f3fbb9337db 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -94,10 +94,13 @@ static void io_zcrx_sync_for_device(const struct page_pool *pool,
#define IO_RQ_MAX_ENTRIES 32768
+#define IO_SKBS_PER_CALL_LIMIT 20
+
struct io_zcrx_args {
struct io_kiocb *req;
struct io_zcrx_ifq *ifq;
struct socket *sock;
+ unsigned nr_skbs;
};
static const struct memory_provider_ops io_uring_pp_zc_ops;
@@ -713,6 +716,9 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
int i, copy, end, off;
int ret = 0;
+ if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT))
+ return -EAGAIN;
+
start = skb_headlen(skb);
start_off = offset;
@@ -803,6 +809,9 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
ret = -ENOTCONN;
else
ret = -EAGAIN;
+ } else if (unlikely(args.nr_skbs > IO_SKBS_PER_CALL_LIMIT) &&
+ (issue_flags & IO_URING_F_MULTISHOT)) {
+ ret = IOU_REQUEUE;
} else if (sock_flag(sk, SOCK_DONE)) {
/* Make it to retry until it finally gets 0. */
if (issue_flags & IO_URING_F_MULTISHOT)
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 19/21] io_uring/zcrx: add copy fallback
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (17 preceding siblings ...)
2025-01-16 23:17 ` [PATCH net-next v11 18/21] io_uring/zcrx: throttle receive requests David Wei
@ 2025-01-16 23:17 ` David Wei
2025-01-16 23:17 ` [PATCH net-next v11 20/21] net: add documentation for io_uring zcrx David Wei
` (2 subsequent siblings)
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:17 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
From: Pavel Begunkov <[email protected]>
There are scenarios in which the zerocopy path can get a kernel buffer
instead of a net_iov and needs to copy it to the user, whether it is
because of mis-steering or simply getting an skb with the linear part.
In this case, grab a net_iov, copy into it and return it to the user as
normally.
At the moment the user doesn't get any indication whether there was a
copy or not, which is left for follow up work.
Reviewed-by: Jens Axboe <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
io_uring/zcrx.c | 120 +++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 114 insertions(+), 6 deletions(-)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 6f3fbb9337db..3366f25d53bd 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -7,6 +7,7 @@
#include <linux/io_uring.h>
#include <linux/netdevice.h>
#include <linux/rtnetlink.h>
+#include <linux/skbuff_ref.h>
#include <net/page_pool/helpers.h>
#include <net/page_pool/memory_provider.h>
@@ -134,6 +135,13 @@ static void io_zcrx_get_niov_uref(struct net_iov *niov)
atomic_inc(io_get_user_counter(niov));
}
+static inline struct page *io_zcrx_iov_page(const struct net_iov *niov)
+{
+ struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+
+ return area->pages[net_iov_idx(niov)];
+}
+
static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
struct io_uring_zcrx_ifq_reg *reg,
struct io_uring_region_desc *rd)
@@ -446,6 +454,11 @@ static void io_zcrx_return_niov(struct net_iov *niov)
{
netmem_ref netmem = net_iov_to_netmem(niov);
+ if (!niov->pp) {
+ /* copy fallback allocated niovs */
+ io_zcrx_return_niov_freelist(niov);
+ return;
+ }
page_pool_put_unrefed_netmem(niov->pp, netmem, -1, false);
}
@@ -679,13 +692,93 @@ static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
return true;
}
+static struct net_iov *io_zcrx_alloc_fallback(struct io_zcrx_area *area)
+{
+ struct net_iov *niov = NULL;
+
+ spin_lock_bh(&area->freelist_lock);
+ if (area->free_count)
+ niov = __io_zcrx_get_free_niov(area);
+ spin_unlock_bh(&area->freelist_lock);
+
+ if (niov)
+ page_pool_fragment_netmem(net_iov_to_netmem(niov), 1);
+ return niov;
+}
+
+static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+ void *src_base, struct page *src_page,
+ unsigned int src_offset, size_t len)
+{
+ struct io_zcrx_area *area = ifq->area;
+ size_t copied = 0;
+ int ret = 0;
+
+ while (len) {
+ size_t copy_size = min_t(size_t, PAGE_SIZE, len);
+ const int dst_off = 0;
+ struct net_iov *niov;
+ struct page *dst_page;
+ void *dst_addr;
+
+ niov = io_zcrx_alloc_fallback(area);
+ if (!niov) {
+ ret = -ENOMEM;
+ break;
+ }
+
+ dst_page = io_zcrx_iov_page(niov);
+ dst_addr = kmap_local_page(dst_page);
+ if (src_page)
+ src_base = kmap_local_page(src_page);
+
+ memcpy(dst_addr, src_base + src_offset, copy_size);
+
+ if (src_page)
+ kunmap_local(src_base);
+ kunmap_local(dst_addr);
+
+ if (!io_zcrx_queue_cqe(req, niov, ifq, dst_off, copy_size)) {
+ io_zcrx_return_niov(niov);
+ ret = -ENOSPC;
+ break;
+ }
+
+ io_zcrx_get_niov_uref(niov);
+ src_offset += copy_size;
+ len -= copy_size;
+ copied += copy_size;
+ }
+
+ return copied ? copied : ret;
+}
+
+static int io_zcrx_copy_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+ const skb_frag_t *frag, int off, int len)
+{
+ struct page *page = skb_frag_page(frag);
+ u32 p_off, p_len, t, copied = 0;
+ int ret = 0;
+
+ off += skb_frag_off(frag);
+
+ skb_frag_foreach_page(frag, off, len,
+ page, p_off, p_len, t) {
+ ret = io_zcrx_copy_chunk(req, ifq, NULL, page, p_off, p_len);
+ if (ret < 0)
+ return copied ? copied : ret;
+ copied += ret;
+ }
+ return copied;
+}
+
static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
const skb_frag_t *frag, int off, int len)
{
struct net_iov *niov;
if (unlikely(!skb_frag_is_net_iov(frag)))
- return -EOPNOTSUPP;
+ return io_zcrx_copy_frag(req, ifq, frag, off, len);
niov = netmem_to_net_iov(frag->netmem);
if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
@@ -712,18 +805,33 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
struct io_zcrx_ifq *ifq = args->ifq;
struct io_kiocb *req = args->req;
struct sk_buff *frag_iter;
- unsigned start, start_off;
+ unsigned start, start_off = offset;
int i, copy, end, off;
int ret = 0;
if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT))
return -EAGAIN;
- start = skb_headlen(skb);
- start_off = offset;
+ if (unlikely(offset < skb_headlen(skb))) {
+ ssize_t copied;
+ size_t to_copy;
- if (offset < start)
- return -EOPNOTSUPP;
+ to_copy = min_t(size_t, skb_headlen(skb) - offset, len);
+ copied = io_zcrx_copy_chunk(req, ifq, skb->data, NULL,
+ offset, to_copy);
+ if (copied < 0) {
+ ret = copied;
+ goto out;
+ }
+ offset += copied;
+ len -= copied;
+ if (!len)
+ goto out;
+ if (offset != skb_headlen(skb))
+ goto out;
+ }
+
+ start = skb_headlen(skb);
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
const skb_frag_t *frag;
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 20/21] net: add documentation for io_uring zcrx
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (18 preceding siblings ...)
2025-01-16 23:17 ` [PATCH net-next v11 19/21] io_uring/zcrx: add copy fallback David Wei
@ 2025-01-16 23:17 ` David Wei
2025-01-16 23:17 ` [PATCH net-next v11 21/21] io_uring/zcrx: add selftest David Wei
2025-01-17 14:28 ` [PATCH net-next v11 00/21] io_uring zero copy rx Paolo Abeni
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:17 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
Add documentation for io_uring zero copy Rx that explains requirements
and the user API.
Signed-off-by: David Wei <[email protected]>
---
Documentation/networking/index.rst | 1 +
Documentation/networking/iou-zcrx.rst | 201 ++++++++++++++++++++++++++
2 files changed, 202 insertions(+)
create mode 100644 Documentation/networking/iou-zcrx.rst
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 058193ed2eeb..c64133d309bf 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -63,6 +63,7 @@ Contents:
gtp
ila
ioam6-sysctl
+ iou-zcrx
ip_dynaddr
ipsec
ip-sysctl
diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst
new file mode 100644
index 000000000000..7f6b7c072b59
--- /dev/null
+++ b/Documentation/networking/iou-zcrx.rst
@@ -0,0 +1,201 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+io_uring zero copy Rx
+=====================
+
+Introduction
+============
+
+io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on
+the network receive path, allowing packet data to be received directly into
+userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that
+there are no strict alignment requirements and no need to mmap()/munmap().
+Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are
+processed by the kernel TCP stack as normal.
+
+NIC HW Requirements
+===================
+
+Several NIC HW features are required for io_uring ZC Rx to work. For now the
+kernel API does not configure the NIC and it must be done by the user.
+
+Header/data split
+-----------------
+
+Required to split packets at the L4 boundary into a header and a payload.
+Headers are received into kernel memory as normal and processed by the TCP
+stack as normal. Payloads are received into userspace memory directly.
+
+Flow steering
+-------------
+
+Specific HW Rx queues are configured for this feature, but modern NICs
+typically distribute flows across all HW Rx queues. Flow steering is required
+to ensure that only desired flows are directed towards HW queues that are
+configured for io_uring ZC Rx.
+
+RSS
+---
+
+In addition to flow steering above, RSS is required to steer all other non-zero
+copy flows away from queues that are configured for io_uring ZC Rx.
+
+Usage
+=====
+
+Setup NIC
+---------
+
+Must be done out of band for now.
+
+Ensure there are at least two queues::
+
+ ethtool -L eth0 combined 2
+
+Enable header/data split::
+
+ ethtool -G eth0 tcp-data-split on
+
+Carve out half of the HW Rx queues for zero copy using RSS::
+
+ ethtool -X eth0 equal 1
+
+Set up flow steering, bearing in mind that queues are 0-indexed::
+
+ ethtool -N eth0 flow-type tcp6 ... action 1
+
+Setup io_uring
+--------------
+
+This section describes the low level io_uring kernel API. Please refer to
+liburing documentation for how to use the higher level API.
+
+Create an io_uring instance with the following required setup flags::
+
+ IORING_SETUP_SINGLE_ISSUER
+ IORING_SETUP_DEFER_TASKRUN
+ IORING_SETUP_CQE32
+
+Create memory area
+------------------
+
+Allocate userspace memory area for receiving zero copy data::
+
+ void *area_ptr = mmap(NULL, area_size,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0, 0);
+
+Create refill ring
+------------------
+
+Allocate memory for a shared ringbuf used for returning consumed buffers::
+
+ void *ring_ptr = mmap(NULL, ring_size,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0, 0);
+
+This refill ring consists of some space for the header, followed by an array of
+``struct io_uring_zcrx_rqe``::
+
+ size_t rq_entries = 4096;
+ size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE;
+ /* align to page size */
+ ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);
+
+Register ZC Rx
+--------------
+
+Fill in registration structs::
+
+ struct io_uring_zcrx_area_reg area_reg = {
+ .addr = (__u64)(unsigned long)area_ptr,
+ .len = area_size,
+ .flags = 0,
+ };
+
+ struct io_uring_region_desc region_reg = {
+ .user_addr = (__u64)(unsigned long)ring_ptr,
+ .size = ring_size,
+ .flags = IORING_MEM_REGION_TYPE_USER,
+ };
+
+ struct io_uring_zcrx_ifq_reg reg = {
+ .if_idx = if_nametoindex("eth0"),
+ /* this is the HW queue with desired flow steered into it */
+ .if_rxq = 1,
+ .rq_entries = rq_entries,
+ .area_ptr = (__u64)(unsigned long)&area_reg,
+ .region_ptr = (__u64)(unsigned long)®ion_reg,
+ };
+
+Register with kernel::
+
+ io_uring_register_ifq(ring, ®);
+
+Map refill ring
+---------------
+
+The kernel fills in fields for the refill ring in the registration ``struct
+io_uring_zcrx_ifq_reg``. Map it into userspace::
+
+ struct io_uring_zcrx_rq refill_ring;
+
+ refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head);
+ refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail);
+ refill_ring.rqes =
+ (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
+ refill_ring.rq_tail = 0;
+ refill_ring.ring_ptr = ring_ptr;
+
+Receiving data
+--------------
+
+Prepare a zero copy recv request::
+
+ struct io_uring_sqe *sqe;
+
+ sqe = io_uring_get_sqe(ring);
+ io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0);
+ sqe->ioprio |= IORING_RECV_MULTISHOT;
+
+Now, submit and wait::
+
+ io_uring_submit_and_wait(ring, 1);
+
+Finally, process completions::
+
+ struct io_uring_cqe *cqe;
+ unsigned int count = 0;
+ unsigned int head;
+
+ io_uring_for_each_cqe(ring, head, cqe) {
+ struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
+
+ unsigned char *data = area_ptr + (rcqe->off & IORING_ZCRX_AREA_MASK);
+ /* do something with the data */
+
+ count++;
+ }
+ io_uring_cq_advance(ring, count);
+
+Recycling buffers
+-----------------
+
+Return buffers back to the kernel to be used again::
+
+ struct io_uring_zcrx_rqe *rqe;
+ unsigned mask = refill_ring.ring_entries - 1;
+ rqe = &refill_ring.rqes[refill_ring.rq_tail & mask];
+
+ area_offset = rcqe->off & IORING_ZCRX_AREA_MASK;
+ rqe->off = area_offset | area_reg.rq_area_token;
+ rqe->len = cqe->res;
+ IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
+
+Testing
+=======
+
+See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH net-next v11 21/21] io_uring/zcrx: add selftest
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (19 preceding siblings ...)
2025-01-16 23:17 ` [PATCH net-next v11 20/21] net: add documentation for io_uring zcrx David Wei
@ 2025-01-16 23:17 ` David Wei
2025-01-17 14:28 ` [PATCH net-next v11 00/21] io_uring zero copy rx Paolo Abeni
21 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-16 23:17 UTC (permalink / raw)
To: io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
Add a selftest for io_uring zero copy Rx. This test cannot run locally
and requires a remote host to be configured in net.config. The remote
host must have hardware support for zero copy Rx as listed in the
documentation page. The test will restore the NIC config back to before
the test and is idempotent.
liburing is required to compile the test and be installed on the remote
host running the test.
Signed-off-by: David Wei <[email protected]>
---
.../selftests/drivers/net/hw/.gitignore | 2 +
.../testing/selftests/drivers/net/hw/Makefile | 5 +
.../selftests/drivers/net/hw/iou-zcrx.c | 432 ++++++++++++++++++
.../selftests/drivers/net/hw/iou-zcrx.py | 64 +++
4 files changed, 503 insertions(+)
create mode 100644 tools/testing/selftests/drivers/net/hw/iou-zcrx.c
create mode 100755 tools/testing/selftests/drivers/net/hw/iou-zcrx.py
diff --git a/tools/testing/selftests/drivers/net/hw/.gitignore b/tools/testing/selftests/drivers/net/hw/.gitignore
index e9fe6ede681a..6942bf575497 100644
--- a/tools/testing/selftests/drivers/net/hw/.gitignore
+++ b/tools/testing/selftests/drivers/net/hw/.gitignore
@@ -1 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+iou-zcrx
ncdevmem
diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 21ba64ce1e34..7efc47c89463 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -1,5 +1,7 @@
# SPDX-License-Identifier: GPL-2.0+ OR MIT
+TEST_GEN_FILES = iou-zcrx
+
TEST_PROGS = \
csum.py \
devlink_port_split.py \
@@ -10,6 +12,7 @@ TEST_PROGS = \
ethtool_rmon.sh \
hw_stats_l3.sh \
hw_stats_l3_gre.sh \
+ iou-zcrx.py \
loopback.sh \
nic_link_layer.py \
nic_performance.py \
@@ -38,3 +41,5 @@ include ../../../lib.mk
# YNL build
YNL_GENS := ethtool netdev
include ../../../net/ynl.mk
+
+$(OUTPUT)/iou-zcrx: LDLIBS += -luring
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
new file mode 100644
index 000000000000..0809db134bba
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -0,0 +1,432 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <assert.h>
+#include <errno.h>
+#include <error.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <arpa/inet.h>
+#include <linux/errqueue.h>
+#include <linux/if_packet.h>
+#include <linux/ipv6.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+#include <netinet/tcp.h>
+#include <netinet/udp.h>
+#include <sys/epoll.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+
+#include <liburing.h>
+
+#define PAGE_SIZE (4096)
+#define AREA_SIZE (8192 * PAGE_SIZE)
+#define SEND_SIZE (512 * 4096)
+#define min(a, b) \
+ ({ \
+ typeof(a) _a = (a); \
+ typeof(b) _b = (b); \
+ _a < _b ? _a : _b; \
+ })
+#define min_t(t, a, b) \
+ ({ \
+ t _ta = (a); \
+ t _tb = (b); \
+ min(_ta, _tb); \
+ })
+
+#define ALIGN_UP(v, align) (((v) + (align) - 1) & ~((align) - 1))
+
+static int cfg_family = PF_UNSPEC;
+static int cfg_server;
+static int cfg_client;
+static int cfg_port = 8000;
+static int cfg_payload_len;
+static const char *cfg_ifname;
+static int cfg_queue_id = -1;
+
+static socklen_t cfg_alen;
+static struct sockaddr_storage cfg_addr;
+
+static char payload[SEND_SIZE] __attribute__((aligned(PAGE_SIZE)));
+static void *area_ptr;
+static void *ring_ptr;
+static size_t ring_size;
+static struct io_uring_zcrx_rq rq_ring;
+static unsigned long area_token;
+static int connfd;
+static bool stop;
+static size_t received;
+
+static unsigned long gettimeofday_ms(void)
+{
+ struct timeval tv;
+
+ gettimeofday(&tv, NULL);
+ return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static inline size_t get_refill_ring_size(unsigned int rq_entries)
+{
+ size_t size;
+
+ ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe);
+ /* add space for the header (head/tail/etc.) */
+ ring_size += PAGE_SIZE;
+ return ALIGN_UP(ring_size, 4096);
+}
+
+static void setup_zcrx(struct io_uring *ring)
+{
+ unsigned int ifindex;
+ unsigned int rq_entries = 4096;
+ int ret;
+
+ ifindex = if_nametoindex(cfg_ifname);
+ if (!ifindex)
+ error(1, 0, "bad interface name: %s", cfg_ifname);
+
+ area_ptr = mmap(NULL,
+ AREA_SIZE,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0,
+ 0);
+ if (area_ptr == MAP_FAILED)
+ error(1, 0, "mmap(): zero copy area");
+
+ ring_size = get_refill_ring_size(rq_entries);
+ ring_ptr = mmap(NULL,
+ ring_size,
+ PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE,
+ 0,
+ 0);
+
+ struct io_uring_region_desc region_reg = {
+ .size = ring_size,
+ .user_addr = (__u64)(unsigned long)ring_ptr,
+ .flags = IORING_MEM_REGION_TYPE_USER,
+ };
+
+ struct io_uring_zcrx_area_reg area_reg = {
+ .addr = (__u64)(unsigned long)area_ptr,
+ .len = AREA_SIZE,
+ .flags = 0,
+ };
+
+ struct io_uring_zcrx_ifq_reg reg = {
+ .if_idx = ifindex,
+ .if_rxq = cfg_queue_id,
+ .rq_entries = rq_entries,
+ .area_ptr = (__u64)(unsigned long)&area_reg,
+ .region_ptr = (__u64)(unsigned long)®ion_reg,
+ };
+
+ ret = io_uring_register_ifq(ring, ®);
+ if (ret)
+ error(1, 0, "io_uring_register_ifq(): %d", ret);
+
+ rq_ring.khead = (unsigned int *)((char *)ring_ptr + reg.offsets.head);
+ rq_ring.ktail = (unsigned int *)((char *)ring_ptr + reg.offsets.tail);
+ rq_ring.rqes = (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
+ rq_ring.rq_tail = 0;
+ rq_ring.ring_entries = reg.rq_entries;
+
+ area_token = area_reg.rq_area_token;
+}
+
+static void add_accept(struct io_uring *ring, int sockfd)
+{
+ struct io_uring_sqe *sqe;
+
+ sqe = io_uring_get_sqe(ring);
+
+ io_uring_prep_accept(sqe, sockfd, NULL, NULL, 0);
+ sqe->user_data = 1;
+}
+
+static void add_recvzc(struct io_uring *ring, int sockfd)
+{
+ struct io_uring_sqe *sqe;
+
+ sqe = io_uring_get_sqe(ring);
+
+ io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, 0, 0);
+ sqe->ioprio |= IORING_RECV_MULTISHOT;
+ sqe->user_data = 2;
+}
+
+static void process_accept(struct io_uring *ring, struct io_uring_cqe *cqe)
+{
+ if (cqe->res < 0)
+ error(1, 0, "accept()");
+ if (connfd)
+ error(1, 0, "Unexpected second connection");
+
+ connfd = cqe->res;
+ add_recvzc(ring, connfd);
+}
+
+static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
+{
+ unsigned rq_mask = rq_ring.ring_entries - 1;
+ struct io_uring_zcrx_cqe *rcqe;
+ struct io_uring_zcrx_rqe *rqe;
+ struct io_uring_sqe *sqe;
+ uint64_t mask;
+ char *data;
+ ssize_t n;
+ int i;
+
+ if (cqe->res == 0 && cqe->flags == 0) {
+ stop = true;
+ return;
+ }
+
+ if (cqe->res < 0)
+ error(1, 0, "recvzc(): %d", cqe->res);
+
+ if (!(cqe->flags & IORING_CQE_F_MORE))
+ add_recvzc(ring, connfd);
+
+ rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
+
+ n = cqe->res;
+ mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
+ data = (char *)area_ptr + (rcqe->off & mask);
+
+ for (i = 0; i < n; i++) {
+ if (*(data + i) != payload[(received + i)])
+ error(1, 0, "payload mismatch");
+ }
+ received += n;
+
+ rqe = &rq_ring.rqes[(rq_ring.rq_tail & rq_mask)];
+ rqe->off = (rcqe->off & IORING_ZCRX_AREA_MASK) | area_token;
+ rqe->len = cqe->res;
+ io_uring_smp_store_release(rq_ring.ktail, ++rq_ring.rq_tail);
+}
+
+static void server_loop(struct io_uring *ring)
+{
+ struct io_uring_cqe *cqe;
+ unsigned int count = 0;
+ unsigned int head;
+ int i, ret;
+
+ io_uring_submit_and_wait(ring, 1);
+
+ io_uring_for_each_cqe(ring, head, cqe) {
+ if (cqe->user_data == 1)
+ process_accept(ring, cqe);
+ else if (cqe->user_data == 2)
+ process_recvzc(ring, cqe);
+ else
+ error(1, 0, "unknown cqe");
+ count++;
+ }
+ io_uring_cq_advance(ring, count);
+}
+
+static void run_server(void)
+{
+ unsigned int flags = 0;
+ struct io_uring ring;
+ int fd, enable, ret;
+ uint64_t tstop;
+
+ fd = socket(cfg_family, SOCK_STREAM, 0);
+ if (fd == -1)
+ error(1, 0, "socket()");
+
+ enable = 1;
+ ret = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(int));
+ if (ret < 0)
+ error(1, 0, "setsockopt(SO_REUSEADDR)");
+
+ ret = bind(fd, (const struct sockaddr *)&cfg_addr, sizeof(cfg_addr));
+ if (ret < 0)
+ error(1, 0, "bind()");
+
+ if (listen(fd, 1024) < 0)
+ error(1, 0, "listen()");
+
+ flags |= IORING_SETUP_COOP_TASKRUN;
+ flags |= IORING_SETUP_SINGLE_ISSUER;
+ flags |= IORING_SETUP_DEFER_TASKRUN;
+ flags |= IORING_SETUP_SUBMIT_ALL;
+ flags |= IORING_SETUP_CQE32;
+
+ io_uring_queue_init(512, &ring, flags);
+
+ setup_zcrx(&ring);
+
+ add_accept(&ring, fd);
+
+ tstop = gettimeofday_ms() + 5000;
+ while (!stop && gettimeofday_ms() < tstop)
+ server_loop(&ring);
+
+ if (!stop)
+ error(1, 0, "test failed\n");
+}
+
+static void run_client(void)
+{
+ ssize_t to_send = SEND_SIZE;
+ ssize_t sent = 0;
+ ssize_t chunk, res;
+ int fd;
+
+ fd = socket(cfg_family, SOCK_STREAM, 0);
+ if (fd == -1)
+ error(1, 0, "socket()");
+
+ if (connect(fd, (void *)&cfg_addr, cfg_alen))
+ error(1, 0, "connect()");
+
+ while (to_send) {
+ void *src = &payload[sent];
+
+ chunk = min_t(ssize_t, cfg_payload_len, to_send);
+ res = send(fd, src, chunk, 0);
+ if (res < 0)
+ error(1, 0, "send(): %d", sent);
+ sent += res;
+ to_send -= res;
+ }
+
+ close(fd);
+}
+
+static void usage(const char *filepath)
+{
+ error(1, 0, "Usage: %s (-4|-6) (-s|-c) -h<server_ip> -p<port> "
+ "-l<payload_size> -i<ifname> -q<rxq_id>", filepath);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+ const int max_payload_len = sizeof(payload) -
+ sizeof(struct ipv6hdr) -
+ sizeof(struct tcphdr) -
+ 40 /* max tcp options */;
+ struct sockaddr_in6 *addr6 = (void *) &cfg_addr;
+ struct sockaddr_in *addr4 = (void *) &cfg_addr;
+ char *addr = NULL;
+ int c;
+
+ if (argc <= 1)
+ usage(argv[0]);
+ cfg_payload_len = max_payload_len;
+
+ while ((c = getopt(argc, argv, "46sch:p:l:i:q:")) != -1) {
+ switch (c) {
+ case '4':
+ if (cfg_family != PF_UNSPEC)
+ error(1, 0, "Pass one of -4 or -6");
+ cfg_family = PF_INET;
+ cfg_alen = sizeof(struct sockaddr_in);
+ break;
+ case '6':
+ if (cfg_family != PF_UNSPEC)
+ error(1, 0, "Pass one of -4 or -6");
+ cfg_family = PF_INET6;
+ cfg_alen = sizeof(struct sockaddr_in6);
+ break;
+ case 's':
+ if (cfg_client)
+ error(1, 0, "Pass one of -s or -c");
+ cfg_server = 1;
+ break;
+ case 'c':
+ if (cfg_server)
+ error(1, 0, "Pass one of -s or -c");
+ cfg_client = 1;
+ break;
+ case 'h':
+ addr = optarg;
+ break;
+ case 'p':
+ cfg_port = strtoul(optarg, NULL, 0);
+ break;
+ case 'l':
+ cfg_payload_len = strtoul(optarg, NULL, 0);
+ break;
+ case 'i':
+ cfg_ifname = optarg;
+ break;
+ case 'q':
+ cfg_queue_id = strtoul(optarg, NULL, 0);
+ break;
+ }
+ }
+
+ if (cfg_server && addr)
+ error(1, 0, "Receiver cannot have -h specified");
+
+ switch (cfg_family) {
+ case PF_INET:
+ memset(addr4, 0, sizeof(*addr4));
+ addr4->sin_family = AF_INET;
+ addr4->sin_port = htons(cfg_port);
+ addr4->sin_addr.s_addr = htonl(INADDR_ANY);
+
+ if (addr &&
+ inet_pton(AF_INET, addr, &(addr4->sin_addr)) != 1)
+ error(1, 0, "ipv4 parse error: %s", addr);
+ break;
+ case PF_INET6:
+ memset(addr6, 0, sizeof(*addr6));
+ addr6->sin6_family = AF_INET6;
+ addr6->sin6_port = htons(cfg_port);
+ addr6->sin6_addr = in6addr_any;
+
+ if (addr &&
+ inet_pton(AF_INET6, addr, &(addr6->sin6_addr)) != 1)
+ error(1, 0, "ipv6 parse error: %s", addr);
+ break;
+ default:
+ error(1, 0, "illegal domain");
+ }
+
+ if (cfg_payload_len > max_payload_len)
+ error(1, 0, "-l: payload exceeds max (%d)", max_payload_len);
+}
+
+int main(int argc, char **argv)
+{
+ const char *cfg_test = argv[argc - 1];
+ int i;
+
+ parse_opts(argc, argv);
+
+ for (i = 0; i < SEND_SIZE; i++)
+ payload[i] = 'a' + (i % 26);
+
+ if (cfg_server)
+ run_server();
+ else if (cfg_client)
+ run_client();
+
+ return 0;
+}
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
new file mode 100755
index 000000000000..3998d0ad504f
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
@@ -0,0 +1,64 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+from os import path
+from lib.py import ksft_run, ksft_exit
+from lib.py import NetDrvEpEnv
+from lib.py import bkg, cmd, wait_port_listen
+
+
+def _get_rx_ring_entries(cfg):
+ eth_cmd = "ethtool -g {} | awk '/RX:/ {{count++}} count == 2 {{print $2; exit}}'"
+ res = cmd(eth_cmd.format(cfg.ifname), host=cfg.remote)
+ return int(res.stdout)
+
+
+def _get_combined_channels(cfg):
+ eth_cmd = "ethtool -l {} | awk '/Combined:/ {{count++}} count == 2 {{print $2; exit}}'"
+ res = cmd(eth_cmd.format(cfg.ifname), host=cfg.remote)
+ return int(res.stdout)
+
+
+def _set_flow_rule(cfg, chan):
+ eth_cmd = "ethtool -N {} flow-type tcp6 dst-port 9999 action {} | awk '{{print $NF}}'"
+ res = cmd(eth_cmd.format(cfg.ifname, chan), host=cfg.remote)
+ return int(res.stdout)
+
+
+def test_zcrx(cfg) -> None:
+ cfg.require_v6()
+ cfg.require_cmd("awk", remote=True)
+
+ combined_chans = _get_combined_channels(cfg)
+ if combined_chans < 2:
+ raise KsftSkipEx('at least 2 combined channels required')
+ rx_ring = _get_rx_ring_entries(cfg)
+
+ rx_cmd = f"{cfg.bin_remote} -6 -s -p 9999 -i {cfg.ifname} -q {combined_chans - 1}"
+ tx_cmd = f"{cfg.bin_local} -6 -c -h {cfg.remote_v6} -p 9999 -l 12840"
+
+ try:
+ cmd(f"ethtool -G {cfg.ifname} rx 64", host=cfg.remote)
+ cmd(f"ethtool -X {cfg.ifname} equal {combined_chans - 1}", host=cfg.remote)
+ flow_rule_id = _set_flow_rule(cfg, combined_chans - 1)
+
+ with bkg(rx_cmd, host=cfg.remote, exit_wait=True):
+ wait_port_listen(9999, proto="tcp", host=cfg.remote)
+ cmd(tx_cmd)
+ finally:
+ cmd(f"ethtool -N {cfg.ifname} delete {flow_rule_id}", host=cfg.remote)
+ cmd(f"ethtool -X {cfg.ifname} default", host=cfg.remote)
+ cmd(f"ethtool -G {cfg.ifname} rx {rx_ring}", host=cfg.remote)
+
+
+def main() -> None:
+ with NetDrvEpEnv(__file__) as cfg:
+ cfg.bin_local = path.abspath(path.dirname(__file__) + "/../../../drivers/net/hw/iou-zcrx")
+ cfg.bin_remote = cfg.remote.deploy(cfg.bin_local)
+
+ ksft_run(globs=globals(), case_pfx={"test_"}, args=(cfg, ))
+ ksft_exit()
+
+
+if __name__ == "__main__":
+ main()
--
2.43.5
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 04/21] net: page_pool: create hooks for custom memory providers
2025-01-16 23:16 ` [PATCH net-next v11 04/21] net: page_pool: create hooks for custom memory providers David Wei
@ 2025-01-17 1:46 ` Jakub Kicinski
2025-01-17 1:48 ` Jakub Kicinski
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-17 1:46 UTC (permalink / raw)
To: David Wei
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On Thu, 16 Jan 2025 15:16:46 -0800 David Wei wrote:
> From: Pavel Begunkov <[email protected]>
>
> A spin off from the original page pool memory providers patch by Jakub,
> which allows extending page pools with custom allocators. One of such
> providers is devmem TCP, and the other is io_uring zerocopy added in
> following patches.
>
> Co-developed-by: Jakub Kicinski <[email protected]> # initial mp proposal
> Link: https://lore.kernel.org/netdev/[email protected]/
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
FWIW you still need to add my SoB for Co-developed-by.
Doesn't checkpatch complain?
I guess not a deal breaker here if I'm the one applying it...
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 09/21] net: page_pool: add memory provider helpers
2025-01-16 23:16 ` [PATCH net-next v11 09/21] net: page_pool: add memory provider helpers David Wei
@ 2025-01-17 1:47 ` Jakub Kicinski
0 siblings, 0 replies; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-17 1:47 UTC (permalink / raw)
To: David Wei
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On Thu, 16 Jan 2025 15:16:51 -0800 David Wei wrote:
> Add helpers for memory providers to interact with page pools.
> net_mp_niov_{set,clear}_page_pool() serve to [dis]associate a net_iov
> with a page pool. If used, the memory provider is responsible to match
> "set" calls with "clear" once a net_iov is not going to be used by a page
> pool anymore, changing a page pool, etc.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
Acked-by: Jakub Kicinski <[email protected]>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 04/21] net: page_pool: create hooks for custom memory providers
2025-01-17 1:46 ` Jakub Kicinski
@ 2025-01-17 1:48 ` Jakub Kicinski
2025-01-17 2:15 ` David Wei
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-17 1:48 UTC (permalink / raw)
To: David Wei
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On Thu, 16 Jan 2025 17:46:34 -0800 Jakub Kicinski wrote:
> On Thu, 16 Jan 2025 15:16:46 -0800 David Wei wrote:
> > From: Pavel Begunkov <[email protected]>
> >
> > A spin off from the original page pool memory providers patch by Jakub,
> > which allows extending page pools with custom allocators. One of such
> > providers is devmem TCP, and the other is io_uring zerocopy added in
> > following patches.
> >
> > Co-developed-by: Jakub Kicinski <[email protected]> # initial mp proposal
> > Link: https://lore.kernel.org/netdev/[email protected]/
> > Signed-off-by: Pavel Begunkov <[email protected]>
> > Signed-off-by: David Wei <[email protected]>
>
> FWIW you still need to add my SoB for Co-developed-by.
> Doesn't checkpatch complain?
> I guess not a deal breaker here if I'm the one applying it...
Ah, and in case I am not..
Signed-off-by: Jakub Kicinski <[email protected]>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-16 23:16 ` [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue David Wei
@ 2025-01-17 1:50 ` Jakub Kicinski
2025-01-17 1:52 ` Jakub Kicinski
2025-01-17 2:25 ` Jakub Kicinski
2 siblings, 0 replies; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-17 1:50 UTC (permalink / raw)
To: David Wei
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On Thu, 16 Jan 2025 15:16:52 -0800 David Wei wrote:
> Add helpers that properly prep or remove a memory provider for an rx
> queue then restart the queue.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
Reviewed-by: Jakub Kicinski <[email protected]>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-16 23:16 ` [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue David Wei
2025-01-17 1:50 ` Jakub Kicinski
@ 2025-01-17 1:52 ` Jakub Kicinski
2025-01-17 2:17 ` David Wei
2025-01-17 2:25 ` Jakub Kicinski
2 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-17 1:52 UTC (permalink / raw)
To: David Wei
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On Thu, 16 Jan 2025 15:16:52 -0800 David Wei wrote:
> + ret = netdev_rx_queue_restart(dev, ifq_idx);
> + if (ret)
> + pr_devel("Could not restart queue %u after removing memory provider.\n",
> + ifq_idx);
devmem does a WARN_ON() in this case, probably better to keep it
consistent? In case you need to respin
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 14/21] io_uring/zcrx: implement zerocopy receive pp memory provider
2025-01-16 23:16 ` [PATCH net-next v11 14/21] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
@ 2025-01-17 2:07 ` Jakub Kicinski
2025-01-17 2:17 ` David Wei
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-17 2:07 UTC (permalink / raw)
To: David Wei
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On Thu, 16 Jan 2025 15:16:56 -0800 David Wei wrote:
> + type = rxq ? NETDEV_A_QUEUE_IO_URING : NETDEV_A_PAGE_POOL_IO_URING;
> + nest = nla_nest_start(rsp, type);
> + nla_nest_end(rsp, nest);
nla_nest_start() can fail, you should return -EMSGSIZE if it does
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 17/21] io_uring/zcrx: set pp memory provider for an rx queue
2025-01-16 23:16 ` [PATCH net-next v11 17/21] io_uring/zcrx: set pp memory provider for an rx queue David Wei
@ 2025-01-17 2:13 ` Jakub Kicinski
2025-01-17 2:38 ` Pavel Begunkov
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-17 2:13 UTC (permalink / raw)
To: David Wei
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On Thu, 16 Jan 2025 15:16:59 -0800 David Wei wrote:
> +static void io_close_queue(struct io_zcrx_ifq *ifq)
> +{
> + struct net_device *netdev;
> + netdevice_tracker netdev_tracker;
> + struct pp_memory_provider_params p = {
> + .mp_ops = &io_uring_pp_zc_ops,
> + .mp_priv = ifq,
> + };
> +
> + if (ifq->if_rxq == -1)
> + return;
> +
> + spin_lock(&ifq->lock);
> + netdev = ifq->netdev;
> + netdev_tracker = ifq->netdev_tracker;
> + ifq->netdev = NULL;
> + spin_unlock(&ifq->lock);
> +
> + if (netdev)
> + net_mp_close_rxq(netdev, ifq->if_rxq, &p);
> + ifq->if_rxq = -1;
> +}
Did you mean to call netdev_put() somewhere here? :S
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 04/21] net: page_pool: create hooks for custom memory providers
2025-01-17 1:48 ` Jakub Kicinski
@ 2025-01-17 2:15 ` David Wei
0 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-17 2:15 UTC (permalink / raw)
To: Jakub Kicinski
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On 2025-01-16 17:48, Jakub Kicinski wrote:
> On Thu, 16 Jan 2025 17:46:34 -0800 Jakub Kicinski wrote:
>> On Thu, 16 Jan 2025 15:16:46 -0800 David Wei wrote:
>>> From: Pavel Begunkov <[email protected]>
>>>
>>> A spin off from the original page pool memory providers patch by Jakub,
>>> which allows extending page pools with custom allocators. One of such
>>> providers is devmem TCP, and the other is io_uring zerocopy added in
>>> following patches.
>>>
>>> Co-developed-by: Jakub Kicinski <[email protected]> # initial mp proposal
>>> Link: https://lore.kernel.org/netdev/[email protected]/
>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>> Signed-off-by: David Wei <[email protected]>
>>
>> FWIW you still need to add my SoB for Co-developed-by.
>> Doesn't checkpatch complain?
>> I guess not a deal breaker here if I'm the one applying it...
>
> Ah, and in case I am not..
>
> Signed-off-by: Jakub Kicinski <[email protected]>
I'm sorry... Checkpatch does complain but I missed it in the sea of
warnings of 'prefer unsigned int to bare unsigned'.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-17 1:52 ` Jakub Kicinski
@ 2025-01-17 2:17 ` David Wei
0 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-17 2:17 UTC (permalink / raw)
To: Jakub Kicinski
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On 2025-01-16 17:52, Jakub Kicinski wrote:
> On Thu, 16 Jan 2025 15:16:52 -0800 David Wei wrote:
>> + ret = netdev_rx_queue_restart(dev, ifq_idx);
>> + if (ret)
>> + pr_devel("Could not restart queue %u after removing memory provider.\n",
>> + ifq_idx);
>
> devmem does a WARN_ON() in this case, probably better to keep it
> consistent? In case you need to respin
Yeah I see it, thanks. It looks like I need to respin so I'll fix it up.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 14/21] io_uring/zcrx: implement zerocopy receive pp memory provider
2025-01-17 2:07 ` Jakub Kicinski
@ 2025-01-17 2:17 ` David Wei
0 siblings, 0 replies; 43+ messages in thread
From: David Wei @ 2025-01-17 2:17 UTC (permalink / raw)
To: Jakub Kicinski
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On 2025-01-16 18:07, Jakub Kicinski wrote:
> On Thu, 16 Jan 2025 15:16:56 -0800 David Wei wrote:
>> + type = rxq ? NETDEV_A_QUEUE_IO_URING : NETDEV_A_PAGE_POOL_IO_URING;
>> + nest = nla_nest_start(rsp, type);
>> + nla_nest_end(rsp, nest);
>
> nla_nest_start() can fail, you should return -EMSGSIZE if it does
Thanks, will fix.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-16 23:16 ` [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue David Wei
2025-01-17 1:50 ` Jakub Kicinski
2025-01-17 1:52 ` Jakub Kicinski
@ 2025-01-17 2:25 ` Jakub Kicinski
2025-01-17 2:47 ` Pavel Begunkov
2 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-17 2:25 UTC (permalink / raw)
To: David Wei
Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela
On Thu, 16 Jan 2025 15:16:52 -0800 David Wei wrote:
> +static void __net_mp_close_rxq(struct net_device *dev, unsigned ifq_idx,
> + struct pp_memory_provider_params *old_p)
> +{
> + struct netdev_rx_queue *rxq;
> + int ret;
> +
> + if (WARN_ON_ONCE(ifq_idx >= dev->real_num_rx_queues))
> + return;
> +
> + rxq = __netif_get_rx_queue(dev, ifq_idx);
I think there's a small race between io_uring closing and the netdev
unregister. We can try to uninstall twice, let's put
/* Callers holding a netdev ref may get here after we already
* went thru shutdown via dev_memory_provider_uninstall().
*/
if (dev->reg_state > NETREG_REGISTERED &&
!rxq->mp_params.mp_ops)
return;
here, and in dev_memory_provider_uninstall() clear the pointers?
> + if (WARN_ON_ONCE(rxq->mp_params.mp_ops != old_p->mp_ops ||
> + rxq->mp_params.mp_priv != old_p->mp_priv))
> + return;
> +
> + rxq->mp_params.mp_ops = NULL;
> + rxq->mp_params.mp_priv = NULL;
> + ret = netdev_rx_queue_restart(dev, ifq_idx);
> + if (ret)
> + pr_devel("Could not restart queue %u after removing memory provider.\n",
> + ifq_idx);
> +}
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 17/21] io_uring/zcrx: set pp memory provider for an rx queue
2025-01-17 2:13 ` Jakub Kicinski
@ 2025-01-17 2:38 ` Pavel Begunkov
0 siblings, 0 replies; 43+ messages in thread
From: Pavel Begunkov @ 2025-01-17 2:38 UTC (permalink / raw)
To: Jakub Kicinski, David Wei
Cc: io-uring, netdev, Jens Axboe, Paolo Abeni, David S. Miller,
Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry,
Stanislav Fomichev, Joe Damato, Pedro Tammela
On 1/17/25 02:13, Jakub Kicinski wrote:
> On Thu, 16 Jan 2025 15:16:59 -0800 David Wei wrote:
>> +static void io_close_queue(struct io_zcrx_ifq *ifq)
>> +{
>> + struct net_device *netdev;
>> + netdevice_tracker netdev_tracker;
>> + struct pp_memory_provider_params p = {
>> + .mp_ops = &io_uring_pp_zc_ops,
>> + .mp_priv = ifq,
>> + };
>> +
>> + if (ifq->if_rxq == -1)
>> + return;
>> +
>> + spin_lock(&ifq->lock);
>> + netdev = ifq->netdev;
>> + netdev_tracker = ifq->netdev_tracker;
>> + ifq->netdev = NULL;
>> + spin_unlock(&ifq->lock);
>> +
>> + if (netdev)
>> + net_mp_close_rxq(netdev, ifq->if_rxq, &p);
>> + ifq->if_rxq = -1;
>> +}
>
> Did you mean to call netdev_put() somewhere here? :S
yeah, it's missed
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-17 2:25 ` Jakub Kicinski
@ 2025-01-17 2:47 ` Pavel Begunkov
2025-01-17 22:11 ` Jakub Kicinski
0 siblings, 1 reply; 43+ messages in thread
From: Pavel Begunkov @ 2025-01-17 2:47 UTC (permalink / raw)
To: Jakub Kicinski, David Wei
Cc: io-uring, netdev, Jens Axboe, Paolo Abeni, David S. Miller,
Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry,
Stanislav Fomichev, Joe Damato, Pedro Tammela
On 1/17/25 02:25, Jakub Kicinski wrote:
> On Thu, 16 Jan 2025 15:16:52 -0800 David Wei wrote:
>> +static void __net_mp_close_rxq(struct net_device *dev, unsigned ifq_idx,
>> + struct pp_memory_provider_params *old_p)
>> +{
>> + struct netdev_rx_queue *rxq;
>> + int ret;
>> +
>> + if (WARN_ON_ONCE(ifq_idx >= dev->real_num_rx_queues))
>> + return;
>> +
>> + rxq = __netif_get_rx_queue(dev, ifq_idx);
>
> I think there's a small race between io_uring closing and the netdev
> unregister. We can try to uninstall twice, let's put
They're gated by checking ifq->netdev in io_uring code, which is
cleared by them under a spin. So either io_uring does
__net_mp_close_rxq() and ->uninstall does nothing, or vise versa.
> /* Callers holding a netdev ref may get here after we already
> * went thru shutdown via dev_memory_provider_uninstall().
> */
> if (dev->reg_state > NETREG_REGISTERED &&
> !rxq->mp_params.mp_ops)
> return;
>
> here, and in dev_memory_provider_uninstall() clear the pointers?
>
>> + if (WARN_ON_ONCE(rxq->mp_params.mp_ops != old_p->mp_ops ||
>> + rxq->mp_params.mp_priv != old_p->mp_priv))
>> + return;
>> +
>> + rxq->mp_params.mp_ops = NULL;
>> + rxq->mp_params.mp_priv = NULL;
>> + ret = netdev_rx_queue_restart(dev, ifq_idx);
>> + if (ret)
>> + pr_devel("Could not restart queue %u after removing memory provider.\n",
>> + ifq_idx);
>> +}
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 00/21] io_uring zero copy rx
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
` (20 preceding siblings ...)
2025-01-16 23:17 ` [PATCH net-next v11 21/21] io_uring/zcrx: add selftest David Wei
@ 2025-01-17 14:28 ` Paolo Abeni
2025-01-17 14:42 ` Pavel Begunkov
21 siblings, 1 reply; 43+ messages in thread
From: Paolo Abeni @ 2025-01-17 14:28 UTC (permalink / raw)
To: David Wei, io-uring, netdev
Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, David S. Miller,
Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry,
Stanislav Fomichev, Joe Damato, Pedro Tammela
On 1/17/25 12:16 AM, David Wei wrote:
> This patchset adds support for zero copy rx into userspace pages using
> io_uring, eliminating a kernel to user copy.
>
> We configure a page pool that a driver uses to fill a hw rx queue to
> hand out user pages instead of kernel pages. Any data that ends up
> hitting this hw rx queue will thus be dma'd into userspace memory
> directly, without needing to be bounced through kernel memory. 'Reading'
> data out of a socket instead becomes a _notification_ mechanism, where
> the kernel tells userspace where the data is. The overall approach is
> similar to the devmem TCP proposal.
>
> This relies on hw header/data split, flow steering and RSS to ensure
> packet headers remain in kernel memory and only desired flows hit a hw
> rx queue configured for zero copy. Configuring this is outside of the
> scope of this patchset.
>
> We share netdev core infra with devmem TCP. The main difference is that
> io_uring is used for the uAPI and the lifetime of all objects are bound
> to an io_uring instance. Data is 'read' using a new io_uring request
> type. When done, data is returned via a new shared refill queue. A zero
> copy page pool refills a hw rx queue from this refill queue directly. Of
> course, the lifetime of these data buffers are managed by io_uring
> rather than the networking stack, with different refcounting rules.
>
> This patchset is the first step adding basic zero copy support. We will
> extend this iteratively with new features e.g. dynamically allocated
> zero copy areas, THP support, dmabuf support, improved copy fallback,
> general optimisations and more.
>
> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
> aren't included since Taehee Yoo has already sent a more comprehensive
> patchset adding support in [1]. Google gve should already support this,
> and Mellanox mlx5 support is WIP pending driver changes.
>
> ===========
> Performance
> ===========
>
> Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.
>
> Test setup:
> * AMD EPYC 9454
> * Broadcom BCM957508 200G
> * Kernel v6.11 base [2]
> * liburing fork [3]
> * kperf fork [4]
> * 4K MTU
> * Single TCP flow
>
> With application thread + net rx softirq pinned to _different_ cores:
>
> +-------------------------------+
> | epoll | io_uring |
> |-----------|-------------------|
> | 82.2 Gbps | 116.2 Gbps (+41%) |
> +-------------------------------+
>
> Pinned to _same_ core:
>
> +-------------------------------+
> | epoll | io_uring |
> |-----------|-------------------|
> | 62.6 Gbps | 80.9 Gbps (+29%) |
> +-------------------------------+
>
> =====
> Links
> =====
>
> Broadcom bnxt support:
> [1]: https://lore.kernel.org/netdev/[email protected]/
>
> Linux kernel branch:
> [2]: https://github.com/spikeh/linux.git zcrx/v9
>
> liburing for testing:
> [3]: https://github.com/isilence/liburing.git zcrx/next
>
> kperf for testing:
> [4]: https://git.kernel.dk/kperf.git
We are getting very close to the merge window. In order to get this
series merged before such deadline the point raised by Jakub on this
version must me resolved, the next iteration should land to the ML
before the end of the current working day and the series must apply
cleanly to net-next, so that it can be processed by our CI.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 00/21] io_uring zero copy rx
2025-01-17 14:28 ` [PATCH net-next v11 00/21] io_uring zero copy rx Paolo Abeni
@ 2025-01-17 14:42 ` Pavel Begunkov
2025-01-17 16:05 ` Paolo Abeni
0 siblings, 1 reply; 43+ messages in thread
From: Pavel Begunkov @ 2025-01-17 14:42 UTC (permalink / raw)
To: Paolo Abeni, David Wei, io-uring, netdev
Cc: Jens Axboe, Jakub Kicinski, David S. Miller, Eric Dumazet,
Jesper Dangaard Brouer, David Ahern, Mina Almasry,
Stanislav Fomichev, Joe Damato, Pedro Tammela
On 1/17/25 14:28, Paolo Abeni wrote:
> On 1/17/25 12:16 AM, David Wei wrote:
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
>>
>> We share netdev core infra with devmem TCP. The main difference is that
>> io_uring is used for the uAPI and the lifetime of all objects are bound
>> to an io_uring instance. Data is 'read' using a new io_uring request
>> type. When done, data is returned via a new shared refill queue. A zero
>> copy page pool refills a hw rx queue from this refill queue directly. Of
>> course, the lifetime of these data buffers are managed by io_uring
>> rather than the networking stack, with different refcounting rules.
>>
>> This patchset is the first step adding basic zero copy support. We will
>> extend this iteratively with new features e.g. dynamically allocated
>> zero copy areas, THP support, dmabuf support, improved copy fallback,
>> general optimisations and more.
>>
>> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
>> aren't included since Taehee Yoo has already sent a more comprehensive
>> patchset adding support in [1]. Google gve should already support this,
>> and Mellanox mlx5 support is WIP pending driver changes.
>>
>> ===========
>> Performance
>> ===========
>>
>> Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.
>>
>> Test setup:
>> * AMD EPYC 9454
>> * Broadcom BCM957508 200G
>> * Kernel v6.11 base [2]
>> * liburing fork [3]
>> * kperf fork [4]
>> * 4K MTU
>> * Single TCP flow
>>
>> With application thread + net rx softirq pinned to _different_ cores:
>>
>> +-------------------------------+
>> | epoll | io_uring |
>> |-----------|-------------------|
>> | 82.2 Gbps | 116.2 Gbps (+41%) |
>> +-------------------------------+
>>
>> Pinned to _same_ core:
>>
>> +-------------------------------+
>> | epoll | io_uring |
>> |-----------|-------------------|
>> | 62.6 Gbps | 80.9 Gbps (+29%) |
>> +-------------------------------+
>>
>> =====
>> Links
>> =====
>>
>> Broadcom bnxt support:
>> [1]: https://lore.kernel.org/netdev/[email protected]/
>>
>> Linux kernel branch:
>> [2]: https://github.com/spikeh/linux.git zcrx/v9
>>
>> liburing for testing:
>> [3]: https://github.com/isilence/liburing.git zcrx/next
>>
>> kperf for testing:
>> [4]: https://git.kernel.dk/kperf.git
>
> We are getting very close to the merge window. In order to get this
> series merged before such deadline the point raised by Jakub on this
> version must me resolved, the next iteration should land to the ML
> before the end of the current working day and the series must apply
> cleanly to net-next, so that it can be processed by our CI.
Sounds good, thanks Paolo.
Since the merging is not trivial, I'll send a PR for the net/
patches instead of reposting the entire thing, if that sounds right
to you. The rest will be handled on the io_uring side.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 00/21] io_uring zero copy rx
2025-01-17 14:42 ` Pavel Begunkov
@ 2025-01-17 16:05 ` Paolo Abeni
2025-01-17 16:19 ` Pavel Begunkov
0 siblings, 1 reply; 43+ messages in thread
From: Paolo Abeni @ 2025-01-17 16:05 UTC (permalink / raw)
To: Pavel Begunkov, Jakub Kicinski
Cc: Jens Axboe, David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela, David Wei, io-uring, netdev
On 1/17/25 3:42 PM, Pavel Begunkov wrote:
> On 1/17/25 14:28, Paolo Abeni wrote:
>> On 1/17/25 12:16 AM, David Wei wrote:
>>> This patchset adds support for zero copy rx into userspace pages using
>>> io_uring, eliminating a kernel to user copy.
>>>
>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>> hand out user pages instead of kernel pages. Any data that ends up
>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>> data out of a socket instead becomes a _notification_ mechanism, where
>>> the kernel tells userspace where the data is. The overall approach is
>>> similar to the devmem TCP proposal.
>>>
>>> This relies on hw header/data split, flow steering ad RSS to ensure
>>> packet headers remain in kernel memory and only desired flows hit a hw
>>> rx queue configured for zero copy. Configuring this is outside of the
>>> scope of this patchset.
>>>
>>> We share netdev core infra with devmem TCP. The main difference is that
>>> io_uring is used for the uAPI and the lifetime of all objects are bound
>>> to an io_uring instance. Data is 'read' using a new io_uring request
>>> type. When done, data is returned via a new shared refill queue. A zero
>>> copy page pool refills a hw rx queue from this refill queue directly. Of
>>> course, the lifetime of these data buffers are managed by io_uring
>>> rather than the networking stack, with different refcounting rules.
>>>
>>> This patchset is the first step adding basic zero copy support. We will
>>> extend this iteratively with new features e.g. dynamically allocated
>>> zero copy areas, THP support, dmabuf support, improved copy fallback,
>>> general optimisations and more.
>>>
>>> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
>>> aren't included since Taehee Yoo has already sent a more comprehensive
>>> patchset adding support in [1]. Google gve should already support this,
>>> and Mellanox mlx5 support is WIP pending driver changes.
>>>
>>> ===========
>>> Performance
>>> ===========
>>>
>>> Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.
>>>
>>> Test setup:
>>> * AMD EPYC 9454
>>> * Broadcom BCM957508 200G
>>> * Kernel v6.11 base [2]
>>> * liburing fork [3]
>>> * kperf fork [4]
>>> * 4K MTU
>>> * Single TCP flow
>>>
>>> With application thread + net rx softirq pinned to _different_ cores:
>>>
>>> +-------------------------------+
>>> | epoll | io_uring |
>>> |-----------|-------------------|
>>> | 82.2 Gbps | 116.2 Gbps (+41%) |
>>> +-------------------------------+
>>>
>>> Pinned to _same_ core:
>>>
>>> +-------------------------------+
>>> | epoll | io_uring |
>>> |-----------|-------------------|
>>> | 62.6 Gbps | 80.9 Gbps (+29%) |
>>> +-------------------------------+
>>>
>>> =====
>>> Links
>>> =====
>>>
>>> Broadcom bnxt support:
>>> [1]: https://lore.kernel.org/netdev/[email protected]/
>>>
>>> Linux kernel branch:
>>> [2]: https://github.com/spikeh/linux.git zcrx/v9
>>>
>>> liburing for testing:
>>> [3]: https://github.com/isilence/liburing.git zcrx/next
>>>
>>> kperf for testing:
>>> [4]: https://git.kernel.dk/kperf.git
>>
>> We are getting very close to the merge window. In order to get this
>> series merged before such deadline the point raised by Jakub on this
>> version must me resolved, the next iteration should land to the ML
>> before the end of the current working day and the series must apply
>> cleanly to net-next, so that it can be processed by our CI.
>
> Sounds good, thanks Paolo.
>
> Since the merging is not trivial, I'll send a PR for the net/
> patches instead of reposting the entire thing, if that sounds right
> to you. The rest will be handled on the io_uring side.
I agree it is the more straight-forward path. @Jakub: do you see any
problem with the above?
/P
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 00/21] io_uring zero copy rx
2025-01-17 16:05 ` Paolo Abeni
@ 2025-01-17 16:19 ` Pavel Begunkov
0 siblings, 0 replies; 43+ messages in thread
From: Pavel Begunkov @ 2025-01-17 16:19 UTC (permalink / raw)
To: Paolo Abeni, Jakub Kicinski
Cc: Jens Axboe, David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
Pedro Tammela, David Wei, io-uring, netdev
On 1/17/25 16:05, Paolo Abeni wrote:
> On 1/17/25 3:42 PM, Pavel Begunkov wrote:
>> On 1/17/25 14:28, Paolo Abeni wrote:
>>> On 1/17/25 12:16 AM, David Wei wrote:
>>>> This patchset adds support for zero copy rx into userspace pages using
>>>> io_uring, eliminating a kernel to user copy.
>>>>
...
>>>> Broadcom bnxt support:
>>>> [1]: https://lore.kernel.org/netdev/[email protected]/
>>>>
>>>> Linux kernel branch:
>>>> [2]: https://github.com/spikeh/linux.git zcrx/v9
>>>>
>>>> liburing for testing:
>>>> [3]: https://github.com/isilence/liburing.git zcrx/next
>>>>
>>>> kperf for testing:
>>>> [4]: https://git.kernel.dk/kperf.git
>>>
>>> We are getting very close to the merge window. In order to get this
>>> series merged before such deadline the point raised by Jakub on this
>>> version must me resolved, the next iteration should land to the ML
>>> before the end of the current working day and the series must apply
>>> cleanly to net-next, so that it can be processed by our CI.
>>
>> Sounds good, thanks Paolo.
>>
>> Since the merging is not trivial, I'll send a PR for the net/
>> patches instead of reposting the entire thing, if that sounds right
>> to you. The rest will be handled on the io_uring side.
>
> I agree it is the more straight-forward path. @Jakub: do you see any
> problem with the above?
I couldn't cleanly base it onto linus' branch, so after talking it
over with Jens, I sent it as a normal patchset. Jens will help to
handle the rest if the net patches are merged.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-17 2:47 ` Pavel Begunkov
@ 2025-01-17 22:11 ` Jakub Kicinski
2025-01-17 23:20 ` Pavel Begunkov
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-17 22:11 UTC (permalink / raw)
To: Pavel Begunkov, Mina Almasry
Cc: David Wei, io-uring, netdev, Jens Axboe, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela
On Fri, 17 Jan 2025 02:47:15 +0000 Pavel Begunkov wrote:
> >> + rxq = __netif_get_rx_queue(dev, ifq_idx);
> >
> > I think there's a small race between io_uring closing and the netdev
> > unregister. We can try to uninstall twice, let's put
>
> They're gated by checking ifq->netdev in io_uring code, which is
> cleared by them under a spin. So either io_uring does
> __net_mp_close_rxq() and ->uninstall does nothing, or vise versa.
True, so not twice, but the race is there. It's not correct to call
ops of a device which has already been unregistered.
Mina, did we consider that the device may be closed when the provider
is being bound? Perhaps that's what you meant when you were reviewing
the netdevsim patches!
Do we need something like this?
---->8------------
From: Jakub Kicinski <[email protected]>
Subject: net: devmem: don't call queue stop / start when the interface is down
We seem to be missing a netif_running() check from the devmem
installation path. Starting a queue on a stopped device makes
no sense. We still want to be able to allocate the memory, just
to test that the device is indeed setting up the page pools
in a memory provider compatible way.
Fixes: 7c88f86576f3 ("netdev: add netdev_rx_queue_restart()")
Signed-off-by: Jakub Kicinski <[email protected]>
---
include/net/netdev_queues.h | 4 ++++
net/core/netdev_rx_queue.c | 16 ++++++++++------
2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index 5ca019d294ca..9296efeab4c0 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -107,6 +107,10 @@ struct netdev_stat_ops {
*
* @ndo_queue_stop: Stop the RX queue at the specified index. The stopped
* queue's memory is written at the specified address.
+ *
+ * Note that @ndo_queue_mem_alloc and @ndo_queue_mem_free may be called while
+ * the interface is closed. @ndo_queue_start and @ndo_queue_stop will only
+ * be called for an interface which is open.
*/
struct netdev_queue_mgmt_ops {
size_t ndo_queue_mem_size;
diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
index b02b28d2ae44..9b9c2589150a 100644
--- a/net/core/netdev_rx_queue.c
+++ b/net/core/netdev_rx_queue.c
@@ -38,13 +38,17 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
if (err)
goto err_free_new_queue_mem;
- err = qops->ndo_queue_stop(dev, old_mem, rxq_idx);
- if (err)
- goto err_free_new_queue_mem;
+ if (netif_running(dev)) {
+ err = qops->ndo_queue_stop(dev, old_mem, rxq_idx);
+ if (err)
+ goto err_free_new_queue_mem;
- err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
- if (err)
- goto err_start_queue;
+ err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
+ if (err)
+ goto err_start_queue;
+ } else {
+ swap(new_mem, old_mem);
+ }
qops->ndo_queue_mem_free(dev, old_mem);
--
2.48.1
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-17 22:11 ` Jakub Kicinski
@ 2025-01-17 23:20 ` Pavel Begunkov
2025-01-18 2:08 ` Jakub Kicinski
0 siblings, 1 reply; 43+ messages in thread
From: Pavel Begunkov @ 2025-01-17 23:20 UTC (permalink / raw)
To: Jakub Kicinski, Mina Almasry
Cc: David Wei, io-uring, netdev, Jens Axboe, Paolo Abeni,
David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela
On 1/17/25 22:11, Jakub Kicinski wrote:
> On Fri, 17 Jan 2025 02:47:15 +0000 Pavel Begunkov wrote:
>>>> + rxq = __netif_get_rx_queue(dev, ifq_idx);
>>>
>>> I think there's a small race between io_uring closing and the netdev
>>> unregister. We can try to uninstall twice, let's put
>>
>> They're gated by checking ifq->netdev in io_uring code, which is
>> cleared by them under a spin. So either io_uring does
>> __net_mp_close_rxq() and ->uninstall does nothing, or vise versa.
>
> True, so not twice, but the race is there. It's not correct to call
> ops of a device which has already been unregistered.
Ok, from what you're saying it's regardless of the netdev still
having refs lingering. In this case it was better a version ago
where io_uring was just taking the rtnl lock, which protects
against concurrent unregistration while io_uring is checking
netdev.
Does your patch below covers that? Or does it have to be resolved
in this set? I assume you're going to queue it as a fix.
> Mina, did we consider that the device may be closed when the provider
> is being bound? Perhaps that's what you meant when you were reviewing
> the netdevsim patches!
>
> Do we need something like this?
>
> ---->8------------
>
> From: Jakub Kicinski <[email protected]>
> Subject: net: devmem: don't call queue stop / start when the interface is down
>
> We seem to be missing a netif_running() check from the devmem
> installation path. Starting a queue on a stopped device makes
> no sense. We still want to be able to allocate the memory, just
> to test that the device is indeed setting up the page pools
> in a memory provider compatible way.
>
> Fixes: 7c88f86576f3 ("netdev: add netdev_rx_queue_restart()")
> Signed-off-by: Jakub Kicinski <[email protected]>
> ---
> include/net/netdev_queues.h | 4 ++++
> net/core/netdev_rx_queue.c | 16 ++++++++++------
> 2 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index 5ca019d294ca..9296efeab4c0 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -107,6 +107,10 @@ struct netdev_stat_ops {
> *
> * @ndo_queue_stop: Stop the RX queue at the specified index. The stopped
> * queue's memory is written at the specified address.
> + *
> + * Note that @ndo_queue_mem_alloc and @ndo_queue_mem_free may be called while
> + * the interface is closed. @ndo_queue_start and @ndo_queue_stop will only
> + * be called for an interface which is open.
> */
> struct netdev_queue_mgmt_ops {
> size_t ndo_queue_mem_size;
> diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c
> index b02b28d2ae44..9b9c2589150a 100644
> --- a/net/core/netdev_rx_queue.c
> +++ b/net/core/netdev_rx_queue.c
> @@ -38,13 +38,17 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx)
> if (err)
> goto err_free_new_queue_mem;
>
> - err = qops->ndo_queue_stop(dev, old_mem, rxq_idx);
> - if (err)
> - goto err_free_new_queue_mem;
> + if (netif_running(dev)) {
> + err = qops->ndo_queue_stop(dev, old_mem, rxq_idx);
> + if (err)
> + goto err_free_new_queue_mem;
>
> - err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
> - if (err)
> - goto err_start_queue;
> + err = qops->ndo_queue_start(dev, new_mem, rxq_idx);
> + if (err)
> + goto err_start_queue;
> + } else {
> + swap(new_mem, old_mem);
> + }
>
> qops->ndo_queue_mem_free(dev, old_mem);
>
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-17 23:20 ` Pavel Begunkov
@ 2025-01-18 2:08 ` Jakub Kicinski
2025-01-18 3:09 ` Pavel Begunkov
0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2025-01-18 2:08 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Mina Almasry, David Wei, io-uring, netdev, Jens Axboe,
Paolo Abeni, David S. Miller, Eric Dumazet,
Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
Joe Damato, Pedro Tammela
On Fri, 17 Jan 2025 23:20:58 +0000 Pavel Begunkov wrote:
> > True, so not twice, but the race is there. It's not correct to call
> > ops of a device which has already been unregistered.
>
> Ok, from what you're saying it's regardless of the netdev still
> having refs lingering. In this case it was better a version ago
> where io_uring was just taking the rtnl lock, which protects
> against concurrent unregistration while io_uring is checking
> netdev.
Yes, v9 didn't have this race, it just didn't release the netdev ref
correctly. Plus we plan to lift the rtnl_lock requirement on this API
in 6.14, so the locking details best live under net/
The change I suggested to earlier should be fine.
- If uninstall path wins it will clear and put the netdev under the
spin lock and the close path will do nothing.
- If the close path grabs the netdev pointer the uninstall path will
do nothing in io_uring, just clear the pointers in net/ side. Then
the close path will grab the lock in net_mp_open_rxq() see the netdev
as unregistered, return early, put the ref.
Did I miss something?
> Does your patch below covers that?
My patch below is semi-related. It prevents us from calling
the start/stop but we shouldn't call the alloc/free either,
after we enter unregister. IOW this patch fixes devmem but
io_uring needs more. I think the "entry point" such as
net_mp_close_rxq() is appropriate for netdev state validation,
rather then the bowels of netdev_rx_queue_restart().
> Or does it have to be resolved in this set? I assume you're going to
> queue it as a fix.
Yes, just hoping to hear from Mina first.
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue
2025-01-18 2:08 ` Jakub Kicinski
@ 2025-01-18 3:09 ` Pavel Begunkov
0 siblings, 0 replies; 43+ messages in thread
From: Pavel Begunkov @ 2025-01-18 3:09 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Mina Almasry, David Wei, io-uring, netdev, Jens Axboe,
Paolo Abeni, David S. Miller, Eric Dumazet,
Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
Joe Damato, Pedro Tammela
On 1/18/25 02:08, Jakub Kicinski wrote:
> On Fri, 17 Jan 2025 23:20:58 +0000 Pavel Begunkov wrote:
>>> True, so not twice, but the race is there. It's not correct to call
>>> ops of a device which has already been unregistered.
>>
>> Ok, from what you're saying it's regardless of the netdev still
>> having refs lingering. In this case it was better a version ago
>> where io_uring was just taking the rtnl lock, which protects
>> against concurrent unregistration while io_uring is checking
>> netdev.
>
> Yes, v9 didn't have this race, it just didn't release the netdev ref
> correctly. Plus we plan to lift the rtnl_lock requirement on this API
> in 6.14, so the locking details best live under net/
>
> The change I suggested to earlier should be fine.
>
> - If uninstall path wins it will clear and put the netdev under the
> spin lock and the close path will do nothing.
>
> - If the close path grabs the netdev pointer the uninstall path will
> do nothing in io_uring, just clear the pointers in net/ side. Then
> the close path will grab the lock in net_mp_open_rxq() see the netdev
> as unregistered, return early, put the ref.
>
> Did I miss something?
That should work, but it's also a house of cards comparing to the
alternative, that netdev trickery with bunch of sync around is a
direct product of that. It absolutely will fail at some point.
I'll put it in, I don't care anymore.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2025-01-18 3:08 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-16 23:16 [PATCH net-next v11 00/21] io_uring zero copy rx David Wei
2025-01-16 23:16 ` [PATCH net-next v11 01/21] net: page_pool: don't cast mp param to devmem David Wei
2025-01-16 23:16 ` [PATCH net-next v11 02/21] net: prefix devmem specific helpers David Wei
2025-01-16 23:16 ` [PATCH net-next v11 03/21] net: generalise net_iov chunk owners David Wei
2025-01-16 23:16 ` [PATCH net-next v11 04/21] net: page_pool: create hooks for custom memory providers David Wei
2025-01-17 1:46 ` Jakub Kicinski
2025-01-17 1:48 ` Jakub Kicinski
2025-01-17 2:15 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 05/21] netdev: add io_uring memory provider info David Wei
2025-01-16 23:16 ` [PATCH net-next v11 06/21] net: page_pool: add callback for mp info printing David Wei
2025-01-16 23:16 ` [PATCH net-next v11 07/21] net: page_pool: add a mp hook to unregister_netdevice* David Wei
2025-01-16 23:16 ` [PATCH net-next v11 08/21] net: prepare for non devmem TCP memory providers David Wei
2025-01-16 23:16 ` [PATCH net-next v11 09/21] net: page_pool: add memory provider helpers David Wei
2025-01-17 1:47 ` Jakub Kicinski
2025-01-16 23:16 ` [PATCH net-next v11 10/21] net: add helpers for setting a memory provider on an rx queue David Wei
2025-01-17 1:50 ` Jakub Kicinski
2025-01-17 1:52 ` Jakub Kicinski
2025-01-17 2:17 ` David Wei
2025-01-17 2:25 ` Jakub Kicinski
2025-01-17 2:47 ` Pavel Begunkov
2025-01-17 22:11 ` Jakub Kicinski
2025-01-17 23:20 ` Pavel Begunkov
2025-01-18 2:08 ` Jakub Kicinski
2025-01-18 3:09 ` Pavel Begunkov
2025-01-16 23:16 ` [PATCH net-next v11 11/21] io_uring/zcrx: add interface queue and refill queue David Wei
2025-01-16 23:16 ` [PATCH net-next v11 12/21] io_uring/zcrx: add io_zcrx_area David Wei
2025-01-16 23:16 ` [PATCH net-next v11 13/21] io_uring/zcrx: grab a net device David Wei
2025-01-16 23:16 ` [PATCH net-next v11 14/21] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
2025-01-17 2:07 ` Jakub Kicinski
2025-01-17 2:17 ` David Wei
2025-01-16 23:16 ` [PATCH net-next v11 15/21] io_uring/zcrx: dma-map area for the device David Wei
2025-01-16 23:16 ` [PATCH net-next v11 16/21] io_uring/zcrx: add io_recvzc request David Wei
2025-01-16 23:16 ` [PATCH net-next v11 17/21] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2025-01-17 2:13 ` Jakub Kicinski
2025-01-17 2:38 ` Pavel Begunkov
2025-01-16 23:17 ` [PATCH net-next v11 18/21] io_uring/zcrx: throttle receive requests David Wei
2025-01-16 23:17 ` [PATCH net-next v11 19/21] io_uring/zcrx: add copy fallback David Wei
2025-01-16 23:17 ` [PATCH net-next v11 20/21] net: add documentation for io_uring zcrx David Wei
2025-01-16 23:17 ` [PATCH net-next v11 21/21] io_uring/zcrx: add selftest David Wei
2025-01-17 14:28 ` [PATCH net-next v11 00/21] io_uring zero copy rx Paolo Abeni
2025-01-17 14:42 ` Pavel Begunkov
2025-01-17 16:05 ` Paolo Abeni
2025-01-17 16:19 ` Pavel Begunkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox