[PATCH v1 00/15] io_uring zero copy rx

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH v1 00/15] io_uring zero copy rx
@ 2024-10-07 22:15 David Wei
  2024-10-07 22:15 ` [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef David Wei
                   ` (18 more replies)
  0 siblings, 19 replies; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

This patchset adds support for zero copy rx into userspace pages using
io_uring, eliminating a kernel to user copy.

We configure a page pool that a driver uses to fill a hw rx queue to
hand out user pages instead of kernel pages. Any data that ends up
hitting this hw rx queue will thus be dma'd into userspace memory
directly, without needing to be bounced through kernel memory. 'Reading'
data out of a socket instead becomes a _notification_ mechanism, where
the kernel tells userspace where the data is. The overall approach is
similar to the devmem TCP proposal.

This relies on hw header/data split, flow steering and RSS to ensure
packet headers remain in kernel memory and only desired flows hit a hw
rx queue configured for zero copy. Configuring this is outside of the
scope of this patchset.

We share netdev core infra with devmem TCP. The main difference is that
io_uring is used for the uAPI and the lifetime of all objects are bound
to an io_uring instance. Data is 'read' using a new io_uring request
type. When done, data is returned via a new shared refill queue. A zero
copy page pool refills a hw rx queue from this refill queue directly. Of
course, the lifetime of these data buffers are managed by io_uring
rather than the networking stack, with different refcounting rules.

This patchset is the first step adding basic zero copy support. We will
extend this iteratively with new features e.g. dynamically allocated
zero copy areas, THP support, dmabuf support, improved copy fallback,
general optimisations and more.

In terms of netdev support, we're first targeting Broadcom bnxt. Patches
aren't included since Taehee Yoo has already sent a more comprehensive
patchset adding support in [1]. Google gve should already support this,
and Mellanox mlx5 support is WIP pending driver changes.

===========
Performance
===========

Test setup:
* AMD EPYC 9454
* Broadcom BCM957508 200G
* Kernel v6.11 base [2]
* liburing fork [3]
* kperf fork [4]
* 4K MTU
* Single TCP flow

With application thread + net rx softirq pinned to _different_ cores:

epoll
82.2 Gbps

io_uring
116.2 Gbps (+41%)

Pinned to _same_ core:

epoll
62.6 Gbps

io_uring
80.9 Gbps (+29%)

==============
Patch overview
==============

Networking folks would be mostly interested in patches 1-8, 11 and 14.
Patches 1-2 clean up net_iov and devmem, then patches 3-8 make changes
to netdev to suit our needs.

Patch 11 implements struct memory_provider_ops, and Patch 14 passes it
all to netdev via the queue API.

io_uring folks would be mostly interested in patches 9-15:

* Initial registration that sets up a hw rx queue.
* Shared ringbuf for userspace to return buffers.
* New request type for doing zero copy rx reads.

=====
Links
=====

Broadcom bnxt support:
[1]: https://lore.kernel.org/netdev/[email protected]/

Linux kernel branch:
[2]: https://github.com/isilence/linux.git zcrx/v5

liburing for testing:
[3]: https://github.com/spikeh/liburing/tree/zcrx/next

kperf for testing:
[4]: https://github.com/spikeh/kperf/tree/zcrx/next

Changes in v1:
--------------
* Rebase on top of merged net_iov + netmem infra.
* Decouple net_iov from devmem TCP.
* Use netdev queue API to allocate an rx queue.
* Minor uAPI enhancements for future extensibility.
* QoS improvements with request throttling.

Changes in RFC v4:
------------------
* Rebased on top of Mina Almasry's TCP devmem patchset and latest
  net-next, now sharing common infra e.g.:
    * netmem_t and net_iovs
    * Page pool memory provider
* The registered buffer (rbuf) completion queue where completions from
  io_recvzc requests are posted is removed. Now these post into the main
  completion queue, using big (32-byte) CQEs. The first 16 bytes is an
  ordinary CQE, while the latter 16 bytes contain the io_uring_rbuf_cqe
  as before. This vastly simplifies the uAPI and removes a level of
  indirection in userspace when looking for payloads.
  * The rbuf refill queue is still needed for userspace to return
    buffers to kernel.
* Simplified code and uAPI on the io_uring side, particularly
  io_recvzc() and io_zc_rx_recv(). Many unnecessary lines were removed
  e.g. extra msg flags, readlen, etc.

Changes in RFC v3:
------------------
* Rebased on top of Jakub Kicinski's memory provider API RFC. The ZC
  pool added is now a backend for memory provider.
* We're also reusing ppiov infrastructure. The refcounting rules stay
  the same but it's shifted into ppiov->refcount. That lets us to
  flexibly manage buffer lifetimes without adding any extra code to the
  common networking paths. It'd also make it easier to support dmabufs
  and device memory in the future.
  * io_uring also knows about pages, and so ppiovs might unnecessarily
    break tools inspecting data, that can easily be solved later.

Many patches are not for upstream as they depend on work in progress,
namely from Mina:

* struct netmem_t
* Driver ndo commands for Rx queue configs
* struct page_pool_iov and shared pp infra

Changes in RFC v2:
------------------
* Added copy fallback support if userspace memory allocated for ZC Rx
  runs out, or if header splitting or flow steering fails.
* Added veth support for ZC Rx, for testing and demonstration. We will
  need to figure out what driver would be best for such testing
  functionality in the future. Perhaps netdevsim?
* Added socket registration API to io_uring to associate specific
  sockets with ifqs/Rx queues for ZC.
* Added multi-socket support, such that multiple connections can be
  steered into the same hardware Rx queue.
* Added Netbench server/client support.

David Wei (5):
  net: page pool: add helper creating area from pages
  io_uring/zcrx: add interface queue and refill queue
  io_uring/zcrx: add io_zcrx_area
  io_uring/zcrx: add io_recvzc request
  io_uring/zcrx: set pp memory provider for an rx queue

Jakub Kicinski (1):
  net: page_pool: create hooks for custom page providers

Pavel Begunkov (9):
  net: devmem: pull struct definitions out of ifdef
  net: prefix devmem specific helpers
  net: generalise net_iov chunk owners
  net: prepare for non devmem TCP memory providers
  net: page_pool: add ->scrub mem provider callback
  net: add helper executing custom callback from napi
  io_uring/zcrx: implement zerocopy receive pp memory provider
  io_uring/zcrx: add copy fallback
  io_uring/zcrx: throttle receive requests

 include/linux/io_uring/net.h   |   5 +
 include/linux/io_uring_types.h |   3 +
 include/net/busy_poll.h        |   6 +
 include/net/netmem.h           |  21 +-
 include/net/page_pool/types.h  |  27 ++
 include/uapi/linux/io_uring.h  |  54 +++
 io_uring/Makefile              |   1 +
 io_uring/io_uring.c            |   7 +
 io_uring/io_uring.h            |  10 +
 io_uring/memmap.c              |   8 +
 io_uring/net.c                 |  81 ++++
 io_uring/opdef.c               |  16 +
 io_uring/register.c            |   7 +
 io_uring/rsrc.c                |   2 +-
 io_uring/rsrc.h                |   1 +
 io_uring/zcrx.c                | 847 +++++++++++++++++++++++++++++++++
 io_uring/zcrx.h                |  74 +++
 net/core/dev.c                 |  53 +++
 net/core/devmem.c              |  44 +-
 net/core/devmem.h              |  71 ++-
 net/core/page_pool.c           |  81 +++-
 net/core/page_pool_user.c      |  15 +-
 net/ipv4/tcp.c                 |   8 +-
 23 files changed, 1364 insertions(+), 78 deletions(-)
 create mode 100644 io_uring/zcrx.c
 create mode 100644 io_uring/zcrx.h

-- 
2.43.5


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-09 20:17   ` Mina Almasry
  2024-10-07 22:15 ` [PATCH v1 02/15] net: prefix devmem specific helpers David Wei
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Don't hide structure definitions under conditional compilation, it only
makes messier and harder to maintain. Move struct
dmabuf_genpool_chunk_owner definition out of CONFIG_NET_DEVMEM ifdef
together with a bunch of trivial inlined helpers using the structure.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 net/core/devmem.h | 44 +++++++++++++++++---------------------------
 1 file changed, 17 insertions(+), 27 deletions(-)

diff --git a/net/core/devmem.h b/net/core/devmem.h
index 76099ef9c482..cf66e53b358f 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -44,7 +44,6 @@ struct net_devmem_dmabuf_binding {
 	u32 id;
 };
 
-#if defined(CONFIG_NET_DEVMEM)
 /* Owner of the dma-buf chunks inserted into the gen pool. Each scatterlist
  * entry from the dmabuf is inserted into the genpool as a chunk, and needs
  * this owner struct to keep track of some metadata necessary to create
@@ -64,16 +63,6 @@ struct dmabuf_genpool_chunk_owner {
 	struct net_devmem_dmabuf_binding *binding;
 };
 
-void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding);
-struct net_devmem_dmabuf_binding *
-net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
-		       struct netlink_ext_ack *extack);
-void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
-int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
-				    struct net_devmem_dmabuf_binding *binding,
-				    struct netlink_ext_ack *extack);
-void dev_dmabuf_uninstall(struct net_device *dev);
-
 static inline struct dmabuf_genpool_chunk_owner *
 net_iov_owner(const struct net_iov *niov)
 {
@@ -91,6 +80,11 @@ net_iov_binding(const struct net_iov *niov)
 	return net_iov_owner(niov)->binding;
 }
 
+static inline u32 net_iov_binding_id(const struct net_iov *niov)
+{
+	return net_iov_owner(niov)->binding->id;
+}
+
 static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
 {
 	struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
@@ -99,10 +93,18 @@ static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
 	       ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT);
 }
 
-static inline u32 net_iov_binding_id(const struct net_iov *niov)
-{
-	return net_iov_owner(niov)->binding->id;
-}
+#if defined(CONFIG_NET_DEVMEM)
+
+void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding);
+struct net_devmem_dmabuf_binding *
+net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
+		       struct netlink_ext_ack *extack);
+void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
+int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
+				    struct net_devmem_dmabuf_binding *binding,
+				    struct netlink_ext_ack *extack);
+void dev_dmabuf_uninstall(struct net_device *dev);
+
 
 static inline void
 net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
@@ -124,8 +126,6 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding);
 void net_devmem_free_dmabuf(struct net_iov *ppiov);
 
 #else
-struct net_devmem_dmabuf_binding;
-
 static inline void
 __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding)
 {
@@ -165,16 +165,6 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
 static inline void net_devmem_free_dmabuf(struct net_iov *ppiov)
 {
 }
-
-static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
-{
-	return 0;
-}
-
-static inline u32 net_iov_binding_id(const struct net_iov *niov)
-{
-	return 0;
-}
 #endif
 
 #endif /* _NET_DEVMEM_H */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 02/15] net: prefix devmem specific helpers
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
  2024-10-07 22:15 ` [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-09 20:19   ` Mina Almasry
  2024-10-07 22:15 ` [PATCH v1 03/15] net: generalise net_iov chunk owners David Wei
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Add prefixes to all helpers that are specific to devmem TCP, i.e.
net_iov_binding[_id].

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 net/core/devmem.c | 2 +-
 net/core/devmem.h | 6 +++---
 net/ipv4/tcp.c    | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 11b91c12ee11..858982858f81 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -93,7 +93,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
 
 void net_devmem_free_dmabuf(struct net_iov *niov)
 {
-	struct net_devmem_dmabuf_binding *binding = net_iov_binding(niov);
+	struct net_devmem_dmabuf_binding *binding = net_devmem_iov_binding(niov);
 	unsigned long dma_addr = net_devmem_get_dma_addr(niov);
 
 	if (WARN_ON(!gen_pool_has_addr(binding->chunk_pool, dma_addr,
diff --git a/net/core/devmem.h b/net/core/devmem.h
index cf66e53b358f..80f38fe46930 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -75,14 +75,14 @@ static inline unsigned int net_iov_idx(const struct net_iov *niov)
 }
 
 static inline struct net_devmem_dmabuf_binding *
-net_iov_binding(const struct net_iov *niov)
+net_devmem_iov_binding(const struct net_iov *niov)
 {
 	return net_iov_owner(niov)->binding;
 }
 
-static inline u32 net_iov_binding_id(const struct net_iov *niov)
+static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
 {
-	return net_iov_owner(niov)->binding->id;
+	return net_devmem_iov_binding(niov)->id;
 }
 
 static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4f77bd862e95..5feef46426f4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2493,7 +2493,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 
 				/* Will perform the exchange later */
 				dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
-				dmabuf_cmsg.dmabuf_id = net_iov_binding_id(niov);
+				dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
 
 				offset += copy;
 				remaining_len -= copy;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
  2024-10-07 22:15 ` [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef David Wei
  2024-10-07 22:15 ` [PATCH v1 02/15] net: prefix devmem specific helpers David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-08 15:46   ` Stanislav Fomichev
  2024-10-09 20:44   ` Mina Almasry
  2024-10-07 22:15 ` [PATCH v1 04/15] net: page_pool: create hooks for custom page providers David Wei
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
which serves as a useful abstraction to share data and provide a
context. However, it's too devmem specific, and we want to reuse it for
other memory providers, and for that we need to decouple net_iov from
devmem. Make net_iov to point to a new base structure called
net_iov_area, which dmabuf_genpool_chunk_owner extends.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/netmem.h | 21 ++++++++++++++++++++-
 net/core/devmem.c    | 25 +++++++++++++------------
 net/core/devmem.h    | 25 +++++++++----------------
 3 files changed, 42 insertions(+), 29 deletions(-)

diff --git a/include/net/netmem.h b/include/net/netmem.h
index 8a6e20be4b9d..3795ded30d2c 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -24,11 +24,20 @@ struct net_iov {
 	unsigned long __unused_padding;
 	unsigned long pp_magic;
 	struct page_pool *pp;
-	struct dmabuf_genpool_chunk_owner *owner;
+	struct net_iov_area *owner;
 	unsigned long dma_addr;
 	atomic_long_t pp_ref_count;
 };
 
+struct net_iov_area {
+	/* Array of net_iovs for this area. */
+	struct net_iov *niovs;
+	size_t num_niovs;
+
+	/* Offset into the dma-buf where this chunk starts.  */
+	unsigned long base_virtual;
+};
+
 /* These fields in struct page are used by the page_pool and net stack:
  *
  *        struct {
@@ -54,6 +63,16 @@ NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr);
 NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
 #undef NET_IOV_ASSERT_OFFSET
 
+static inline struct net_iov_area *net_iov_owner(const struct net_iov *niov)
+{
+	return niov->owner;
+}
+
+static inline unsigned int net_iov_idx(const struct net_iov *niov)
+{
+	return niov - net_iov_owner(niov)->niovs;
+}
+
 /* netmem */
 
 /**
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 858982858f81..5c10cf0e2a18 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -32,14 +32,15 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
 {
 	struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
 
-	kvfree(owner->niovs);
+	kvfree(owner->area.niovs);
 	kfree(owner);
 }
 
 static dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)
 {
-	struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
+	struct dmabuf_genpool_chunk_owner *owner;
 
+	owner = net_devmem_iov_to_chunk_owner(niov);
 	return owner->base_dma_addr +
 	       ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT);
 }
@@ -82,7 +83,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
 
 	offset = dma_addr - owner->base_dma_addr;
 	index = offset / PAGE_SIZE;
-	niov = &owner->niovs[index];
+	niov = &owner->area.niovs[index];
 
 	niov->pp_magic = 0;
 	niov->pp = NULL;
@@ -250,9 +251,9 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 			goto err_free_chunks;
 		}
 
-		owner->base_virtual = virtual;
+		owner->area.base_virtual = virtual;
 		owner->base_dma_addr = dma_addr;
-		owner->num_niovs = len / PAGE_SIZE;
+		owner->area.num_niovs = len / PAGE_SIZE;
 		owner->binding = binding;
 
 		err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
@@ -264,17 +265,17 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 			goto err_free_chunks;
 		}
 
-		owner->niovs = kvmalloc_array(owner->num_niovs,
-					      sizeof(*owner->niovs),
-					      GFP_KERNEL);
-		if (!owner->niovs) {
+		owner->area.niovs = kvmalloc_array(owner->area.num_niovs,
+						   sizeof(*owner->area.niovs),
+						   GFP_KERNEL);
+		if (!owner->area.niovs) {
 			err = -ENOMEM;
 			goto err_free_chunks;
 		}
 
-		for (i = 0; i < owner->num_niovs; i++) {
-			niov = &owner->niovs[i];
-			niov->owner = owner;
+		for (i = 0; i < owner->area.num_niovs; i++) {
+			niov = &owner->area.niovs[i];
+			niov->owner = &owner->area;
 			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
 						      net_devmem_get_dma_addr(niov));
 		}
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 80f38fe46930..12b14377ed3f 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -10,6 +10,8 @@
 #ifndef _NET_DEVMEM_H
 #define _NET_DEVMEM_H
 
+#include <net/netmem.h>
+
 struct netlink_ext_ack;
 
 struct net_devmem_dmabuf_binding {
@@ -50,34 +52,25 @@ struct net_devmem_dmabuf_binding {
  * allocations from this chunk.
  */
 struct dmabuf_genpool_chunk_owner {
-	/* Offset into the dma-buf where this chunk starts.  */
-	unsigned long base_virtual;
+	struct net_iov_area area;
+	struct net_devmem_dmabuf_binding *binding;
 
 	/* dma_addr of the start of the chunk.  */
 	dma_addr_t base_dma_addr;
-
-	/* Array of net_iovs for this chunk. */
-	struct net_iov *niovs;
-	size_t num_niovs;
-
-	struct net_devmem_dmabuf_binding *binding;
 };
 
 static inline struct dmabuf_genpool_chunk_owner *
-net_iov_owner(const struct net_iov *niov)
+net_devmem_iov_to_chunk_owner(const struct net_iov *niov)
 {
-	return niov->owner;
-}
+	struct net_iov_area *owner = net_iov_owner(niov);
 
-static inline unsigned int net_iov_idx(const struct net_iov *niov)
-{
-	return niov - net_iov_owner(niov)->niovs;
+	return container_of(owner, struct dmabuf_genpool_chunk_owner, area);
 }
 
 static inline struct net_devmem_dmabuf_binding *
 net_devmem_iov_binding(const struct net_iov *niov)
 {
-	return net_iov_owner(niov)->binding;
+	return net_devmem_iov_to_chunk_owner(niov)->binding;
 }
 
 static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
@@ -87,7 +80,7 @@ static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
 
 static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
 {
-	struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
+	struct net_iov_area *owner = net_iov_owner(niov);
 
 	return owner->base_virtual +
 	       ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 04/15] net: page_pool: create hooks for custom page providers
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (2 preceding siblings ...)
  2024-10-07 22:15 ` [PATCH v1 03/15] net: generalise net_iov chunk owners David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-09 20:49   ` Mina Almasry
  2024-10-07 22:15 ` [PATCH v1 05/15] net: prepare for non devmem TCP memory providers David Wei
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Jakub Kicinski <[email protected]>

The page providers which try to reuse the same pages will
need to hold onto the ref, even if page gets released from
the pool - as in releasing the page from the pp just transfers
the "ownership" reference from pp to the provider, and provider
will wait for other references to be gone before feeding this
page back into the pool.

Signed-off-by: Jakub Kicinski <[email protected]>
[Pavel] Rebased, renamed callback, +converted devmem
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/types.h |  9 +++++++++
 net/core/devmem.c             | 13 ++++++++++++-
 net/core/devmem.h             |  2 ++
 net/core/page_pool.c          | 17 +++++++++--------
 4 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index c022c410abe3..8a35fe474adb 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -152,8 +152,16 @@ struct page_pool_stats {
  */
 #define PAGE_POOL_FRAG_GROUP_ALIGN	(4 * sizeof(long))
 
+struct memory_provider_ops {
+	netmem_ref (*alloc_netmems)(struct page_pool *pool, gfp_t gfp);
+	bool (*release_netmem)(struct page_pool *pool, netmem_ref netmem);
+	int (*init)(struct page_pool *pool);
+	void (*destroy)(struct page_pool *pool);
+};
+
 struct pp_memory_provider_params {
 	void *mp_priv;
+	const struct memory_provider_ops *mp_ops;
 };
 
 struct page_pool {
@@ -215,6 +223,7 @@ struct page_pool {
 	struct ptr_ring ring;
 
 	void *mp_priv;
+	const struct memory_provider_ops *mp_ops;
 
 #ifdef CONFIG_PAGE_POOL_STATS
 	/* recycle stats are per-cpu to avoid locking */
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 5c10cf0e2a18..83d13eb441b6 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -117,6 +117,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 		WARN_ON(rxq->mp_params.mp_priv != binding);
 
 		rxq->mp_params.mp_priv = NULL;
+		rxq->mp_params.mp_ops = NULL;
 
 		rxq_idx = get_netdev_rx_queue_index(rxq);
 
@@ -142,7 +143,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 	}
 
 	rxq = __netif_get_rx_queue(dev, rxq_idx);
-	if (rxq->mp_params.mp_priv) {
+	if (rxq->mp_params.mp_ops) {
 		NL_SET_ERR_MSG(extack, "designated queue already memory provider bound");
 		return -EEXIST;
 	}
@@ -160,6 +161,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 		return err;
 
 	rxq->mp_params.mp_priv = binding;
+	rxq->mp_params.mp_ops = &dmabuf_devmem_ops;
 
 	err = netdev_rx_queue_restart(dev, rxq_idx);
 	if (err)
@@ -169,6 +171,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 
 err_xa_erase:
 	rxq->mp_params.mp_priv = NULL;
+	rxq->mp_params.mp_ops = NULL;
 	xa_erase(&binding->bound_rxqs, xa_idx);
 
 	return err;
@@ -388,3 +391,11 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
 	/* We don't want the page pool put_page()ing our net_iovs. */
 	return false;
 }
+
+const struct memory_provider_ops dmabuf_devmem_ops = {
+	.init			= mp_dmabuf_devmem_init,
+	.destroy		= mp_dmabuf_devmem_destroy,
+	.alloc_netmems		= mp_dmabuf_devmem_alloc_netmems,
+	.release_netmem		= mp_dmabuf_devmem_release_page,
+};
+EXPORT_SYMBOL(dmabuf_devmem_ops);
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 12b14377ed3f..fbf7ec9a62cb 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -88,6 +88,8 @@ static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
 
 #if defined(CONFIG_NET_DEVMEM)
 
+extern const struct memory_provider_ops dmabuf_devmem_ops;
+
 void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding);
 struct net_devmem_dmabuf_binding *
 net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a813d30d2135..c21c5b9edc68 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -284,10 +284,11 @@ static int page_pool_init(struct page_pool *pool,
 		rxq = __netif_get_rx_queue(pool->slow.netdev,
 					   pool->slow.queue_idx);
 		pool->mp_priv = rxq->mp_params.mp_priv;
+		pool->mp_ops = rxq->mp_params.mp_ops;
 	}
 
-	if (pool->mp_priv) {
-		err = mp_dmabuf_devmem_init(pool);
+	if (pool->mp_ops) {
+		err = pool->mp_ops->init(pool);
 		if (err) {
 			pr_warn("%s() mem-provider init failed %d\n", __func__,
 				err);
@@ -584,8 +585,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp)
 		return netmem;
 
 	/* Slow-path: cache empty, do real allocation */
-	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
-		netmem = mp_dmabuf_devmem_alloc_netmems(pool, gfp);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		netmem = pool->mp_ops->alloc_netmems(pool, gfp);
 	else
 		netmem = __page_pool_alloc_pages_slow(pool, gfp);
 	return netmem;
@@ -676,8 +677,8 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	bool put;
 
 	put = true;
-	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
-		put = mp_dmabuf_devmem_release_page(pool, netmem);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		put = pool->mp_ops->release_netmem(pool, netmem);
 	else
 		__page_pool_release_page_dma(pool, netmem);
 
@@ -1010,8 +1011,8 @@ static void __page_pool_destroy(struct page_pool *pool)
 	page_pool_unlist(pool);
 	page_pool_uninit(pool);
 
-	if (pool->mp_priv) {
-		mp_dmabuf_devmem_destroy(pool);
+	if (pool->mp_ops) {
+		pool->mp_ops->destroy(pool);
 		static_branch_dec(&page_pool_mem_providers);
 	}
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 05/15] net: prepare for non devmem TCP memory providers
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (3 preceding siblings ...)
  2024-10-07 22:15 ` [PATCH v1 04/15] net: page_pool: create hooks for custom page providers David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-09 20:56   ` Mina Almasry
  2024-10-07 22:15 ` [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback David Wei
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

There is a good bunch of places in generic paths assuming that the only
page pool memory provider is devmem TCP. As we want to reuse the net_iov
and provider infrastructure, we need to patch it up and explicitly check
the provider type when we branch into devmem TCP code.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 net/core/devmem.c         |  4 ++--
 net/core/page_pool_user.c | 15 +++++++++------
 net/ipv4/tcp.c            |  6 ++++++
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 83d13eb441b6..b0733cf42505 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -314,10 +314,10 @@ void dev_dmabuf_uninstall(struct net_device *dev)
 	unsigned int i;
 
 	for (i = 0; i < dev->real_num_rx_queues; i++) {
-		binding = dev->_rx[i].mp_params.mp_priv;
-		if (!binding)
+		if (dev->_rx[i].mp_params.mp_ops != &dmabuf_devmem_ops)
 			continue;
 
+		binding = dev->_rx[i].mp_params.mp_priv;
 		xa_for_each(&binding->bound_rxqs, xa_idx, rxq)
 			if (rxq == &dev->_rx[i]) {
 				xa_erase(&binding->bound_rxqs, xa_idx);
diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
index 48335766c1bf..0d6cb7fb562c 100644
--- a/net/core/page_pool_user.c
+++ b/net/core/page_pool_user.c
@@ -214,7 +214,7 @@ static int
 page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
 		  const struct genl_info *info)
 {
-	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
+	struct net_devmem_dmabuf_binding *binding;
 	size_t inflight, refsz;
 	void *hdr;
 
@@ -244,8 +244,11 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
 			 pool->user.detach_time))
 		goto err_cancel;
 
-	if (binding && nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id))
-		goto err_cancel;
+	if (pool->mp_ops == &dmabuf_devmem_ops) {
+		binding = pool->mp_priv;
+		if (nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id))
+			goto err_cancel;
+	}
 
 	genlmsg_end(rsp, hdr);
 
@@ -353,16 +356,16 @@ void page_pool_unlist(struct page_pool *pool)
 int page_pool_check_memory_provider(struct net_device *dev,
 				    struct netdev_rx_queue *rxq)
 {
-	struct net_devmem_dmabuf_binding *binding = rxq->mp_params.mp_priv;
+	void *mp_priv = rxq->mp_params.mp_priv;
 	struct page_pool *pool;
 	struct hlist_node *n;
 
-	if (!binding)
+	if (!mp_priv)
 		return 0;
 
 	mutex_lock(&page_pools_lock);
 	hlist_for_each_entry_safe(pool, n, &dev->page_pools, user.list) {
-		if (pool->mp_priv != binding)
+		if (pool->mp_priv != mp_priv)
 			continue;
 
 		if (pool->slow.queue_idx == get_netdev_rx_queue_index(rxq)) {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5feef46426f4..2140fa1ec9f8 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -277,6 +277,7 @@
 #include <net/ip.h>
 #include <net/sock.h>
 #include <net/rstreason.h>
+#include <net/page_pool/types.h>
 
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
@@ -2475,6 +2476,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 			}
 
 			niov = skb_frag_net_iov(frag);
+			if (niov->pp->mp_ops != &dmabuf_devmem_ops) {
+				err = -ENODEV;
+				goto out;
+			}
+
 			end = start + skb_frag_size(frag);
 			copy = end - offset;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (4 preceding siblings ...)
  2024-10-07 22:15 ` [PATCH v1 05/15] net: prepare for non devmem TCP memory providers David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-09 21:00   ` Mina Almasry
  2024-10-07 22:15 ` [PATCH v1 07/15] net: page pool: add helper creating area from pages David Wei
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

page pool is now waiting for all ppiovs to return before destroying
itself, and for that to happen the memory provider might need to push
some buffers, flush caches and so on.

todo: we'll try to get by without it before the final release

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/types.h | 1 +
 net/core/page_pool.c          | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 8a35fe474adb..fd0376ad0d26 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -157,6 +157,7 @@ struct memory_provider_ops {
 	bool (*release_netmem)(struct page_pool *pool, netmem_ref netmem);
 	int (*init)(struct page_pool *pool);
 	void (*destroy)(struct page_pool *pool);
+	void (*scrub)(struct page_pool *pool);
 };
 
 struct pp_memory_provider_params {
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index c21c5b9edc68..9a675e16e6a4 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1038,6 +1038,9 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 
 static void page_pool_scrub(struct page_pool *pool)
 {
+	if (pool->mp_ops && pool->mp_ops->scrub)
+		pool->mp_ops->scrub(pool);
+
 	page_pool_empty_alloc_cache_once(pool);
 	pool->destroy_cnt++;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 07/15] net: page pool: add helper creating area from pages
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (5 preceding siblings ...)
  2024-10-07 22:15 ` [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-09 21:11   ` Mina Almasry
  2024-10-07 22:15 ` [PATCH v1 08/15] net: add helper executing custom callback from napi David Wei
                   ` (11 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Add a helper that takes an array of pages and initialises passed in
memory provider's area with them, where each net_iov takes one page.
It's also responsible for setting up dma mappings.

We keep it in page_pool.c not to leak netmem details to outside
providers like io_uring, which don't have access to netmem_priv.h
and other private helpers.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/types.h | 17 ++++++++++
 net/core/page_pool.c          | 61 +++++++++++++++++++++++++++++++++--
 2 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index fd0376ad0d26..1180ad07423c 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -271,6 +271,11 @@ void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
 			   const struct xdp_mem_info *mem);
 void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 			     int count);
+
+int page_pool_init_paged_area(struct page_pool *pool,
+			      struct net_iov_area *area, struct page **pages);
+void page_pool_release_area(struct page_pool *pool,
+			    struct net_iov_area *area);
 #else
 static inline void page_pool_destroy(struct page_pool *pool)
 {
@@ -286,6 +291,18 @@ static inline void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 					   int count)
 {
 }
+
+static inline int page_pool_init_paged_area(struct page_pool *pool,
+					    struct net_iov_area *area,
+					    struct page **pages)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void page_pool_release_area(struct page_pool *pool,
+					  struct net_iov_area *area)
+{
+}
 #endif
 
 void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 9a675e16e6a4..112b6fe4b7ff 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -459,7 +459,8 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
 		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
 }
 
-static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_dma_map_page(struct page_pool *pool, netmem_ref netmem,
+				   struct page *page)
 {
 	dma_addr_t dma;
 
@@ -468,7 +469,7 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
 	 * This mapping is kept for lifetime of page, until leaving pool.
 	 */
-	dma = dma_map_page_attrs(pool->p.dev, netmem_to_page(netmem), 0,
+	dma = dma_map_page_attrs(pool->p.dev, page, 0,
 				 (PAGE_SIZE << pool->p.order), pool->p.dma_dir,
 				 DMA_ATTR_SKIP_CPU_SYNC |
 					 DMA_ATTR_WEAK_ORDERING);
@@ -490,6 +491,11 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	return false;
 }
 
+static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
+{
+	return page_pool_dma_map_page(pool, netmem, netmem_to_page(netmem));
+}
+
 static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 						 gfp_t gfp)
 {
@@ -1154,3 +1160,54 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);
+
+static void page_pool_release_page_dma(struct page_pool *pool,
+				       netmem_ref netmem)
+{
+	__page_pool_release_page_dma(pool, netmem);
+}
+
+int page_pool_init_paged_area(struct page_pool *pool,
+			      struct net_iov_area *area, struct page **pages)
+{
+	struct net_iov *niov;
+	netmem_ref netmem;
+	int i, ret = 0;
+
+	if (!pool->dma_map)
+		return -EOPNOTSUPP;
+
+	for (i = 0; i < area->num_niovs; i++) {
+		niov = &area->niovs[i];
+		netmem = net_iov_to_netmem(niov);
+
+		page_pool_set_pp_info(pool, netmem);
+		if (!page_pool_dma_map_page(pool, netmem, pages[i])) {
+			ret = -EINVAL;
+			goto err_unmap_dma;
+		}
+	}
+	return 0;
+
+err_unmap_dma:
+	while (i--) {
+		netmem = net_iov_to_netmem(&area->niovs[i]);
+		page_pool_release_page_dma(pool, netmem);
+	}
+	return ret;
+}
+
+void page_pool_release_area(struct page_pool *pool,
+			    struct net_iov_area *area)
+{
+	int i;
+
+	if (!pool->dma_map)
+		return;
+
+	for (i = 0; i < area->num_niovs; i++) {
+		struct net_iov *niov = &area->niovs[i];
+
+		page_pool_release_page_dma(pool, net_iov_to_netmem(niov));
+	}
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 08/15] net: add helper executing custom callback from napi
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (6 preceding siblings ...)
  2024-10-07 22:15 ` [PATCH v1 07/15] net: page pool: add helper creating area from pages David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-08 22:25   ` Joe Damato
  2024-10-07 22:15 ` [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

It's useful to have napi private bits and pieces like page pool's fast
allocating cache, so that the hot allocation path doesn't have to do any
additional synchronisation. In case of io_uring memory provider
introduced in following patches, we keep the consumer end of the
io_uring's refill queue private to napi as it's a hot path.

However, from time to time we need to synchronise with the napi, for
example to add more user memory or allocate fallback buffers. Add a
helper function napi_execute that allows to run a custom callback from
under napi context so that it can access and modify napi protected
parts of io_uring. It works similar to busy polling and stops napi from
running in the meantime, so it's supposed to be a slow control path.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/busy_poll.h |  6 +++++
 net/core/dev.c          | 53 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index f03040baaefd..3fd9e65731e9 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -47,6 +47,7 @@ bool sk_busy_loop_end(void *p, unsigned long start_time);
 void napi_busy_loop(unsigned int napi_id,
 		    bool (*loop_end)(void *, unsigned long),
 		    void *loop_end_arg, bool prefer_busy_poll, u16 budget);
+void napi_execute(unsigned napi_id, void (*cb)(void *), void *cb_arg);
 
 void napi_busy_loop_rcu(unsigned int napi_id,
 			bool (*loop_end)(void *, unsigned long),
@@ -63,6 +64,11 @@ static inline bool sk_can_busy_loop(struct sock *sk)
 	return false;
 }
 
+static inline void napi_execute(unsigned napi_id,
+				void (*cb)(void *), void *cb_arg)
+{
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static inline unsigned long busy_loop_current_time(void)
diff --git a/net/core/dev.c b/net/core/dev.c
index 1e740faf9e78..ba2f43cf5517 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6497,6 +6497,59 @@ void napi_busy_loop(unsigned int napi_id,
 }
 EXPORT_SYMBOL(napi_busy_loop);
 
+void napi_execute(unsigned napi_id,
+		  void (*cb)(void *), void *cb_arg)
+{
+	struct napi_struct *napi;
+	bool done = false;
+	unsigned long val;
+	void *have_poll_lock = NULL;
+
+	rcu_read_lock();
+
+	napi = napi_by_id(napi_id);
+	if (!napi) {
+		rcu_read_unlock();
+		return;
+	}
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+	for (;;) {
+		local_bh_disable();
+		val = READ_ONCE(napi->state);
+
+		/* If multiple threads are competing for this napi,
+		* we avoid dirtying napi->state as much as we can.
+		*/
+		if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
+			  NAPIF_STATE_IN_BUSY_POLL))
+			goto restart;
+
+		if (cmpxchg(&napi->state, val,
+			   val | NAPIF_STATE_IN_BUSY_POLL |
+				 NAPIF_STATE_SCHED) != val)
+			goto restart;
+
+		have_poll_lock = netpoll_poll_lock(napi);
+		cb(cb_arg);
+		done = true;
+		gro_normal_list(napi);
+		local_bh_enable();
+		break;
+restart:
+		local_bh_enable();
+		if (unlikely(need_resched()))
+			break;
+		cpu_relax();
+	}
+	if (done)
+		busy_poll_stop(napi, have_poll_lock, false, 1);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
+	rcu_read_unlock();
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static void napi_hash_add(struct napi_struct *napi)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (7 preceding siblings ...)
  2024-10-07 22:15 ` [PATCH v1 08/15] net: add helper executing custom callback from napi David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-09 17:50   ` Jens Axboe
  2024-10-07 22:15 ` [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area David Wei
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: David Wei <[email protected]>

Add a new object called an interface queue (ifq) that represents a net rx queue
that has been configured for zero copy. Each ifq is registered using a new
registration opcode IORING_REGISTER_ZCRX_IFQ.

The refill queue is allocated by the kernel and mapped by userspace using a new
offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used
by userspace to return buffers that it is done with, which will then be re-used
by the netdev again.

The main CQ ring is used to notify userspace of received data by using the
upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry
contains the offset + len to the data.

For now, each io_uring instance only has a single ifq.

Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring_types.h |   3 +
 include/uapi/linux/io_uring.h  |  43 ++++++++++
 io_uring/Makefile              |   1 +
 io_uring/io_uring.c            |   7 ++
 io_uring/memmap.c              |   8 ++
 io_uring/register.c            |   7 ++
 io_uring/zcrx.c                | 147 +++++++++++++++++++++++++++++++++
 io_uring/zcrx.h                |  39 +++++++++
 8 files changed, 255 insertions(+)
 create mode 100644 io_uring/zcrx.c
 create mode 100644 io_uring/zcrx.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 3315005df117..ace7ac056d51 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -39,6 +39,8 @@ enum io_uring_cmd_flags {
 	IO_URING_F_COMPAT		= (1 << 12),
 };
 
+struct io_zcrx_ifq;
+
 struct io_wq_work_node {
 	struct io_wq_work_node *next;
 };
@@ -372,6 +374,7 @@ struct io_ring_ctx {
 	struct io_alloc_cache		rsrc_node_cache;
 	struct wait_queue_head		rsrc_quiesce_wq;
 	unsigned			rsrc_quiesce;
+	struct io_zcrx_ifq		*ifq;
 
 	u32			pers_next;
 	struct xarray		personalities;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index adc2524fd8e3..567cdb89711e 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -457,6 +457,8 @@ struct io_uring_cqe {
 #define IORING_OFF_PBUF_RING		0x80000000ULL
 #define IORING_OFF_PBUF_SHIFT		16
 #define IORING_OFF_MMAP_MASK		0xf8000000ULL
+#define IORING_OFF_RQ_RING		0x20000000ULL
+#define IORING_OFF_RQ_SHIFT		16
 
 /*
  * Filled with the offset for mmap(2)
@@ -595,6 +597,9 @@ enum io_uring_register_op {
 	IORING_REGISTER_NAPI			= 27,
 	IORING_UNREGISTER_NAPI			= 28,
 
+	/* register a netdev hw rx queue for zerocopy */
+	IORING_REGISTER_ZCRX_IFQ		= 29,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
@@ -802,6 +807,44 @@ enum io_uring_socket_op {
 	SOCKET_URING_OP_SETSOCKOPT,
 };
 
+/* Zero copy receive refill queue entry */
+struct io_uring_zcrx_rqe {
+	__u64	off;
+	__u32	len;
+	__u32	__pad;
+};
+
+struct io_uring_zcrx_cqe {
+	__u64	off;
+	__u64	__pad;
+};
+
+/* The bit from which area id is encoded into offsets */
+#define IORING_ZCRX_AREA_SHIFT	48
+#define IORING_ZCRX_AREA_MASK	(~(((__u64)1 << IORING_ZCRX_AREA_SHIFT) - 1))
+
+struct io_uring_zcrx_offsets {
+	__u32	head;
+	__u32	tail;
+	__u32	rqes;
+	__u32	mmap_sz;
+	__u64	__resv[2];
+};
+
+/*
+ * Argument for IORING_REGISTER_ZCRX_IFQ
+ */
+struct io_uring_zcrx_ifq_reg {
+	__u32	if_idx;
+	__u32	if_rxq;
+	__u32	rq_entries;
+	__u32	flags;
+
+	__u64	area_ptr; /* pointer to struct io_uring_zcrx_area_reg */
+	struct io_uring_zcrx_offsets offsets;
+	__u64	__resv[3];
+};
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 61923e11c767..1a1184f3946a 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
 					epoll.o statx.o timeout.o fdinfo.o \
 					cancel.o waitid.o register.o \
 					truncate.o memmap.o
+obj-$(CONFIG_PAGE_POOL)	+= zcrx.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
 obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 3942db160f18..02856245af3c 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -97,6 +97,7 @@
 #include "uring_cmd.h"
 #include "msg_ring.h"
 #include "memmap.h"
+#include "zcrx.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -2600,6 +2601,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		return;
 
 	mutex_lock(&ctx->uring_lock);
+	io_unregister_zcrx_ifqs(ctx);
 	if (ctx->buf_data)
 		__io_sqe_buffers_unregister(ctx);
 	if (ctx->file_data)
@@ -2772,6 +2774,11 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 			io_cqring_overflow_kill(ctx);
 			mutex_unlock(&ctx->uring_lock);
 		}
+		if (ctx->ifq) {
+			mutex_lock(&ctx->uring_lock);
+			io_shutdown_zcrx_ifqs(ctx);
+			mutex_unlock(&ctx->uring_lock);
+		}
 
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
 			io_move_task_work_from_local(ctx);
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index a0f32a255fd1..4c384e8615f6 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -12,6 +12,7 @@
 
 #include "memmap.h"
 #include "kbuf.h"
+#include "zcrx.h"
 
 static void *io_mem_alloc_compound(struct page **pages, int nr_pages,
 				   size_t size, gfp_t gfp)
@@ -223,6 +224,10 @@ static void *io_uring_validate_mmap_request(struct file *file, loff_t pgoff,
 		io_put_bl(ctx, bl);
 		return ptr;
 		}
+	case IORING_OFF_RQ_RING:
+		if (!ctx->ifq)
+			return ERR_PTR(-EINVAL);
+		return ctx->ifq->rq_ring;
 	}
 
 	return ERR_PTR(-EINVAL);
@@ -261,6 +266,9 @@ __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
 						ctx->n_sqe_pages);
 	case IORING_OFF_PBUF_RING:
 		return io_pbuf_mmap(file, vma);
+	case IORING_OFF_RQ_RING:
+		return io_uring_mmap_pages(ctx, vma, ctx->ifq->rqe_pages,
+						ctx->ifq->n_rqe_pages);
 	}
 
 	return -EINVAL;
diff --git a/io_uring/register.c b/io_uring/register.c
index e3c20be5a198..3b221427e988 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -28,6 +28,7 @@
 #include "kbuf.h"
 #include "napi.h"
 #include "eventfd.h"
+#include "zcrx.h"
 
 #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
 				 IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -511,6 +512,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_unregister_napi(ctx, arg);
 		break;
+	case IORING_REGISTER_ZCRX_IFQ:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_zcrx_ifq(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
new file mode 100644
index 000000000000..79d79b9b8df8
--- /dev/null
+++ b/io_uring/zcrx.c
@@ -0,0 +1,147 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "kbuf.h"
+#include "memmap.h"
+#include "zcrx.h"
+
+#define IO_RQ_MAX_ENTRIES		32768
+
+#if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
+
+static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
+				 struct io_uring_zcrx_ifq_reg *reg)
+{
+	size_t off, size;
+	void *ptr;
+
+	off = sizeof(struct io_uring);
+	size = off + sizeof(struct io_uring_zcrx_rqe) * reg->rq_entries;
+
+	ptr = io_pages_map(&ifq->rqe_pages, &ifq->n_rqe_pages, size);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	ifq->rq_ring = (struct io_uring *)ptr;
+	ifq->rqes = (struct io_uring_zcrx_rqe *)((char *)ptr + off);
+	return 0;
+}
+
+static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq)
+{
+	io_pages_unmap(ifq->rq_ring, &ifq->rqe_pages, &ifq->n_rqe_pages, true);
+	ifq->rq_ring = NULL;
+	ifq->rqes = NULL;
+}
+
+static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
+{
+	struct io_zcrx_ifq *ifq;
+
+	ifq = kzalloc(sizeof(*ifq), GFP_KERNEL);
+	if (!ifq)
+		return NULL;
+
+	ifq->if_rxq = -1;
+	ifq->ctx = ctx;
+	return ifq;
+}
+
+static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
+{
+	io_free_rbuf_ring(ifq);
+	kfree(ifq);
+}
+
+int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
+			  struct io_uring_zcrx_ifq_reg __user *arg)
+{
+	struct io_uring_zcrx_ifq_reg reg;
+	struct io_zcrx_ifq *ifq;
+	size_t ring_sz, rqes_sz;
+	int ret;
+
+	/*
+	 * 1. Interface queue allocation.
+	 * 2. It can observe data destined for sockets of other tasks.
+	 */
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	/* mandatory io_uring features for zc rx */
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN &&
+	      ctx->flags & IORING_SETUP_CQE32))
+		return -EINVAL;
+	if (ctx->ifq)
+		return -EBUSY;
+	if (copy_from_user(&reg, arg, sizeof(reg)))
+		return -EFAULT;
+	if (reg.__resv[0] || reg.__resv[1] || reg.__resv[2])
+		return -EINVAL;
+	if (reg.if_rxq == -1 || !reg.rq_entries || reg.flags)
+		return -EINVAL;
+	if (reg.rq_entries > IO_RQ_MAX_ENTRIES) {
+		if (!(ctx->flags & IORING_SETUP_CLAMP))
+			return -EINVAL;
+		reg.rq_entries = IO_RQ_MAX_ENTRIES;
+	}
+	reg.rq_entries = roundup_pow_of_two(reg.rq_entries);
+
+	if (!reg.area_ptr)
+		return -EFAULT;
+
+	ifq = io_zcrx_ifq_alloc(ctx);
+	if (!ifq)
+		return -ENOMEM;
+
+	ret = io_allocate_rbuf_ring(ifq, &reg);
+	if (ret)
+		goto err;
+
+	ifq->rq_entries = reg.rq_entries;
+	ifq->if_rxq = reg.if_rxq;
+
+	ring_sz = sizeof(struct io_uring);
+	rqes_sz = sizeof(struct io_uring_zcrx_rqe) * ifq->rq_entries;
+	reg.offsets.mmap_sz = ring_sz + rqes_sz;
+	reg.offsets.rqes = ring_sz;
+	reg.offsets.head = offsetof(struct io_uring, head);
+	reg.offsets.tail = offsetof(struct io_uring, tail);
+
+	if (copy_to_user(arg, &reg, sizeof(reg))) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	ctx->ifq = ifq;
+	return 0;
+err:
+	io_zcrx_ifq_free(ifq);
+	return ret;
+}
+
+void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+	struct io_zcrx_ifq *ifq = ctx->ifq;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	if (!ifq)
+		return;
+
+	ctx->ifq = NULL;
+	io_zcrx_ifq_free(ifq);
+}
+
+void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+	lockdep_assert_held(&ctx->uring_lock);
+}
+
+#endif
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
new file mode 100644
index 000000000000..4ef94e19d36b
--- /dev/null
+++ b/io_uring/zcrx.h
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_ZC_RX_H
+#define IOU_ZC_RX_H
+
+#include <linux/io_uring_types.h>
+
+struct io_zcrx_ifq {
+	struct io_ring_ctx		*ctx;
+	struct net_device		*dev;
+	struct io_uring			*rq_ring;
+	struct io_uring_zcrx_rqe 	*rqes;
+	u32				rq_entries;
+
+	unsigned short			n_rqe_pages;
+	struct page			**rqe_pages;
+
+	u32				if_rxq;
+};
+
+#if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
+int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
+			 struct io_uring_zcrx_ifq_reg __user *arg);
+void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);
+void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);
+#else
+static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
+					struct io_uring_zcrx_ifq_reg __user *arg)
+{
+	return -EOPNOTSUPP;
+}
+static inline void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+#endif
+
+#endif
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (8 preceding siblings ...)
  2024-10-07 22:15 ` [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-09 18:02   ` Jens Axboe
  2024-10-09 21:29   ` Mina Almasry
  2024-10-07 22:15 ` [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
                   ` (8 subsequent siblings)
  18 siblings, 2 replies; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: David Wei <[email protected]>

Add io_zcrx_area that represents a region of userspace memory that is
used for zero copy. During ifq registration, userspace passes in the
uaddr and len of userspace memory, which is then pinned by the kernel.
Each net_iov is mapped to one of these pages.

The freelist is a spinlock protected list that keeps track of all the
net_iovs/pages that aren't used.

For now, there is only one area per ifq and area registration happens
implicitly as part of ifq registration. There is no API for
adding/removing areas yet. The struct for area registration is there for
future extensibility once we support multiple areas and TCP devmem.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/uapi/linux/io_uring.h |  9 ++++
 io_uring/rsrc.c               |  2 +-
 io_uring/rsrc.h               |  1 +
 io_uring/zcrx.c               | 93 ++++++++++++++++++++++++++++++++++-
 io_uring/zcrx.h               | 16 ++++++
 5 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 567cdb89711e..ffd315d8c6b5 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -831,6 +831,15 @@ struct io_uring_zcrx_offsets {
 	__u64	__resv[2];
 };
 
+struct io_uring_zcrx_area_reg {
+	__u64	addr;
+	__u64	len;
+	__u64	rq_area_token;
+	__u32	flags;
+	__u32	__resv1;
+	__u64	__resv2[2];
+};
+
 /*
  * Argument for IORING_REGISTER_ZCRX_IFQ
  */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 453867add7ca..42606404019e 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -85,7 +85,7 @@ static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
 	return 0;
 }
 
-static int io_buffer_validate(struct iovec *iov)
+int io_buffer_validate(struct iovec *iov)
 {
 	unsigned long tmp, acct_len = iov->iov_len + (PAGE_SIZE - 1);
 
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index c032ca3436ca..e691e8ed849b 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -74,6 +74,7 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
 			    unsigned size, unsigned type);
 int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
 			unsigned int size, unsigned int type);
+int io_buffer_validate(struct iovec *iov);
 
 static inline void io_put_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 {
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 79d79b9b8df8..8382129402ac 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -10,6 +10,7 @@
 #include "kbuf.h"
 #include "memmap.h"
 #include "zcrx.h"
+#include "rsrc.h"
 
 #define IO_RQ_MAX_ENTRIES		32768
 
@@ -40,6 +41,83 @@ static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq)
 	ifq->rqes = NULL;
 }
 
+static void io_zcrx_free_area(struct io_zcrx_area *area)
+{
+	if (area->freelist)
+		kvfree(area->freelist);
+	if (area->nia.niovs)
+		kvfree(area->nia.niovs);
+	if (area->pages) {
+		unpin_user_pages(area->pages, area->nia.num_niovs);
+		kvfree(area->pages);
+	}
+	kfree(area);
+}
+
+static int io_zcrx_create_area(struct io_ring_ctx *ctx,
+			       struct io_zcrx_ifq *ifq,
+			       struct io_zcrx_area **res,
+			       struct io_uring_zcrx_area_reg *area_reg)
+{
+	struct io_zcrx_area *area;
+	int i, ret, nr_pages;
+	struct iovec iov;
+
+	if (area_reg->flags || area_reg->rq_area_token)
+		return -EINVAL;
+	if (area_reg->__resv1 || area_reg->__resv2[0] || area_reg->__resv2[1])
+		return -EINVAL;
+	if (area_reg->addr & ~PAGE_MASK || area_reg->len & ~PAGE_MASK)
+		return -EINVAL;
+
+	iov.iov_base = u64_to_user_ptr(area_reg->addr);
+	iov.iov_len = area_reg->len;
+	ret = io_buffer_validate(&iov);
+	if (ret)
+		return ret;
+
+	ret = -ENOMEM;
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (!area)
+		goto err;
+
+	area->pages = io_pin_pages((unsigned long)area_reg->addr, area_reg->len,
+				   &nr_pages);
+	if (IS_ERR(area->pages)) {
+		ret = PTR_ERR(area->pages);
+		area->pages = NULL;
+		goto err;
+	}
+	area->nia.num_niovs = nr_pages;
+
+	area->nia.niovs = kvmalloc_array(nr_pages, sizeof(area->nia.niovs[0]),
+					 GFP_KERNEL | __GFP_ZERO);
+	if (!area->nia.niovs)
+		goto err;
+
+	area->freelist = kvmalloc_array(nr_pages, sizeof(area->freelist[0]),
+					GFP_KERNEL | __GFP_ZERO);
+	if (!area->freelist)
+		goto err;
+
+	for (i = 0; i < nr_pages; i++) {
+		area->freelist[i] = i;
+	}
+
+	area->free_count = nr_pages;
+	area->ifq = ifq;
+	/* we're only supporting one area per ifq for now */
+	area->area_id = 0;
+	area_reg->rq_area_token = (u64)area->area_id << IORING_ZCRX_AREA_SHIFT;
+	spin_lock_init(&area->freelist_lock);
+	*res = area;
+	return 0;
+err:
+	if (area)
+		io_zcrx_free_area(area);
+	return ret;
+}
+
 static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
 {
 	struct io_zcrx_ifq *ifq;
@@ -55,6 +133,9 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
 
 static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
 {
+	if (ifq->area)
+		io_zcrx_free_area(ifq->area);
+
 	io_free_rbuf_ring(ifq);
 	kfree(ifq);
 }
@@ -62,6 +143,7 @@ static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
 int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 			  struct io_uring_zcrx_ifq_reg __user *arg)
 {
+	struct io_uring_zcrx_area_reg area;
 	struct io_uring_zcrx_ifq_reg reg;
 	struct io_zcrx_ifq *ifq;
 	size_t ring_sz, rqes_sz;
@@ -93,7 +175,7 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 	}
 	reg.rq_entries = roundup_pow_of_two(reg.rq_entries);
 
-	if (!reg.area_ptr)
+	if (copy_from_user(&area, u64_to_user_ptr(reg.area_ptr), sizeof(area)))
 		return -EFAULT;
 
 	ifq = io_zcrx_ifq_alloc(ctx);
@@ -104,6 +186,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 	if (ret)
 		goto err;
 
+	ret = io_zcrx_create_area(ctx, ifq, &ifq->area, &area);
+	if (ret)
+		goto err;
+
 	ifq->rq_entries = reg.rq_entries;
 	ifq->if_rxq = reg.if_rxq;
 
@@ -118,7 +204,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 		ret = -EFAULT;
 		goto err;
 	}
-
+	if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
+		ret = -EFAULT;
+		goto err;
+	}
 	ctx->ifq = ifq;
 	return 0;
 err:
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 4ef94e19d36b..2fcbeb3d5501 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -3,10 +3,26 @@
 #define IOU_ZC_RX_H
 
 #include <linux/io_uring_types.h>
+#include <net/page_pool/types.h>
+
+struct io_zcrx_area {
+	struct net_iov_area	nia;
+	struct io_zcrx_ifq	*ifq;
+
+	u16			area_id;
+	struct page		**pages;
+
+	/* freelist */
+	spinlock_t		freelist_lock ____cacheline_aligned_in_smp;
+	u32			free_count;
+	u32			*freelist;
+};
 
 struct io_zcrx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
+	struct io_zcrx_area		*area;
+
 	struct io_uring			*rq_ring;
 	struct io_uring_zcrx_rqe 	*rqes;
 	u32				rq_entries;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (9 preceding siblings ...)
  2024-10-07 22:15 ` [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area David Wei
@ 2024-10-07 22:15 ` David Wei
  2024-10-09 18:10   ` Jens Axboe
  2024-10-09 22:01   ` Mina Almasry
  2024-10-07 22:16 ` [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request David Wei
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 124+ messages in thread
From: David Wei @ 2024-10-07 22:15 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

Implement a page pool memory provider for io_uring to receieve in a
zero copy fashion. For that, the provider allocates user pages wrapped
around into struct net_iovs, that are stored in a previously registered
struct net_iov_area.

Unlike with traditional receives, for which pages from a page pool can
be deallocated right after the user receives data, e.g. via recv(2),
we extend the lifetime by recycling buffers only after the user space
acknowledges that it's done processing the data via the refill queue.
Before handing buffers to the user, we mark them by bumping the refcount
by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
returned back. When the corresponding io_uring instance and/or page pool
are destroyed, we'll force back all buffers that are currently in the
user space in ->io_pp_zc_scrub by clearing the bias.

Refcounting and lifetime:

Initially, all buffers are considered unallocated and stored in
->freelist, at which point they are not yet directly exposed to the core
page pool code and not accounted to page pool's pages_state_hold_cnt.
The ->alloc_netmems callback will allocate them by placing into the
page pool's cache, setting the refcount to 1 as usual and adjusting
pages_state_hold_cnt.

Then, either the buffer is dropped and returns back to the page pool
into the ->freelist via io_pp_zc_release_netmem, in which case the page
pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
likely the buffer will go through the network/protocol stacks and end up
in the corresponding socket's receive queue. From there the user can get
it via an new io_uring request implemented in following patches. As
mentioned above, before giving a buffer to the user we bump the refcount
by IO_ZC_RX_UREF.

Once the user is done with the buffer processing, it must return it back
via the refill queue, from where our ->alloc_netmems implementation can
grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
there are no more users left. As we place such buffers right back into
the page pools fast cache and they didn't go through the normal pp
release path, they are still considered "allocated" and no pp hold_cnt
is required. For the same reason we dma sync buffers for the device
in io_zc_add_pp_cache().

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/linux/io_uring/net.h |   5 +
 io_uring/zcrx.c              | 229 +++++++++++++++++++++++++++++++++++
 io_uring/zcrx.h              |   6 +
 3 files changed, 240 insertions(+)

diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
index b58f39fed4d5..610b35b451fd 100644
--- a/include/linux/io_uring/net.h
+++ b/include/linux/io_uring/net.h
@@ -5,6 +5,11 @@
 struct io_uring_cmd;
 
 #if defined(CONFIG_IO_URING)
+
+#if defined(CONFIG_PAGE_POOL)
+extern const struct memory_provider_ops io_uring_pp_zc_ops;
+#endif
+
 int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
 
 #else
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 8382129402ac..6cd3dee8b90a 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -2,7 +2,11 @@
 #include <linux/kernel.h>
 #include <linux/errno.h>
 #include <linux/mm.h>
+#include <linux/nospec.h>
+#include <linux/netdevice.h>
 #include <linux/io_uring.h>
+#include <net/page_pool/helpers.h>
+#include <trace/events/page_pool.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -16,6 +20,13 @@
 
 #if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
 
+static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
+{
+	struct net_iov_area *owner = net_iov_owner(niov);
+
+	return container_of(owner, struct io_zcrx_area, nia);
+}
+
 static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
 				 struct io_uring_zcrx_ifq_reg *reg)
 {
@@ -101,6 +112,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
 		goto err;
 
 	for (i = 0; i < nr_pages; i++) {
+		struct net_iov *niov = &area->nia.niovs[i];
+
+		niov->owner = &area->nia;
 		area->freelist[i] = i;
 	}
 
@@ -233,4 +247,219 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 	lockdep_assert_held(&ctx->uring_lock);
 }
 
+static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
+{
+	return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
+}
+
+static bool io_zcrx_put_niov_uref(struct net_iov *niov)
+{
+	if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
+		return false;
+
+	return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
+}
+
+static inline void io_zc_add_pp_cache(struct page_pool *pp,
+				      struct net_iov *niov)
+{
+	netmem_ref netmem = net_iov_to_netmem(niov);
+
+#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
+	if (pp->dma_sync && dma_dev_need_sync(pp->p.dev)) {
+		dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
+
+		dma_sync_single_range_for_device(pp->p.dev, dma_addr,
+						 pp->p.offset, pp->p.max_len,
+						 pp->p.dma_dir);
+	}
+#endif
+
+	page_pool_fragment_netmem(netmem, 1);
+	pp->alloc.cache[pp->alloc.count++] = netmem;
+}
+
+static inline u32 io_zcrx_rqring_entries(struct io_zcrx_ifq *ifq)
+{
+	u32 entries;
+
+	entries = smp_load_acquire(&ifq->rq_ring->tail) - ifq->cached_rq_head;
+	return min(entries, ifq->rq_entries);
+}
+
+static struct io_uring_zcrx_rqe *io_zcrx_get_rqe(struct io_zcrx_ifq *ifq,
+						 unsigned mask)
+{
+	unsigned int idx = ifq->cached_rq_head++ & mask;
+
+	return &ifq->rqes[idx];
+}
+
+static void io_zcrx_ring_refill(struct page_pool *pp,
+				struct io_zcrx_ifq *ifq)
+{
+	unsigned int entries = io_zcrx_rqring_entries(ifq);
+	unsigned int mask = ifq->rq_entries - 1;
+
+	entries = min_t(unsigned, entries, PP_ALLOC_CACHE_REFILL - pp->alloc.count);
+	if (unlikely(!entries))
+		return;
+
+	do {
+		struct io_uring_zcrx_rqe *rqe = io_zcrx_get_rqe(ifq, mask);
+		struct io_zcrx_area *area;
+		struct net_iov *niov;
+		unsigned niov_idx, area_idx;
+
+		area_idx = rqe->off >> IORING_ZCRX_AREA_SHIFT;
+		niov_idx = (rqe->off & ~IORING_ZCRX_AREA_MASK) / PAGE_SIZE;
+
+		if (unlikely(rqe->__pad || area_idx))
+			continue;
+		area = ifq->area;
+
+		if (unlikely(niov_idx >= area->nia.num_niovs))
+			continue;
+		niov_idx = array_index_nospec(niov_idx, area->nia.num_niovs);
+
+		niov = &area->nia.niovs[niov_idx];
+		if (!io_zcrx_put_niov_uref(niov))
+			continue;
+		io_zc_add_pp_cache(pp, niov);
+	} while (--entries);
+
+	smp_store_release(&ifq->rq_ring->head, ifq->cached_rq_head);
+}
+
+static void io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *ifq)
+{
+	struct io_zcrx_area *area = ifq->area;
+
+	spin_lock_bh(&area->freelist_lock);
+	while (area->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
+		struct net_iov *niov;
+		u32 pgid;
+
+		pgid = area->freelist[--area->free_count];
+		niov = &area->nia.niovs[pgid];
+
+		io_zc_add_pp_cache(pp, niov);
+
+		pp->pages_state_hold_cnt++;
+		trace_page_pool_state_hold(pp, net_iov_to_netmem(niov),
+					   pp->pages_state_hold_cnt);
+	}
+	spin_unlock_bh(&area->freelist_lock);
+}
+
+static void io_zcrx_recycle_niov(struct net_iov *niov)
+{
+	struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+
+	spin_lock_bh(&area->freelist_lock);
+	area->freelist[area->free_count++] = net_iov_idx(niov);
+	spin_unlock_bh(&area->freelist_lock);
+}
+
+static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
+{
+	struct io_zcrx_ifq *ifq = pp->mp_priv;
+
+	/* pp should already be ensuring that */
+	if (unlikely(pp->alloc.count))
+		goto out_return;
+
+	io_zcrx_ring_refill(pp, ifq);
+	if (likely(pp->alloc.count))
+		goto out_return;
+
+	io_zcrx_refill_slow(pp, ifq);
+	if (!pp->alloc.count)
+		return 0;
+out_return:
+	return pp->alloc.cache[--pp->alloc.count];
+}
+
+static bool io_pp_zc_release_netmem(struct page_pool *pp, netmem_ref netmem)
+{
+	struct net_iov *niov;
+
+	if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+		return false;
+
+	niov = netmem_to_net_iov(netmem);
+
+	if (io_zcrx_niov_put(niov, 1))
+		io_zcrx_recycle_niov(niov);
+	return false;
+}
+
+static void io_pp_zc_scrub(struct page_pool *pp)
+{
+	struct io_zcrx_ifq *ifq = pp->mp_priv;
+	struct io_zcrx_area *area = ifq->area;
+	int i;
+
+	/* Reclaim back all buffers given to the user space. */
+	for (i = 0; i < area->nia.num_niovs; i++) {
+		struct net_iov *niov = &area->nia.niovs[i];
+		int count;
+
+		if (!io_zcrx_put_niov_uref(niov))
+			continue;
+		io_zcrx_recycle_niov(niov);
+
+		count = atomic_inc_return_relaxed(&pp->pages_state_release_cnt);
+		trace_page_pool_state_release(pp, net_iov_to_netmem(niov), count);
+	}
+}
+
+static int io_pp_zc_init(struct page_pool *pp)
+{
+	struct io_zcrx_ifq *ifq = pp->mp_priv;
+	struct io_zcrx_area *area = ifq->area;
+	int ret;
+
+	if (!ifq)
+		return -EINVAL;
+	if (pp->p.order != 0)
+		return -EINVAL;
+	if (!pp->p.napi)
+		return -EINVAL;
+	if (!pp->p.napi->napi_id)
+		return -EINVAL;
+
+	ret = page_pool_init_paged_area(pp, &area->nia, area->pages);
+	if (ret)
+		return ret;
+
+	ifq->napi_id = pp->p.napi->napi_id;
+	percpu_ref_get(&ifq->ctx->refs);
+	ifq->pp = pp;
+	return 0;
+}
+
+static void io_pp_zc_destroy(struct page_pool *pp)
+{
+	struct io_zcrx_ifq *ifq = pp->mp_priv;
+	struct io_zcrx_area *area = ifq->area;
+
+	page_pool_release_area(pp, &ifq->area->nia);
+
+	ifq->pp = NULL;
+	ifq->napi_id = 0;
+
+	if (WARN_ON_ONCE(area->free_count != area->nia.num_niovs))
+		return;
+	percpu_ref_put(&ifq->ctx->refs);
+}
+
+const struct memory_provider_ops io_uring_pp_zc_ops = {
+	.alloc_netmems		= io_pp_zc_alloc_netmems,
+	.release_netmem		= io_pp_zc_release_netmem,
+	.init			= io_pp_zc_init,
+	.destroy		= io_pp_zc_destroy,
+	.scrub			= io_pp_zc_scrub,
+};
+
 #endif
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 2fcbeb3d5501..67512fc69cc4 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -5,6 +5,9 @@
 #include <linux/io_uring_types.h>
 #include <net/page_pool/types.h>
 
+#define IO_ZC_RX_UREF			0x10000
+#define IO_ZC_RX_KREF_MASK		(IO_ZC_RX_UREF - 1)
+
 struct io_zcrx_area {
 	struct net_iov_area	nia;
 	struct io_zcrx_ifq	*ifq;
@@ -22,15 +25,18 @@ struct io_zcrx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
 	struct io_zcrx_area		*area;
+	struct page_pool		*pp;
 
 	struct io_uring			*rq_ring;
 	struct io_uring_zcrx_rqe 	*rqes;
 	u32				rq_entries;
+	u32				cached_rq_head;
 
 	unsigned short			n_rqe_pages;
 	struct page			**rqe_pages;
 
 	u32				if_rxq;
+	unsigned			napi_id;
 };
 
 #if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (10 preceding siblings ...)
  2024-10-07 22:15 ` [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
@ 2024-10-07 22:16 ` David Wei
  2024-10-09 18:28   ` Jens Axboe
  2024-10-07 22:16 ` [PATCH v1 13/15] io_uring/zcrx: add copy fallback David Wei
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:16 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a
socket. Only the connection should be land on the specific rx queue set
up for zero copy, and the socket must be handled by the io_uring
instance that the rx queue was registered for zero copy with. That's
because neither net_iovs / buffers from our queue can be read by outside
applications, nor zero copy is possible if traffic for the zero copy
connection goes to another queue. This coordination is outside of the
scope of this patch series. Also, any traffic directed to the zero copy
enabled queue is immediately visible to the application, which is why
CAP_NET_ADMIN is required at the registeration step.

Of course, no data is actually read out of the socket, it has already
been copied by the netdev into userspace memory via DMA. OP_RECV_ZC
reads skbs out of the socket and checks that its frags are indeed
net_iovs that belong to io_uring. A cqe is queued for each one of these
frags.

Recall that each cqe is a big cqe, with the top half being an
io_uring_zcrx_cqe. The cqe res field contains the len or error. The
lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off
field contain the offset relative to the start of the zero copy area.
The upper part of the off field is trivially zero, and will be used
to carry the area id.

For now, there is no limit as to how much work each OP_RECV_ZC request
does. It will attempt to drain a socket of all available data. This
request always operates in multishot mode.

Signed-off-by: David Wei <[email protected]>
---
 include/uapi/linux/io_uring.h |   2 +
 io_uring/io_uring.h           |  10 ++
 io_uring/net.c                |  78 +++++++++++++++
 io_uring/opdef.c              |  16 +++
 io_uring/zcrx.c               | 180 ++++++++++++++++++++++++++++++++++
 io_uring/zcrx.h               |  11 +++
 6 files changed, 297 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ffd315d8c6b5..c9c9877f2ba7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -87,6 +87,7 @@ struct io_uring_sqe {
 	union {
 		__s32	splice_fd_in;
 		__u32	file_index;
+		__u32	zcrx_ifq_idx;
 		__u32	optlen;
 		struct {
 			__u16	addr_len;
@@ -259,6 +260,7 @@ enum io_uring_op {
 	IORING_OP_FTRUNCATE,
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
+	IORING_OP_RECV_ZC,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index c2acf6180845..8cec53a63c39 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -171,6 +171,16 @@ static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret
 	return io_get_cqe_overflow(ctx, ret, false);
 }
 
+static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
+					       struct io_uring_cqe **cqe_ret)
+{
+	io_lockdep_assert_cq_locked(ctx);
+
+	ctx->cq_extra++;
+	ctx->submit_state.cq_flush = true;
+	return io_get_cqe(ctx, cqe_ret);
+}
+
 static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
 					    struct io_kiocb *req)
 {
diff --git a/io_uring/net.c b/io_uring/net.c
index d08abcca89cc..482e138d2994 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -16,6 +16,7 @@
 #include "net.h"
 #include "notif.h"
 #include "rsrc.h"
+#include "zcrx.h"
 
 #if defined(CONFIG_NET)
 struct io_shutdown {
@@ -89,6 +90,13 @@ struct io_sr_msg {
  */
 #define MULTISHOT_MAX_RETRY	32
 
+struct io_recvzc {
+	struct file			*file;
+	unsigned			msg_flags;
+	u16				flags;
+	struct io_zcrx_ifq		*ifq;
+};
+
 int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
@@ -1193,6 +1201,76 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+	unsigned ifq_idx;
+
+	if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
+		     sqe->len || sqe->addr3))
+		return -EINVAL;
+
+	ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
+	if (ifq_idx != 0)
+		return -EINVAL;
+	zc->ifq = req->ctx->ifq;
+	if (!zc->ifq)
+		return -EINVAL;
+
+	/* All data completions are posted as aux CQEs. */
+	req->flags |= REQ_F_APOLL_MULTISHOT;
+
+	zc->flags = READ_ONCE(sqe->ioprio);
+	zc->msg_flags = READ_ONCE(sqe->msg_flags);
+	if (zc->msg_flags)
+		return -EINVAL;
+	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECV_MULTISHOT))
+		return -EINVAL;
+
+
+#ifdef CONFIG_COMPAT
+	if (req->ctx->compat)
+		zc->msg_flags |= MSG_CMSG_COMPAT;
+#endif
+	return 0;
+}
+
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+	struct io_zcrx_ifq *ifq;
+	struct socket *sock;
+	int ret;
+
+	if (!(req->flags & REQ_F_POLLED) &&
+	    (zc->flags & IORING_RECVSEND_POLL_FIRST))
+		return -EAGAIN;
+
+	sock = sock_from_file(req->file);
+	if (unlikely(!sock))
+		return -ENOTSOCK;
+	ifq = req->ctx->ifq;
+	if (!ifq)
+		return -EINVAL;
+
+	ret = io_zcrx_recv(req, ifq, sock, zc->msg_flags | MSG_DONTWAIT);
+	if (unlikely(ret <= 0) && ret != -EAGAIN) {
+		if (ret == -ERESTARTSYS)
+			ret = -EINTR;
+
+		req_set_fail(req);
+		io_req_set_res(req, ret, 0);
+
+		if (issue_flags & IO_URING_F_MULTISHOT)
+			return IOU_STOP_MULTISHOT;
+		return IOU_OK;
+	}
+
+	if (issue_flags & IO_URING_F_MULTISHOT)
+		return IOU_ISSUE_SKIP_COMPLETE;
+	return -EAGAIN;
+}
+
 void io_send_zc_cleanup(struct io_kiocb *req)
 {
 	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index a2be3bbca5ff..599eb3ea5ff4 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -36,6 +36,7 @@
 #include "waitid.h"
 #include "futex.h"
 #include "truncate.h"
+#include "zcrx.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -513,6 +514,18 @@ const struct io_issue_def io_issue_defs[] = {
 		.async_size		= sizeof(struct io_async_msghdr),
 #else
 		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_RECV_ZC] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
+		.ioprio			= 1,
+#if defined(CONFIG_NET)
+		.prep			= io_recvzc_prep,
+		.issue			= io_recvzc,
+#else
+		.prep			= io_eopnotsupp_prep,
 #endif
 	},
 };
@@ -742,6 +755,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_LISTEN] = {
 		.name			= "LISTEN",
 	},
+	[IORING_OP_RECV_ZC] = {
+		.name			= "RECV_ZC",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 6cd3dee8b90a..8166d8a2656e 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -7,6 +7,8 @@
 #include <linux/io_uring.h>
 #include <net/page_pool/helpers.h>
 #include <trace/events/page_pool.h>
+#include <net/tcp.h>
+#include <net/rps.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -20,6 +22,12 @@
 
 #if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
 
+struct io_zcrx_args {
+	struct io_kiocb		*req;
+	struct io_zcrx_ifq	*ifq;
+	struct socket		*sock;
+};
+
 static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
 {
 	struct net_iov_area *owner = net_iov_owner(niov);
@@ -247,6 +255,11 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 	lockdep_assert_held(&ctx->uring_lock);
 }
 
+static void io_zcrx_get_buf_uref(struct net_iov *niov)
+{
+	atomic_long_add(IO_ZC_RX_UREF, &niov->pp_ref_count);
+}
+
 static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
 {
 	return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
@@ -462,4 +475,171 @@ const struct memory_provider_ops io_uring_pp_zc_ops = {
 	.scrub			= io_pp_zc_scrub,
 };
 
+static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
+			      struct io_zcrx_ifq *ifq, int off, int len)
+{
+	struct io_uring_zcrx_cqe *rcqe;
+	struct io_zcrx_area *area;
+	struct io_uring_cqe *cqe;
+	u64 offset;
+
+	if (!io_defer_get_uncommited_cqe(req->ctx, &cqe))
+		return false;
+
+	cqe->user_data = req->cqe.user_data;
+	cqe->res = len;
+	cqe->flags = IORING_CQE_F_MORE;
+
+	area = io_zcrx_iov_to_area(niov);
+	offset = off + (net_iov_idx(niov) << PAGE_SHIFT);
+	rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
+	rcqe->off = offset + ((u64)area->area_id << IORING_ZCRX_AREA_SHIFT);
+	memset(&rcqe->__pad, 0, sizeof(rcqe->__pad));
+	return true;
+}
+
+static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+			     const skb_frag_t *frag, int off, int len)
+{
+	struct net_iov *niov;
+
+	off += skb_frag_off(frag);
+
+	if (unlikely(!skb_frag_is_net_iov(frag)))
+		return -EOPNOTSUPP;
+
+	niov = netmem_to_net_iov(frag->netmem);
+	if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
+	    niov->pp->mp_priv != ifq)
+		return -EFAULT;
+
+	if (!io_zcrx_queue_cqe(req, niov, ifq, off, len))
+		return -ENOSPC;
+	io_zcrx_get_buf_uref(niov);
+	return len;
+}
+
+static int
+io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+		 unsigned int offset, size_t len)
+{
+	struct io_zcrx_args *args = desc->arg.data;
+	struct io_zcrx_ifq *ifq = args->ifq;
+	struct io_kiocb *req = args->req;
+	struct sk_buff *frag_iter;
+	unsigned start, start_off;
+	int i, copy, end, off;
+	int ret = 0;
+
+	start = skb_headlen(skb);
+	start_off = offset;
+
+	if (offset < start)
+		return -EOPNOTSUPP;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		const skb_frag_t *frag;
+
+		if (WARN_ON(start > offset + len))
+			return -EFAULT;
+
+		frag = &skb_shinfo(skb)->frags[i];
+		end = start + skb_frag_size(frag);
+
+		if (offset < end) {
+			copy = end - offset;
+			if (copy > len)
+				copy = len;
+
+			off = offset - start;
+			ret = io_zcrx_recv_frag(req, ifq, frag, off, copy);
+			if (ret < 0)
+				goto out;
+
+			offset += ret;
+			len -= ret;
+			if (len == 0 || ret != copy)
+				goto out;
+		}
+		start = end;
+	}
+
+	skb_walk_frags(skb, frag_iter) {
+		if (WARN_ON(start > offset + len))
+			return -EFAULT;
+
+		end = start + frag_iter->len;
+		if (offset < end) {
+			copy = end - offset;
+			if (copy > len)
+				copy = len;
+
+			off = offset - start;
+			ret = io_zcrx_recv_skb(desc, frag_iter, off, copy);
+			if (ret < 0)
+				goto out;
+
+			offset += ret;
+			len -= ret;
+			if (len == 0 || ret != copy)
+				goto out;
+		}
+		start = end;
+	}
+
+out:
+	if (offset == start_off)
+		return ret;
+	return offset - start_off;
+}
+
+static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+				struct sock *sk, int flags)
+{
+	struct io_zcrx_args args = {
+		.req = req,
+		.ifq = ifq,
+		.sock = sk->sk_socket,
+	};
+	read_descriptor_t rd_desc = {
+		.count = 1,
+		.arg.data = &args,
+	};
+	int ret;
+
+	lock_sock(sk);
+	ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
+	if (ret <= 0) {
+		if (ret < 0 || sock_flag(sk, SOCK_DONE))
+			goto out;
+		if (sk->sk_err)
+			ret = sock_error(sk);
+		else if (sk->sk_shutdown & RCV_SHUTDOWN)
+			goto out;
+		else if (sk->sk_state == TCP_CLOSE)
+			ret = -ENOTCONN;
+		else
+			ret = -EAGAIN;
+	} else if (sock_flag(sk, SOCK_DONE)) {
+		/* Make it to retry until it finally gets 0. */
+		ret = -EAGAIN;
+	}
+out:
+	release_sock(sk);
+	return ret;
+}
+
+int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+		 struct socket *sock, unsigned int flags)
+{
+	struct sock *sk = sock->sk;
+	const struct proto *prot = READ_ONCE(sk->sk_prot);
+
+	if (prot->recvmsg != tcp_recvmsg)
+		return -EPROTONOSUPPORT;
+
+	sock_rps_record_flow(sk);
+	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags);
+}
+
 #endif
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 67512fc69cc4..ddd68098122a 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -3,6 +3,7 @@
 #define IOU_ZC_RX_H
 
 #include <linux/io_uring_types.h>
+#include <linux/socket.h>
 #include <net/page_pool/types.h>
 
 #define IO_ZC_RX_UREF			0x10000
@@ -44,6 +45,8 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 			 struct io_uring_zcrx_ifq_reg __user *arg);
 void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);
 void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);
+int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+		 struct socket *sock, unsigned int flags);
 #else
 static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 					struct io_uring_zcrx_ifq_reg __user *arg)
@@ -56,6 +59,14 @@ static inline void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
 static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 {
 }
+static inline int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+			       struct socket *sock, unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags);
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+
 #endif
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 13/15] io_uring/zcrx: add copy fallback
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (11 preceding siblings ...)
  2024-10-07 22:16 ` [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request David Wei
@ 2024-10-07 22:16 ` David Wei
  2024-10-08 15:58   ` Stanislav Fomichev
  2024-10-09 18:38   ` Jens Axboe
  2024-10-07 22:16 ` [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
                   ` (5 subsequent siblings)
  18 siblings, 2 replies; 124+ messages in thread
From: David Wei @ 2024-10-07 22:16 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

There are scenarios in which the zerocopy path might get a normal
in-kernel buffer, it could be a mis-steered packet or simply the linear
part of an skb. Another use case is to allow the driver to allocate
kernel pages when it's out of zc buffers, which makes it more resilient
to spikes in load and allow the user to choose the balance between the
amount of memory provided and performance.

At the moment we fail such requests. Instead, grab a buffer from the
page pool, copy data there, and return back to user in the usual way.
Because the refill ring is private to the napi our page pool is running
from, it's done by stopping the napi via napi_execute() helper. It grabs
only one buffer, which is inefficient, and improving it is left for
follow up patches.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 io_uring/zcrx.c | 125 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 118 insertions(+), 7 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 8166d8a2656e..d21e7017deb3 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -5,6 +5,8 @@
 #include <linux/nospec.h>
 #include <linux/netdevice.h>
 #include <linux/io_uring.h>
+#include <linux/skbuff_ref.h>
+#include <net/busy_poll.h>
 #include <net/page_pool/helpers.h>
 #include <trace/events/page_pool.h>
 #include <net/tcp.h>
@@ -28,6 +30,11 @@ struct io_zcrx_args {
 	struct socket		*sock;
 };
 
+struct io_zc_refill_data {
+	struct io_zcrx_ifq *ifq;
+	struct net_iov *niov;
+};
+
 static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
 {
 	struct net_iov_area *owner = net_iov_owner(niov);
@@ -35,6 +42,13 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
 	return container_of(owner, struct io_zcrx_area, nia);
 }
 
+static inline struct page *io_zcrx_iov_page(const struct net_iov *niov)
+{
+	struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+
+	return area->pages[net_iov_idx(niov)];
+}
+
 static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
 				 struct io_uring_zcrx_ifq_reg *reg)
 {
@@ -475,6 +489,34 @@ const struct memory_provider_ops io_uring_pp_zc_ops = {
 	.scrub			= io_pp_zc_scrub,
 };
 
+static void io_napi_refill(void *data)
+{
+	struct io_zc_refill_data *rd = data;
+	struct io_zcrx_ifq *ifq = rd->ifq;
+	netmem_ref netmem;
+
+	if (WARN_ON_ONCE(!ifq->pp))
+		return;
+
+	netmem = page_pool_alloc_netmem(ifq->pp, GFP_ATOMIC | __GFP_NOWARN);
+	if (!netmem)
+		return;
+	if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+		return;
+
+	rd->niov = netmem_to_net_iov(netmem);
+}
+
+static struct net_iov *io_zc_get_buf_task_safe(struct io_zcrx_ifq *ifq)
+{
+	struct io_zc_refill_data rd = {
+		.ifq = ifq,
+	};
+
+	napi_execute(ifq->napi_id, io_napi_refill, &rd);
+	return rd.niov;
+}
+
 static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
 			      struct io_zcrx_ifq *ifq, int off, int len)
 {
@@ -498,6 +540,45 @@ static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
 	return true;
 }
 
+static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+				  void *data, unsigned int offset, size_t len)
+{
+	size_t copy_size, copied = 0;
+	int ret = 0, off = 0;
+	struct page *page;
+	u8 *vaddr;
+
+	do {
+		struct net_iov *niov;
+
+		niov = io_zc_get_buf_task_safe(ifq);
+		if (!niov) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		page = io_zcrx_iov_page(niov);
+		vaddr = kmap_local_page(page);
+		copy_size = min_t(size_t, PAGE_SIZE, len);
+		memcpy(vaddr, data + offset, copy_size);
+		kunmap_local(vaddr);
+
+		if (!io_zcrx_queue_cqe(req, niov, ifq, off, copy_size)) {
+			napi_pp_put_page(net_iov_to_netmem(niov));
+			return -ENOSPC;
+		}
+
+		io_zcrx_get_buf_uref(niov);
+		napi_pp_put_page(net_iov_to_netmem(niov));
+
+		offset += copy_size;
+		len -= copy_size;
+		copied += copy_size;
+	} while (offset < len);
+
+	return copied ? copied : ret;
+}
+
 static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 			     const skb_frag_t *frag, int off, int len)
 {
@@ -505,8 +586,24 @@ static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 
 	off += skb_frag_off(frag);
 
-	if (unlikely(!skb_frag_is_net_iov(frag)))
-		return -EOPNOTSUPP;
+	if (unlikely(!skb_frag_is_net_iov(frag))) {
+		struct page *page = skb_frag_page(frag);
+		u32 p_off, p_len, t, copied = 0;
+		u8 *vaddr;
+		int ret = 0;
+
+		skb_frag_foreach_page(frag, off, len,
+				      page, p_off, p_len, t) {
+			vaddr = kmap_local_page(page);
+			ret = io_zcrx_copy_chunk(req, ifq, vaddr, p_off, p_len);
+			kunmap_local(vaddr);
+
+			if (ret < 0)
+				return copied ? copied : ret;
+			copied += ret;
+		}
+		return copied;
+	}
 
 	niov = netmem_to_net_iov(frag->netmem);
 	if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
@@ -527,15 +624,29 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 	struct io_zcrx_ifq *ifq = args->ifq;
 	struct io_kiocb *req = args->req;
 	struct sk_buff *frag_iter;
-	unsigned start, start_off;
+	unsigned start, start_off = offset;
 	int i, copy, end, off;
 	int ret = 0;
 
-	start = skb_headlen(skb);
-	start_off = offset;
+	if (unlikely(offset < skb_headlen(skb))) {
+		ssize_t copied;
+		size_t to_copy;
 
-	if (offset < start)
-		return -EOPNOTSUPP;
+		to_copy = min_t(size_t, skb_headlen(skb) - offset, len);
+		copied = io_zcrx_copy_chunk(req, ifq, skb->data, offset, to_copy);
+		if (copied < 0) {
+			ret = copied;
+			goto out;
+		}
+		offset += copied;
+		len -= copied;
+		if (!len)
+			goto out;
+		if (offset != skb_headlen(skb))
+			goto out;
+	}
+
+	start = skb_headlen(skb);
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		const skb_frag_t *frag;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (12 preceding siblings ...)
  2024-10-07 22:16 ` [PATCH v1 13/15] io_uring/zcrx: add copy fallback David Wei
@ 2024-10-07 22:16 ` David Wei
  2024-10-09 18:42   ` Jens Axboe
  2024-10-07 22:16 ` [PATCH v1 15/15] io_uring/zcrx: throttle receive requests David Wei
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:16 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: David Wei <[email protected]>

Set the page pool memory provider for the rx queue configured for zero copy to
io_uring. Then the rx queue is reset using netdev_rx_queue_restart() and netdev
core + page pool will take care of filling the rx queue from the io_uring zero
copy memory provider.

For now, there is only one ifq so its destruction happens implicitly during
io_uring cleanup.

Signed-off-by: David Wei <[email protected]>
---
 io_uring/zcrx.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index d21e7017deb3..7939f830cf5b 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -6,6 +6,8 @@
 #include <linux/netdevice.h>
 #include <linux/io_uring.h>
 #include <linux/skbuff_ref.h>
+#include <linux/io_uring/net.h>
+#include <net/netdev_rx_queue.h>
 #include <net/busy_poll.h>
 #include <net/page_pool/helpers.h>
 #include <trace/events/page_pool.h>
@@ -49,6 +51,63 @@ static inline struct page *io_zcrx_iov_page(const struct net_iov *niov)
 	return area->pages[net_iov_idx(niov)];
 }
 
+static int io_open_zc_rxq(struct io_zcrx_ifq *ifq, unsigned ifq_idx)
+{
+	struct netdev_rx_queue *rxq;
+	struct net_device *dev = ifq->dev;
+	int ret;
+
+	ASSERT_RTNL();
+
+	if (ifq_idx >= dev->num_rx_queues)
+		return -EINVAL;
+	ifq_idx = array_index_nospec(ifq_idx, dev->num_rx_queues);
+
+	rxq = __netif_get_rx_queue(ifq->dev, ifq_idx);
+	if (rxq->mp_params.mp_priv)
+		return -EEXIST;
+
+	ifq->if_rxq = ifq_idx;
+	rxq->mp_params.mp_ops = &io_uring_pp_zc_ops;
+	rxq->mp_params.mp_priv = ifq;
+	ret = netdev_rx_queue_restart(ifq->dev, ifq->if_rxq);
+	if (ret) {
+		rxq->mp_params.mp_ops = NULL;
+		rxq->mp_params.mp_priv = NULL;
+		ifq->if_rxq = -1;
+	}
+	return ret;
+}
+
+static void io_close_zc_rxq(struct io_zcrx_ifq *ifq)
+{
+	struct netdev_rx_queue *rxq;
+	int err;
+
+	if (ifq->if_rxq == -1)
+		return;
+
+	rtnl_lock();
+	if (WARN_ON_ONCE(ifq->if_rxq >= ifq->dev->num_rx_queues)) {
+		rtnl_unlock();
+		return;
+	}
+
+	rxq = __netif_get_rx_queue(ifq->dev, ifq->if_rxq);
+
+	WARN_ON_ONCE(rxq->mp_params.mp_priv != ifq);
+
+	rxq->mp_params.mp_ops = NULL;
+	rxq->mp_params.mp_priv = NULL;
+
+	err = netdev_rx_queue_restart(ifq->dev, ifq->if_rxq);
+	if (err)
+		pr_devel("io_uring: can't restart a queue on zcrx close\n");
+
+	rtnl_unlock();
+	ifq->if_rxq = -1;
+}
+
 static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
 				 struct io_uring_zcrx_ifq_reg *reg)
 {
@@ -169,9 +228,12 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
 
 static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
 {
+	io_close_zc_rxq(ifq);
+
 	if (ifq->area)
 		io_zcrx_free_area(ifq->area);
-
+	if (ifq->dev)
+		dev_put(ifq->dev);
 	io_free_rbuf_ring(ifq);
 	kfree(ifq);
 }
@@ -227,7 +289,17 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 		goto err;
 
 	ifq->rq_entries = reg.rq_entries;
-	ifq->if_rxq = reg.if_rxq;
+
+	ret = -ENODEV;
+	rtnl_lock();
+	ifq->dev = dev_get_by_index(current->nsproxy->net_ns, reg.if_idx);
+	if (!ifq->dev)
+		goto err_rtnl_unlock;
+
+	ret = io_open_zc_rxq(ifq, reg.if_rxq);
+	if (ret)
+		goto err_rtnl_unlock;
+	rtnl_unlock();
 
 	ring_sz = sizeof(struct io_uring);
 	rqes_sz = sizeof(struct io_uring_zcrx_rqe) * ifq->rq_entries;
@@ -237,15 +309,20 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 	reg.offsets.tail = offsetof(struct io_uring, tail);
 
 	if (copy_to_user(arg, &reg, sizeof(reg))) {
+		io_close_zc_rxq(ifq);
 		ret = -EFAULT;
 		goto err;
 	}
 	if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
+		io_close_zc_rxq(ifq);
 		ret = -EFAULT;
 		goto err;
 	}
 	ctx->ifq = ifq;
 	return 0;
+
+err_rtnl_unlock:
+	rtnl_unlock();
 err:
 	io_zcrx_ifq_free(ifq);
 	return ret;
@@ -267,6 +344,9 @@ void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
 void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 {
 	lockdep_assert_held(&ctx->uring_lock);
+
+	if (ctx->ifq)
+		io_close_zc_rxq(ctx->ifq);
 }
 
 static void io_zcrx_get_buf_uref(struct net_iov *niov)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH v1 15/15] io_uring/zcrx: throttle receive requests
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (13 preceding siblings ...)
  2024-10-07 22:16 ` [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
@ 2024-10-07 22:16 ` David Wei
  2024-10-09 18:43   ` Jens Axboe
  2024-10-07 22:20 ` [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-07 22:16 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: David Wei, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

From: Pavel Begunkov <[email protected]>

io_zc_rx_tcp_recvmsg() continues until it fails or there is nothing to
receive. If the other side sends fast enough, we might get stuck in
io_zc_rx_tcp_recvmsg() producing more and more CQEs but not letting the
user to handle them leading to unbound latencies.

Break out of it based on an arbitrarily chosen limit, the upper layer
will either return to userspace or requeue the request.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 io_uring/net.c  |  5 ++++-
 io_uring/zcrx.c | 17 ++++++++++++++---
 io_uring/zcrx.h |  6 ++++--
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 482e138d2994..c99e62c7dcfb 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1253,10 +1253,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
 	if (!ifq)
 		return -EINVAL;
 
-	ret = io_zcrx_recv(req, ifq, sock, zc->msg_flags | MSG_DONTWAIT);
+	ret = io_zcrx_recv(req, ifq, sock, zc->msg_flags | MSG_DONTWAIT,
+			   issue_flags);
 	if (unlikely(ret <= 0) && ret != -EAGAIN) {
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
+		if (ret == IOU_REQUEUE)
+			return IOU_REQUEUE;
 
 		req_set_fail(req);
 		io_req_set_res(req, ret, 0);
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 7939f830cf5b..a78d82a2d404 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -26,10 +26,13 @@
 
 #if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
 
+#define IO_SKBS_PER_CALL_LIMIT	20
+
 struct io_zcrx_args {
 	struct io_kiocb		*req;
 	struct io_zcrx_ifq	*ifq;
 	struct socket		*sock;
+	unsigned		nr_skbs;
 };
 
 struct io_zc_refill_data {
@@ -708,6 +711,9 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 	int i, copy, end, off;
 	int ret = 0;
 
+	if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT))
+		return -EAGAIN;
+
 	if (unlikely(offset < skb_headlen(skb))) {
 		ssize_t copied;
 		size_t to_copy;
@@ -785,7 +791,8 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 }
 
 static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
-				struct sock *sk, int flags)
+				struct sock *sk, int flags,
+				unsigned int issue_flags)
 {
 	struct io_zcrx_args args = {
 		.req = req,
@@ -811,6 +818,9 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 			ret = -ENOTCONN;
 		else
 			ret = -EAGAIN;
+	} else if (unlikely(args.nr_skbs > IO_SKBS_PER_CALL_LIMIT) &&
+		   (issue_flags & IO_URING_F_MULTISHOT)) {
+		ret = IOU_REQUEUE;
 	} else if (sock_flag(sk, SOCK_DONE)) {
 		/* Make it to retry until it finally gets 0. */
 		ret = -EAGAIN;
@@ -821,7 +831,8 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 }
 
 int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
-		 struct socket *sock, unsigned int flags)
+		 struct socket *sock, unsigned int flags,
+		 unsigned int issue_flags)
 {
 	struct sock *sk = sock->sk;
 	const struct proto *prot = READ_ONCE(sk->sk_prot);
@@ -830,7 +841,7 @@ int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 		return -EPROTONOSUPPORT;
 
 	sock_rps_record_flow(sk);
-	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags);
+	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags);
 }
 
 #endif
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index ddd68098122a..bb7ca61a251e 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -46,7 +46,8 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);
 void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);
 int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
-		 struct socket *sock, unsigned int flags);
+		 struct socket *sock, unsigned int flags,
+		 unsigned int issue_flags);
 #else
 static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 					struct io_uring_zcrx_ifq_reg __user *arg)
@@ -60,7 +61,8 @@ static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 {
 }
 static inline int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
-			       struct socket *sock, unsigned int flags)
+				struct socket *sock, unsigned int flags,
+				unsigned int issue_flags)
 {
 	return -EOPNOTSUPP;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (14 preceding siblings ...)
  2024-10-07 22:16 ` [PATCH v1 15/15] io_uring/zcrx: throttle receive requests David Wei
@ 2024-10-07 22:20 ` David Wei
  2024-10-08 23:10 ` Joe Damato
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 124+ messages in thread
From: David Wei @ 2024-10-07 22:20 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

On 2024-10-07 15:15, David Wei wrote:
> This patchset adds support for zero copy rx into userspace pages using
> io_uring, eliminating a kernel to user copy.

Sorry, I didn't know that versioning do not get reset when going from
RFC -> non-RFC. This patchset should read v5. I'll fix this in the next
version.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-07 22:15 ` [PATCH v1 03/15] net: generalise net_iov chunk owners David Wei
@ 2024-10-08 15:46   ` Stanislav Fomichev
  2024-10-08 16:34     ` Pavel Begunkov
  2024-10-09 20:44   ` Mina Almasry
  1 sibling, 1 reply; 124+ messages in thread
From: Stanislav Fomichev @ 2024-10-08 15:46 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/07, David Wei wrote:
> From: Pavel Begunkov <[email protected]>
> 
> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
> which serves as a useful abstraction to share data and provide a
> context. However, it's too devmem specific, and we want to reuse it for
> other memory providers, and for that we need to decouple net_iov from
> devmem. Make net_iov to point to a new base structure called
> net_iov_area, which dmabuf_genpool_chunk_owner extends.
> 
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  include/net/netmem.h | 21 ++++++++++++++++++++-
>  net/core/devmem.c    | 25 +++++++++++++------------
>  net/core/devmem.h    | 25 +++++++++----------------
>  3 files changed, 42 insertions(+), 29 deletions(-)
> 
> diff --git a/include/net/netmem.h b/include/net/netmem.h
> index 8a6e20be4b9d..3795ded30d2c 100644
> --- a/include/net/netmem.h
> +++ b/include/net/netmem.h
> @@ -24,11 +24,20 @@ struct net_iov {
>  	unsigned long __unused_padding;
>  	unsigned long pp_magic;
>  	struct page_pool *pp;
> -	struct dmabuf_genpool_chunk_owner *owner;
> +	struct net_iov_area *owner;

Any reason not to use dmabuf_genpool_chunk_owner as is (or rename it
to net_iov_area to generalize) with the fields that you don't need
set to 0/NULL? container_of makes everything harder to follow :-(

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 13/15] io_uring/zcrx: add copy fallback
  2024-10-07 22:16 ` [PATCH v1 13/15] io_uring/zcrx: add copy fallback David Wei
@ 2024-10-08 15:58   ` Stanislav Fomichev
  2024-10-08 16:39     ` Pavel Begunkov
  2024-10-08 16:40     ` David Wei
  2024-10-09 18:38   ` Jens Axboe
  1 sibling, 2 replies; 124+ messages in thread
From: Stanislav Fomichev @ 2024-10-08 15:58 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/07, David Wei wrote:
> From: Pavel Begunkov <[email protected]>
> 
> There are scenarios in which the zerocopy path might get a normal
> in-kernel buffer, it could be a mis-steered packet or simply the linear
> part of an skb. Another use case is to allow the driver to allocate
> kernel pages when it's out of zc buffers, which makes it more resilient
> to spikes in load and allow the user to choose the balance between the
> amount of memory provided and performance.

Tangential: should there be some clear way for the users to discover that
(some counter of some entry on cq about copy fallback)?

Or the expectation is that somebody will run bpftrace to diagnose
(supposedly) poor ZC performance when it falls back to copy?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-08 15:46   ` Stanislav Fomichev
@ 2024-10-08 16:34     ` Pavel Begunkov
  2024-10-09 16:28       ` Stanislav Fomichev
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-08 16:34 UTC (permalink / raw)
  To: Stanislav Fomichev, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

On 10/8/24 16:46, Stanislav Fomichev wrote:
> On 10/07, David Wei wrote:
>> From: Pavel Begunkov <[email protected]>
>>
>> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
>> which serves as a useful abstraction to share data and provide a
>> context. However, it's too devmem specific, and we want to reuse it for
>> other memory providers, and for that we need to decouple net_iov from
>> devmem. Make net_iov to point to a new base structure called
>> net_iov_area, which dmabuf_genpool_chunk_owner extends.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   include/net/netmem.h | 21 ++++++++++++++++++++-
>>   net/core/devmem.c    | 25 +++++++++++++------------
>>   net/core/devmem.h    | 25 +++++++++----------------
>>   3 files changed, 42 insertions(+), 29 deletions(-)
>>
>> diff --git a/include/net/netmem.h b/include/net/netmem.h
>> index 8a6e20be4b9d..3795ded30d2c 100644
>> --- a/include/net/netmem.h
>> +++ b/include/net/netmem.h
>> @@ -24,11 +24,20 @@ struct net_iov {
>>   	unsigned long __unused_padding;
>>   	unsigned long pp_magic;
>>   	struct page_pool *pp;
>> -	struct dmabuf_genpool_chunk_owner *owner;
>> +	struct net_iov_area *owner;
> 
> Any reason not to use dmabuf_genpool_chunk_owner as is (or rename it
> to net_iov_area to generalize) with the fields that you don't need
> set to 0/NULL? container_of makes everything harder to follow :-(

It can be that, but then io_uring would have a (null) pointer to
struct net_devmem_dmabuf_binding it knows nothing about and other
fields devmem might add in the future. Also, it reduces the
temptation for the common code to make assumptions about the origin
of the area / pp memory provider. IOW, I think it's cleaner
when separated like in this patch.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 13/15] io_uring/zcrx: add copy fallback
  2024-10-08 15:58   ` Stanislav Fomichev
@ 2024-10-08 16:39     ` Pavel Begunkov
  2024-10-08 16:40     ` David Wei
  1 sibling, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-08 16:39 UTC (permalink / raw)
  To: Stanislav Fomichev, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

On 10/8/24 16:58, Stanislav Fomichev wrote:
> On 10/07, David Wei wrote:
>> From: Pavel Begunkov <[email protected]>
>>
>> There are scenarios in which the zerocopy path might get a normal
>> in-kernel buffer, it could be a mis-steered packet or simply the linear
>> part of an skb. Another use case is to allow the driver to allocate
>> kernel pages when it's out of zc buffers, which makes it more resilient
>> to spikes in load and allow the user to choose the balance between the
>> amount of memory provided and performance.
> 
> Tangential: should there be some clear way for the users to discover that
> (some counter of some entry on cq about copy fallback)?
> 
> Or the expectation is that somebody will run bpftrace to diagnose
> (supposedly) poor ZC performance when it falls back to copy?

We had some notification for testing before, but that's left out
of the series to follow up patches to keep it simple. We can post
a special CQE to notify the user about that from time to time, which
should be just fine as it's a slow path.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 13/15] io_uring/zcrx: add copy fallback
  2024-10-08 15:58   ` Stanislav Fomichev
  2024-10-08 16:39     ` Pavel Begunkov
@ 2024-10-08 16:40     ` David Wei
  2024-10-09 16:30       ` Stanislav Fomichev
  1 sibling, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-08 16:40 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry, David Wei

On 2024-10-08 08:58, Stanislav Fomichev wrote:
> On 10/07, David Wei wrote:
>> From: Pavel Begunkov <[email protected]>
>>
>> There are scenarios in which the zerocopy path might get a normal
>> in-kernel buffer, it could be a mis-steered packet or simply the linear
>> part of an skb. Another use case is to allow the driver to allocate
>> kernel pages when it's out of zc buffers, which makes it more resilient
>> to spikes in load and allow the user to choose the balance between the
>> amount of memory provided and performance.
> 
> Tangential: should there be some clear way for the users to discover that
> (some counter of some entry on cq about copy fallback)?
> 
> Or the expectation is that somebody will run bpftrace to diagnose
> (supposedly) poor ZC performance when it falls back to copy?

Yeah there definitely needs to be a way to notify the user that copy
fallback happened. Right now I'm relying on bpftrace hooking into
io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
too much. I can think of two other options:

1. Send a final cqe at the end of a number of frag cqes with a count of
   the number of copies.
2. Register a secondary area just for handling copies.

Other suggestions are also very welcome.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 08/15] net: add helper executing custom callback from napi
  2024-10-07 22:15 ` [PATCH v1 08/15] net: add helper executing custom callback from napi David Wei
@ 2024-10-08 22:25   ` Joe Damato
  2024-10-09 15:09     ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Joe Damato @ 2024-10-08 22:25 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On Mon, Oct 07, 2024 at 03:15:56PM -0700, David Wei wrote:
> From: Pavel Begunkov <[email protected]>

[...]

> However, from time to time we need to synchronise with the napi, for
> example to add more user memory or allocate fallback buffers. Add a
> helper function napi_execute that allows to run a custom callback from
> under napi context so that it can access and modify napi protected
> parts of io_uring. It works similar to busy polling and stops napi from
> running in the meantime, so it's supposed to be a slow control path.
> 
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>

[...]

> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1e740faf9e78..ba2f43cf5517 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6497,6 +6497,59 @@ void napi_busy_loop(unsigned int napi_id,
>  }
>  EXPORT_SYMBOL(napi_busy_loop);
>  
> +void napi_execute(unsigned napi_id,
> +		  void (*cb)(void *), void *cb_arg)
> +{
> +	struct napi_struct *napi;
> +	bool done = false;
> +	unsigned long val;
> +	void *have_poll_lock = NULL;
> +
> +	rcu_read_lock();
> +
> +	napi = napi_by_id(napi_id);
> +	if (!napi) {
> +		rcu_read_unlock();
> +		return;
> +	}
> +
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		preempt_disable();
> +	for (;;) {
> +		local_bh_disable();
> +		val = READ_ONCE(napi->state);
> +
> +		/* If multiple threads are competing for this napi,
> +		* we avoid dirtying napi->state as much as we can.
> +		*/
> +		if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
> +			  NAPIF_STATE_IN_BUSY_POLL))
> +			goto restart;
> +
> +		if (cmpxchg(&napi->state, val,
> +			   val | NAPIF_STATE_IN_BUSY_POLL |
> +				 NAPIF_STATE_SCHED) != val)
> +			goto restart;
> +
> +		have_poll_lock = netpoll_poll_lock(napi);
> +		cb(cb_arg);

A lot of the above code seems quite similar to __napi_busy_loop, as
you mentioned.

It might be too painful, but I can't help but wonder if there's a
way to refactor this to use common helpers or something?

I had been thinking that the napi->state check /
cmpxchg could maybe be refactored to avoid being repeated in both
places?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (15 preceding siblings ...)
  2024-10-07 22:20 ` [PATCH v1 00/15] io_uring zero copy rx David Wei
@ 2024-10-08 23:10 ` Joe Damato
  2024-10-09 15:07   ` Pavel Begunkov
  2024-10-09 15:27 ` Jens Axboe
  2024-10-09 16:55 ` Mina Almasry
  18 siblings, 1 reply; 124+ messages in thread
From: Joe Damato @ 2024-10-08 23:10 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
> This patchset adds support for zero copy rx into userspace pages using
> io_uring, eliminating a kernel to user copy.
> 
> We configure a page pool that a driver uses to fill a hw rx queue to
> hand out user pages instead of kernel pages. Any data that ends up
> hitting this hw rx queue will thus be dma'd into userspace memory
> directly, without needing to be bounced through kernel memory. 'Reading'
> data out of a socket instead becomes a _notification_ mechanism, where
> the kernel tells userspace where the data is. The overall approach is
> similar to the devmem TCP proposal.
> 
> This relies on hw header/data split, flow steering and RSS to ensure
> packet headers remain in kernel memory and only desired flows hit a hw
> rx queue configured for zero copy. Configuring this is outside of the
> scope of this patchset.

This looks super cool and very useful, thanks for doing this work.

Is there any possibility of some notes or sample pseudo code on how
userland can use this being added to Documentation/networking/ ?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-08 23:10 ` Joe Damato
@ 2024-10-09 15:07   ` Pavel Begunkov
  2024-10-09 16:10     ` Joe Damato
  2024-10-09 16:12     ` Jens Axboe
  0 siblings, 2 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 15:07 UTC (permalink / raw)
  To: Joe Damato, David Wei, io-uring, netdev, Jens Axboe,
	Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 00:10, Joe Damato wrote:
> On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
> 
> This looks super cool and very useful, thanks for doing this work.
> 
> Is there any possibility of some notes or sample pseudo code on how
> userland can use this being added to Documentation/networking/ ?

io_uring man pages would need to be updated with it, there are tests
in liburing and would be a good idea to add back a simple exapmle
to liburing/example/*. I think it should cover it

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 08/15] net: add helper executing custom callback from napi
  2024-10-08 22:25   ` Joe Damato
@ 2024-10-09 15:09     ` Pavel Begunkov
  2024-10-09 16:13       ` Joe Damato
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 15:09 UTC (permalink / raw)
  To: Joe Damato, David Wei, io-uring, netdev, Jens Axboe,
	Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/8/24 23:25, Joe Damato wrote:
> On Mon, Oct 07, 2024 at 03:15:56PM -0700, David Wei wrote:
>> From: Pavel Begunkov <[email protected]>
> 
> [...]
> 
>> However, from time to time we need to synchronise with the napi, for
>> example to add more user memory or allocate fallback buffers. Add a
>> helper function napi_execute that allows to run a custom callback from
>> under napi context so that it can access and modify napi protected
>> parts of io_uring. It works similar to busy polling and stops napi from
>> running in the meantime, so it's supposed to be a slow control path.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> Signed-off-by: David Wei <[email protected]>
> 
> [...]
> 
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 1e740faf9e78..ba2f43cf5517 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -6497,6 +6497,59 @@ void napi_busy_loop(unsigned int napi_id,
>>   }
>>   EXPORT_SYMBOL(napi_busy_loop);
>>   
>> +void napi_execute(unsigned napi_id,
>> +		  void (*cb)(void *), void *cb_arg)
>> +{
>> +	struct napi_struct *napi;
>> +	bool done = false;
>> +	unsigned long val;
>> +	void *have_poll_lock = NULL;
>> +
>> +	rcu_read_lock();
>> +
>> +	napi = napi_by_id(napi_id);
>> +	if (!napi) {
>> +		rcu_read_unlock();
>> +		return;
>> +	}
>> +
>> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>> +		preempt_disable();
>> +	for (;;) {
>> +		local_bh_disable();
>> +		val = READ_ONCE(napi->state);
>> +
>> +		/* If multiple threads are competing for this napi,
>> +		* we avoid dirtying napi->state as much as we can.
>> +		*/
>> +		if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
>> +			  NAPIF_STATE_IN_BUSY_POLL))
>> +			goto restart;
>> +
>> +		if (cmpxchg(&napi->state, val,
>> +			   val | NAPIF_STATE_IN_BUSY_POLL |
>> +				 NAPIF_STATE_SCHED) != val)
>> +			goto restart;
>> +
>> +		have_poll_lock = netpoll_poll_lock(napi);
>> +		cb(cb_arg);
> 
> A lot of the above code seems quite similar to __napi_busy_loop, as
> you mentioned.
> 
> It might be too painful, but I can't help but wonder if there's a
> way to refactor this to use common helpers or something?
> 
> I had been thinking that the napi->state check /
> cmpxchg could maybe be refactored to avoid being repeated in both
> places?

Yep, I can add a helper for that, but I'm not sure how to
deduplicate it further while trying not to pollute the
napi polling path.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (16 preceding siblings ...)
  2024-10-08 23:10 ` Joe Damato
@ 2024-10-09 15:27 ` Jens Axboe
  2024-10-09 15:38   ` David Ahern
  2024-10-09 16:55 ` Mina Almasry
  18 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 15:27 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/7/24 4:15 PM, David Wei wrote:
> ===========
> Performance
> ===========
> 
> Test setup:
> * AMD EPYC 9454
> * Broadcom BCM957508 200G
> * Kernel v6.11 base [2]
> * liburing fork [3]
> * kperf fork [4]
> * 4K MTU
> * Single TCP flow
> 
> With application thread + net rx softirq pinned to _different_ cores:
> 
> epoll
> 82.2 Gbps
> 
> io_uring
> 116.2 Gbps (+41%)
> 
> Pinned to _same_ core:
> 
> epoll
> 62.6 Gbps
> 
> io_uring
> 80.9 Gbps (+29%)

I'll review the io_uring bits in detail, but I did take a quick look and
overall it looks really nice.

I decided to give this a spin, as I noticed that Broadcom now has a
230.x firmware release out that supports this. Hence no dependencies on
that anymore, outside of some pain getting the fw updated. Here are my
test setup details:

Receiver:
AMD EPYC 9754 (recei
Broadcom P2100G
-git + this series + the bnxt series referenced

Sender:
Intel(R) Xeon(R) Platinum 8458P
Broadcom P2100G
-git

Test:
kperf with David's patches to support io_uring zc. Eg single flow TCP,
just testing bandwidth. A single cpu/thread being used on both the
receiver and sender side.

non-zc
60.9 Gbps

io_uring + zc
97.1 Gbps

or +59% faster. There's quite a bit of IRQ side work, I'm guessing I
might need to tune it a bit. But it Works For Me, and the results look
really nice.

I did run into an issue with the bnxt driver defaulting to shared tx/rx
queues, and it not working for me in that configuration. Once I disabled
that, it worked fine. This may or may not be an issue with the flow rule
to direct the traffic, the driver queue start, or something else. Don't
know for sure, will need to check with the driver folks. Once sorted, I
didn't see any issues with the code in the patchset.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 15:27 ` Jens Axboe
@ 2024-10-09 15:38   ` David Ahern
  2024-10-09 15:43     ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: David Ahern @ 2024-10-09 15:38 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/9/24 9:27 AM, Jens Axboe wrote:
> On 10/7/24 4:15 PM, David Wei wrote:
>> ===========
>> Performance
>> ===========
>>
>> Test setup:
>> * AMD EPYC 9454
>> * Broadcom BCM957508 200G
>> * Kernel v6.11 base [2]
>> * liburing fork [3]
>> * kperf fork [4]
>> * 4K MTU
>> * Single TCP flow
>>
>> With application thread + net rx softirq pinned to _different_ cores:
>>
>> epoll
>> 82.2 Gbps
>>
>> io_uring
>> 116.2 Gbps (+41%)
>>
>> Pinned to _same_ core:
>>
>> epoll
>> 62.6 Gbps
>>
>> io_uring
>> 80.9 Gbps (+29%)
> 
> I'll review the io_uring bits in detail, but I did take a quick look and
> overall it looks really nice.
> 
> I decided to give this a spin, as I noticed that Broadcom now has a
> 230.x firmware release out that supports this. Hence no dependencies on
> that anymore, outside of some pain getting the fw updated. Here are my
> test setup details:
> 
> Receiver:
> AMD EPYC 9754 (recei
> Broadcom P2100G
> -git + this series + the bnxt series referenced
> 
> Sender:
> Intel(R) Xeon(R) Platinum 8458P
> Broadcom P2100G
> -git
> 
> Test:
> kperf with David's patches to support io_uring zc. Eg single flow TCP,
> just testing bandwidth. A single cpu/thread being used on both the
> receiver and sender side.
> 
> non-zc
> 60.9 Gbps
> 
> io_uring + zc
> 97.1 Gbps

so line rate? Did you look at whether there is cpu to spare? meaning it
will report higher speeds with a 200G setup?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 15:38   ` David Ahern
@ 2024-10-09 15:43     ` Jens Axboe
  2024-10-09 15:49       ` Pavel Begunkov
  2024-10-09 16:35       ` David Ahern
  0 siblings, 2 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 15:43 UTC (permalink / raw)
  To: David Ahern, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/9/24 9:38 AM, David Ahern wrote:
> On 10/9/24 9:27 AM, Jens Axboe wrote:
>> On 10/7/24 4:15 PM, David Wei wrote:
>>> ===========
>>> Performance
>>> ===========
>>>
>>> Test setup:
>>> * AMD EPYC 9454
>>> * Broadcom BCM957508 200G
>>> * Kernel v6.11 base [2]
>>> * liburing fork [3]
>>> * kperf fork [4]
>>> * 4K MTU
>>> * Single TCP flow
>>>
>>> With application thread + net rx softirq pinned to _different_ cores:
>>>
>>> epoll
>>> 82.2 Gbps
>>>
>>> io_uring
>>> 116.2 Gbps (+41%)
>>>
>>> Pinned to _same_ core:
>>>
>>> epoll
>>> 62.6 Gbps
>>>
>>> io_uring
>>> 80.9 Gbps (+29%)
>>
>> I'll review the io_uring bits in detail, but I did take a quick look and
>> overall it looks really nice.
>>
>> I decided to give this a spin, as I noticed that Broadcom now has a
>> 230.x firmware release out that supports this. Hence no dependencies on
>> that anymore, outside of some pain getting the fw updated. Here are my
>> test setup details:
>>
>> Receiver:
>> AMD EPYC 9754 (recei
>> Broadcom P2100G
>> -git + this series + the bnxt series referenced
>>
>> Sender:
>> Intel(R) Xeon(R) Platinum 8458P
>> Broadcom P2100G
>> -git
>>
>> Test:
>> kperf with David's patches to support io_uring zc. Eg single flow TCP,
>> just testing bandwidth. A single cpu/thread being used on both the
>> receiver and sender side.
>>
>> non-zc
>> 60.9 Gbps
>>
>> io_uring + zc
>> 97.1 Gbps
> 
> so line rate? Did you look at whether there is cpu to spare? meaning it
> will report higher speeds with a 200G setup?

Yep basically line rate, I get 97-98Gbps. I originally used a slower box
as the sender, but then you're capped on the non-zc sender being too
slow. The intel box does better, but it's still basically maxing out the
sender at this point. So yeah, with a faster (or more efficient sender),
I have no doubts this will go much higher per thread, if the link bw was
there. When I looked at CPU usage for the receiver, the thread itself is
using ~30% CPU. And then there's some softirq/irq time outside of that,
but that should ammortize with higher bps rates too I'd expect.

My nic does have 2 100G ports, so might warrant a bit more testing...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 15:43     ` Jens Axboe
@ 2024-10-09 15:49       ` Pavel Begunkov
  2024-10-09 15:50         ` Jens Axboe
  2024-10-09 16:35       ` David Ahern
  1 sibling, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 15:49 UTC (permalink / raw)
  To: Jens Axboe, David Ahern, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, Mina Almasry

On 10/9/24 16:43, Jens Axboe wrote:
> On 10/9/24 9:38 AM, David Ahern wrote:
>> On 10/9/24 9:27 AM, Jens Axboe wrote:
>>> On 10/7/24 4:15 PM, David Wei wrote:
>>>> ===========
>>>> Performance
>>>> ===========
>>>>
>>>> Test setup:
>>>> * AMD EPYC 9454
>>>> * Broadcom BCM957508 200G
>>>> * Kernel v6.11 base [2]
>>>> * liburing fork [3]
>>>> * kperf fork [4]
>>>> * 4K MTU
>>>> * Single TCP flow
>>>>
>>>> With application thread + net rx softirq pinned to _different_ cores:
>>>>
>>>> epoll
>>>> 82.2 Gbps
>>>>
>>>> io_uring
>>>> 116.2 Gbps (+41%)
>>>>
>>>> Pinned to _same_ core:
>>>>
>>>> epoll
>>>> 62.6 Gbps
>>>>
>>>> io_uring
>>>> 80.9 Gbps (+29%)
>>>
>>> I'll review the io_uring bits in detail, but I did take a quick look and
>>> overall it looks really nice.
>>>
>>> I decided to give this a spin, as I noticed that Broadcom now has a
>>> 230.x firmware release out that supports this. Hence no dependencies on
>>> that anymore, outside of some pain getting the fw updated. Here are my
>>> test setup details:
>>>
>>> Receiver:
>>> AMD EPYC 9754 (recei
>>> Broadcom P2100G
>>> -git + this series + the bnxt series referenced
>>>
>>> Sender:
>>> Intel(R) Xeon(R) Platinum 8458P
>>> Broadcom P2100G
>>> -git
>>>
>>> Test:
>>> kperf with David's patches to support io_uring zc. Eg single flow TCP,
>>> just testing bandwidth. A single cpu/thread being used on both the
>>> receiver and sender side.
>>>
>>> non-zc
>>> 60.9 Gbps
>>>
>>> io_uring + zc
>>> 97.1 Gbps
>>
>> so line rate? Did you look at whether there is cpu to spare? meaning it
>> will report higher speeds with a 200G setup?
> 
> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
> as the sender, but then you're capped on the non-zc sender being too
> slow. The intel box does better, but it's still basically maxing out the
> sender at this point. So yeah, with a faster (or more efficient sender),
> I have no doubts this will go much higher per thread, if the link bw was
> there. When I looked at CPU usage for the receiver, the thread itself is
> using ~30% CPU. And then there's some softirq/irq time outside of that,
> but that should ammortize with higher bps rates too I'd expect.
> 
> My nic does have 2 100G ports, so might warrant a bit more testing...
If you haven't done it already, I'd also pin softirq processing to
the same CPU as the app so we measure the full stack. kperf has an
option IIRC.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 15:49       ` Pavel Begunkov
@ 2024-10-09 15:50         ` Jens Axboe
  0 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 15:50 UTC (permalink / raw)
  To: Pavel Begunkov, David Ahern, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, Mina Almasry

On 10/9/24 9:49 AM, Pavel Begunkov wrote:
> On 10/9/24 16:43, Jens Axboe wrote:
>> On 10/9/24 9:38 AM, David Ahern wrote:
>>> On 10/9/24 9:27 AM, Jens Axboe wrote:
>>>> On 10/7/24 4:15 PM, David Wei wrote:
>>>>> ===========
>>>>> Performance
>>>>> ===========
>>>>>
>>>>> Test setup:
>>>>> * AMD EPYC 9454
>>>>> * Broadcom BCM957508 200G
>>>>> * Kernel v6.11 base [2]
>>>>> * liburing fork [3]
>>>>> * kperf fork [4]
>>>>> * 4K MTU
>>>>> * Single TCP flow
>>>>>
>>>>> With application thread + net rx softirq pinned to _different_ cores:
>>>>>
>>>>> epoll
>>>>> 82.2 Gbps
>>>>>
>>>>> io_uring
>>>>> 116.2 Gbps (+41%)
>>>>>
>>>>> Pinned to _same_ core:
>>>>>
>>>>> epoll
>>>>> 62.6 Gbps
>>>>>
>>>>> io_uring
>>>>> 80.9 Gbps (+29%)
>>>>
>>>> I'll review the io_uring bits in detail, but I did take a quick look and
>>>> overall it looks really nice.
>>>>
>>>> I decided to give this a spin, as I noticed that Broadcom now has a
>>>> 230.x firmware release out that supports this. Hence no dependencies on
>>>> that anymore, outside of some pain getting the fw updated. Here are my
>>>> test setup details:
>>>>
>>>> Receiver:
>>>> AMD EPYC 9754 (recei
>>>> Broadcom P2100G
>>>> -git + this series + the bnxt series referenced
>>>>
>>>> Sender:
>>>> Intel(R) Xeon(R) Platinum 8458P
>>>> Broadcom P2100G
>>>> -git
>>>>
>>>> Test:
>>>> kperf with David's patches to support io_uring zc. Eg single flow TCP,
>>>> just testing bandwidth. A single cpu/thread being used on both the
>>>> receiver and sender side.
>>>>
>>>> non-zc
>>>> 60.9 Gbps
>>>>
>>>> io_uring + zc
>>>> 97.1 Gbps
>>>
>>> so line rate? Did you look at whether there is cpu to spare? meaning it
>>> will report higher speeds with a 200G setup?
>>
>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>> as the sender, but then you're capped on the non-zc sender being too
>> slow. The intel box does better, but it's still basically maxing out the
>> sender at this point. So yeah, with a faster (or more efficient sender),
>> I have no doubts this will go much higher per thread, if the link bw was
>> there. When I looked at CPU usage for the receiver, the thread itself is
>> using ~30% CPU. And then there's some softirq/irq time outside of that,
>> but that should ammortize with higher bps rates too I'd expect.
>>
>> My nic does have 2 100G ports, so might warrant a bit more testing...
> If you haven't done it already, I'd also pin softirq processing to
> the same CPU as the app so we measure the full stack. kperf has an
> option IIRC.

I thought that was the default if you didn't give it a cpu-off option?
I'll check...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 15:07   ` Pavel Begunkov
@ 2024-10-09 16:10     ` Joe Damato
  2024-10-09 16:12     ` Jens Axboe
  1 sibling, 0 replies; 124+ messages in thread
From: Joe Damato @ 2024-10-09 16:10 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On Wed, Oct 09, 2024 at 04:07:01PM +0100, Pavel Begunkov wrote:
> On 10/9/24 00:10, Joe Damato wrote:
> > On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
> > > This patchset adds support for zero copy rx into userspace pages using
> > > io_uring, eliminating a kernel to user copy.
> > > 
> > > We configure a page pool that a driver uses to fill a hw rx queue to
> > > hand out user pages instead of kernel pages. Any data that ends up
> > > hitting this hw rx queue will thus be dma'd into userspace memory
> > > directly, without needing to be bounced through kernel memory. 'Reading'
> > > data out of a socket instead becomes a _notification_ mechanism, where
> > > the kernel tells userspace where the data is. The overall approach is
> > > similar to the devmem TCP proposal.
> > > 
> > > This relies on hw header/data split, flow steering and RSS to ensure
> > > packet headers remain in kernel memory and only desired flows hit a hw
> > > rx queue configured for zero copy. Configuring this is outside of the
> > > scope of this patchset.
> > 
> > This looks super cool and very useful, thanks for doing this work.
> > 
> > Is there any possibility of some notes or sample pseudo code on how
> > userland can use this being added to Documentation/networking/ ?
> 
> io_uring man pages would need to be updated with it, there are tests
> in liburing and would be a good idea to add back a simple exapmle
> to liburing/example/*. I think it should cover it

Ah, that sounds amazing to me!

I thought that suggesting that might be too much work ;) which is
why I had suggested Documentation/, but man page updates would be
excellent!

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 15:07   ` Pavel Begunkov
  2024-10-09 16:10     ` Joe Damato
@ 2024-10-09 16:12     ` Jens Axboe
  2024-10-11  6:15       ` David Wei
  1 sibling, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 16:12 UTC (permalink / raw)
  To: Pavel Begunkov, Joe Damato, David Wei, io-uring, netdev,
	Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 9:07 AM, Pavel Begunkov wrote:
> On 10/9/24 00:10, Joe Damato wrote:
>> On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
>>> This patchset adds support for zero copy rx into userspace pages using
>>> io_uring, eliminating a kernel to user copy.
>>>
>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>> hand out user pages instead of kernel pages. Any data that ends up
>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>> data out of a socket instead becomes a _notification_ mechanism, where
>>> the kernel tells userspace where the data is. The overall approach is
>>> similar to the devmem TCP proposal.
>>>
>>> This relies on hw header/data split, flow steering and RSS to ensure
>>> packet headers remain in kernel memory and only desired flows hit a hw
>>> rx queue configured for zero copy. Configuring this is outside of the
>>> scope of this patchset.
>>
>> This looks super cool and very useful, thanks for doing this work.
>>
>> Is there any possibility of some notes or sample pseudo code on how
>> userland can use this being added to Documentation/networking/ ?
> 
> io_uring man pages would need to be updated with it, there are tests
> in liburing and would be a good idea to add back a simple exapmle
> to liburing/example/*. I think it should cover it

man pages for sure, but +1 to the example too. Just a basic thing would
get the point across, I think.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 08/15] net: add helper executing custom callback from napi
  2024-10-09 15:09     ` Pavel Begunkov
@ 2024-10-09 16:13       ` Joe Damato
  2024-10-09 19:12         ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Joe Damato @ 2024-10-09 16:13 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On Wed, Oct 09, 2024 at 04:09:53PM +0100, Pavel Begunkov wrote:
> On 10/8/24 23:25, Joe Damato wrote:
> > On Mon, Oct 07, 2024 at 03:15:56PM -0700, David Wei wrote:
> > > From: Pavel Begunkov <[email protected]>
> > 
> > [...]
> > 
> > > However, from time to time we need to synchronise with the napi, for
> > > example to add more user memory or allocate fallback buffers. Add a
> > > helper function napi_execute that allows to run a custom callback from
> > > under napi context so that it can access and modify napi protected
> > > parts of io_uring. It works similar to busy polling and stops napi from
> > > running in the meantime, so it's supposed to be a slow control path.
> > > 
> > > Signed-off-by: Pavel Begunkov <[email protected]>
> > > Signed-off-by: David Wei <[email protected]>
> > 
> > [...]
> > 
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 1e740faf9e78..ba2f43cf5517 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -6497,6 +6497,59 @@ void napi_busy_loop(unsigned int napi_id,
> > >   }
> > >   EXPORT_SYMBOL(napi_busy_loop);
> > > +void napi_execute(unsigned napi_id,
> > > +		  void (*cb)(void *), void *cb_arg)
> > > +{
> > > +	struct napi_struct *napi;
> > > +	bool done = false;
> > > +	unsigned long val;
> > > +	void *have_poll_lock = NULL;
> > > +
> > > +	rcu_read_lock();
> > > +
> > > +	napi = napi_by_id(napi_id);
> > > +	if (!napi) {
> > > +		rcu_read_unlock();
> > > +		return;
> > > +	}
> > > +
> > > +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > > +		preempt_disable();
> > > +	for (;;) {
> > > +		local_bh_disable();
> > > +		val = READ_ONCE(napi->state);
> > > +
> > > +		/* If multiple threads are competing for this napi,
> > > +		* we avoid dirtying napi->state as much as we can.
> > > +		*/
> > > +		if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
> > > +			  NAPIF_STATE_IN_BUSY_POLL))
> > > +			goto restart;
> > > +
> > > +		if (cmpxchg(&napi->state, val,
> > > +			   val | NAPIF_STATE_IN_BUSY_POLL |
> > > +				 NAPIF_STATE_SCHED) != val)
> > > +			goto restart;
> > > +
> > > +		have_poll_lock = netpoll_poll_lock(napi);
> > > +		cb(cb_arg);
> > 
> > A lot of the above code seems quite similar to __napi_busy_loop, as
> > you mentioned.
> > 
> > It might be too painful, but I can't help but wonder if there's a
> > way to refactor this to use common helpers or something?
> > 
> > I had been thinking that the napi->state check /
> > cmpxchg could maybe be refactored to avoid being repeated in both
> > places?
> 
> Yep, I can add a helper for that, but I'm not sure how to
> deduplicate it further while trying not to pollute the
> napi polling path.

It was just a minor nit; I wouldn't want to hold back this important
work just for that.

I'm still looking at the code myself to see if I can see a better
arrangement of the code.

But that could always come later as a cleanup for -next ?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-08 16:34     ` Pavel Begunkov
@ 2024-10-09 16:28       ` Stanislav Fomichev
  2024-10-11 18:44         ` David Wei
  0 siblings, 1 reply; 124+ messages in thread
From: Stanislav Fomichev @ 2024-10-09 16:28 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/08, Pavel Begunkov wrote:
> On 10/8/24 16:46, Stanislav Fomichev wrote:
> > On 10/07, David Wei wrote:
> > > From: Pavel Begunkov <[email protected]>
> > > 
> > > Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
> > > which serves as a useful abstraction to share data and provide a
> > > context. However, it's too devmem specific, and we want to reuse it for
> > > other memory providers, and for that we need to decouple net_iov from
> > > devmem. Make net_iov to point to a new base structure called
> > > net_iov_area, which dmabuf_genpool_chunk_owner extends.
> > > 
> > > Signed-off-by: Pavel Begunkov <[email protected]>
> > > Signed-off-by: David Wei <[email protected]>
> > > ---
> > >   include/net/netmem.h | 21 ++++++++++++++++++++-
> > >   net/core/devmem.c    | 25 +++++++++++++------------
> > >   net/core/devmem.h    | 25 +++++++++----------------
> > >   3 files changed, 42 insertions(+), 29 deletions(-)
> > > 
> > > diff --git a/include/net/netmem.h b/include/net/netmem.h
> > > index 8a6e20be4b9d..3795ded30d2c 100644
> > > --- a/include/net/netmem.h
> > > +++ b/include/net/netmem.h
> > > @@ -24,11 +24,20 @@ struct net_iov {
> > >   	unsigned long __unused_padding;
> > >   	unsigned long pp_magic;
> > >   	struct page_pool *pp;
> > > -	struct dmabuf_genpool_chunk_owner *owner;
> > > +	struct net_iov_area *owner;
> > 
> > Any reason not to use dmabuf_genpool_chunk_owner as is (or rename it
> > to net_iov_area to generalize) with the fields that you don't need
> > set to 0/NULL? container_of makes everything harder to follow :-(
> 
> It can be that, but then io_uring would have a (null) pointer to
> struct net_devmem_dmabuf_binding it knows nothing about and other
> fields devmem might add in the future. Also, it reduces the
> temptation for the common code to make assumptions about the origin
> of the area / pp memory provider. IOW, I think it's cleaner
> when separated like in this patch.

Ack, let's see whether other people find any issues with this approach.
For me, it makes the devmem parts harder to read, so my preference
is on dropping this patch and keeping owner=null on your side.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 13/15] io_uring/zcrx: add copy fallback
  2024-10-08 16:40     ` David Wei
@ 2024-10-09 16:30       ` Stanislav Fomichev
  2024-10-09 23:05         ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Stanislav Fomichev @ 2024-10-09 16:30 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/08, David Wei wrote:
> On 2024-10-08 08:58, Stanislav Fomichev wrote:
> > On 10/07, David Wei wrote:
> >> From: Pavel Begunkov <[email protected]>
> >>
> >> There are scenarios in which the zerocopy path might get a normal
> >> in-kernel buffer, it could be a mis-steered packet or simply the linear
> >> part of an skb. Another use case is to allow the driver to allocate
> >> kernel pages when it's out of zc buffers, which makes it more resilient
> >> to spikes in load and allow the user to choose the balance between the
> >> amount of memory provided and performance.
> > 
> > Tangential: should there be some clear way for the users to discover that
> > (some counter of some entry on cq about copy fallback)?
> > 
> > Or the expectation is that somebody will run bpftrace to diagnose
> > (supposedly) poor ZC performance when it falls back to copy?
> 
> Yeah there definitely needs to be a way to notify the user that copy
> fallback happened. Right now I'm relying on bpftrace hooking into
> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
> too much. I can think of two other options:
> 
> 1. Send a final cqe at the end of a number of frag cqes with a count of
>    the number of copies.
> 2. Register a secondary area just for handling copies.
> 
> Other suggestions are also very welcome.

SG, thanks. Up to you and Pavel on the mechanism and whether to follow
up separately. Maybe even move this fallback (this patch) into that separate
series as well? Will be easier to review/accept the rest.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 15:43     ` Jens Axboe
  2024-10-09 15:49       ` Pavel Begunkov
@ 2024-10-09 16:35       ` David Ahern
  2024-10-09 16:50         ` Jens Axboe
  1 sibling, 1 reply; 124+ messages in thread
From: David Ahern @ 2024-10-09 16:35 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/9/24 9:43 AM, Jens Axboe wrote:
> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
> as the sender, but then you're capped on the non-zc sender being too
> slow. The intel box does better, but it's still basically maxing out the
> sender at this point. So yeah, with a faster (or more efficient sender),

I am surprised by this comment. You should not see a Tx limited test
(including CPU bound sender). Tx with ZC has been the easy option for a
while now.

> I have no doubts this will go much higher per thread, if the link bw was
> there. When I looked at CPU usage for the receiver, the thread itself is
> using ~30% CPU. And then there's some softirq/irq time outside of that,
> but that should ammortize with higher bps rates too I'd expect.
> 
> My nic does have 2 100G ports, so might warrant a bit more testing...
> 

It would be good to see what the next bottleneck is for io_uring with ZC
Rx path. My expectation is that a 200G link is a means to show you (ie.,
you will not hit 200G so cpu monitoring, perf-top, etc will show the
limiter).

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 16:35       ` David Ahern
@ 2024-10-09 16:50         ` Jens Axboe
  2024-10-09 16:53           ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 16:50 UTC (permalink / raw)
  To: David Ahern, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/9/24 10:35 AM, David Ahern wrote:
> On 10/9/24 9:43 AM, Jens Axboe wrote:
>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>> as the sender, but then you're capped on the non-zc sender being too
>> slow. The intel box does better, but it's still basically maxing out the
>> sender at this point. So yeah, with a faster (or more efficient sender),
> 
> I am surprised by this comment. You should not see a Tx limited test
> (including CPU bound sender). Tx with ZC has been the easy option for a
> while now.

I just set this up to test yesterday and just used default! I'm sure
there is a zc option, just not the default and hence it wasn't used.
I'll give it a spin, will be useful for 200G testing.

>> I have no doubts this will go much higher per thread, if the link bw was
>> there. When I looked at CPU usage for the receiver, the thread itself is
>> using ~30% CPU. And then there's some softirq/irq time outside of that,
>> but that should ammortize with higher bps rates too I'd expect.
>>
>> My nic does have 2 100G ports, so might warrant a bit more testing...
>>
> 
> It would be good to see what the next bottleneck is for io_uring with ZC
> Rx path. My expectation is that a 200G link is a means to show you (ie.,
> you will not hit 200G so cpu monitoring, perf-top, etc will show the
> limiter).

I'm pretty familiar with profiling ;-)

I'll see if I can get the 200G test setup and then I'll report back what
I get.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 16:50         ` Jens Axboe
@ 2024-10-09 16:53           ` Jens Axboe
  2024-10-09 17:12             ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 16:53 UTC (permalink / raw)
  To: David Ahern, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/9/24 10:50 AM, Jens Axboe wrote:
> On 10/9/24 10:35 AM, David Ahern wrote:
>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>> as the sender, but then you're capped on the non-zc sender being too
>>> slow. The intel box does better, but it's still basically maxing out the
>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>
>> I am surprised by this comment. You should not see a Tx limited test
>> (including CPU bound sender). Tx with ZC has been the easy option for a
>> while now.
> 
> I just set this up to test yesterday and just used default! I'm sure
> there is a zc option, just not the default and hence it wasn't used.
> I'll give it a spin, will be useful for 200G testing.

I think we're talking past each other. Yes send with zerocopy is
available for a while now, both with io_uring and just sendmsg(), but
I'm using kperf for testing and it does not look like it supports it.
Might have to add it... We'll see how far I can get without it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
                   ` (17 preceding siblings ...)
  2024-10-09 15:27 ` Jens Axboe
@ 2024-10-09 16:55 ` Mina Almasry
  2024-10-09 16:57   ` Jens Axboe
                     ` (3 more replies)
  18 siblings, 4 replies; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 16:55 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> This patchset adds support for zero copy rx into userspace pages using
> io_uring, eliminating a kernel to user copy.
>
> We configure a page pool that a driver uses to fill a hw rx queue to
> hand out user pages instead of kernel pages. Any data that ends up
> hitting this hw rx queue will thus be dma'd into userspace memory
> directly, without needing to be bounced through kernel memory. 'Reading'
> data out of a socket instead becomes a _notification_ mechanism, where
> the kernel tells userspace where the data is. The overall approach is
> similar to the devmem TCP proposal.
>
> This relies on hw header/data split, flow steering and RSS to ensure
> packet headers remain in kernel memory and only desired flows hit a hw
> rx queue configured for zero copy. Configuring this is outside of the
> scope of this patchset.
>
> We share netdev core infra with devmem TCP. The main difference is that
> io_uring is used for the uAPI and the lifetime of all objects are bound
> to an io_uring instance.

I've been thinking about this a bit, and I hope this feedback isn't
too late, but I think your work may be useful for users not using
io_uring. I.e. zero copy to host memory that is not dependent on page
aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.

If we refactor things around a bit we should be able to have the
memory tied to the RX queue similar to what AF_XDP does, and then we
should be able to zero copy to the memory via regular sockets and via
io_uring. This will be useful for us and other applications that would
like to ZC similar to what you're doing here but not necessarily
through io_uring.

> Data is 'read' using a new io_uring request
> type. When done, data is returned via a new shared refill queue. A zero
> copy page pool refills a hw rx queue from this refill queue directly. Of
> course, the lifetime of these data buffers are managed by io_uring
> rather than the networking stack, with different refcounting rules.
>
> This patchset is the first step adding basic zero copy support. We will
> extend this iteratively with new features e.g. dynamically allocated
> zero copy areas, THP support, dmabuf support, improved copy fallback,
> general optimisations and more.
>
> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
> aren't included since Taehee Yoo has already sent a more comprehensive
> patchset adding support in [1]. Google gve should already support this,

This is an aside, but GVE supports this via the out-of-tree patches
I've been carrying on github. Uptsream we're working on adding the
prerequisite page_pool support.

> and Mellanox mlx5 support is WIP pending driver changes.
>
> ===========
> Performance
> ===========
>
> Test setup:
> * AMD EPYC 9454
> * Broadcom BCM957508 200G
> * Kernel v6.11 base [2]
> * liburing fork [3]
> * kperf fork [4]
> * 4K MTU
> * Single TCP flow
>
> With application thread + net rx softirq pinned to _different_ cores:
>
> epoll
> 82.2 Gbps
>
> io_uring
> 116.2 Gbps (+41%)
>
> Pinned to _same_ core:
>
> epoll
> 62.6 Gbps
>
> io_uring
> 80.9 Gbps (+29%)
>

Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
[1]  and io_uring zerocopy respectively?

If not, I would like to see a comparison between TCP RX zerocopy and
this new io-uring zerocopy. For Google for example we use the TCP RX
zerocopy, I would like to see perf numbers possibly motivating us to
move to this new thing.

[1] https://lwn.net/Articles/752046/


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 16:55 ` Mina Almasry
@ 2024-10-09 16:57   ` Jens Axboe
  2024-10-09 19:32     ` Mina Almasry
  2024-10-09 17:19   ` David Ahern
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 16:57 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 10:55 AM, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16?PM David Wei <[email protected]> wrote:
>>
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
>>
>> We share netdev core infra with devmem TCP. The main difference is that
>> io_uring is used for the uAPI and the lifetime of all objects are bound
>> to an io_uring instance.
> 
> I've been thinking about this a bit, and I hope this feedback isn't
> too late, but I think your work may be useful for users not using
> io_uring. I.e. zero copy to host memory that is not dependent on page
> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.

Not David, but come on, let's please get this moving forward. It's been
stuck behind dependencies for seemingly forever, which are finally
resolved. I don't think this is a reasonable ask at all for this
patchset. If you want to work on that after the fact, then that's
certainly an option. But gating this on now new requirements for
something that isn't even a goal of this patchset, that's getting pretty
silly imho.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 16:53           ` Jens Axboe
@ 2024-10-09 17:12             ` Jens Axboe
  2024-10-10 14:21               ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 17:12 UTC (permalink / raw)
  To: David Ahern, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/9/24 10:53 AM, Jens Axboe wrote:
> On 10/9/24 10:50 AM, Jens Axboe wrote:
>> On 10/9/24 10:35 AM, David Ahern wrote:
>>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>>> as the sender, but then you're capped on the non-zc sender being too
>>>> slow. The intel box does better, but it's still basically maxing out the
>>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>>
>>> I am surprised by this comment. You should not see a Tx limited test
>>> (including CPU bound sender). Tx with ZC has been the easy option for a
>>> while now.
>>
>> I just set this up to test yesterday and just used default! I'm sure
>> there is a zc option, just not the default and hence it wasn't used.
>> I'll give it a spin, will be useful for 200G testing.
> 
> I think we're talking past each other. Yes send with zerocopy is
> available for a while now, both with io_uring and just sendmsg(), but
> I'm using kperf for testing and it does not look like it supports it.
> Might have to add it... We'll see how far I can get without it.

Stanislav pointed me at:

https://github.com/facebookexperimental/kperf/pull/2

which adds zc send. I ran a quick test, and it does reduce cpu
utilization on the sender from 100% to 95%. I'll keep poking...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 16:55 ` Mina Almasry
  2024-10-09 16:57   ` Jens Axboe
@ 2024-10-09 17:19   ` David Ahern
  2024-10-09 18:21   ` Pedro Tammela
  2024-10-11  0:29   ` David Wei
  3 siblings, 0 replies; 124+ messages in thread
From: David Ahern @ 2024-10-09 17:19 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer

On 10/9/24 10:55 AM, Mina Almasry wrote:
> 
> I've been thinking about this a bit, and I hope this feedback isn't
> too late, but I think your work may be useful for users not using
> io_uring. I.e. zero copy to host memory that is not dependent on page
> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.

I disagree with this request; AF_XDP by definition is bypassing the
kernel stack.




^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue
  2024-10-07 22:15 ` [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
@ 2024-10-09 17:50   ` Jens Axboe
  2024-10-09 18:09     ` Jens Axboe
                       ` (3 more replies)
  0 siblings, 4 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 17:50 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/7/24 4:15 PM, David Wei wrote:
> From: David Wei <[email protected]>
> 
> Add a new object called an interface queue (ifq) that represents a net rx queue
> that has been configured for zero copy. Each ifq is registered using a new
> registration opcode IORING_REGISTER_ZCRX_IFQ.
> 
> The refill queue is allocated by the kernel and mapped by userspace using a new
> offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used
> by userspace to return buffers that it is done with, which will then be re-used
> by the netdev again.
> 
> The main CQ ring is used to notify userspace of received data by using the
> upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry
> contains the offset + len to the data.
> 
> For now, each io_uring instance only has a single ifq.

Looks pretty straight forward to me, but please wrap your commit
messages at ~72 chars or it doesn't read so well in the git log.

A few minor comments...

> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index adc2524fd8e3..567cdb89711e 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -595,6 +597,9 @@ enum io_uring_register_op {
>  	IORING_REGISTER_NAPI			= 27,
>  	IORING_UNREGISTER_NAPI			= 28,
>  
> +	/* register a netdev hw rx queue for zerocopy */
> +	IORING_REGISTER_ZCRX_IFQ		= 29,
> +

Will need to change as the current tree has moved a bit beyond this. Not
a huge deal, just an FYI as it obviously impacts userspace too.

> +struct io_uring_zcrx_rqe {
> +	__u64	off;
> +	__u32	len;
> +	__u32	__pad;
> +};
> +
> +struct io_uring_zcrx_cqe {
> +	__u64	off;
> +	__u64	__pad;
> +};

Would be nice to avoid padding for this one as it doubles its size. But
at the same time, always nice to have padding for future proofing...

Always a good idea to add padding, but 

> diff --git a/io_uring/Makefile b/io_uring/Makefile
> index 61923e11c767..1a1184f3946a 100644
> --- a/io_uring/Makefile
> +++ b/io_uring/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
>  					epoll.o statx.o timeout.o fdinfo.o \
>  					cancel.o waitid.o register.o \
>  					truncate.o memmap.o
> +obj-$(CONFIG_PAGE_POOL)	+= zcrx.o
>  obj-$(CONFIG_IO_WQ)		+= io-wq.o
>  obj-$(CONFIG_FUTEX)		+= futex.o
>  obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o

I wonder if this should be expressed a bit differently. Probably have a
CONFIG_IO_URING_ZCRX which depends on CONFIG_INET and CONFIG_PAGE_POOL.
And then you can also use that rather than doing:

#if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)

in some spots. Not a big deal, it'll work as-is. And honestly should
probably cleanup the existing IO_WQ symbol while at it, so perhaps
better left for after the fact.

> +static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
> +				 struct io_uring_zcrx_ifq_reg *reg)
> +{
> +	size_t off, size;
> +	void *ptr;
> +
> +	off = sizeof(struct io_uring);
> +	size = off + sizeof(struct io_uring_zcrx_rqe) * reg->rq_entries;
> +
> +	ptr = io_pages_map(&ifq->rqe_pages, &ifq->n_rqe_pages, size);
> +	if (IS_ERR(ptr))
> +		return PTR_ERR(ptr);
> +
> +	ifq->rq_ring = (struct io_uring *)ptr;
> +	ifq->rqes = (struct io_uring_zcrx_rqe *)((char *)ptr + off);
> +	return 0;
> +}

No need to cast that ptr to char *.

Rest looks fine to me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area
  2024-10-07 22:15 ` [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area David Wei
@ 2024-10-09 18:02   ` Jens Axboe
  2024-10-09 19:05     ` Pavel Begunkov
  2024-10-09 21:29   ` Mina Almasry
  1 sibling, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 18:02 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/7/24 4:15 PM, David Wei wrote:
> +static int io_zcrx_create_area(struct io_ring_ctx *ctx,
> +			       struct io_zcrx_ifq *ifq,
> +			       struct io_zcrx_area **res,
> +			       struct io_uring_zcrx_area_reg *area_reg)
> +{
> +	struct io_zcrx_area *area;
> +	int i, ret, nr_pages;
> +	struct iovec iov;
> +
> +	if (area_reg->flags || area_reg->rq_area_token)
> +		return -EINVAL;
> +	if (area_reg->__resv1 || area_reg->__resv2[0] || area_reg->__resv2[1])
> +		return -EINVAL;
> +	if (area_reg->addr & ~PAGE_MASK || area_reg->len & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	iov.iov_base = u64_to_user_ptr(area_reg->addr);
> +	iov.iov_len = area_reg->len;
> +	ret = io_buffer_validate(&iov);
> +	if (ret)
> +		return ret;
> +
> +	ret = -ENOMEM;
> +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> +	if (!area)
> +		goto err;

This should probably just be a:

	area = kzalloc(sizeof(*area), GFP_KERNEL);
	if (!area)
		return -ENOMEM;

Minor it...

> diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
> index 4ef94e19d36b..2fcbeb3d5501 100644
> --- a/io_uring/zcrx.h
> +++ b/io_uring/zcrx.h
> @@ -3,10 +3,26 @@
>  #define IOU_ZC_RX_H
>  
>  #include <linux/io_uring_types.h>
> +#include <net/page_pool/types.h>
> +
> +struct io_zcrx_area {
> +	struct net_iov_area	nia;
> +	struct io_zcrx_ifq	*ifq;
> +
> +	u16			area_id;
> +	struct page		**pages;
> +
> +	/* freelist */
> +	spinlock_t		freelist_lock ____cacheline_aligned_in_smp;
> +	u32			free_count;
> +	u32			*freelist;
> +};

I'm wondering if this really needs an aligned lock? Since it's only a
single structure, probably not a big deal. But unless there's evidence
to the contrary, might not be a bad idea to just kill that.

Apart from that, looks fine to me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue
  2024-10-09 17:50   ` Jens Axboe
@ 2024-10-09 18:09     ` Jens Axboe
  2024-10-09 19:08     ` Pavel Begunkov
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 18:09 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 11:50 AM, Jens Axboe wrote:
>> +struct io_uring_zcrx_rqe {
>> +	__u64	off;
>> +	__u32	len;
>> +	__u32	__pad;
>> +};
>> +
>> +struct io_uring_zcrx_cqe {
>> +	__u64	off;
>> +	__u64	__pad;
>> +};
> 
> Would be nice to avoid padding for this one as it doubles its size. But
> at the same time, always nice to have padding for future proofing...

Ah nevermind, I see it mirrors the io_uring_cqe itself. Disregard.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-07 22:15 ` [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
@ 2024-10-09 18:10   ` Jens Axboe
  2024-10-09 22:01   ` Mina Almasry
  1 sibling, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 18:10 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

Looks good to me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 16:55 ` Mina Almasry
  2024-10-09 16:57   ` Jens Axboe
  2024-10-09 17:19   ` David Ahern
@ 2024-10-09 18:21   ` Pedro Tammela
  2024-10-10 13:19     ` Pavel Begunkov
  2024-10-11  0:35     ` David Wei
  2024-10-11  0:29   ` David Wei
  3 siblings, 2 replies; 124+ messages in thread
From: Pedro Tammela @ 2024-10-09 18:21 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 09/10/2024 13:55, Mina Almasry wrote:
> [...]
> 
> If not, I would like to see a comparison between TCP RX zerocopy and
> this new io-uring zerocopy. For Google for example we use the TCP RX
> zerocopy, I would like to see perf numbers possibly motivating us to
> move to this new thing.
> 
> [1] https://lwn.net/Articles/752046/
> 

Hi!

 From my own testing, the TCP RX Zerocopy is quite heavy on the page 
unmapping side. Since the io_uring implementation is expected to be 
lighter (see patch 11), I would expect a simple comparison to show 
better numbers for io_uring.

To be fair to the existing implementation, it would then be needed to be 
paired with some 'real' computation, but that varies a lot. As we 
presented in netdevconf this year, HW-GRO eventually was the best option 
for us (no app changes, etc...) but still a case by case decision.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-07 22:16 ` [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request David Wei
@ 2024-10-09 18:28   ` Jens Axboe
  2024-10-09 18:51     ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 18:28 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

> diff --git a/io_uring/net.c b/io_uring/net.c
> index d08abcca89cc..482e138d2994 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -1193,6 +1201,76 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>  	return ret;
>  }
>  
> +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> +	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
> +	unsigned ifq_idx;
> +
> +	if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
> +		     sqe->len || sqe->addr3))
> +		return -EINVAL;
> +
> +	ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
> +	if (ifq_idx != 0)
> +		return -EINVAL;
> +	zc->ifq = req->ctx->ifq;
> +	if (!zc->ifq)
> +		return -EINVAL;

This is read and assigned to 'zc' here, but then the issue handler does
it again? I'm assuming that at some point we'll have ifq selection here,
and then the issue handler will just use zc->ifq. So this part should
probably remain, and the issue side just use zc->ifq?

> +	/* All data completions are posted as aux CQEs. */
> +	req->flags |= REQ_F_APOLL_MULTISHOT;

This puzzles me a bit...

> +	zc->flags = READ_ONCE(sqe->ioprio);
> +	zc->msg_flags = READ_ONCE(sqe->msg_flags);
> +	if (zc->msg_flags)
> +		return -EINVAL;

Maybe allow MSG_DONTWAIT at least? You already pass that in anyway.

> +	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECV_MULTISHOT))
> +		return -EINVAL;
> +
> +
> +#ifdef CONFIG_COMPAT
> +	if (req->ctx->compat)
> +		zc->msg_flags |= MSG_CMSG_COMPAT;
> +#endif
> +	return 0;
> +}

Heh, we could probably just return -EINVAL for that case, but since this
is all we need, fine.

> +
> +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
> +{
> +	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
> +	struct io_zcrx_ifq *ifq;
> +	struct socket *sock;
> +	int ret;
> +
> +	if (!(req->flags & REQ_F_POLLED) &&
> +	    (zc->flags & IORING_RECVSEND_POLL_FIRST))
> +		return -EAGAIN;
> +
> +	sock = sock_from_file(req->file);
> +	if (unlikely(!sock))
> +		return -ENOTSOCK;
> +	ifq = req->ctx->ifq;
> +	if (!ifq)
> +		return -EINVAL;

	irq = zc->ifq;

and then that check can go away too, as it should already have been
errored at prep time if this wasn't valid.

> +static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
> +			      struct io_zcrx_ifq *ifq, int off, int len)
> +{
> +	struct io_uring_zcrx_cqe *rcqe;
> +	struct io_zcrx_area *area;
> +	struct io_uring_cqe *cqe;
> +	u64 offset;
> +
> +	if (!io_defer_get_uncommited_cqe(req->ctx, &cqe))
> +		return false;
> +
> +	cqe->user_data = req->cqe.user_data;
> +	cqe->res = len;
> +	cqe->flags = IORING_CQE_F_MORE;
> +
> +	area = io_zcrx_iov_to_area(niov);
> +	offset = off + (net_iov_idx(niov) << PAGE_SHIFT);
> +	rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
> +	rcqe->off = offset + ((u64)area->area_id << IORING_ZCRX_AREA_SHIFT);
> +	memset(&rcqe->__pad, 0, sizeof(rcqe->__pad));

Just do

	rcqe->__pad = 0;

since it's a single field.

Rest looks fine to me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 13/15] io_uring/zcrx: add copy fallback
  2024-10-07 22:16 ` [PATCH v1 13/15] io_uring/zcrx: add copy fallback David Wei
  2024-10-08 15:58   ` Stanislav Fomichev
@ 2024-10-09 18:38   ` Jens Axboe
  1 sibling, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 18:38 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/7/24 4:16 PM, David Wei wrote:
> From: Pavel Begunkov <[email protected]>
> 
> There are scenarios in which the zerocopy path might get a normal
> in-kernel buffer, it could be a mis-steered packet or simply the linear
> part of an skb. Another use case is to allow the driver to allocate
> kernel pages when it's out of zc buffers, which makes it more resilient
> to spikes in load and allow the user to choose the balance between the
> amount of memory provided and performance.
> 
> At the moment we fail such requests. Instead, grab a buffer from the
> page pool, copy data there, and return back to user in the usual way.
> Because the refill ring is private to the napi our page pool is running
> from, it's done by stopping the napi via napi_execute() helper. It grabs
> only one buffer, which is inefficient, and improving it is left for
> follow up patches.

This also looks fine to me. Agree with the sentiment that it'd be nice
to propagate back if copies are happening, and to what extent. But also
agree that this can wait.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue
  2024-10-07 22:16 ` [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
@ 2024-10-09 18:42   ` Jens Axboe
  2024-10-10 13:09     ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 18:42 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/7/24 4:16 PM, David Wei wrote:
> From: David Wei <[email protected]>
> 
> Set the page pool memory provider for the rx queue configured for zero copy to
> io_uring. Then the rx queue is reset using netdev_rx_queue_restart() and netdev
> core + page pool will take care of filling the rx queue from the io_uring zero
> copy memory provider.
> 
> For now, there is only one ifq so its destruction happens implicitly during
> io_uring cleanup.

Bit wide...

> @@ -237,15 +309,20 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
>  	reg.offsets.tail = offsetof(struct io_uring, tail);
>  
>  	if (copy_to_user(arg, &reg, sizeof(reg))) {
> +		io_close_zc_rxq(ifq);
>  		ret = -EFAULT;
>  		goto err;
>  	}
>  	if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
> +		io_close_zc_rxq(ifq);
>  		ret = -EFAULT;
>  		goto err;
>  	}
>  	ctx->ifq = ifq;
>  	return 0;

Not added in this patch, but since I was looking at rtnl lock coverage,
it's OK to potentially fault while holding this lock? I'm assuming it
is, as I can't imagine any faulting needing to grab it. Not even from
nbd ;-)

Looks fine to me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 15/15] io_uring/zcrx: throttle receive requests
  2024-10-07 22:16 ` [PATCH v1 15/15] io_uring/zcrx: throttle receive requests David Wei
@ 2024-10-09 18:43   ` Jens Axboe
  0 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 18:43 UTC (permalink / raw)
  To: David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/7/24 4:16 PM, David Wei wrote:
> From: Pavel Begunkov <[email protected]>
> 
> io_zc_rx_tcp_recvmsg() continues until it fails or there is nothing to
> receive. If the other side sends fast enough, we might get stuck in
> io_zc_rx_tcp_recvmsg() producing more and more CQEs but not letting the
> user to handle them leading to unbound latencies.
> 
> Break out of it based on an arbitrarily chosen limit, the upper layer
> will either return to userspace or requeue the request.

Probably prudent, and hand wavy limits are just fine as all we really
care about is breaking out.

Looks fine to me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-09 18:28   ` Jens Axboe
@ 2024-10-09 18:51     ` Pavel Begunkov
  2024-10-09 19:01       ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 18:51 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 19:28, Jens Axboe wrote:
>> diff --git a/io_uring/net.c b/io_uring/net.c
>> index d08abcca89cc..482e138d2994 100644
>> --- a/io_uring/net.c
>> +++ b/io_uring/net.c
>> @@ -1193,6 +1201,76 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>   	return ret;
>>   }
>>   
>> +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>> +	unsigned ifq_idx;
>> +
>> +	if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
>> +		     sqe->len || sqe->addr3))
>> +		return -EINVAL;
>> +
>> +	ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
>> +	if (ifq_idx != 0)
>> +		return -EINVAL;
>> +	zc->ifq = req->ctx->ifq;
>> +	if (!zc->ifq)
>> +		return -EINVAL;
> 
> This is read and assigned to 'zc' here, but then the issue handler does
> it again? I'm assuming that at some point we'll have ifq selection here,
> and then the issue handler will just use zc->ifq. So this part should
> probably remain, and the issue side just use zc->ifq?

Yep, fairly overlooked. It's not a real problem, but should
only be fetched and checked here.

>> +	/* All data completions are posted as aux CQEs. */
>> +	req->flags |= REQ_F_APOLL_MULTISHOT;
> 
> This puzzles me a bit...

Well, it's a multishot request. And that flag protects from cq
locking rules violations, i.e. avoiding multishot reqs from
posting from io-wq.

>> +	zc->flags = READ_ONCE(sqe->ioprio);
>> +	zc->msg_flags = READ_ONCE(sqe->msg_flags);
>> +	if (zc->msg_flags)
>> +		return -EINVAL;
> 
> Maybe allow MSG_DONTWAIT at least? You already pass that in anyway.

What would the semantics be? The io_uring nowait has always
been a pure mess because it's not even clear what it supposed
to mean for async requests.


>> +	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECV_MULTISHOT))
>> +		return -EINVAL;
>> +
>> +
>> +#ifdef CONFIG_COMPAT
>> +	if (req->ctx->compat)
>> +		zc->msg_flags |= MSG_CMSG_COMPAT;
>> +#endif
>> +	return 0;
>> +}
> 
> Heh, we could probably just return -EINVAL for that case, but since this
> is all we need, fine.

Well, there is no msghdr, cmsg nor iovec there, so doesn't even
make sense to set it. Can fail as well, I don't anyone would care.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-09 18:51     ` Pavel Begunkov
@ 2024-10-09 19:01       ` Jens Axboe
  2024-10-09 19:27         ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 19:01 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 12:51 PM, Pavel Begunkov wrote:
> On 10/9/24 19:28, Jens Axboe wrote:
>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>> index d08abcca89cc..482e138d2994 100644
>>> --- a/io_uring/net.c
>>> +++ b/io_uring/net.c
>>> @@ -1193,6 +1201,76 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>>       return ret;
>>>   }
>>>   +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>> +{
>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>> +    unsigned ifq_idx;
>>> +
>>> +    if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
>>> +             sqe->len || sqe->addr3))
>>> +        return -EINVAL;
>>> +
>>> +    ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
>>> +    if (ifq_idx != 0)
>>> +        return -EINVAL;
>>> +    zc->ifq = req->ctx->ifq;
>>> +    if (!zc->ifq)
>>> +        return -EINVAL;
>>
>> This is read and assigned to 'zc' here, but then the issue handler does
>> it again? I'm assuming that at some point we'll have ifq selection here,
>> and then the issue handler will just use zc->ifq. So this part should
>> probably remain, and the issue side just use zc->ifq?
> 
> Yep, fairly overlooked. It's not a real problem, but should
> only be fetched and checked here.

Right

>>> +    /* All data completions are posted as aux CQEs. */
>>> +    req->flags |= REQ_F_APOLL_MULTISHOT;
>>
>> This puzzles me a bit...
> 
> Well, it's a multishot request. And that flag protects from cq
> locking rules violations, i.e. avoiding multishot reqs from
> posting from io-wq.

Maybe make it more like the others and require that
IORING_RECV_MULTISHOT is set then, and set it based on that?

>>> +    zc->flags = READ_ONCE(sqe->ioprio);
>>> +    zc->msg_flags = READ_ONCE(sqe->msg_flags);
>>> +    if (zc->msg_flags)
>>> +        return -EINVAL;
>>
>> Maybe allow MSG_DONTWAIT at least? You already pass that in anyway.
> 
> What would the semantics be? The io_uring nowait has always
> been a pure mess because it's not even clear what it supposed
> to mean for async requests.

Yeah can't disagree with that. Not a big deal, doesn't really matter,
can stay as-is.

>>> +    if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECV_MULTISHOT))
>>> +        return -EINVAL;
>>> +
>>> +
>>> +#ifdef CONFIG_COMPAT
>>> +    if (req->ctx->compat)
>>> +        zc->msg_flags |= MSG_CMSG_COMPAT;
>>> +#endif
>>> +    return 0;
>>> +}
>>
>> Heh, we could probably just return -EINVAL for that case, but since this
>> is all we need, fine.
> 
> Well, there is no msghdr, cmsg nor iovec there, so doesn't even
> make sense to set it. Can fail as well, I don't anyone would care.

Then let's please just kill it, should not need a check for that then.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area
  2024-10-09 18:02   ` Jens Axboe
@ 2024-10-09 19:05     ` Pavel Begunkov
  2024-10-09 19:06       ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 19:05 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 19:02, Jens Axboe wrote:
> On 10/7/24 4:15 PM, David Wei wrote:
...
>> +struct io_zcrx_area {
>> +	struct net_iov_area	nia;
>> +	struct io_zcrx_ifq	*ifq;
>> +
>> +	u16			area_id;
>> +	struct page		**pages;
>> +
>> +	/* freelist */
>> +	spinlock_t		freelist_lock ____cacheline_aligned_in_smp;
>> +	u32			free_count;
>> +	u32			*freelist;
>> +};
> 
> I'm wondering if this really needs an aligned lock? Since it's only a
> single structure, probably not a big deal. But unless there's evidence
> to the contrary, might not be a bad idea to just kill that.

napi and IORING_OP_RECV_ZC can run on different CPUs, I wouldn't
want the fields before the lock being contended by the lock
because of cache line sharing, would especially hurt until it's
warmed up well. Not really profiled, but not like we need to
care about space here.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area
  2024-10-09 19:05     ` Pavel Begunkov
@ 2024-10-09 19:06       ` Jens Axboe
  0 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 19:06 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 1:05 PM, Pavel Begunkov wrote:
> On 10/9/24 19:02, Jens Axboe wrote:
>> On 10/7/24 4:15 PM, David Wei wrote:
> ...
>>> +struct io_zcrx_area {
>>> +    struct net_iov_area    nia;
>>> +    struct io_zcrx_ifq    *ifq;
>>> +
>>> +    u16            area_id;
>>> +    struct page        **pages;
>>> +
>>> +    /* freelist */
>>> +    spinlock_t        freelist_lock ____cacheline_aligned_in_smp;
>>> +    u32            free_count;
>>> +    u32            *freelist;
>>> +};
>>
>> I'm wondering if this really needs an aligned lock? Since it's only a
>> single structure, probably not a big deal. But unless there's evidence
>> to the contrary, might not be a bad idea to just kill that.
> 
> napi and IORING_OP_RECV_ZC can run on different CPUs, I wouldn't
> want the fields before the lock being contended by the lock
> because of cache line sharing, would especially hurt until it's
> warmed up well. Not really profiled, but not like we need to
> care about space here.

Right, as mentioned it's just a single struct, so doesn't matter that
much. I guess my testing all ran with same cpu for napi + rx, so would
not have seen it regardless. We can keep it as-is.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue
  2024-10-09 17:50   ` Jens Axboe
  2024-10-09 18:09     ` Jens Axboe
@ 2024-10-09 19:08     ` Pavel Begunkov
  2024-10-11 22:11     ` Pavel Begunkov
  2024-10-13 17:32     ` David Wei
  3 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 19:08 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 18:50, Jens Axboe wrote:
> On 10/7/24 4:15 PM, David Wei wrote:
>> From: David Wei <[email protected]>
>>
...
>> diff --git a/io_uring/Makefile b/io_uring/Makefile
>> index 61923e11c767..1a1184f3946a 100644
>> --- a/io_uring/Makefile
>> +++ b/io_uring/Makefile
>> @@ -10,6 +10,7 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
>>   					epoll.o statx.o timeout.o fdinfo.o \
>>   					cancel.o waitid.o register.o \
>>   					truncate.o memmap.o
>> +obj-$(CONFIG_PAGE_POOL)	+= zcrx.o
>>   obj-$(CONFIG_IO_WQ)		+= io-wq.o
>>   obj-$(CONFIG_FUTEX)		+= futex.o
>>   obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
> 
> I wonder if this should be expressed a bit differently. Probably have a
> CONFIG_IO_URING_ZCRX which depends on CONFIG_INET and CONFIG_PAGE_POOL.
> And then you can also use that rather than doing:
> 
> #if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
> 
> in some spots. Not a big deal, it'll work as-is. And honestly should
> probably cleanup the existing IO_WQ symbol while at it, so perhaps
> better left for after the fact.

I should probably just add not selectable by user
CONFIG_IO_URING_ZCRX and make it depend on INET/etc.


>> +static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
>> +				 struct io_uring_zcrx_ifq_reg *reg)
>> +{
>> +	size_t off, size;
>> +	void *ptr;
>> +
>> +	off = sizeof(struct io_uring);
>> +	size = off + sizeof(struct io_uring_zcrx_rqe) * reg->rq_entries;
>> +
>> +	ptr = io_pages_map(&ifq->rqe_pages, &ifq->n_rqe_pages, size);
>> +	if (IS_ERR(ptr))
>> +		return PTR_ERR(ptr);
>> +
>> +	ifq->rq_ring = (struct io_uring *)ptr;
>> +	ifq->rqes = (struct io_uring_zcrx_rqe *)((char *)ptr + off);
>> +	return 0;
>> +}
> 
> No need to cast that ptr to char *.

I'll apply it and other small nits, thanks for the review

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 08/15] net: add helper executing custom callback from napi
  2024-10-09 16:13       ` Joe Damato
@ 2024-10-09 19:12         ` Pavel Begunkov
  0 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 19:12 UTC (permalink / raw)
  To: Joe Damato, David Wei, io-uring, netdev, Jens Axboe,
	Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 17:13, Joe Damato wrote:
> On Wed, Oct 09, 2024 at 04:09:53PM +0100, Pavel Begunkov wrote:
>> On 10/8/24 23:25, Joe Damato wrote:
>>> On Mon, Oct 07, 2024 at 03:15:56PM -0700, David Wei wrote:
>>>> From: Pavel Begunkov <[email protected]>
>>>
>>> [...]
>>>
>>>> However, from time to time we need to synchronise with the napi, for
>>>> example to add more user memory or allocate fallback buffers. Add a
>>>> helper function napi_execute that allows to run a custom callback from
>>>> under napi context so that it can access and modify napi protected
>>>> parts of io_uring. It works similar to busy polling and stops napi from
>>>> running in the meantime, so it's supposed to be a slow control path.
>>>>
>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>> Signed-off-by: David Wei <[email protected]>
>>>
>>> [...]
>>>
>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>> index 1e740faf9e78..ba2f43cf5517 100644
>>>> --- a/net/core/dev.c
>>>> +++ b/net/core/dev.c
>>>> @@ -6497,6 +6497,59 @@ void napi_busy_loop(unsigned int napi_id,
>>>>    }
>>>>    EXPORT_SYMBOL(napi_busy_loop);
>>>> +void napi_execute(unsigned napi_id,
>>>> +		  void (*cb)(void *), void *cb_arg)
>>>> +{
>>>> +	struct napi_struct *napi;
>>>> +	bool done = false;
>>>> +	unsigned long val;
>>>> +	void *have_poll_lock = NULL;
>>>> +
>>>> +	rcu_read_lock();
>>>> +
>>>> +	napi = napi_by_id(napi_id);
>>>> +	if (!napi) {
>>>> +		rcu_read_unlock();
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>>>> +		preempt_disable();
>>>> +	for (;;) {
>>>> +		local_bh_disable();
>>>> +		val = READ_ONCE(napi->state);
>>>> +
>>>> +		/* If multiple threads are competing for this napi,
>>>> +		* we avoid dirtying napi->state as much as we can.
>>>> +		*/
>>>> +		if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
>>>> +			  NAPIF_STATE_IN_BUSY_POLL))
>>>> +			goto restart;
>>>> +
>>>> +		if (cmpxchg(&napi->state, val,
>>>> +			   val | NAPIF_STATE_IN_BUSY_POLL |
>>>> +				 NAPIF_STATE_SCHED) != val)
>>>> +			goto restart;
>>>> +
>>>> +		have_poll_lock = netpoll_poll_lock(napi);
>>>> +		cb(cb_arg);
>>>
>>> A lot of the above code seems quite similar to __napi_busy_loop, as
>>> you mentioned.
>>>
>>> It might be too painful, but I can't help but wonder if there's a
>>> way to refactor this to use common helpers or something?
>>>
>>> I had been thinking that the napi->state check /
>>> cmpxchg could maybe be refactored to avoid being repeated in both
>>> places?
>>
>> Yep, I can add a helper for that, but I'm not sure how to
>> deduplicate it further while trying not to pollute the
>> napi polling path.
> 
> It was just a minor nit; I wouldn't want to hold back this important
> work just for that.
> 
> I'm still looking at the code myself to see if I can see a better
> arrangement of the code.
> 
> But that could always come later as a cleanup for -next ?

It's still early, there will be a v6 anyway. And thanks for
taking a look.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-09 19:01       ` Jens Axboe
@ 2024-10-09 19:27         ` Pavel Begunkov
  2024-10-09 19:42           ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 19:27 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 20:01, Jens Axboe wrote:
> On 10/9/24 12:51 PM, Pavel Begunkov wrote:
>> On 10/9/24 19:28, Jens Axboe wrote:
>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>> index d08abcca89cc..482e138d2994 100644
>>>> --- a/io_uring/net.c
>>>> +++ b/io_uring/net.c
>>>> @@ -1193,6 +1201,76 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
>>>>        return ret;
>>>>    }
>>>>    +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>> +{
>>>> +    struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
>>>> +    unsigned ifq_idx;
>>>> +
>>>> +    if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
>>>> +             sqe->len || sqe->addr3))
>>>> +        return -EINVAL;
>>>> +
>>>> +    ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
>>>> +    if (ifq_idx != 0)
>>>> +        return -EINVAL;
>>>> +    zc->ifq = req->ctx->ifq;
>>>> +    if (!zc->ifq)
>>>> +        return -EINVAL;
>>>
>>> This is read and assigned to 'zc' here, but then the issue handler does
>>> it again? I'm assuming that at some point we'll have ifq selection here,
>>> and then the issue handler will just use zc->ifq. So this part should
>>> probably remain, and the issue side just use zc->ifq?
>>
>> Yep, fairly overlooked. It's not a real problem, but should
>> only be fetched and checked here.
> 
> Right
> 
>>>> +    /* All data completions are posted as aux CQEs. */
>>>> +    req->flags |= REQ_F_APOLL_MULTISHOT;
>>>
>>> This puzzles me a bit...
>>
>> Well, it's a multishot request. And that flag protects from cq
>> locking rules violations, i.e. avoiding multishot reqs from
>> posting from io-wq.
> 
> Maybe make it more like the others and require that
> IORING_RECV_MULTISHOT is set then, and set it based on that?

if (IORING_RECV_MULTISHOT)
	return -EINVAL;
req->flags |= REQ_F_APOLL_MULTISHOT;

It can be this if that's the preference. It's a bit more consistent,
but might be harder to use. Though I can just hide the flag behind
liburing helpers, would spare from neverending GH issues asking
why it's -EINVAL'ed


>>>> +    zc->flags = READ_ONCE(sqe->ioprio);
>>>> +    zc->msg_flags = READ_ONCE(sqe->msg_flags);
>>>> +    if (zc->msg_flags)
>>>> +        return -EINVAL;
>>>
>>> Maybe allow MSG_DONTWAIT at least? You already pass that in anyway.
>>
>> What would the semantics be? The io_uring nowait has always
>> been a pure mess because it's not even clear what it supposed
>> to mean for async requests.
> 
> Yeah can't disagree with that. Not a big deal, doesn't really matter,
> can stay as-is.

I went through the MSG_* flags before looking which ones might
even make sense here and be useful... Let's better enable if
it'd be needed.

>>>> +    if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECV_MULTISHOT))
>>>> +        return -EINVAL;
>>>> +
>>>> +
>>>> +#ifdef CONFIG_COMPAT
>>>> +    if (req->ctx->compat)
>>>> +        zc->msg_flags |= MSG_CMSG_COMPAT;
>>>> +#endif
>>>> +    return 0;
>>>> +}
>>>
>>> Heh, we could probably just return -EINVAL for that case, but since this
>>> is all we need, fine.
>>
>> Well, there is no msghdr, cmsg nor iovec there, so doesn't even
>> make sense to set it. Can fail as well, I don't anyone would care.
> 
> Then let's please just kill it, should not need a check for that then.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 16:57   ` Jens Axboe
@ 2024-10-09 19:32     ` Mina Almasry
  2024-10-09 19:43       ` Pavel Begunkov
  2024-10-09 19:47       ` Jens Axboe
  0 siblings, 2 replies; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 19:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: David Wei, io-uring, netdev, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Wed, Oct 9, 2024 at 9:57 AM Jens Axboe <[email protected]> wrote:
>
> On 10/9/24 10:55 AM, Mina Almasry wrote:
> > On Mon, Oct 7, 2024 at 3:16?PM David Wei <[email protected]> wrote:
> >>
> >> This patchset adds support for zero copy rx into userspace pages using
> >> io_uring, eliminating a kernel to user copy.
> >>
> >> We configure a page pool that a driver uses to fill a hw rx queue to
> >> hand out user pages instead of kernel pages. Any data that ends up
> >> hitting this hw rx queue will thus be dma'd into userspace memory
> >> directly, without needing to be bounced through kernel memory. 'Reading'
> >> data out of a socket instead becomes a _notification_ mechanism, where
> >> the kernel tells userspace where the data is. The overall approach is
> >> similar to the devmem TCP proposal.
> >>
> >> This relies on hw header/data split, flow steering and RSS to ensure
> >> packet headers remain in kernel memory and only desired flows hit a hw
> >> rx queue configured for zero copy. Configuring this is outside of the
> >> scope of this patchset.
> >>
> >> We share netdev core infra with devmem TCP. The main difference is that
> >> io_uring is used for the uAPI and the lifetime of all objects are bound
> >> to an io_uring instance.
> >
> > I've been thinking about this a bit, and I hope this feedback isn't
> > too late, but I think your work may be useful for users not using
> > io_uring. I.e. zero copy to host memory that is not dependent on page
> > aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
>
> Not David, but come on, let's please get this moving forward. It's been
> stuck behind dependencies for seemingly forever, which are finally
> resolved.

Part of the reason this has been stuck behind dependencies for so long
is because the dependency took the time to implement things very
generically (memory providers, net_iovs) and provided you with the
primitives that enable your work. And dealt with nacks in this area
you now don't have to deal with.

> I don't think this is a reasonable ask at all for this
> patchset. If you want to work on that after the fact, then that's
> certainly an option.

I think this work is extensible to sockets and the implementation need
not be heavily tied to io_uring; yes at least leaving things open for
a socket extension to be done easier in the future would be good, IMO.
I'll look at the series more closely to see if I actually have any
concrete feedback along these lines. I hope you're open to some of it
:-)

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-09 19:27         ` Pavel Begunkov
@ 2024-10-09 19:42           ` Jens Axboe
  2024-10-09 19:47             ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 19:42 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 1:27 PM, Pavel Begunkov wrote:
>>>>> +    /* All data completions are posted as aux CQEs. */
>>>>> +    req->flags |= REQ_F_APOLL_MULTISHOT;
>>>>
>>>> This puzzles me a bit...
>>>
>>> Well, it's a multishot request. And that flag protects from cq
>>> locking rules violations, i.e. avoiding multishot reqs from
>>> posting from io-wq.
>>
>> Maybe make it more like the others and require that
>> IORING_RECV_MULTISHOT is set then, and set it based on that?
> 
> if (IORING_RECV_MULTISHOT)
>     return -EINVAL;
> req->flags |= REQ_F_APOLL_MULTISHOT;
> 
> It can be this if that's the preference. It's a bit more consistent,
> but might be harder to use. Though I can just hide the flag behind
> liburing helpers, would spare from neverending GH issues asking
> why it's -EINVAL'ed

Maybe I'm missing something, but why not make it:

/* multishot required */
if (!(flags & IORING_RECV_MULTISHOT))
	return -EINVAL;
req->flags |= REQ_F_APOLL_MULTISHOT;

and yeah just put it in the io_uring_prep_recv_zc() or whatever helper.
That would seem to be a lot more consistent with other users, no?

>>>>> +    zc->flags = READ_ONCE(sqe->ioprio);
>>>>> +    zc->msg_flags = READ_ONCE(sqe->msg_flags);
>>>>> +    if (zc->msg_flags)
>>>>> +        return -EINVAL;
>>>>
>>>> Maybe allow MSG_DONTWAIT at least? You already pass that in anyway.
>>>
>>> What would the semantics be? The io_uring nowait has always
>>> been a pure mess because it's not even clear what it supposed
>>> to mean for async requests.
>>
>> Yeah can't disagree with that. Not a big deal, doesn't really matter,
>> can stay as-is.
> 
> I went through the MSG_* flags before looking which ones might
> even make sense here and be useful... Let's better enable if
> it'd be needed.

Yep that's fine.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 19:32     ` Mina Almasry
@ 2024-10-09 19:43       ` Pavel Begunkov
  2024-10-09 19:47       ` Jens Axboe
  1 sibling, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 19:43 UTC (permalink / raw)
  To: Mina Almasry, Jens Axboe
  Cc: David Wei, io-uring, netdev, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 20:32, Mina Almasry wrote:
> On Wed, Oct 9, 2024 at 9:57 AM Jens Axboe <[email protected]> wrote:
>>
>> On 10/9/24 10:55 AM, Mina Almasry wrote:
>>> On Mon, Oct 7, 2024 at 3:16?PM David Wei <[email protected]> wrote:
>>>>
>>>> This patchset adds support for zero copy rx into userspace pages using
>>>> io_uring, eliminating a kernel to user copy.
>>>>
>>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>>> hand out user pages instead of kernel pages. Any data that ends up
>>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>>> data out of a socket instead becomes a _notification_ mechanism, where
>>>> the kernel tells userspace where the data is. The overall approach is
>>>> similar to the devmem TCP proposal.
>>>>
>>>> This relies on hw header/data split, flow steering and RSS to ensure
>>>> packet headers remain in kernel memory and only desired flows hit a hw
>>>> rx queue configured for zero copy. Configuring this is outside of the
>>>> scope of this patchset.
>>>>
>>>> We share netdev core infra with devmem TCP. The main difference is that
>>>> io_uring is used for the uAPI and the lifetime of all objects are bound
>>>> to an io_uring instance.
>>>
>>> I've been thinking about this a bit, and I hope this feedback isn't
>>> too late, but I think your work may be useful for users not using
>>> io_uring. I.e. zero copy to host memory that is not dependent on page
>>> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
>>
>> Not David, but come on, let's please get this moving forward. It's been
>> stuck behind dependencies for seemingly forever, which are finally
>> resolved.
> 
> Part of the reason this has been stuck behind dependencies for so long
> is because the dependency took the time to implement things very
> generically (memory providers, net_iovs) and provided you with the
> primitives that enable your work. And dealt with nacks in this area
> you now don't have to deal with.

And that's well appreciated, but I completely share Jens' sentiment.
Is there anything like uapi concerns that prevents it to be
implemented after / separately? I'd say that for io_uring users
it's nice to have the API done the io_uring way regardless of the
socket API option, so at the very least it would fork on the completion
format and that thing would need to have a different ring/etc.

>> I don't think this is a reasonable ask at all for this
>> patchset. If you want to work on that after the fact, then that's
>> certainly an option.
> 
> I think this work is extensible to sockets and the implementation need
> not be heavily tied to io_uring; yes at least leaving things open for
> a socket extension to be done easier in the future would be good, IMO

And as far as I can tell there is already a socket API allowing
all that called devmem TCP :) Might need slight improvement on
the registration side unless dmabuf wrapped user pages are good
enough.

> I'll look at the series more closely to see if I actually have any
> concrete feedback along these lines. I hope you're open to some of it
> :-)

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 19:32     ` Mina Almasry
  2024-10-09 19:43       ` Pavel Begunkov
@ 2024-10-09 19:47       ` Jens Axboe
  1 sibling, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 19:47 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 10/9/24 1:32 PM, Mina Almasry wrote:
> On Wed, Oct 9, 2024 at 9:57?AM Jens Axboe <[email protected]> wrote:
>>
>> On 10/9/24 10:55 AM, Mina Almasry wrote:
>>> On Mon, Oct 7, 2024 at 3:16?PM David Wei <[email protected]> wrote:
>>>>
>>>> This patchset adds support for zero copy rx into userspace pages using
>>>> io_uring, eliminating a kernel to user copy.
>>>>
>>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>>> hand out user pages instead of kernel pages. Any data that ends up
>>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>>> data out of a socket instead becomes a _notification_ mechanism, where
>>>> the kernel tells userspace where the data is. The overall approach is
>>>> similar to the devmem TCP proposal.
>>>>
>>>> This relies on hw header/data split, flow steering and RSS to ensure
>>>> packet headers remain in kernel memory and only desired flows hit a hw
>>>> rx queue configured for zero copy. Configuring this is outside of the
>>>> scope of this patchset.
>>>>
>>>> We share netdev core infra with devmem TCP. The main difference is that
>>>> io_uring is used for the uAPI and the lifetime of all objects are bound
>>>> to an io_uring instance.
>>>
>>> I've been thinking about this a bit, and I hope this feedback isn't
>>> too late, but I think your work may be useful for users not using
>>> io_uring. I.e. zero copy to host memory that is not dependent on page
>>> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
>>
>> Not David, but come on, let's please get this moving forward. It's been
>> stuck behind dependencies for seemingly forever, which are finally
>> resolved.
> 
> Part of the reason this has been stuck behind dependencies for so long
> is because the dependency took the time to implement things very
> generically (memory providers, net_iovs) and provided you with the
> primitives that enable your work. And dealt with nacks in this area
> you now don't have to deal with.

For sure, not trying to put blame on anyone here, just saying it's been
a long winding road.

>> I don't think this is a reasonable ask at all for this
>> patchset. If you want to work on that after the fact, then that's
>> certainly an option.
> 
> I think this work is extensible to sockets and the implementation need
> not be heavily tied to io_uring; yes at least leaving things open for
> a socket extension to be done easier in the future would be good, IMO.
> I'll look at the series more closely to see if I actually have any
> concrete feedback along these lines. I hope you're open to some of it
> :-)

I'm really not, if someone wants to tackle that, then they are welcome
to do so after the fact. I don't want to create Yet Another dependency
that would need resolving with another patch set behind it, particularly
when no such dependency exists in the first place.

There's zero reason why anyone interested in pursuing this path can't
just do it on top.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-09 19:42           ` Jens Axboe
@ 2024-10-09 19:47             ` Pavel Begunkov
  2024-10-09 19:50               ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 19:47 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 20:42, Jens Axboe wrote:
> On 10/9/24 1:27 PM, Pavel Begunkov wrote:
>>>>>> +    /* All data completions are posted as aux CQEs. */
>>>>>> +    req->flags |= REQ_F_APOLL_MULTISHOT;
>>>>>
>>>>> This puzzles me a bit...
>>>>
>>>> Well, it's a multishot request. And that flag protects from cq
>>>> locking rules violations, i.e. avoiding multishot reqs from
>>>> posting from io-wq.
>>>
>>> Maybe make it more like the others and require that
>>> IORING_RECV_MULTISHOT is set then, and set it based on that?
>>
>> if (IORING_RECV_MULTISHOT)
>>      return -EINVAL;
>> req->flags |= REQ_F_APOLL_MULTISHOT;
>>
>> It can be this if that's the preference. It's a bit more consistent,
>> but might be harder to use. Though I can just hide the flag behind
>> liburing helpers, would spare from neverending GH issues asking
>> why it's -EINVAL'ed
> 
> Maybe I'm missing something, but why not make it:
> 
> /* multishot required */
> if (!(flags & IORING_RECV_MULTISHOT))
> 	return -EINVAL;
> req->flags |= REQ_F_APOLL_MULTISHOT;

Right, that's what I meant before spewing a non sensible snippet.

> and yeah just put it in the io_uring_prep_recv_zc() or whatever helper.
> That would seem to be a lot more consistent with other users, no?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-09 19:47             ` Pavel Begunkov
@ 2024-10-09 19:50               ` Jens Axboe
  0 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-09 19:50 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 1:47 PM, Pavel Begunkov wrote:
> On 10/9/24 20:42, Jens Axboe wrote:
>> On 10/9/24 1:27 PM, Pavel Begunkov wrote:
>>>>>>> +    /* All data completions are posted as aux CQEs. */
>>>>>>> +    req->flags |= REQ_F_APOLL_MULTISHOT;
>>>>>>
>>>>>> This puzzles me a bit...
>>>>>
>>>>> Well, it's a multishot request. And that flag protects from cq
>>>>> locking rules violations, i.e. avoiding multishot reqs from
>>>>> posting from io-wq.
>>>>
>>>> Maybe make it more like the others and require that
>>>> IORING_RECV_MULTISHOT is set then, and set it based on that?
>>>
>>> if (IORING_RECV_MULTISHOT)
>>>      return -EINVAL;
>>> req->flags |= REQ_F_APOLL_MULTISHOT;
>>>
>>> It can be this if that's the preference. It's a bit more consistent,
>>> but might be harder to use. Though I can just hide the flag behind
>>> liburing helpers, would spare from neverending GH issues asking
>>> why it's -EINVAL'ed
>>
>> Maybe I'm missing something, but why not make it:
>>
>> /* multishot required */
>> if (!(flags & IORING_RECV_MULTISHOT))
>>     return -EINVAL;
>> req->flags |= REQ_F_APOLL_MULTISHOT;
> 
> Right, that's what I meant before spewing a non sensible snippet.

ok phew, I was scratching my head there for a bit... All good then.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef
  2024-10-07 22:15 ` [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef David Wei
@ 2024-10-09 20:17   ` Mina Almasry
  2024-10-09 23:16     ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 20:17 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Don't hide structure definitions under conditional compilation, it only
> makes messier and harder to maintain. Move struct
> dmabuf_genpool_chunk_owner definition out of CONFIG_NET_DEVMEM ifdef
> together with a bunch of trivial inlined helpers using the structure.
>

To be honest I think the way it is is better? Having the struct
defined but always not set (because the code to set it is always
compiled out) seem worse to me.

Is there a strong reason to have this? Otherwise maybe drop this?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 02/15] net: prefix devmem specific helpers
  2024-10-07 22:15 ` [PATCH v1 02/15] net: prefix devmem specific helpers David Wei
@ 2024-10-09 20:19   ` Mina Almasry
  0 siblings, 0 replies; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 20:19 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Add prefixes to all helpers that are specific to devmem TCP, i.e.
> net_iov_binding[_id].
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---

The rename looks fine to me, but actually, like Stan, I imagine you
reuse the net_devmem_dmabuf_binding (renamed to something more
generic) and only replace the dma-buf specific pieces in it. Lets
discuss that in the patch that actually does that change.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-07 22:15 ` [PATCH v1 03/15] net: generalise net_iov chunk owners David Wei
  2024-10-08 15:46   ` Stanislav Fomichev
@ 2024-10-09 20:44   ` Mina Almasry
  2024-10-09 22:13     ` Pavel Begunkov
  2024-10-09 22:19     ` Pavel Begunkov
  1 sibling, 2 replies; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 20:44 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
> which serves as a useful abstraction to share data and provide a
> context. However, it's too devmem specific, and we want to reuse it for
> other memory providers, and for that we need to decouple net_iov from
> devmem. Make net_iov to point to a new base structure called
> net_iov_area, which dmabuf_genpool_chunk_owner extends.


Similar feeling to Stan initially. I also thought you'd reuse
dmabuf_genpool_chunk_owner. Seems like you're doing that but also
renaming it to net_iov_area almost, which seems fine.

I guess, with this patch, there is no way to tell, given just a
net_iov whether it's dmabuf or something else, right? I wonder if
that's an issue. In my mind when an skb is in tcp_recvmsg() we need to
make sure it's a dmabuf net_iov specifically to call
tcp_recvmsg_dmabuf for example. I'll look deeper here.

...

>
>  static inline struct dmabuf_genpool_chunk_owner *
> -net_iov_owner(const struct net_iov *niov)
> +net_devmem_iov_to_chunk_owner(const struct net_iov *niov)
>  {
> -       return niov->owner;
> -}
> +       struct net_iov_area *owner = net_iov_owner(niov);
>
> -static inline unsigned int net_iov_idx(const struct net_iov *niov)
> -{
> -       return niov - net_iov_owner(niov)->niovs;
> +       return container_of(owner, struct dmabuf_genpool_chunk_owner, area);

Couldn't this end up returning garbage if the net_iov is not actually
a dmabuf one? Is that handled somewhere in a later patch that I
missed?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 04/15] net: page_pool: create hooks for custom page providers
  2024-10-07 22:15 ` [PATCH v1 04/15] net: page_pool: create hooks for custom page providers David Wei
@ 2024-10-09 20:49   ` Mina Almasry
  2024-10-09 22:02     ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 20:49 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> From: Jakub Kicinski <[email protected]>
>
> The page providers which try to reuse the same pages will
> need to hold onto the ref, even if page gets released from
> the pool - as in releasing the page from the pp just transfers
> the "ownership" reference from pp to the provider, and provider
> will wait for other references to be gone before feeding this
> page back into the pool.
>
> Signed-off-by: Jakub Kicinski <[email protected]>
> [Pavel] Rebased, renamed callback, +converted devmem
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>

Likely needs a Cc: Christoph Hellwig <[email protected]>, given previous
feedback to this patch?

But that's going to run into the same feedback again. You don't want
to do this without the ops again?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 05/15] net: prepare for non devmem TCP memory providers
  2024-10-07 22:15 ` [PATCH v1 05/15] net: prepare for non devmem TCP memory providers David Wei
@ 2024-10-09 20:56   ` Mina Almasry
  2024-10-09 21:45     ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 20:56 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> There is a good bunch of places in generic paths assuming that the only
> page pool memory provider is devmem TCP. As we want to reuse the net_iov
> and provider infrastructure, we need to patch it up and explicitly check
> the provider type when we branch into devmem TCP code.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  net/core/devmem.c         |  4 ++--
>  net/core/page_pool_user.c | 15 +++++++++------
>  net/ipv4/tcp.c            |  6 ++++++
>  3 files changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 83d13eb441b6..b0733cf42505 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -314,10 +314,10 @@ void dev_dmabuf_uninstall(struct net_device *dev)
>         unsigned int i;
>
>         for (i = 0; i < dev->real_num_rx_queues; i++) {
> -               binding = dev->_rx[i].mp_params.mp_priv;
> -               if (!binding)
> +               if (dev->_rx[i].mp_params.mp_ops != &dmabuf_devmem_ops)
>                         continue;
>

Sorry if I missed it (and please ignore me if I did), but
dmabuf_devmem_ops are maybe not defined yet?

I'm also wondering how to find all the annyoing places where we need
to check this. Looks like maybe a grep for net_devmem_dmabuf_binding
is the way to go? I need to check whether these are all the places we
need the check but so far looks fine.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-10-07 22:15 ` [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback David Wei
@ 2024-10-09 21:00   ` Mina Almasry
  2024-10-09 21:59     ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 21:00 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> page pool is now waiting for all ppiovs to return before destroying
> itself, and for that to happen the memory provider might need to push
> some buffers, flush caches and so on.
>
> todo: we'll try to get by without it before the final release
>

Is the intention to drop this todo and stick with this patch, or to
move ahead with this patch?

To be honest, I think I read in a follow up patch that you want to
unref all the memory on page_pool_destory, which is not how the
page_pool is used today. Tdoay page_pool_destroy does not reclaim
memory. Changing that may be OK.

But I'm not sure this is generic change that should be put in the
page_pool providers. I don't envision other providers implementing
this. I think they'll be more interested in using the page_pool the
way it's used today.

I would suggest that instead of making this a page_pool provider
thing, to instead have your iouring code listen to a notification that
a new generic notificatino that page_pool is being destroyed or an
rx-queue is being destroyed or something like that, and doing the
scrubbing based on that, maybe?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 07/15] net: page pool: add helper creating area from pages
  2024-10-07 22:15 ` [PATCH v1 07/15] net: page pool: add helper creating area from pages David Wei
@ 2024-10-09 21:11   ` Mina Almasry
  2024-10-09 21:34     ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 21:11 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Add a helper that takes an array of pages and initialises passed in
> memory provider's area with them, where each net_iov takes one page.
> It's also responsible for setting up dma mappings.
>
> We keep it in page_pool.c not to leak netmem details to outside
> providers like io_uring, which don't have access to netmem_priv.h
> and other private helpers.
>

Initial feeling is that this belongs somewhere in the provider. The
functions added here don't seem generically useful to the page pool to
be honest.

The challenge seems to be netmem/net_iov dependencies. The only thing
I see you're calling is net_iov_to_netmem() and friends. Are these the
issue? I think these are in netmem.h actually. Consider including that
in the provider implementation, if it makes sense to you.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area
  2024-10-07 22:15 ` [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area David Wei
  2024-10-09 18:02   ` Jens Axboe
@ 2024-10-09 21:29   ` Mina Almasry
  1 sibling, 0 replies; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 21:29 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> From: David Wei <[email protected]>
>
> Add io_zcrx_area that represents a region of userspace memory that is
> used for zero copy. During ifq registration, userspace passes in the
> uaddr and len of userspace memory, which is then pinned by the kernel.
> Each net_iov is mapped to one of these pages.
>
> The freelist is a spinlock protected list that keeps track of all the
> net_iovs/pages that aren't used.
>
> For now, there is only one area per ifq and area registration happens
> implicitly as part of ifq registration. There is no API for
> adding/removing areas yet. The struct for area registration is there for
> future extensibility once we support multiple areas and TCP devmem.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>

This patch, and the later patch to add the io_uring memory provider
are what I was referring to in the other thread as changes I would
like to reuse for a socket extension for this.

In my mind it would be nice to decouple the memory being bound to the
page_pool from io_uring, so I can bind a malloced/pinned block of
memory to the rx queue and use it with regular sockets. Seems a lot of
this patch and the provider can be reused. The biggest issue AFAICT I
see is that there are io_uring specific calls to set up the region
like io_buffer_validate/io_pin_pages, but these in turn seem to call
generic mm helpers and from a quick look I don't see much iouring
specific.

Seems fine to leave this to a future extension.

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 07/15] net: page pool: add helper creating area from pages
  2024-10-09 21:11   ` Mina Almasry
@ 2024-10-09 21:34     ` Pavel Begunkov
  0 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 21:34 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 22:11, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> Add a helper that takes an array of pages and initialises passed in
>> memory provider's area with them, where each net_iov takes one page.
>> It's also responsible for setting up dma mappings.
>>
>> We keep it in page_pool.c not to leak netmem details to outside
>> providers like io_uring, which don't have access to netmem_priv.h
>> and other private helpers.
>>
> 
> Initial feeling is that this belongs somewhere in the provider. The
> functions added here don't seem generically useful to the page pool to
> be honest.
> 
> The challenge seems to be netmem/net_iov dependencies. The only thing
> I see you're calling is net_iov_to_netmem() and friends. Are these the
> issue? I think these are in netmem.h actually. Consider including that
> in the provider implementation, if it makes sense to you.

io_uring would need bits from netmem_priv.h and page_pool_priv.h,
and Jakub was pushing hard for the devmem patches to hide all of it
under net/core/. It's a last week change, I believe Jakub doesn't
want any of those leaked outside, in which case net/ needs to
provide a helper.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 05/15] net: prepare for non devmem TCP memory providers
  2024-10-09 20:56   ` Mina Almasry
@ 2024-10-09 21:45     ` Pavel Begunkov
  2024-10-13 22:33       ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 21:45 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 21:56, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> There is a good bunch of places in generic paths assuming that the only
>> page pool memory provider is devmem TCP. As we want to reuse the net_iov
>> and provider infrastructure, we need to patch it up and explicitly check
>> the provider type when we branch into devmem TCP code.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>   net/core/devmem.c         |  4 ++--
>>   net/core/page_pool_user.c | 15 +++++++++------
>>   net/ipv4/tcp.c            |  6 ++++++
>>   3 files changed, 17 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/core/devmem.c b/net/core/devmem.c
>> index 83d13eb441b6..b0733cf42505 100644
>> --- a/net/core/devmem.c
>> +++ b/net/core/devmem.c
>> @@ -314,10 +314,10 @@ void dev_dmabuf_uninstall(struct net_device *dev)
>>          unsigned int i;
>>
>>          for (i = 0; i < dev->real_num_rx_queues; i++) {
>> -               binding = dev->_rx[i].mp_params.mp_priv;
>> -               if (!binding)
>> +               if (dev->_rx[i].mp_params.mp_ops != &dmabuf_devmem_ops)
>>                          continue;
>>
> 
> Sorry if I missed it (and please ignore me if I did), but
> dmabuf_devmem_ops are maybe not defined yet?

You exported it in devmem.h

> I'm also wondering how to find all the annyoing places where we need
> to check this. Looks like maybe a grep for net_devmem_dmabuf_binding
> is the way to go? I need to check whether these are all the places we
> need the check but so far looks fine.

I whac-a-mole'd them the best I can following recent devmem TCP
changes. Would be great if you take a look and might remember
some more places to check. And thanks for the review!

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-10-09 21:00   ` Mina Almasry
@ 2024-10-09 21:59     ` Pavel Begunkov
  2024-10-10 17:54       ` Mina Almasry
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 21:59 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 22:00, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> page pool is now waiting for all ppiovs to return before destroying
>> itself, and for that to happen the memory provider might need to push
>> some buffers, flush caches and so on.
>>
>> todo: we'll try to get by without it before the final release
>>
> 
> Is the intention to drop this todo and stick with this patch, or to
> move ahead with this patch?

Heh, I overlooked this todo. The plan is to actually leave it
as is, it's by far the simplest way and doesn't really gets
into anyone's way as it's a slow path.

> To be honest, I think I read in a follow up patch that you want to
> unref all the memory on page_pool_destory, which is not how the
> page_pool is used today. Tdoay page_pool_destroy does not reclaim
> memory. Changing that may be OK.

It doesn't because it can't (not breaking anything), which is a
problem as the page pool might never get destroyed. io_uring
doesn't change that, a buffer can't be reclaimed while anything
in the kernel stack holds it. It's only when it's given to the
user we can force it back out of there.

And it has to happen one way or another, we can't trust the
user to put buffers back, it's just devmem does that by temporarily
attaching the lifetime of such buffers to a socket.

> But I'm not sure this is generic change that should be put in the
> page_pool providers. I don't envision other providers implementing
> this. I think they'll be more interested in using the page_pool the
> way it's used today.

If the pp/net maintainers abhor it, I could try to replace it
with some "inventive" solution, which most likely would need
referencing all io_uring zcrx requests, but otherwise I'd
prefer to leave it as is.

> I would suggest that instead of making this a page_pool provider
> thing, to instead have your iouring code listen to a notification that
> a new generic notificatino that page_pool is being destroyed or an
> rx-queue is being destroyed or something like that, and doing the
> scrubbing based on that, maybe?

You can say it listens to the page pool being destroyed, exactly
what it's interesting in. Trying to catch the destruction of an
rx-queue is the same thing but with jumping more hops and indirectly
deriving that the page pool is killed.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-07 22:15 ` [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
  2024-10-09 18:10   ` Jens Axboe
@ 2024-10-09 22:01   ` Mina Almasry
  2024-10-09 22:58     ` Pavel Begunkov
  1 sibling, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-09 22:01 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Implement a page pool memory provider for io_uring to receieve in a
> zero copy fashion. For that, the provider allocates user pages wrapped
> around into struct net_iovs, that are stored in a previously registered
> struct net_iov_area.
>
> Unlike with traditional receives, for which pages from a page pool can
> be deallocated right after the user receives data, e.g. via recv(2),
> we extend the lifetime by recycling buffers only after the user space
> acknowledges that it's done processing the data via the refill queue.
> Before handing buffers to the user, we mark them by bumping the refcount
> by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
> returned back. When the corresponding io_uring instance and/or page pool
> are destroyed, we'll force back all buffers that are currently in the
> user space in ->io_pp_zc_scrub by clearing the bias.
>

This is an interesting design choice. In my experience the page_pool
works the opposite way, i.e. all the netmems in it are kept alive
until the user is done with them. Deviating from that requires custom
behavior (->scrub), which may be fine, but why do this? Isn't it
better for uapi perspective to keep the memory alive until the user is
done with it?

> Refcounting and lifetime:
>
> Initially, all buffers are considered unallocated and stored in
> ->freelist, at which point they are not yet directly exposed to the core
> page pool code and not accounted to page pool's pages_state_hold_cnt.
> The ->alloc_netmems callback will allocate them by placing into the
> page pool's cache, setting the refcount to 1 as usual and adjusting
> pages_state_hold_cnt.
>
> Then, either the buffer is dropped and returns back to the page pool
> into the ->freelist via io_pp_zc_release_netmem, in which case the page
> pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
> likely the buffer will go through the network/protocol stacks and end up
> in the corresponding socket's receive queue. From there the user can get
> it via an new io_uring request implemented in following patches. As
> mentioned above, before giving a buffer to the user we bump the refcount
> by IO_ZC_RX_UREF.
>
> Once the user is done with the buffer processing, it must return it back
> via the refill queue, from where our ->alloc_netmems implementation can
> grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
> there are no more users left. As we place such buffers right back into
> the page pools fast cache and they didn't go through the normal pp
> release path, they are still considered "allocated" and no pp hold_cnt
> is required.

Why is this needed? In general the provider is to allocate free memory
and logic as to where the memory should go (to fast cache, to normal
pp release path, etc) should remain in provider agnostic code paths in
the page_pool. Not maintainable IMO in the long run to have individual
pp providers customizing non-provider specific code or touching pp
private structs.

> For the same reason we dma sync buffers for the device
> in io_zc_add_pp_cache().
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  include/linux/io_uring/net.h |   5 +
>  io_uring/zcrx.c              | 229 +++++++++++++++++++++++++++++++++++
>  io_uring/zcrx.h              |   6 +
>  3 files changed, 240 insertions(+)
>
> diff --git a/include/linux/io_uring/net.h b/include/linux/io_uring/net.h
> index b58f39fed4d5..610b35b451fd 100644
> --- a/include/linux/io_uring/net.h
> +++ b/include/linux/io_uring/net.h
> @@ -5,6 +5,11 @@
>  struct io_uring_cmd;
>
>  #if defined(CONFIG_IO_URING)
> +
> +#if defined(CONFIG_PAGE_POOL)
> +extern const struct memory_provider_ops io_uring_pp_zc_ops;
> +#endif
> +
>  int io_uring_cmd_sock(struct io_uring_cmd *cmd, unsigned int issue_flags);
>
>  #else
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index 8382129402ac..6cd3dee8b90a 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -2,7 +2,11 @@
>  #include <linux/kernel.h>
>  #include <linux/errno.h>
>  #include <linux/mm.h>
> +#include <linux/nospec.h>
> +#include <linux/netdevice.h>
>  #include <linux/io_uring.h>
> +#include <net/page_pool/helpers.h>
> +#include <trace/events/page_pool.h>
>
>  #include <uapi/linux/io_uring.h>
>
> @@ -16,6 +20,13 @@
>
>  #if defined(CONFIG_PAGE_POOL) && defined(CONFIG_INET)
>
> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
> +{
> +       struct net_iov_area *owner = net_iov_owner(niov);
> +
> +       return container_of(owner, struct io_zcrx_area, nia);

Similar to other comment in the other patch, why are we sure this
doesn't return garbage (i.e. it's accidentally called on a dmabuf
net_iov?)

> +}
> +
>  static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
>                                  struct io_uring_zcrx_ifq_reg *reg)
>  {
> @@ -101,6 +112,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
>                 goto err;
>
>         for (i = 0; i < nr_pages; i++) {
> +               struct net_iov *niov = &area->nia.niovs[i];
> +
> +               niov->owner = &area->nia;
>                 area->freelist[i] = i;
>         }
>
> @@ -233,4 +247,219 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
>         lockdep_assert_held(&ctx->uring_lock);
>  }
>
> +static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
> +{
> +       return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
> +}
> +
> +static bool io_zcrx_put_niov_uref(struct net_iov *niov)
> +{
> +       if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
> +               return false;
> +
> +       return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
> +}
> +
> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
> +                                     struct net_iov *niov)
> +{
> +       netmem_ref netmem = net_iov_to_netmem(niov);
> +
> +#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
> +       if (pp->dma_sync && dma_dev_need_sync(pp->p.dev)) {

IIRC we force that dma_sync == true for memory providers, unless you
changed that and I missed it.

> +               dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
> +
> +               dma_sync_single_range_for_device(pp->p.dev, dma_addr,
> +                                                pp->p.offset, pp->p.max_len,
> +                                                pp->p.dma_dir);
> +       }
> +#endif
> +
> +       page_pool_fragment_netmem(netmem, 1);
> +       pp->alloc.cache[pp->alloc.count++] = netmem;

IMO touching pp internals in a provider should not be acceptable.

pp->alloc.cache is a data structure private to the page_pool and
should not be touched at all by any specific memory provider. Not
maintainable in the long run tbh for individual pp providers to mess
with pp private structs and we hunt for bugs that are reproducible
with 1 pp provider or another, or have to deal with the mental strain
of provider specific handling in what is supposed to be generic
page_pool paths.

IMO the provider must implement the 4 'ops' (alloc, free, init,
destroy) and must not touch pp privates while doing so. If you need to
change how pp recycling works then it needs to be done in a provider
agnostic way.

I think both the dmabuf provider and Jakub's huge page provider both
implemented the ops while never touching pp internals. I wonder if we
can follow this lead.

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 04/15] net: page_pool: create hooks for custom page providers
  2024-10-09 20:49   ` Mina Almasry
@ 2024-10-09 22:02     ` Pavel Begunkov
  0 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 22:02 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 21:49, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>
>> From: Jakub Kicinski <[email protected]>
>>
>> The page providers which try to reuse the same pages will
>> need to hold onto the ref, even if page gets released from
>> the pool - as in releasing the page from the pp just transfers
>> the "ownership" reference from pp to the provider, and provider
>> will wait for other references to be gone before feeding this
>> page back into the pool.
>>
>> Signed-off-by: Jakub Kicinski <[email protected]>
>> [Pavel] Rebased, renamed callback, +converted devmem
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> Signed-off-by: David Wei <[email protected]>
> 
> Likely needs a Cc: Christoph Hellwig <[email protected]>, given previous
> feedback to this patch?

I wouldn't bother, I don't believe it's done in good faith.

> But that's going to run into the same feedback again. You don't want
> to do this without the ops again?

Well, the guy sprinkles nacks as confetti, it's getting hard
to take it seriously. It's a net change, I'll leave it to the
net maintainers to judge.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-09 20:44   ` Mina Almasry
@ 2024-10-09 22:13     ` Pavel Begunkov
  2024-10-09 22:19     ` Pavel Begunkov
  1 sibling, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 22:13 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 21:44, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
>> which serves as a useful abstraction to share data and provide a
>> context. However, it's too devmem specific, and we want to reuse it for
>> other memory providers, and for that we need to decouple net_iov from
>> devmem. Make net_iov to point to a new base structure called
>> net_iov_area, which dmabuf_genpool_chunk_owner extends.
> 
> 
> Similar feeling to Stan initially. I also thought you'd reuse
> dmabuf_genpool_chunk_owner. Seems like you're doing that but also
> renaming it to net_iov_area almost, which seems fine.
> 
> I guess, with this patch, there is no way to tell, given just a
> net_iov whether it's dmabuf or something else, right? I wonder if

By intention there is no good/clear way to tell if it's a dmabuf
or page backed net_iov in the generic path, but you can easily
check if it's devmem or io_uring by comparing page pool's ops.
net_iov::pp should always be available when it's in the net stack.
5/15 does exactly that in the devmem tcp portion of tcp.c.

> that's an issue. In my mind when an skb is in tcp_recvmsg() we need to
> make sure it's a dmabuf net_iov specifically to call
> tcp_recvmsg_dmabuf for example. I'll look deeper here.

Mentioned above, patch 5/15 handles that.

>>   static inline struct dmabuf_genpool_chunk_owner *
>> -net_iov_owner(const struct net_iov *niov)
>> +net_devmem_iov_to_chunk_owner(const struct net_iov *niov)
>>   {
>> -       return niov->owner;
>> -}
>> +       struct net_iov_area *owner = net_iov_owner(niov);
>>
>> -static inline unsigned int net_iov_idx(const struct net_iov *niov)
>> -{
>> -       return niov - net_iov_owner(niov)->niovs;
>> +       return container_of(owner, struct dmabuf_genpool_chunk_owner, area);
> 
> Couldn't this end up returning garbage if the net_iov is not actually
> a dmabuf one? Is that handled somewhere in a later patch that I
> missed?

Surely it will if someone manages to use it with non-devmem net_iovs,
which is why I renamed it to "devmem*".

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-09 20:44   ` Mina Almasry
  2024-10-09 22:13     ` Pavel Begunkov
@ 2024-10-09 22:19     ` Pavel Begunkov
  1 sibling, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 22:19 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 21:44, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
>> which serves as a useful abstraction to share data and provide a
>> context. However, it's too devmem specific, and we want to reuse it for
>> other memory providers, and for that we need to decouple net_iov from
>> devmem. Make net_iov to point to a new base structure called
>> net_iov_area, which dmabuf_genpool_chunk_owner extends.
> 
> 
> Similar feeling to Stan initially. I also thought you'd reuse
> dmabuf_genpool_chunk_owner. Seems like you're doing that but also
> renaming it to net_iov_area almost, which seems fine.

I did give it a thought long time ago, was thinking to have
chunk_owner / area to store a pointer to some kind of context,
i.e. binding for devmem, but then you need to store void* instead
of well typed net_devmem_dmabuf_binding / etc., while it'd still
need to cast the owner / area, e.g. io_uring needs a fair share of
additional fields.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-09 22:01   ` Mina Almasry
@ 2024-10-09 22:58     ` Pavel Begunkov
  2024-10-10 18:19       ` Mina Almasry
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 22:58 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 23:01, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> Implement a page pool memory provider for io_uring to receieve in a
>> zero copy fashion. For that, the provider allocates user pages wrapped
>> around into struct net_iovs, that are stored in a previously registered
>> struct net_iov_area.
>>
>> Unlike with traditional receives, for which pages from a page pool can
>> be deallocated right after the user receives data, e.g. via recv(2),
>> we extend the lifetime by recycling buffers only after the user space
>> acknowledges that it's done processing the data via the refill queue.
>> Before handing buffers to the user, we mark them by bumping the refcount
>> by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
>> returned back. When the corresponding io_uring instance and/or page pool
>> are destroyed, we'll force back all buffers that are currently in the
>> user space in ->io_pp_zc_scrub by clearing the bias.
>>
> 
> This is an interesting design choice. In my experience the page_pool
> works the opposite way, i.e. all the netmems in it are kept alive
> until the user is done with them. Deviating from that requires custom
> behavior (->scrub), which may be fine, but why do this? Isn't it
> better for uapi perspective to keep the memory alive until the user is
> done with it?

It's hardly interesting, it's _exactly_ the same thing devmem TCP
does by attaching the lifetime of buffers to a socket's xarray,
which requires custom behaviour. Maybe I wasn't clear on one thing
though, it's accounting from the page pool's perspective. Those are
user pages, likely still mapped into the user space, in which case
they're not going to be destroyed.

>> Refcounting and lifetime:
>>
>> Initially, all buffers are considered unallocated and stored in
>> ->freelist, at which point they are not yet directly exposed to the core
>> page pool code and not accounted to page pool's pages_state_hold_cnt.
>> The ->alloc_netmems callback will allocate them by placing into the
>> page pool's cache, setting the refcount to 1 as usual and adjusting
>> pages_state_hold_cnt.
>>
>> Then, either the buffer is dropped and returns back to the page pool
>> into the ->freelist via io_pp_zc_release_netmem, in which case the page
>> pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
>> likely the buffer will go through the network/protocol stacks and end up
>> in the corresponding socket's receive queue. From there the user can get
>> it via an new io_uring request implemented in following patches. As
>> mentioned above, before giving a buffer to the user we bump the refcount
>> by IO_ZC_RX_UREF.
>>
>> Once the user is done with the buffer processing, it must return it back
>> via the refill queue, from where our ->alloc_netmems implementation can
>> grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
>> there are no more users left. As we place such buffers right back into
>> the page pools fast cache and they didn't go through the normal pp
>> release path, they are still considered "allocated" and no pp hold_cnt
>> is required.
> 
> Why is this needed? In general the provider is to allocate free memory

I don't get it, what "this"? If it's refill queue, that's because
I don't like actively returning buffers back via syscall / setsockopt
and trying to transfer them into the napi context (i.e.
napi_pp_put_page) hoping it works / cached well.

If "this" is IO_ZC_RX_UREF, it's because we need to track when a
buffer is given to the userspace, and I don't think some kind of
map / xarray in the hot path is the best for performance solution.

> and logic as to where the memory should go (to fast cache, to normal
> pp release path, etc) should remain in provider agnostic code paths in
> the page_pool. Not maintainable IMO in the long run to have individual

Please do elaborate what exactly is not maintainable here

> pp providers customizing non-provider specific code or touching pp
> private structs.

...
>> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
>> index 8382129402ac..6cd3dee8b90a 100644
>> --- a/io_uring/zcrx.c
>> +++ b/io_uring/zcrx.c
>> @@ -2,7 +2,11 @@
...
>> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
>> +{
>> +       struct net_iov_area *owner = net_iov_owner(niov);
>> +
>> +       return container_of(owner, struct io_zcrx_area, nia);
> 
> Similar to other comment in the other patch, why are we sure this
> doesn't return garbage (i.e. it's accidentally called on a dmabuf
> net_iov?)

There couldn't be any net_iov at this point not belonging to
the current io_uring instance / etc. Same with devmem TCP,
devmem callbacks can't be called for some random net_iov, the
only place you need to explicitly check is where it comes
from generic path to a devmem aware path like that patched
chunk in tcp.c

>> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
>> +                                     struct net_iov *niov)
>> +{
>> +       netmem_ref netmem = net_iov_to_netmem(niov);
>> +
>> +#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
>> +       if (pp->dma_sync && dma_dev_need_sync(pp->p.dev)) {
> 
> IIRC we force that dma_sync == true for memory providers, unless you
> changed that and I missed it.

I'll take a look, might remove it.

>> +               dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
>> +
>> +               dma_sync_single_range_for_device(pp->p.dev, dma_addr,
>> +                                                pp->p.offset, pp->p.max_len,
>> +                                                pp->p.dma_dir);
>> +       }
>> +#endif
>> +
>> +       page_pool_fragment_netmem(netmem, 1);
>> +       pp->alloc.cache[pp->alloc.count++] = netmem;
> 
> IMO touching pp internals in a provider should not be acceptable.

Ok, I can add a page pool helper for that.

> pp->alloc.cache is a data structure private to the page_pool and
> should not be touched at all by any specific memory provider. Not
> maintainable in the long run tbh for individual pp providers to mess
> with pp private structs and we hunt for bugs that are reproducible
> with 1 pp provider or another, or have to deal with the mental strain
> of provider specific handling in what is supposed to be generic
> page_pool paths.

I get what you're trying to say about not touching internals,
I agree with that, but I can't share the sentiment about debugging.
It's a pretty specific api, users running io_uring almost always
write directly to io_uring and we solve it. If happens it's not
the case, please do redirect the issue.
  
> IMO the provider must implement the 4 'ops' (alloc, free, init,

Doing 1 buffer per callback wouldn't be scalable at speeds
we're looking at.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 13/15] io_uring/zcrx: add copy fallback
  2024-10-09 16:30       ` Stanislav Fomichev
@ 2024-10-09 23:05         ` Pavel Begunkov
  2024-10-11  6:22           ` David Wei
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 23:05 UTC (permalink / raw)
  To: Stanislav Fomichev, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

On 10/9/24 17:30, Stanislav Fomichev wrote:
> On 10/08, David Wei wrote:
>> On 2024-10-08 08:58, Stanislav Fomichev wrote:
>>> On 10/07, David Wei wrote:
>>>> From: Pavel Begunkov <[email protected]>
>>>>
>>>> There are scenarios in which the zerocopy path might get a normal
>>>> in-kernel buffer, it could be a mis-steered packet or simply the linear
>>>> part of an skb. Another use case is to allow the driver to allocate
>>>> kernel pages when it's out of zc buffers, which makes it more resilient
>>>> to spikes in load and allow the user to choose the balance between the
>>>> amount of memory provided and performance.
>>>
>>> Tangential: should there be some clear way for the users to discover that
>>> (some counter of some entry on cq about copy fallback)?
>>>
>>> Or the expectation is that somebody will run bpftrace to diagnose
>>> (supposedly) poor ZC performance when it falls back to copy?
>>
>> Yeah there definitely needs to be a way to notify the user that copy
>> fallback happened. Right now I'm relying on bpftrace hooking into
>> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
>> too much. I can think of two other options:
>>
>> 1. Send a final cqe at the end of a number of frag cqes with a count of
>>     the number of copies.
>> 2. Register a secondary area just for handling copies.
>>
>> Other suggestions are also very welcome.
> 
> SG, thanks. Up to you and Pavel on the mechanism and whether to follow
> up separately. Maybe even move this fallback (this patch) into that separate
> series as well? Will be easier to review/accept the rest.

I think it's fine to leave it? It shouldn't be particularly
interesting to the net folks to review, and without it any skb
with the linear part would break it, but perhaps it's not such
a concern for bnxt.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef
  2024-10-09 20:17   ` Mina Almasry
@ 2024-10-09 23:16     ` Pavel Begunkov
  2024-10-10 18:01       ` Mina Almasry
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-09 23:16 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 21:17, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> Don't hide structure definitions under conditional compilation, it only
>> makes messier and harder to maintain. Move struct
>> dmabuf_genpool_chunk_owner definition out of CONFIG_NET_DEVMEM ifdef
>> together with a bunch of trivial inlined helpers using the structure.
>>
> 
> To be honest I think the way it is is better? Having the struct
> defined but always not set (because the code to set it is always
> compiled out) seem worse to me.
> 
> Is there a strong reason to have this? Otherwise maybe drop this?
I can drop it if there are strong opinions on that, but I'm
allergic to ifdef hell and just trying to help to avoid it becoming
so. I even believe it's considered a bad pattern (is it?).

As for a more technical description "why", it reduces the line count
and you don't need to duplicate functions. It's always annoying
making sure the prototypes stay same, but this way it's always
compiled and syntactically checked. And when refactoring anything
like the next patch does, you only need to change one function
but not both. Do you find that convincing?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue
  2024-10-09 18:42   ` Jens Axboe
@ 2024-10-10 13:09     ` Pavel Begunkov
  2024-10-10 13:19       ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-10 13:09 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 19:42, Jens Axboe wrote:
> On 10/7/24 4:16 PM, David Wei wrote:
>> From: David Wei <[email protected]>
...
>>   	if (copy_to_user(arg, &reg, sizeof(reg))) {
>> +		io_close_zc_rxq(ifq);
>>   		ret = -EFAULT;
>>   		goto err;
>>   	}
>>   	if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
>> +		io_close_zc_rxq(ifq);
>>   		ret = -EFAULT;
>>   		goto err;
>>   	}
>>   	ctx->ifq = ifq;
>>   	return 0;
> 
> Not added in this patch, but since I was looking at rtnl lock coverage,
> it's OK to potentially fault while holding this lock? I'm assuming it
> is, as I can't imagine any faulting needing to grab it. Not even from
> nbd ;-)

I believe it should be fine to fault, but regardless neither this
chunk nor page pinning is under rtnl. Only netdev_rx_queue_restart()
is under it from heavy stuff, intentionally trying to minimise the
section as it's a global lock.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 18:21   ` Pedro Tammela
@ 2024-10-10 13:19     ` Pavel Begunkov
  2024-10-11  0:35     ` David Wei
  1 sibling, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-10 13:19 UTC (permalink / raw)
  To: Pedro Tammela, Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 19:21, Pedro Tammela wrote:
> On 09/10/2024 13:55, Mina Almasry wrote:
>> [...]
>>
>> If not, I would like to see a comparison between TCP RX zerocopy and
>> this new io-uring zerocopy. For Google for example we use the TCP RX
>> zerocopy, I would like to see perf numbers possibly motivating us to
>> move to this new thing.
>>
>> [1] https://lwn.net/Articles/752046/
>>
> 
> Hi!
> 
>  From my own testing, the TCP RX Zerocopy is quite heavy on the page unmapping side. Since the io_uring implementation is expected to be lighter (see patch 11), I would expect a simple comparison to show better numbers for io_uring.

Let's see if kperf supports it or can be easily added, but since page
flipping requires heavy mmap amortisation, looks there are even
different sets of users the interfaces cover, in this sense comparing
to copy IMHO could be more interesting.

> To be fair to the existing implementation, it would then be needed to be paired with some 'real' computation, but that varies a lot. As we presented in netdevconf this year, HW-GRO eventually was the best option for us (no app changes, etc...) but still a case by case decision.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue
  2024-10-10 13:09     ` Pavel Begunkov
@ 2024-10-10 13:19       ` Jens Axboe
  0 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-10 13:19 UTC (permalink / raw)
  To: Pavel Begunkov, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/10/24 7:09 AM, Pavel Begunkov wrote:
> On 10/9/24 19:42, Jens Axboe wrote:
>> On 10/7/24 4:16 PM, David Wei wrote:
>>> From: David Wei <[email protected]>
> ...
>>>       if (copy_to_user(arg, &reg, sizeof(reg))) {
>>> +        io_close_zc_rxq(ifq);
>>>           ret = -EFAULT;
>>>           goto err;
>>>       }
>>>       if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
>>> +        io_close_zc_rxq(ifq);
>>>           ret = -EFAULT;
>>>           goto err;
>>>       }
>>>       ctx->ifq = ifq;
>>>       return 0;
>>
>> Not added in this patch, but since I was looking at rtnl lock coverage,
>> it's OK to potentially fault while holding this lock? I'm assuming it
>> is, as I can't imagine any faulting needing to grab it. Not even from
>> nbd ;-)
> 
> I believe it should be fine to fault, but regardless neither this
> chunk nor page pinning is under rtnl. Only netdev_rx_queue_restart()
> is under it from heavy stuff, intentionally trying to minimise the
> section as it's a global lock.

Yep you're right, it is dropped before this section anyway.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 17:12             ` Jens Axboe
@ 2024-10-10 14:21               ` Jens Axboe
  2024-10-10 15:03                 ` David Ahern
  2024-10-10 18:11                 ` Jens Axboe
  0 siblings, 2 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-10 14:21 UTC (permalink / raw)
  To: David Ahern, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/9/24 11:12 AM, Jens Axboe wrote:
> On 10/9/24 10:53 AM, Jens Axboe wrote:
>> On 10/9/24 10:50 AM, Jens Axboe wrote:
>>> On 10/9/24 10:35 AM, David Ahern wrote:
>>>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>>>> as the sender, but then you're capped on the non-zc sender being too
>>>>> slow. The intel box does better, but it's still basically maxing out the
>>>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>>>
>>>> I am surprised by this comment. You should not see a Tx limited test
>>>> (including CPU bound sender). Tx with ZC has been the easy option for a
>>>> while now.
>>>
>>> I just set this up to test yesterday and just used default! I'm sure
>>> there is a zc option, just not the default and hence it wasn't used.
>>> I'll give it a spin, will be useful for 200G testing.
>>
>> I think we're talking past each other. Yes send with zerocopy is
>> available for a while now, both with io_uring and just sendmsg(), but
>> I'm using kperf for testing and it does not look like it supports it.
>> Might have to add it... We'll see how far I can get without it.
> 
> Stanislav pointed me at:
> 
> https://github.com/facebookexperimental/kperf/pull/2
> 
> which adds zc send. I ran a quick test, and it does reduce cpu
> utilization on the sender from 100% to 95%. I'll keep poking...

Update on this - did more testing and the 100 -> 95 was a bit of a
fluke, it's still maxed. So I added io_uring send and sendzc support to
kperf, and I still saw the sendzc being maxed out sending at 100G rates
with 100% cpu usage.

Poked a bit, and the reason is that it's all memcpy() off
skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
as that made no sense to me, and turns out the kernel thinks there's a
tap on the device. Maybe there is, haven't looked at that yet, but I
just killed the orphaning and tested again.

This looks better, now I can get 100G line rate from a single thread
using io_uring sendzc using only 30% of the single cpu/thread (including
irq time). That is good news, as it unlocks being able to test > 100G as
the sender is no longer the bottleneck.

Tap side still a mystery, but it unblocked testing. I'll figure that
part out separately.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-10 14:21               ` Jens Axboe
@ 2024-10-10 15:03                 ` David Ahern
  2024-10-10 15:15                   ` Jens Axboe
  2024-10-10 18:11                 ` Jens Axboe
  1 sibling, 1 reply; 124+ messages in thread
From: David Ahern @ 2024-10-10 15:03 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/10/24 8:21 AM, Jens Axboe wrote:
>> which adds zc send. I ran a quick test, and it does reduce cpu
>> utilization on the sender from 100% to 95%. I'll keep poking...
> 
> Update on this - did more testing and the 100 -> 95 was a bit of a
> fluke, it's still maxed. So I added io_uring send and sendzc support to
> kperf, and I still saw the sendzc being maxed out sending at 100G rates
> with 100% cpu usage.
> 
> Poked a bit, and the reason is that it's all memcpy() off
> skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
> as that made no sense to me, and turns out the kernel thinks there's a
> tap on the device. Maybe there is, haven't looked at that yet, but I
> just killed the orphaning and tested again.
> 
> This looks better, now I can get 100G line rate from a single thread
> using io_uring sendzc using only 30% of the single cpu/thread (including
> irq time). That is good news, as it unlocks being able to test > 100G as
> the sender is no longer the bottleneck.
> 
> Tap side still a mystery, but it unblocked testing. I'll figure that
> part out separately.
> 

Thanks for the update. 30% cpu is more inline with my testing.

For the "tap" you need to make sure no packet socket applications are
running -- e.g., lldpd is a typical open I have a seen in tests. Check
/proc/net/packet

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-10 15:03                 ` David Ahern
@ 2024-10-10 15:15                   ` Jens Axboe
  0 siblings, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2024-10-10 15:15 UTC (permalink / raw)
  To: David Ahern, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/10/24 9:03 AM, David Ahern wrote:
> On 10/10/24 8:21 AM, Jens Axboe wrote:
>>> which adds zc send. I ran a quick test, and it does reduce cpu
>>> utilization on the sender from 100% to 95%. I'll keep poking...
>>
>> Update on this - did more testing and the 100 -> 95 was a bit of a
>> fluke, it's still maxed. So I added io_uring send and sendzc support to
>> kperf, and I still saw the sendzc being maxed out sending at 100G rates
>> with 100% cpu usage.
>>
>> Poked a bit, and the reason is that it's all memcpy() off
>> skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
>> as that made no sense to me, and turns out the kernel thinks there's a
>> tap on the device. Maybe there is, haven't looked at that yet, but I
>> just killed the orphaning and tested again.
>>
>> This looks better, now I can get 100G line rate from a single thread
>> using io_uring sendzc using only 30% of the single cpu/thread (including
>> irq time). That is good news, as it unlocks being able to test > 100G as
>> the sender is no longer the bottleneck.
>>
>> Tap side still a mystery, but it unblocked testing. I'll figure that
>> part out separately.
>>
> 
> Thanks for the update. 30% cpu is more inline with my testing.
> 
> For the "tap" you need to make sure no packet socket applications are
> running -- e.g., lldpd is a typical open I have a seen in tests. Check
> /proc/net/packet

Here's what I see:

sk               RefCnt Type Proto  Iface R Rmem   User   Inode
0000000078c66cbc 3      3    0003   2     1 0      0      112645
00000000558db352 3      3    0003   2     1 0      0      109578
00000000486837f4 3      3    0003   4     1 0      0      109580
00000000f7c6edd6 3      3    0003   4     1 0      0      22563 
000000006ec0363c 3      3    0003   2     1 0      0      22565 
0000000095e63bff 3      3    0003   5     1 0      0      22567 

was just now poking at what could be causing this. This is a server box,
nothing really is running on it... The nic in question is ifindex 2.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-10-09 21:59     ` Pavel Begunkov
@ 2024-10-10 17:54       ` Mina Almasry
  2024-10-13 17:25         ` David Wei
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-10 17:54 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <[email protected]> wrote:
>
> On 10/9/24 22:00, Mina Almasry wrote:
> > On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
> >>
> >> From: Pavel Begunkov <[email protected]>
> >>
> >> page pool is now waiting for all ppiovs to return before destroying
> >> itself, and for that to happen the memory provider might need to push
> >> some buffers, flush caches and so on.
> >>
> >> todo: we'll try to get by without it before the final release
> >>
> >
> > Is the intention to drop this todo and stick with this patch, or to
> > move ahead with this patch?
>
> Heh, I overlooked this todo. The plan is to actually leave it
> as is, it's by far the simplest way and doesn't really gets
> into anyone's way as it's a slow path.
>
> > To be honest, I think I read in a follow up patch that you want to
> > unref all the memory on page_pool_destory, which is not how the
> > page_pool is used today. Tdoay page_pool_destroy does not reclaim
> > memory. Changing that may be OK.
>
> It doesn't because it can't (not breaking anything), which is a
> problem as the page pool might never get destroyed. io_uring
> doesn't change that, a buffer can't be reclaimed while anything
> in the kernel stack holds it. It's only when it's given to the
> user we can force it back out of there.
>
> And it has to happen one way or another, we can't trust the
> user to put buffers back, it's just devmem does that by temporarily
> attaching the lifetime of such buffers to a socket.
>

(noob question) does io_uring not have a socket equivalent that you
can tie the lifetime of the buffers to? I'm thinking there must be
one, because in your patches IIRC you have the fill queues and the
memory you bind from the userspace, there should be something that
tells you that the userspace has exited/crashed and it's time to now
destroy the fill queue and unbind the memory, right?

I'm thinking you may want to bind the lifetime of the buffers to that,
instead of the lifetime of the pool. The pool will not be destroyed
until the next driver/reset reconfiguration happens, right? That could
be long long after the userspace has stopped using the memory.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef
  2024-10-09 23:16     ` Pavel Begunkov
@ 2024-10-10 18:01       ` Mina Almasry
  2024-10-10 18:57         ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-10 18:01 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Wed, Oct 9, 2024 at 4:16 PM Pavel Begunkov <[email protected]> wrote:
>
> On 10/9/24 21:17, Mina Almasry wrote:
> > On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
> >>
> >> From: Pavel Begunkov <[email protected]>
> >>
> >> Don't hide structure definitions under conditional compilation, it only
> >> makes messier and harder to maintain. Move struct
> >> dmabuf_genpool_chunk_owner definition out of CONFIG_NET_DEVMEM ifdef
> >> together with a bunch of trivial inlined helpers using the structure.
> >>
> >
> > To be honest I think the way it is is better? Having the struct
> > defined but always not set (because the code to set it is always
> > compiled out) seem worse to me.
> >
> > Is there a strong reason to have this? Otherwise maybe drop this?
> I can drop it if there are strong opinions on that, but I'm
> allergic to ifdef hell and just trying to help to avoid it becoming
> so. I even believe it's considered a bad pattern (is it?).
>
> As for a more technical description "why", it reduces the line count
> and you don't need to duplicate functions. It's always annoying
> making sure the prototypes stay same, but this way it's always
> compiled and syntactically checked. And when refactoring anything
> like the next patch does, you only need to change one function
> but not both. Do you find that convincing?
>

To be honest the tradeoff wins in the other direction for me. The
extra boiler plate is not that bad, and we can be sure that any code
that touches net_devmem_dmabuf_binding will get a valid internals
since it won't compile if the feature is disabled. This could be
critical and could be preventing bugs.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-10 14:21               ` Jens Axboe
  2024-10-10 15:03                 ` David Ahern
@ 2024-10-10 18:11                 ` Jens Axboe
  2024-10-14  8:42                   ` David Laight
  1 sibling, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2024-10-10 18:11 UTC (permalink / raw)
  To: David Ahern, David Wei, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

On 10/10/24 8:21 AM, Jens Axboe wrote:
> On 10/9/24 11:12 AM, Jens Axboe wrote:
>> On 10/9/24 10:53 AM, Jens Axboe wrote:
>>> On 10/9/24 10:50 AM, Jens Axboe wrote:
>>>> On 10/9/24 10:35 AM, David Ahern wrote:
>>>>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>>>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>>>>> as the sender, but then you're capped on the non-zc sender being too
>>>>>> slow. The intel box does better, but it's still basically maxing out the
>>>>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>>>>
>>>>> I am surprised by this comment. You should not see a Tx limited test
>>>>> (including CPU bound sender). Tx with ZC has been the easy option for a
>>>>> while now.
>>>>
>>>> I just set this up to test yesterday and just used default! I'm sure
>>>> there is a zc option, just not the default and hence it wasn't used.
>>>> I'll give it a spin, will be useful for 200G testing.
>>>
>>> I think we're talking past each other. Yes send with zerocopy is
>>> available for a while now, both with io_uring and just sendmsg(), but
>>> I'm using kperf for testing and it does not look like it supports it.
>>> Might have to add it... We'll see how far I can get without it.
>>
>> Stanislav pointed me at:
>>
>> https://github.com/facebookexperimental/kperf/pull/2
>>
>> which adds zc send. I ran a quick test, and it does reduce cpu
>> utilization on the sender from 100% to 95%. I'll keep poking...
> 
> Update on this - did more testing and the 100 -> 95 was a bit of a
> fluke, it's still maxed. So I added io_uring send and sendzc support to
> kperf, and I still saw the sendzc being maxed out sending at 100G rates
> with 100% cpu usage.
> 
> Poked a bit, and the reason is that it's all memcpy() off
> skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
> as that made no sense to me, and turns out the kernel thinks there's a
> tap on the device. Maybe there is, haven't looked at that yet, but I
> just killed the orphaning and tested again.
> 
> This looks better, now I can get 100G line rate from a single thread
> using io_uring sendzc using only 30% of the single cpu/thread (including
> irq time). That is good news, as it unlocks being able to test > 100G as
> the sender is no longer the bottleneck.
> 
> Tap side still a mystery, but it unblocked testing. I'll figure that
> part out separately.

Further update - the above mystery was dhclient, thanks a lot to David
for being able to figure that out very quickly.

But the more interesting update - I got both links up on the receiving
side, providing 200G of bandwidth. I re-ran the test, with proper zero
copy running on the sending side, and io_uring zcrx on the receiver. The
receiver is two threads, BUT targeting the same queue on the two nics.
Both receiver threads bound to the same core (453 in this case). In
other words, single cpu thread is running all of both rx threads, napi
included.

Basic thread usage from top here:

10816 root      20   0  396640 393224      0 R  49.0   0.0   0:01.77 server
10818 root      20   0  396640 389128      0 R  49.0   0.0   0:01.76 server      

and I get 98.4Gbps and 98.6Gbps on the receiver side, which is basically
the combined link bw again. So 200G not enough to saturate a single cpu
thread.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-09 22:58     ` Pavel Begunkov
@ 2024-10-10 18:19       ` Mina Almasry
  2024-10-10 20:26         ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-10 18:19 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Wed, Oct 9, 2024 at 3:57 PM Pavel Begunkov <[email protected]> wrote:
>
> On 10/9/24 23:01, Mina Almasry wrote:
> > On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
> >>
> >> From: Pavel Begunkov <[email protected]>
> >>
> >> Implement a page pool memory provider for io_uring to receieve in a
> >> zero copy fashion. For that, the provider allocates user pages wrapped
> >> around into struct net_iovs, that are stored in a previously registered
> >> struct net_iov_area.
> >>
> >> Unlike with traditional receives, for which pages from a page pool can
> >> be deallocated right after the user receives data, e.g. via recv(2),
> >> we extend the lifetime by recycling buffers only after the user space
> >> acknowledges that it's done processing the data via the refill queue.
> >> Before handing buffers to the user, we mark them by bumping the refcount
> >> by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
> >> returned back. When the corresponding io_uring instance and/or page pool
> >> are destroyed, we'll force back all buffers that are currently in the
> >> user space in ->io_pp_zc_scrub by clearing the bias.
> >>
> >
> > This is an interesting design choice. In my experience the page_pool
> > works the opposite way, i.e. all the netmems in it are kept alive
> > until the user is done with them. Deviating from that requires custom
> > behavior (->scrub), which may be fine, but why do this? Isn't it
> > better for uapi perspective to keep the memory alive until the user is
> > done with it?
>
> It's hardly interesting, it's _exactly_ the same thing devmem TCP
> does by attaching the lifetime of buffers to a socket's xarray,
> which requires custom behaviour. Maybe I wasn't clear on one thing
> though, it's accounting from the page pool's perspective. Those are
> user pages, likely still mapped into the user space, in which case
> they're not going to be destroyed.
>

I think we miscommunicated. Both devmem TCP and io_uring seem to bump
the refcount of memory while the user is using it, yes. But devmem TCP
doesn't scrub the memory when the page_pool dies. io_uring seems to
want to scrub the memory when the page_pool dies. I'm wondering about
this difference. Seems it's better from a uapi prespective to keep the
memory alive until the user returns it or crash. Otherwise you could
have 1 thread reading user memory and 1 thread destroying the
page_pool and the memory will be pulled from under the read, right?

> >> Refcounting and lifetime:
> >>
> >> Initially, all buffers are considered unallocated and stored in
> >> ->freelist, at which point they are not yet directly exposed to the core
> >> page pool code and not accounted to page pool's pages_state_hold_cnt.
> >> The ->alloc_netmems callback will allocate them by placing into the
> >> page pool's cache, setting the refcount to 1 as usual and adjusting
> >> pages_state_hold_cnt.
> >>
> >> Then, either the buffer is dropped and returns back to the page pool
> >> into the ->freelist via io_pp_zc_release_netmem, in which case the page
> >> pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
> >> likely the buffer will go through the network/protocol stacks and end up
> >> in the corresponding socket's receive queue. From there the user can get
> >> it via an new io_uring request implemented in following patches. As
> >> mentioned above, before giving a buffer to the user we bump the refcount
> >> by IO_ZC_RX_UREF.
> >>
> >> Once the user is done with the buffer processing, it must return it back
> >> via the refill queue, from where our ->alloc_netmems implementation can
> >> grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
> >> there are no more users left. As we place such buffers right back into
> >> the page pools fast cache and they didn't go through the normal pp
> >> release path, they are still considered "allocated" and no pp hold_cnt
> >> is required.
> >
> > Why is this needed? In general the provider is to allocate free memory
>
> I don't get it, what "this"? If it's refill queue, that's because
> I don't like actively returning buffers back via syscall / setsockopt
> and trying to transfer them into the napi context (i.e.
> napi_pp_put_page) hoping it works / cached well.
>
> If "this" is IO_ZC_RX_UREF, it's because we need to track when a
> buffer is given to the userspace, and I don't think some kind of
> map / xarray in the hot path is the best for performance solution.
>

Sorry I wasn't clear. By 'this' I'm referring to:

"from where our ->alloc_netmems implementation can grab it, check
references, put IO_ZC_RX_UREF, and recycle the buffer if there are no
more users left"

This is the part that I'm not able to stomach at the moment. Maybe if
I look deeper it would make more sense, but my first feelings is that
it's really not acceptable.

alloc_netmems (and more generically page_pool_alloc_netmem), just
allocates a netmem and gives it to the page_pool code to decide
whether to put it in the cache, in the ptr ring, or directly to the
user, etc.

The provider should not be overstepping or overriding the page_pool
logic to recycle pages or deliver them to the user. alloc_netmem
should just just alloc the netmem and hand it to the page_pool to
decide what to do with it.

> > and logic as to where the memory should go (to fast cache, to normal
> > pp release path, etc) should remain in provider agnostic code paths in
> > the page_pool. Not maintainable IMO in the long run to have individual
>
> Please do elaborate what exactly is not maintainable here
>

In the future we will have N memory providers. It's not maintainable
IMO for each of them to touch pp->alloc.cache and other internals in M
special ways and for us to have to handle N * M edge cases in the
page_pool code because each provider is overstepping on our internals.

The provider should just provide memory. The page_pool should decide
to fill its alloc.cache & ptr ring & give memory to the pp caller as
it sees fit.

> > pp providers customizing non-provider specific code or touching pp
> > private structs.
>
> ...
> >> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> >> index 8382129402ac..6cd3dee8b90a 100644
> >> --- a/io_uring/zcrx.c
> >> +++ b/io_uring/zcrx.c
> >> @@ -2,7 +2,11 @@
> ...
> >> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
> >> +{
> >> +       struct net_iov_area *owner = net_iov_owner(niov);
> >> +
> >> +       return container_of(owner, struct io_zcrx_area, nia);
> >
> > Similar to other comment in the other patch, why are we sure this
> > doesn't return garbage (i.e. it's accidentally called on a dmabuf
> > net_iov?)
>
> There couldn't be any net_iov at this point not belonging to
> the current io_uring instance / etc. Same with devmem TCP,
> devmem callbacks can't be called for some random net_iov, the
> only place you need to explicitly check is where it comes
> from generic path to a devmem aware path like that patched
> chunk in tcp.c
>
> >> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
> >> +                                     struct net_iov *niov)
> >> +{
> >> +       netmem_ref netmem = net_iov_to_netmem(niov);
> >> +
> >> +#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
> >> +       if (pp->dma_sync && dma_dev_need_sync(pp->p.dev)) {
> >
> > IIRC we force that dma_sync == true for memory providers, unless you
> > changed that and I missed it.
>
> I'll take a look, might remove it.
>
> >> +               dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
> >> +
> >> +               dma_sync_single_range_for_device(pp->p.dev, dma_addr,
> >> +                                                pp->p.offset, pp->p.max_len,
> >> +                                                pp->p.dma_dir);
> >> +       }
> >> +#endif
> >> +
> >> +       page_pool_fragment_netmem(netmem, 1);
> >> +       pp->alloc.cache[pp->alloc.count++] = netmem;
> >
> > IMO touching pp internals in a provider should not be acceptable.
>
> Ok, I can add a page pool helper for that.
>

To be clear, adding a helper will not resolve the issue I'm seeing.
IMO nothing in the alloc_netmem or any helpers its calling should
touch pp->alloc.cache. alloc_netmem should just allocate the memory
and let the non-provider pp code decide what to do with the memory.

> > pp->alloc.cache is a data structure private to the page_pool and
> > should not be touched at all by any specific memory provider. Not
> > maintainable in the long run tbh for individual pp providers to mess
> > with pp private structs and we hunt for bugs that are reproducible
> > with 1 pp provider or another, or have to deal with the mental strain
> > of provider specific handling in what is supposed to be generic
> > page_pool paths.
>
> I get what you're trying to say about not touching internals,
> I agree with that, but I can't share the sentiment about debugging.
> It's a pretty specific api, users running io_uring almost always
> write directly to io_uring and we solve it. If happens it's not
> the case, please do redirect the issue.
>
> > IMO the provider must implement the 4 'ops' (alloc, free, init,
>
> Doing 1 buffer per callback wouldn't be scalable at speeds
> we're looking at.
>

I doubt this is true or at least there needs to be more info here. The
page_pool_alloc_netmem() pretty much allocates 1 buffer per callback
for all its current users (regular memory & dmabuf), and that's good
enough to drive 200gbps NICs. What is special about io_uring use case
that this is not good enough?

The reason it is good enough in my experience is that
page_pool_alloc_netmem() is a slow path. netmems are allocated from
that function and heavily recycled by the page_pool afterwards.


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef
  2024-10-10 18:01       ` Mina Almasry
@ 2024-10-10 18:57         ` Pavel Begunkov
  2024-10-13 22:38           ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-10 18:57 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 10/10/24 19:01, Mina Almasry wrote:
> On Wed, Oct 9, 2024 at 4:16 PM Pavel Begunkov <[email protected]> wrote:
>>
>> On 10/9/24 21:17, Mina Almasry wrote:
>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>>>
>>>> From: Pavel Begunkov <[email protected]>
>>>>
>>>> Don't hide structure definitions under conditional compilation, it only
>>>> makes messier and harder to maintain. Move struct
>>>> dmabuf_genpool_chunk_owner definition out of CONFIG_NET_DEVMEM ifdef
>>>> together with a bunch of trivial inlined helpers using the structure.
>>>>
>>>
>>> To be honest I think the way it is is better? Having the struct
>>> defined but always not set (because the code to set it is always
>>> compiled out) seem worse to me.
>>>
>>> Is there a strong reason to have this? Otherwise maybe drop this?
>> I can drop it if there are strong opinions on that, but I'm
>> allergic to ifdef hell and just trying to help to avoid it becoming
>> so. I even believe it's considered a bad pattern (is it?).
>>
>> As for a more technical description "why", it reduces the line count
>> and you don't need to duplicate functions. It's always annoying
>> making sure the prototypes stay same, but this way it's always
>> compiled and syntactically checked. And when refactoring anything
>> like the next patch does, you only need to change one function
>> but not both. Do you find that convincing?
>>
> 
> To be honest the tradeoff wins in the other direction for me. The
> extra boiler plate is not that bad, and we can be sure that any code

We can count how often people break builds because a change
was compiled just with one configuration in mind. Unfortunately,
I did it myself a fair share of times, and there is enough of
build robot reports like that. It's not just about boiler plate
but rather overall maintainability.

> that touches net_devmem_dmabuf_binding will get a valid internals
> since it won't compile if the feature is disabled. This could be
> critical and could be preventing bugs.

I don't see the concern, if devmem is compiled out there wouldn't
be a devmem provider to even create it, and you don't need to
worry. If you think someone would create a binding without a devmem,
then I don't believe it'd be enough to hide a struct definition
to prevent that in the first place.

I think the maintainers can tell whichever way they think is
better, I can drop the patch, even though I think it's much
better with it.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-10 18:19       ` Mina Almasry
@ 2024-10-10 20:26         ` Pavel Begunkov
  2024-10-10 20:53           ` Mina Almasry
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-10 20:26 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 10/10/24 19:19, Mina Almasry wrote:
> On Wed, Oct 9, 2024 at 3:57 PM Pavel Begunkov <[email protected]> wrote:
>>
>> On 10/9/24 23:01, Mina Almasry wrote:
>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>>>
>>>> From: Pavel Begunkov <[email protected]>
>>>>
>>>> Implement a page pool memory provider for io_uring to receieve in a
>>>> zero copy fashion. For that, the provider allocates user pages wrapped
>>>> around into struct net_iovs, that are stored in a previously registered
>>>> struct net_iov_area.
>>>>
>>>> Unlike with traditional receives, for which pages from a page pool can
>>>> be deallocated right after the user receives data, e.g. via recv(2),
>>>> we extend the lifetime by recycling buffers only after the user space
>>>> acknowledges that it's done processing the data via the refill queue.
>>>> Before handing buffers to the user, we mark them by bumping the refcount
>>>> by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
>>>> returned back. When the corresponding io_uring instance and/or page pool
>>>> are destroyed, we'll force back all buffers that are currently in the
>>>> user space in ->io_pp_zc_scrub by clearing the bias.
>>>>
>>>
>>> This is an interesting design choice. In my experience the page_pool
>>> works the opposite way, i.e. all the netmems in it are kept alive
>>> until the user is done with them. Deviating from that requires custom
>>> behavior (->scrub), which may be fine, but why do this? Isn't it
>>> better for uapi perspective to keep the memory alive until the user is
>>> done with it?
>>
>> It's hardly interesting, it's _exactly_ the same thing devmem TCP
>> does by attaching the lifetime of buffers to a socket's xarray,
>> which requires custom behaviour. Maybe I wasn't clear on one thing
>> though, it's accounting from the page pool's perspective. Those are
>> user pages, likely still mapped into the user space, in which case
>> they're not going to be destroyed.
>>
> 
> I think we miscommunicated. Both devmem TCP and io_uring seem to bump
> the refcount of memory while the user is using it, yes. But devmem TCP
> doesn't scrub the memory when the page_pool dies. io_uring seems to
> want to scrub the memory when the page_pool dies. I'm wondering about
> this difference. Seems it's better from a uapi prespective to keep the
> memory alive until the user returns it or crash. Otherwise you could

The (user) memory is not going to be pulled under the user,
it's user pages in the user's mm and pinned by it. The difference
is that the page pool will be able to die.

> have 1 thread reading user memory and 1 thread destroying the
> page_pool and the memory will be pulled from under the read, right?

If an io_uring is shared b/w users and one of them is destroying
the instance while it's still running, it's a severe userspace bug,
even then the memory will still be alive as per above.

>>>> Refcounting and lifetime:
>>>>
>>>> Initially, all buffers are considered unallocated and stored in
>>>> ->freelist, at which point they are not yet directly exposed to the core
>>>> page pool code and not accounted to page pool's pages_state_hold_cnt.
>>>> The ->alloc_netmems callback will allocate them by placing into the
>>>> page pool's cache, setting the refcount to 1 as usual and adjusting
>>>> pages_state_hold_cnt.
>>>>
>>>> Then, either the buffer is dropped and returns back to the page pool
>>>> into the ->freelist via io_pp_zc_release_netmem, in which case the page
>>>> pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
>>>> likely the buffer will go through the network/protocol stacks and end up
>>>> in the corresponding socket's receive queue. From there the user can get
>>>> it via an new io_uring request implemented in following patches. As
>>>> mentioned above, before giving a buffer to the user we bump the refcount
>>>> by IO_ZC_RX_UREF.
>>>>
>>>> Once the user is done with the buffer processing, it must return it back
>>>> via the refill queue, from where our ->alloc_netmems implementation can
>>>> grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
>>>> there are no more users left. As we place such buffers right back into
>>>> the page pools fast cache and they didn't go through the normal pp
>>>> release path, they are still considered "allocated" and no pp hold_cnt
>>>> is required.
>>>
>>> Why is this needed? In general the provider is to allocate free memory
>>
>> I don't get it, what "this"? If it's refill queue, that's because
>> I don't like actively returning buffers back via syscall / setsockopt
>> and trying to transfer them into the napi context (i.e.
>> napi_pp_put_page) hoping it works / cached well.
>>
>> If "this" is IO_ZC_RX_UREF, it's because we need to track when a
>> buffer is given to the userspace, and I don't think some kind of
>> map / xarray in the hot path is the best for performance solution.
>>
> 
> Sorry I wasn't clear. By 'this' I'm referring to:
> 
> "from where our ->alloc_netmems implementation can grab it, check
> references, put IO_ZC_RX_UREF, and recycle the buffer if there are no
> more users left"
> 
> This is the part that I'm not able to stomach at the moment. Maybe if
> I look deeper it would make more sense, but my first feelings is that
> it's really not acceptable.
> 
> alloc_netmems (and more generically page_pool_alloc_netmem), just
> allocates a netmem and gives it to the page_pool code to decide

That how it works because that's how devmem needs it and you
tailored it, not the other way around. It could've pretty well
been a callback that fills the cache as an intermediate, from
where page pool can grab netmems and return back to the user,
and it would've been a pretty clean interface as well.

> whether to put it in the cache, in the ptr ring, or directly to the
> user, etc.

And that's the semantics you've just imbued into it.

> The provider should not be overstepping or overriding the page_pool
> logic to recycle pages or deliver them to the user. alloc_netmem

I'm baffled, where does it overrides page pool's logic? It provides the
memory to the page pool, nothing more, nothing less, it doesn't decide if
it's handed to the user or go to ptr ring, the page pool is free to do
whatever is needed. Yes, it's handed by means of returning in the cache
because of performance considerations. The provider API can look
differently, e.g. passing a temp array like in the oversimplified snippet
below, but even then I don't think that's good enough.

page_pool_alloc_netmem() {
	netmem_t arr[64];
	nr = pool->mp_ops->alloc_netmems(arr, 64);

	// page pool does the page pool stuff.
	for (i = 0; i < nr; i++)
		pp->cache[idx] = arr[i];
	return pp->cache;
}

> should just just alloc the netmem and hand it to the page_pool to
> decide what to do with it.
> 
>>> and logic as to where the memory should go (to fast cache, to normal
>>> pp release path, etc) should remain in provider agnostic code paths in
>>> the page_pool. Not maintainable IMO in the long run to have individual
>>
>> Please do elaborate what exactly is not maintainable here
>>
> 
> In the future we will have N memory providers. It's not maintainable
> IMO for each of them to touch pp->alloc.cache and other internals in M
> special ways and for us to have to handle N * M edge cases in the
> page_pool code because each provider is overstepping on our internals.

It sounds like anything that strays from the devmem TCP way is wrong,
please let me know if so, because if that's the case there can only be
devmem TCP, maybe just renamed for niceness. The patch set uses the
abstractions, in a performant way, without adding overhead to everyone
else, and to the best of my taste in a clean way.

> The provider should just provide memory. The page_pool should decide
> to fill its alloc.cache & ptr ring & give memory to the pp caller as
> it sees fit.
> 
>>> pp providers customizing non-provider specific code or touching pp
>>> private structs.
>>
>> ...
>>>> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
>>>> index 8382129402ac..6cd3dee8b90a 100644
>>>> --- a/io_uring/zcrx.c
>>>> +++ b/io_uring/zcrx.c
>>>> @@ -2,7 +2,11 @@
>> ...
>>>> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
>>>> +{
>>>> +       struct net_iov_area *owner = net_iov_owner(niov);
>>>> +
>>>> +       return container_of(owner, struct io_zcrx_area, nia);
>>>
>>> Similar to other comment in the other patch, why are we sure this
>>> doesn't return garbage (i.e. it's accidentally called on a dmabuf
>>> net_iov?)
>>
>> There couldn't be any net_iov at this point not belonging to
>> the current io_uring instance / etc. Same with devmem TCP,
>> devmem callbacks can't be called for some random net_iov, the
>> only place you need to explicitly check is where it comes
>> from generic path to a devmem aware path like that patched
>> chunk in tcp.c
>>
>>>> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
>>>> +                                     struct net_iov *niov)
>>>> +{
>>>> +       netmem_ref netmem = net_iov_to_netmem(niov);
>>>> +
>>>> +#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
>>>> +       if (pp->dma_sync && dma_dev_need_sync(pp->p.dev)) {
>>>
>>> IIRC we force that dma_sync == true for memory providers, unless you
>>> changed that and I missed it.
>>
>> I'll take a look, might remove it.
>>
>>>> +               dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
>>>> +
>>>> +               dma_sync_single_range_for_device(pp->p.dev, dma_addr,
>>>> +                                                pp->p.offset, pp->p.max_len,
>>>> +                                                pp->p.dma_dir);
>>>> +       }
>>>> +#endif
>>>> +
>>>> +       page_pool_fragment_netmem(netmem, 1);
>>>> +       pp->alloc.cache[pp->alloc.count++] = netmem;
>>>
>>> IMO touching pp internals in a provider should not be acceptable.
>>
>> Ok, I can add a page pool helper for that.
>>
> 
> To be clear, adding a helper will not resolve the issue I'm seeing.
> IMO nothing in the alloc_netmem or any helpers its calling should
> touch pp->alloc.cache. alloc_netmem should just allocate the memory
> and let the non-provider pp code decide what to do with the memory.

Then we have opposite opinions, and I can't agree with what
you're proposing. If I'm adding an interface, I'm trying to make
it lasting and not be thrown away in a year. One indirect call
per page in the hot hot path is just insanity. Just remember what
you've been told about one single "if" in the hot path.

>>> pp->alloc.cache is a data structure private to the page_pool and
>>> should not be touched at all by any specific memory provider. Not
>>> maintainable in the long run tbh for individual pp providers to mess
>>> with pp private structs and we hunt for bugs that are reproducible
>>> with 1 pp provider or another, or have to deal with the mental strain
>>> of provider specific handling in what is supposed to be generic
>>> page_pool paths.
>>
>> I get what you're trying to say about not touching internals,
>> I agree with that, but I can't share the sentiment about debugging.
>> It's a pretty specific api, users running io_uring almost always
>> write directly to io_uring and we solve it. If happens it's not
>> the case, please do redirect the issue.
>>
>>> IMO the provider must implement the 4 'ops' (alloc, free, init,
>>
>> Doing 1 buffer per callback wouldn't be scalable at speeds
>> we're looking at.
>>
> 
> I doubt this is true or at least there needs to be more info here. The

If you don't believe me, then, please, go ahead and do your testing,
or look through patches addressing it across the stack like [1],
but you'll be able to find many more. I don't have any recent numbers
on indirect calls, but I did a fair share of testing before for
different kinds of overhead, it has always been expensive, can easily
be 1-2% per fast block request, which could be much worse if it's per
page.

[1] https://lore.kernel.org/netdev/[email protected]/


> page_pool_alloc_netmem() pretty much allocates 1 buffer per callback
> for all its current users (regular memory & dmabuf), and that's good
> enough to drive 200gbps NICs. What is special about io_uring use case
> that this is not good enough?
> 
> The reason it is good enough in my experience is that
> page_pool_alloc_netmem() is a slow path. netmems are allocated from
> that function and heavily recycled by the page_pool afterwards.

That's because how you return buffers back to the page pool, with
io_uring it is a hot path, even though ammortised exactly because
it doesn't just return one buffer at a time.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-10 20:26         ` Pavel Begunkov
@ 2024-10-10 20:53           ` Mina Almasry
  2024-10-10 20:58             ` Mina Almasry
  2024-10-10 21:22             ` Pavel Begunkov
  0 siblings, 2 replies; 124+ messages in thread
From: Mina Almasry @ 2024-10-10 20:53 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Thu, Oct 10, 2024 at 1:26 PM Pavel Begunkov <[email protected]> wrote:
>
> On 10/10/24 19:19, Mina Almasry wrote:
> > On Wed, Oct 9, 2024 at 3:57 PM Pavel Begunkov <[email protected]> wrote:
> >>
> >> On 10/9/24 23:01, Mina Almasry wrote:
> >>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
> >>>>
> >>>> From: Pavel Begunkov <[email protected]>
> >>>>
> >>>> Implement a page pool memory provider for io_uring to receieve in a
> >>>> zero copy fashion. For that, the provider allocates user pages wrapped
> >>>> around into struct net_iovs, that are stored in a previously registered
> >>>> struct net_iov_area.
> >>>>
> >>>> Unlike with traditional receives, for which pages from a page pool can
> >>>> be deallocated right after the user receives data, e.g. via recv(2),
> >>>> we extend the lifetime by recycling buffers only after the user space
> >>>> acknowledges that it's done processing the data via the refill queue.
> >>>> Before handing buffers to the user, we mark them by bumping the refcount
> >>>> by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
> >>>> returned back. When the corresponding io_uring instance and/or page pool
> >>>> are destroyed, we'll force back all buffers that are currently in the
> >>>> user space in ->io_pp_zc_scrub by clearing the bias.
> >>>>
> >>>
> >>> This is an interesting design choice. In my experience the page_pool
> >>> works the opposite way, i.e. all the netmems in it are kept alive
> >>> until the user is done with them. Deviating from that requires custom
> >>> behavior (->scrub), which may be fine, but why do this? Isn't it
> >>> better for uapi perspective to keep the memory alive until the user is
> >>> done with it?
> >>
> >> It's hardly interesting, it's _exactly_ the same thing devmem TCP
> >> does by attaching the lifetime of buffers to a socket's xarray,
> >> which requires custom behaviour. Maybe I wasn't clear on one thing
> >> though, it's accounting from the page pool's perspective. Those are
> >> user pages, likely still mapped into the user space, in which case
> >> they're not going to be destroyed.
> >>
> >
> > I think we miscommunicated. Both devmem TCP and io_uring seem to bump
> > the refcount of memory while the user is using it, yes. But devmem TCP
> > doesn't scrub the memory when the page_pool dies. io_uring seems to
> > want to scrub the memory when the page_pool dies. I'm wondering about
> > this difference. Seems it's better from a uapi prespective to keep the
> > memory alive until the user returns it or crash. Otherwise you could
>
> The (user) memory is not going to be pulled under the user,
> it's user pages in the user's mm and pinned by it. The difference
> is that the page pool will be able to die.
>
> > have 1 thread reading user memory and 1 thread destroying the
> > page_pool and the memory will be pulled from under the read, right?
>
> If an io_uring is shared b/w users and one of them is destroying
> the instance while it's still running, it's a severe userspace bug,
> even then the memory will still be alive as per above.
>
> >>>> Refcounting and lifetime:
> >>>>
> >>>> Initially, all buffers are considered unallocated and stored in
> >>>> ->freelist, at which point they are not yet directly exposed to the core
> >>>> page pool code and not accounted to page pool's pages_state_hold_cnt.
> >>>> The ->alloc_netmems callback will allocate them by placing into the
> >>>> page pool's cache, setting the refcount to 1 as usual and adjusting
> >>>> pages_state_hold_cnt.
> >>>>
> >>>> Then, either the buffer is dropped and returns back to the page pool
> >>>> into the ->freelist via io_pp_zc_release_netmem, in which case the page
> >>>> pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
> >>>> likely the buffer will go through the network/protocol stacks and end up
> >>>> in the corresponding socket's receive queue. From there the user can get
> >>>> it via an new io_uring request implemented in following patches. As
> >>>> mentioned above, before giving a buffer to the user we bump the refcount
> >>>> by IO_ZC_RX_UREF.
> >>>>
> >>>> Once the user is done with the buffer processing, it must return it back
> >>>> via the refill queue, from where our ->alloc_netmems implementation can
> >>>> grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
> >>>> there are no more users left. As we place such buffers right back into
> >>>> the page pools fast cache and they didn't go through the normal pp
> >>>> release path, they are still considered "allocated" and no pp hold_cnt
> >>>> is required.
> >>>
> >>> Why is this needed? In general the provider is to allocate free memory
> >>
> >> I don't get it, what "this"? If it's refill queue, that's because
> >> I don't like actively returning buffers back via syscall / setsockopt
> >> and trying to transfer them into the napi context (i.e.
> >> napi_pp_put_page) hoping it works / cached well.
> >>
> >> If "this" is IO_ZC_RX_UREF, it's because we need to track when a
> >> buffer is given to the userspace, and I don't think some kind of
> >> map / xarray in the hot path is the best for performance solution.
> >>
> >
> > Sorry I wasn't clear. By 'this' I'm referring to:
> >
> > "from where our ->alloc_netmems implementation can grab it, check
> > references, put IO_ZC_RX_UREF, and recycle the buffer if there are no
> > more users left"
> >
> > This is the part that I'm not able to stomach at the moment. Maybe if
> > I look deeper it would make more sense, but my first feelings is that
> > it's really not acceptable.
> >
> > alloc_netmems (and more generically page_pool_alloc_netmem), just
> > allocates a netmem and gives it to the page_pool code to decide
>
> That how it works because that's how devmem needs it and you
> tailored it, not the other way around. It could've pretty well
> been a callback that fills the cache as an intermediate, from
> where page pool can grab netmems and return back to the user,
> and it would've been a pretty clean interface as well.
>

It could have been, but that would be a much worse design IMO. The
whole point of memory proivders is that they provide memory to the
page_pool and the page_pool does its things (among which is recycling)
with that memory. In this patch you seem to have implemented a
provider which, if the page is returned by io_uring, then it's not
returned to the page_pool, it's returned directly to the provider. In
other code paths the memory will be returned to the page_pool.

I.e allocation is always:
provider -> pp -> driver

freeing from io_uring is:
io_uring -> provider -> pp

freeing from tcp stack or driver I'm guessing will be:
tcp stack/driver -> pp -> provider

I'm recommending that the model for memory providers must be in line
with what we do for pages, devmem TCP, and Jakub's out of tree huge
page provider (i.e. everything else using the page_pool). The model is
the streamlined:

allocation:
provider -> pp -> driver

freeing (always):
tcp stack/io_uring/driver/whatever else -> pp -> driver

Is special casing the freeing path for io_uring OK? Is letting the
io_uring provider do its own recycling OK? IMO, no. All providers must
follow an easy to follow general framework otherwise the special
casing for N providers will get out of hand, but that's just my
opinion. A maintainer will make the judgement call here.

> > whether to put it in the cache, in the ptr ring, or directly to the
> > user, etc.
>
> And that's the semantics you've just imbued into it.
>
> > The provider should not be overstepping or overriding the page_pool
> > logic to recycle pages or deliver them to the user. alloc_netmem
>
> I'm baffled, where does it overrides page pool's logic? It provides the
> memory to the page pool, nothing more, nothing less, it doesn't decide if
> it's handed to the user or go to ptr ring, the page pool is free to do
> whatever is needed. Yes, it's handed by means of returning in the cache
> because of performance considerations. The provider API can look
> differently, e.g. passing a temp array like in the oversimplified snippet
> below, but even then I don't think that's good enough.
>
> page_pool_alloc_netmem() {
>         netmem_t arr[64];
>         nr = pool->mp_ops->alloc_netmems(arr, 64);
>
>         // page pool does the page pool stuff.
>         for (i = 0; i < nr; i++)
>                 pp->cache[idx] = arr[i];
>         return pp->cache;
> }
>
> > should just just alloc the netmem and hand it to the page_pool to
> > decide what to do with it.
> >
> >>> and logic as to where the memory should go (to fast cache, to normal
> >>> pp release path, etc) should remain in provider agnostic code paths in
> >>> the page_pool. Not maintainable IMO in the long run to have individual
> >>
> >> Please do elaborate what exactly is not maintainable here
> >>
> >
> > In the future we will have N memory providers. It's not maintainable
> > IMO for each of them to touch pp->alloc.cache and other internals in M
> > special ways and for us to have to handle N * M edge cases in the
> > page_pool code because each provider is overstepping on our internals.
>
> It sounds like anything that strays from the devmem TCP way is wrong,
> please let me know if so, because if that's the case there can only be
> devmem TCP, maybe just renamed for niceness. The patch set uses the
> abstractions, in a performant way, without adding overhead to everyone
> else, and to the best of my taste in a clean way.
>
> > The provider should just provide memory. The page_pool should decide
> > to fill its alloc.cache & ptr ring & give memory to the pp caller as
> > it sees fit.
> >
> >>> pp providers customizing non-provider specific code or touching pp
> >>> private structs.
> >>
> >> ...
> >>>> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> >>>> index 8382129402ac..6cd3dee8b90a 100644
> >>>> --- a/io_uring/zcrx.c
> >>>> +++ b/io_uring/zcrx.c
> >>>> @@ -2,7 +2,11 @@
> >> ...
> >>>> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
> >>>> +{
> >>>> +       struct net_iov_area *owner = net_iov_owner(niov);
> >>>> +
> >>>> +       return container_of(owner, struct io_zcrx_area, nia);
> >>>
> >>> Similar to other comment in the other patch, why are we sure this
> >>> doesn't return garbage (i.e. it's accidentally called on a dmabuf
> >>> net_iov?)
> >>
> >> There couldn't be any net_iov at this point not belonging to
> >> the current io_uring instance / etc. Same with devmem TCP,
> >> devmem callbacks can't be called for some random net_iov, the
> >> only place you need to explicitly check is where it comes
> >> from generic path to a devmem aware path like that patched
> >> chunk in tcp.c
> >>
> >>>> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
> >>>> +                                     struct net_iov *niov)
> >>>> +{
> >>>> +       netmem_ref netmem = net_iov_to_netmem(niov);
> >>>> +
> >>>> +#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC)
> >>>> +       if (pp->dma_sync && dma_dev_need_sync(pp->p.dev)) {
> >>>
> >>> IIRC we force that dma_sync == true for memory providers, unless you
> >>> changed that and I missed it.
> >>
> >> I'll take a look, might remove it.
> >>
> >>>> +               dma_addr_t dma_addr = page_pool_get_dma_addr_netmem(netmem);
> >>>> +
> >>>> +               dma_sync_single_range_for_device(pp->p.dev, dma_addr,
> >>>> +                                                pp->p.offset, pp->p.max_len,
> >>>> +                                                pp->p.dma_dir);
> >>>> +       }
> >>>> +#endif
> >>>> +
> >>>> +       page_pool_fragment_netmem(netmem, 1);
> >>>> +       pp->alloc.cache[pp->alloc.count++] = netmem;
> >>>
> >>> IMO touching pp internals in a provider should not be acceptable.
> >>
> >> Ok, I can add a page pool helper for that.
> >>
> >
> > To be clear, adding a helper will not resolve the issue I'm seeing.
> > IMO nothing in the alloc_netmem or any helpers its calling should
> > touch pp->alloc.cache. alloc_netmem should just allocate the memory
> > and let the non-provider pp code decide what to do with the memory.
>
> Then we have opposite opinions, and I can't agree with what
> you're proposing. If I'm adding an interface, I'm trying to make
> it lasting and not be thrown away in a year. One indirect call
> per page in the hot hot path is just insanity. Just remember what
> you've been told about one single "if" in the hot path.
>
> >>> pp->alloc.cache is a data structure private to the page_pool and
> >>> should not be touched at all by any specific memory provider. Not
> >>> maintainable in the long run tbh for individual pp providers to mess
> >>> with pp private structs and we hunt for bugs that are reproducible
> >>> with 1 pp provider or another, or have to deal with the mental strain
> >>> of provider specific handling in what is supposed to be generic
> >>> page_pool paths.
> >>
> >> I get what you're trying to say about not touching internals,
> >> I agree with that, but I can't share the sentiment about debugging.
> >> It's a pretty specific api, users running io_uring almost always
> >> write directly to io_uring and we solve it. If happens it's not
> >> the case, please do redirect the issue.
> >>
> >>> IMO the provider must implement the 4 'ops' (alloc, free, init,
> >>
> >> Doing 1 buffer per callback wouldn't be scalable at speeds
> >> we're looking at.
> >>
> >
> > I doubt this is true or at least there needs to be more info here. The
>
> If you don't believe me, then, please, go ahead and do your testing,
> or look through patches addressing it across the stack like [1],
> but you'll be able to find many more. I don't have any recent numbers
> on indirect calls, but I did a fair share of testing before for
> different kinds of overhead, it has always been expensive, can easily
> be 1-2% per fast block request, which could be much worse if it's per
> page.
>
> [1] https://lore.kernel.org/netdev/[email protected]/
>
>
> > page_pool_alloc_netmem() pretty much allocates 1 buffer per callback
> > for all its current users (regular memory & dmabuf), and that's good
> > enough to drive 200gbps NICs. What is special about io_uring use case
> > that this is not good enough?
> >
> > The reason it is good enough in my experience is that
> > page_pool_alloc_netmem() is a slow path. netmems are allocated from
> > that function and heavily recycled by the page_pool afterwards.
>
> That's because how you return buffers back to the page pool, with
> io_uring it is a hot path, even though ammortised exactly because
> it doesn't just return one buffer at a time.
>

Right, I guess I understand now. You need to implement your own
recycling in the provider because your model has bypassed the
page_pool recycling - which to me is 90% of the utility of the
page_pool. To make matters worse, the bypass is only there if the
netmems are returned from io_uring, and not bypassed when the netmems
are returned from driver/tcp stack. I'm guessing if you reused the
page_pool recycling in the io_uring return path then it would remove
the need for your provider to implement its own recycling for the
io_uring return case.

Is letting providers bypass and override the page_pool's recycling in
some code paths OK? IMO, no. A maintainer will make the judgement call
and speak authoritatively here and I will follow, but I do think it's
a (much) worse design.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-10 20:53           ` Mina Almasry
@ 2024-10-10 20:58             ` Mina Almasry
  2024-10-10 21:22             ` Pavel Begunkov
  1 sibling, 0 replies; 124+ messages in thread
From: Mina Almasry @ 2024-10-10 20:58 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Thu, Oct 10, 2024 at 1:53 PM Mina Almasry <[email protected]> wrote:
>
> > >>>>
> > >>>> Once the user is done with the buffer processing, it must return it back
> > >>>> via the refill queue, from where our ->alloc_netmems implementation can
> > >>>> grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
> > >>>> there are no more users left. As we place such buffers right back into
> > >>>> the page pools fast cache and they didn't go through the normal pp
> > >>>> release path, they are still considered "allocated" and no pp hold_cnt
> > >>>> is required.
> > >>>
> > >>> Why is this needed? In general the provider is to allocate free memory
> > >>
> > >> I don't get it, what "this"? If it's refill queue, that's because
> > >> I don't like actively returning buffers back via syscall / setsockopt
> > >> and trying to transfer them into the napi context (i.e.
> > >> napi_pp_put_page) hoping it works / cached well.
> > >>
> > >> If "this" is IO_ZC_RX_UREF, it's because we need to track when a
> > >> buffer is given to the userspace, and I don't think some kind of
> > >> map / xarray in the hot path is the best for performance solution.
> > >>
> > >
> > > Sorry I wasn't clear. By 'this' I'm referring to:
> > >
> > > "from where our ->alloc_netmems implementation can grab it, check
> > > references, put IO_ZC_RX_UREF, and recycle the buffer if there are no
> > > more users left"
> > >
> > > This is the part that I'm not able to stomach at the moment. Maybe if
> > > I look deeper it would make more sense, but my first feelings is that
> > > it's really not acceptable.
> > >
> > > alloc_netmems (and more generically page_pool_alloc_netmem), just
> > > allocates a netmem and gives it to the page_pool code to decide
> >
> > That how it works because that's how devmem needs it and you
> > tailored it, not the other way around. It could've pretty well
> > been a callback that fills the cache as an intermediate, from
> > where page pool can grab netmems and return back to the user,
> > and it would've been a pretty clean interface as well.
> >
>
> It could have been, but that would be a much worse design IMO. The
> whole point of memory proivders is that they provide memory to the
> page_pool and the page_pool does its things (among which is recycling)
> with that memory. In this patch you seem to have implemented a
> provider which, if the page is returned by io_uring, then it's not
> returned to the page_pool, it's returned directly to the provider. In
> other code paths the memory will be returned to the page_pool.
>
> I.e allocation is always:
> provider -> pp -> driver
>
> freeing from io_uring is:
> io_uring -> provider -> pp
>
> freeing from tcp stack or driver I'm guessing will be:
> tcp stack/driver -> pp -> provider
>
> I'm recommending that the model for memory providers must be in line
> with what we do for pages, devmem TCP, and Jakub's out of tree huge
> page provider (i.e. everything else using the page_pool). The model is
> the streamlined:
>
> allocation:
> provider -> pp -> driver
>
> freeing (always):
> tcp stack/io_uring/driver/whatever else -> pp -> driver
>

Should be:

> freeing (always):
> tcp stack/io_uring/driver/whatever else -> pp -> provider

I.e. the pp frees to the provider, not the driver.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-10 20:53           ` Mina Almasry
  2024-10-10 20:58             ` Mina Almasry
@ 2024-10-10 21:22             ` Pavel Begunkov
  2024-10-11  0:32               ` Mina Almasry
  1 sibling, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-10 21:22 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 10/10/24 21:53, Mina Almasry wrote:
> On Thu, Oct 10, 2024 at 1:26 PM Pavel Begunkov <[email protected]> wrote:
...
>>>
>>> Sorry I wasn't clear. By 'this' I'm referring to:
>>>
>>> "from where our ->alloc_netmems implementation can grab it, check
>>> references, put IO_ZC_RX_UREF, and recycle the buffer if there are no
>>> more users left"
>>>
>>> This is the part that I'm not able to stomach at the moment. Maybe if
>>> I look deeper it would make more sense, but my first feelings is that
>>> it's really not acceptable.
>>>
>>> alloc_netmems (and more generically page_pool_alloc_netmem), just
>>> allocates a netmem and gives it to the page_pool code to decide
>>
>> That how it works because that's how devmem needs it and you
>> tailored it, not the other way around. It could've pretty well
>> been a callback that fills the cache as an intermediate, from
>> where page pool can grab netmems and return back to the user,
>> and it would've been a pretty clean interface as well.
>>
> 
> It could have been, but that would be a much worse design IMO. The
> whole point of memory proivders is that they provide memory to the
> page_pool and the page_pool does its things (among which is recycling)
> with that memory. In this patch you seem to have implemented a
> provider which, if the page is returned by io_uring, then it's not
> returned to the page_pool, it's returned directly to the provider. In
> other code paths the memory will be returned to the page_pool.
> 
> I.e allocation is always:
> provider -> pp -> driver
> 
> freeing from io_uring is:
> io_uring -> provider -> pp
> 
> freeing from tcp stack or driver I'm guessing will be:
> tcp stack/driver -> pp -> provider
> 
> I'm recommending that the model for memory providers must be in line
> with what we do for pages, devmem TCP, and Jakub's out of tree huge
> page provider (i.e. everything else using the page_pool). The model is
> the streamlined:

Let's not go into the normal pages, because 1) it can't work
any other way in general case, it has to cross the context from
whenever page is to the napi / page pool, and 2) because devmem
TCP and io_uring already deviate from the standard page pool,
by extending lifetime of buffers to user space and more.

And then that's exactly what I'm saying, you recommend it to be
aligned with devmem TCP. And let's not forget that you had to add
batching to that exact syscall return path because of
performance...

...
>>> I doubt this is true or at least there needs to be more info here. The
>>
>> If you don't believe me, then, please, go ahead and do your testing,
>> or look through patches addressing it across the stack like [1],
>> but you'll be able to find many more. I don't have any recent numbers
>> on indirect calls, but I did a fair share of testing before for
>> different kinds of overhead, it has always been expensive, can easily
>> be 1-2% per fast block request, which could be much worse if it's per
>> page.
>>
>> [1] https://lore.kernel.org/netdev/[email protected]/
>>
>>
>>> page_pool_alloc_netmem() pretty much allocates 1 buffer per callback
>>> for all its current users (regular memory & dmabuf), and that's good
>>> enough to drive 200gbps NICs. What is special about io_uring use case
>>> that this is not good enough?
>>>
>>> The reason it is good enough in my experience is that
>>> page_pool_alloc_netmem() is a slow path. netmems are allocated from
>>> that function and heavily recycled by the page_pool afterwards.
>>
>> That's because how you return buffers back to the page pool, with
>> io_uring it is a hot path, even though ammortised exactly because
>> it doesn't just return one buffer at a time.
>>
> 
> Right, I guess I understand now. You need to implement your own
> recycling in the provider because your model has bypassed the
> page_pool recycling - which to me is 90% of the utility of the

So the utility of the page pool is a fast return path for the
standard page mode, i.e. napi_pp_put_page, which it is and is
important, I agree. But then even though we have a better IMO
approach for this "extended to userspace buffer life cycle"
scenario, it has to use that very same return path because...?

> page_pool. To make matters worse, the bypass is only there if the
> netmems are returned from io_uring, and not bypassed when the netmems
> are returned from driver/tcp stack. I'm guessing if you reused the
> page_pool recycling in the io_uring return path then it would remove
> the need for your provider to implement its own recycling for the
> io_uring return case.
> 
> Is letting providers bypass and override the page_pool's recycling in
> some code paths OK? IMO, no. A maintainer will make the judgement call

Mina, frankly, that's nonsense. If we extend the same logic,
devmem overrides page allocation rules with callbacks, devmem
overrides and violates page pool buffer lifetimes by extending
it to user space, devmem violates and overrides the page pool
object lifetime by binding buffers to sockets. And all of it
I'd rather name extends and enhances to fit in the devmem use
case.

> and speak authoritatively here and I will follow, but I do think it's
> a (much) worse design.

Sure, I have a completely opposite opinion, that's a much
better approach than returning through a syscall, but I will
agree with you that ultimately the maintainers will say if
that's acceptable for the networking or not.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 16:55 ` Mina Almasry
                     ` (2 preceding siblings ...)
  2024-10-09 18:21   ` Pedro Tammela
@ 2024-10-11  0:29   ` David Wei
  2024-10-11 19:43     ` Mina Almasry
  3 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-11  0:29 UTC (permalink / raw)
  To: Mina Almasry
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, David Wei

On 2024-10-09 09:55, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
>>
>> We share netdev core infra with devmem TCP. The main difference is that
>> io_uring is used for the uAPI and the lifetime of all objects are bound
>> to an io_uring instance.
> 
> I've been thinking about this a bit, and I hope this feedback isn't
> too late, but I think your work may be useful for users not using
> io_uring. I.e. zero copy to host memory that is not dependent on page
> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
> 
> If we refactor things around a bit we should be able to have the
> memory tied to the RX queue similar to what AF_XDP does, and then we
> should be able to zero copy to the memory via regular sockets and via
> io_uring. This will be useful for us and other applications that would
> like to ZC similar to what you're doing here but not necessarily
> through io_uring.

Using io_uring and trying to move away from a socket based interface is
an explicit longer term goal. I see your proposal of adding a
traditional socket based API as orthogonal to what we're trying to do.
If someone is motivated enough to see this exist then they can build it
themselves.

> 
>> Data is 'read' using a new io_uring request
>> type. When done, data is returned via a new shared refill queue. A zero
>> copy page pool refills a hw rx queue from this refill queue directly. Of
>> course, the lifetime of these data buffers are managed by io_uring
>> rather than the networking stack, with different refcounting rules.
>>
>> This patchset is the first step adding basic zero copy support. We will
>> extend this iteratively with new features e.g. dynamically allocated
>> zero copy areas, THP support, dmabuf support, improved copy fallback,
>> general optimisations and more.
>>
>> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
>> aren't included since Taehee Yoo has already sent a more comprehensive
>> patchset adding support in [1]. Google gve should already support this,
> 
> This is an aside, but GVE supports this via the out-of-tree patches
> I've been carrying on github. Uptsream we're working on adding the
> prerequisite page_pool support.
> 
>> and Mellanox mlx5 support is WIP pending driver changes.
>>
>> ===========
>> Performance
>> ===========
>>
>> Test setup:
>> * AMD EPYC 9454
>> * Broadcom BCM957508 200G
>> * Kernel v6.11 base [2]
>> * liburing fork [3]
>> * kperf fork [4]
>> * 4K MTU
>> * Single TCP flow
>>
>> With application thread + net rx softirq pinned to _different_ cores:
>>
>> epoll
>> 82.2 Gbps
>>
>> io_uring
>> 116.2 Gbps (+41%)
>>
>> Pinned to _same_ core:
>>
>> epoll
>> 62.6 Gbps
>>
>> io_uring
>> 80.9 Gbps (+29%)
>>
> 
> Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
> [1]  and io_uring zerocopy respectively?
> 
> If not, I would like to see a comparison between TCP RX zerocopy and
> this new io-uring zerocopy. For Google for example we use the TCP RX
> zerocopy, I would like to see perf numbers possibly motivating us to
> move to this new thing.

No, it is comparing epoll without zero copy vs io_uring zero copy. Yes,
that's a fair request. I will add epoll with TCP_ZEROCOPY_RECEIVE to
kperf and compare.

> 
> [1] https://lwn.net/Articles/752046/
> 
> 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-10 21:22             ` Pavel Begunkov
@ 2024-10-11  0:32               ` Mina Almasry
  2024-10-11  1:49                 ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-11  0:32 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Thu, Oct 10, 2024 at 2:22 PM Pavel Begunkov <[email protected]> wrote:
>
> > page_pool. To make matters worse, the bypass is only there if the
> > netmems are returned from io_uring, and not bypassed when the netmems
> > are returned from driver/tcp stack. I'm guessing if you reused the
> > page_pool recycling in the io_uring return path then it would remove
> > the need for your provider to implement its own recycling for the
> > io_uring return case.
> >
> > Is letting providers bypass and override the page_pool's recycling in
> > some code paths OK? IMO, no. A maintainer will make the judgement call
>
> Mina, frankly, that's nonsense. If we extend the same logic,
> devmem overrides page allocation rules with callbacks, devmem
> overrides and violates page pool buffer lifetimes by extending
> it to user space, devmem violates and overrides the page pool
> object lifetime by binding buffers to sockets. And all of it
> I'd rather name extends and enhances to fit in the devmem use
> case.
>
> > and speak authoritatively here and I will follow, but I do think it's
> > a (much) worse design.
>
> Sure, I have a completely opposite opinion, that's a much
> better approach than returning through a syscall, but I will
> agree with you that ultimately the maintainers will say if
> that's acceptable for the networking or not.
>

Right, I'm not suggesting that you return the pages through a syscall.
That will add syscall overhead when it's better not to have that
especially in io_uring context. Devmem TCP needed a syscall because I
couldn't figure out a non-syscall way with sockets for the userspace
to tell the kernel that it's done with some netmems. You do not need
to follow that at all. Sorry if I made it seem like so.

However, I'm suggesting that when io_uring figures out that the
userspace is done with a netmem, that you feed that netmem back to the
pp, and utilize the pp's recycling, rather than adding your own
recycling in the provider.

From your commit message:

"we extend the lifetime by recycling buffers only after the user space
acknowledges that it's done processing the data via the refill queue"

It seems to me that you get some signal from the userspace that data
is ready to be reuse via that refill queue (whatever it is, very
sorry, I'm not that familiar with io_uring). My suggestion here is
when the userspace tells you that a netmem is ready for reuse (however
it does that), that you feed that page back to the pp via something
like napi_pp_put_page() or page_pool_put_page_bulk() if that makes
sense to you. FWIW I'm trying to look through your code to understand
what that refill queue is and where - if anywhere - it may be possible
to feed pages back to the pp, rather than directly to the provider.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 18:21   ` Pedro Tammela
  2024-10-10 13:19     ` Pavel Begunkov
@ 2024-10-11  0:35     ` David Wei
  2024-10-11 14:28       ` Pedro Tammela
  1 sibling, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-11  0:35 UTC (permalink / raw)
  To: Pedro Tammela, Mina Almasry
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, David Wei

On 2024-10-09 11:21, Pedro Tammela wrote:
> On 09/10/2024 13:55, Mina Almasry wrote:
>> [...]
>>
>> If not, I would like to see a comparison between TCP RX zerocopy and
>> this new io-uring zerocopy. For Google for example we use the TCP RX
>> zerocopy, I would like to see perf numbers possibly motivating us to
>> move to this new thing.
>>
>> [1] https://lwn.net/Articles/752046/
>>
> 
> Hi!
> 
> From my own testing, the TCP RX Zerocopy is quite heavy on the page unmapping side. Since the io_uring implementation is expected to be lighter (see patch 11), I would expect a simple comparison to show better numbers for io_uring.

Hi Pedro, I will add TCP_ZEROCOPY_RECEIVE to kperf and compare in the
next patchset.

> 
> To be fair to the existing implementation, it would then be needed to be paired with some 'real' computation, but that varies a lot. As we presented in netdevconf this year, HW-GRO eventually was the best option for us (no app changes, etc...) but still a case by case decision.

Why is there a need to add some computation to the benchmarks? A
benchmark is meant to be just that - a simple comparison that just looks
at the overheads of the stack. Real workloads are complex, I don't see
this feature as a universal win in all cases, but very workload and
userspace architecture dependent.

As for HW-GRO, whynotboth.jpg?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-11  0:32               ` Mina Almasry
@ 2024-10-11  1:49                 ` Pavel Begunkov
  0 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-11  1:49 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 10/11/24 01:32, Mina Almasry wrote:
> On Thu, Oct 10, 2024 at 2:22 PM Pavel Begunkov <[email protected]> wrote:
>>
>>> page_pool. To make matters worse, the bypass is only there if the
>>> netmems are returned from io_uring, and not bypassed when the netmems
>>> are returned from driver/tcp stack. I'm guessing if you reused the
>>> page_pool recycling in the io_uring return path then it would remove
>>> the need for your provider to implement its own recycling for the
>>> io_uring return case.
>>>
>>> Is letting providers bypass and override the page_pool's recycling in
>>> some code paths OK? IMO, no. A maintainer will make the judgement call
>>
>> Mina, frankly, that's nonsense. If we extend the same logic,
>> devmem overrides page allocation rules with callbacks, devmem
>> overrides and violates page pool buffer lifetimes by extending
>> it to user space, devmem violates and overrides the page pool
>> object lifetime by binding buffers to sockets. And all of it
>> I'd rather name extends and enhances to fit in the devmem use
>> case.
>>
>>> and speak authoritatively here and I will follow, but I do think it's
>>> a (much) worse design.
>>
>> Sure, I have a completely opposite opinion, that's a much
>> better approach than returning through a syscall, but I will
>> agree with you that ultimately the maintainers will say if
>> that's acceptable for the networking or not.
>>
> 
> Right, I'm not suggesting that you return the pages through a syscall.
> That will add syscall overhead when it's better not to have that
> especially in io_uring context. Devmem TCP needed a syscall because I
> couldn't figure out a non-syscall way with sockets for the userspace
> to tell the kernel that it's done with some netmems. You do not need
> to follow that at all. Sorry if I made it seem like so.
> 
> However, I'm suggesting that when io_uring figures out that the
> userspace is done with a netmem, that you feed that netmem back to the
> pp, and utilize the pp's recycling, rather than adding your own
> recycling in the provider.

I should spell it out somewhere in commits, the difference is that we
let the page pool to pull buffers instead of having a syscall to push
like devmem TCP does. With pushing, you'll be doing it from some task
context, and it'll need to find a way back into the page pool, via ptr
ring or with the opportunistic optimisations napi_pp_put_page() provides.
And if you do it this way, the function is very useful.

With pulling though, returning already happens from within the page
pool's allocation path, just in the right context that doesn't need
any additional locking / sync to access page pool's napi/bh protected
caches/etc.. That's why it has a potential to be faster, and why
optimisation wise napi_pp_put_page() doesn't make sense for this
case, i.e. no need to jump through hoops of finding how to transfer
a buffer to the page pool's context because we're already in there.

>  From your commit message:
> 
> "we extend the lifetime by recycling buffers only after the user space
> acknowledges that it's done processing the data via the refill queue"
> 
> It seems to me that you get some signal from the userspace that data

You don't even need to signal it, the page pool will take buffers
when it needs to allocate memory.

> is ready to be reuse via that refill queue (whatever it is, very
> sorry, I'm not that familiar with io_uring). My suggestion here is
> when the userspace tells you that a netmem is ready for reuse (however
> it does that), that you feed that page back to the pp via something
> like napi_pp_put_page() or page_pool_put_page_bulk() if that makes
> sense to you. FWIW I'm trying to look through your code to understand
> what that refill queue is and where - if anywhere - it may be possible
> to feed pages back to the pp, rather than directly to the provider.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-09 16:12     ` Jens Axboe
@ 2024-10-11  6:15       ` David Wei
  0 siblings, 0 replies; 124+ messages in thread
From: David Wei @ 2024-10-11  6:15 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov, Joe Damato, io-uring, netdev,
	Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry, David Wei

On 2024-10-09 09:12, Jens Axboe wrote:
> On 10/9/24 9:07 AM, Pavel Begunkov wrote:
>> On 10/9/24 00:10, Joe Damato wrote:
>>> On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
>>>> This patchset adds support for zero copy rx into userspace pages using
>>>> io_uring, eliminating a kernel to user copy.
>>>>
>>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>>> hand out user pages instead of kernel pages. Any data that ends up
>>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>>> data out of a socket instead becomes a _notification_ mechanism, where
>>>> the kernel tells userspace where the data is. The overall approach is
>>>> similar to the devmem TCP proposal.
>>>>
>>>> This relies on hw header/data split, flow steering and RSS to ensure
>>>> packet headers remain in kernel memory and only desired flows hit a hw
>>>> rx queue configured for zero copy. Configuring this is outside of the
>>>> scope of this patchset.
>>>
>>> This looks super cool and very useful, thanks for doing this work.
>>>
>>> Is there any possibility of some notes or sample pseudo code on how
>>> userland can use this being added to Documentation/networking/ ?
>>
>> io_uring man pages would need to be updated with it, there are tests
>> in liburing and would be a good idea to add back a simple exapmle
>> to liburing/example/*. I think it should cover it
> 
> man pages for sure, but +1 to the example too. Just a basic thing would
> get the point across, I think.
> 

Yeah, there's the liburing side with helpers and all that which will get
manpages. We'll also put back a simple example demonstrating the uAPI.
The liburing changes will be sent as a separate patchset to the io-uring
list.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 13/15] io_uring/zcrx: add copy fallback
  2024-10-09 23:05         ` Pavel Begunkov
@ 2024-10-11  6:22           ` David Wei
  2024-10-11 14:43             ` Stanislav Fomichev
  0 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-11  6:22 UTC (permalink / raw)
  To: Pavel Begunkov, Stanislav Fomichev
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, David Wei

On 2024-10-09 16:05, Pavel Begunkov wrote:
> On 10/9/24 17:30, Stanislav Fomichev wrote:
>> On 10/08, David Wei wrote:
>>> On 2024-10-08 08:58, Stanislav Fomichev wrote:
>>>> On 10/07, David Wei wrote:
>>>>> From: Pavel Begunkov <[email protected]>
>>>>>
>>>>> There are scenarios in which the zerocopy path might get a normal
>>>>> in-kernel buffer, it could be a mis-steered packet or simply the linear
>>>>> part of an skb. Another use case is to allow the driver to allocate
>>>>> kernel pages when it's out of zc buffers, which makes it more resilient
>>>>> to spikes in load and allow the user to choose the balance between the
>>>>> amount of memory provided and performance.
>>>>
>>>> Tangential: should there be some clear way for the users to discover that
>>>> (some counter of some entry on cq about copy fallback)?
>>>>
>>>> Or the expectation is that somebody will run bpftrace to diagnose
>>>> (supposedly) poor ZC performance when it falls back to copy?
>>>
>>> Yeah there definitely needs to be a way to notify the user that copy
>>> fallback happened. Right now I'm relying on bpftrace hooking into
>>> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
>>> too much. I can think of two other options:
>>>
>>> 1. Send a final cqe at the end of a number of frag cqes with a count of
>>>     the number of copies.
>>> 2. Register a secondary area just for handling copies.
>>>
>>> Other suggestions are also very welcome.
>>
>> SG, thanks. Up to you and Pavel on the mechanism and whether to follow
>> up separately. Maybe even move this fallback (this patch) into that separate
>> series as well? Will be easier to review/accept the rest.
> 
> I think it's fine to leave it? It shouldn't be particularly
> interesting to the net folks to review, and without it any skb
> with the linear part would break it, but perhaps it's not such
> a concern for bnxt.
> 

My preference is to leave it. Actually from real workloads, fully
linearized skbs are not uncommon due to the minimum size for HDS to kick
in for bnxt. Taking this out would imo make this patchset functionally
broken. Since we're all in agreement here, let's defer the improvements
as a follow up.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-11  0:35     ` David Wei
@ 2024-10-11 14:28       ` Pedro Tammela
  0 siblings, 0 replies; 124+ messages in thread
From: Pedro Tammela @ 2024-10-11 14:28 UTC (permalink / raw)
  To: David Wei, Mina Almasry
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 10/10/2024 21:35, David Wei wrote:
> On 2024-10-09 11:21, Pedro Tammela wrote:
>> On 09/10/2024 13:55, Mina Almasry wrote:
>>> [...]
>>>
>>> If not, I would like to see a comparison between TCP RX zerocopy and
>>> this new io-uring zerocopy. For Google for example we use the TCP RX
>>> zerocopy, I would like to see perf numbers possibly motivating us to
>>> move to this new thing.
>>>
>>> [1] https://lwn.net/Articles/752046/
>>>
>>
>> Hi!
>>
>>  From my own testing, the TCP RX Zerocopy is quite heavy on the page unmapping side. Since the io_uring implementation is expected to be lighter (see patch 11), I would expect a simple comparison to show better numbers for io_uring.
> 
> Hi Pedro, I will add TCP_ZEROCOPY_RECEIVE to kperf and compare in the
> next patchset.
> 
>>
>> To be fair to the existing implementation, it would then be needed to be paired with some 'real' computation, but that varies a lot. As we presented in netdevconf this year, HW-GRO eventually was the best option for us (no app changes, etc...) but still a case by case decision.
> 
> Why is there a need to add some computation to the benchmarks? A
> benchmark is meant to be just that - a simple comparison that just looks
> at the overheads of the stack.

For the use case we saw, streaming lots of data with zc, the RX pages 
would linger for a reasonable time
in processing and the unmap cost amortized in the hotpath.
Which was not considered in our simple benchmark.

So for Mina's case, I guess the only way to know for sure if it's worth 
is to implement the io_uring approach and compare.

> Real workloads are complex, I don't see> this feature as a universal win in all cases, but very workload and
> userspace architecture dependent.

100% agree here, that's our experience so far as well.
Just wanted to share this sentiment in my previous email.

I personally believe the io_uring approach will encompass more use cases 
than the existing implementation.

> 
> As for HW-GRO, whynotboth.jpg?

For us the cost of changing the apps/services to accomodate rx zc was 
prohibitive for now,
which lead us to stick with HW-GRO.

IIRC, you mentioned in netdevconf Meta uses a library for RPC, but we 
don't have this luxury :/




^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 13/15] io_uring/zcrx: add copy fallback
  2024-10-11  6:22           ` David Wei
@ 2024-10-11 14:43             ` Stanislav Fomichev
  0 siblings, 0 replies; 124+ messages in thread
From: Stanislav Fomichev @ 2024-10-11 14:43 UTC (permalink / raw)
  To: David Wei
  Cc: Pavel Begunkov, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/10, David Wei wrote:
> On 2024-10-09 16:05, Pavel Begunkov wrote:
> > On 10/9/24 17:30, Stanislav Fomichev wrote:
> >> On 10/08, David Wei wrote:
> >>> On 2024-10-08 08:58, Stanislav Fomichev wrote:
> >>>> On 10/07, David Wei wrote:
> >>>>> From: Pavel Begunkov <[email protected]>
> >>>>>
> >>>>> There are scenarios in which the zerocopy path might get a normal
> >>>>> in-kernel buffer, it could be a mis-steered packet or simply the linear
> >>>>> part of an skb. Another use case is to allow the driver to allocate
> >>>>> kernel pages when it's out of zc buffers, which makes it more resilient
> >>>>> to spikes in load and allow the user to choose the balance between the
> >>>>> amount of memory provided and performance.
> >>>>
> >>>> Tangential: should there be some clear way for the users to discover that
> >>>> (some counter of some entry on cq about copy fallback)?
> >>>>
> >>>> Or the expectation is that somebody will run bpftrace to diagnose
> >>>> (supposedly) poor ZC performance when it falls back to copy?
> >>>
> >>> Yeah there definitely needs to be a way to notify the user that copy
> >>> fallback happened. Right now I'm relying on bpftrace hooking into
> >>> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
> >>> too much. I can think of two other options:
> >>>
> >>> 1. Send a final cqe at the end of a number of frag cqes with a count of
> >>>     the number of copies.
> >>> 2. Register a secondary area just for handling copies.
> >>>
> >>> Other suggestions are also very welcome.
> >>
> >> SG, thanks. Up to you and Pavel on the mechanism and whether to follow
> >> up separately. Maybe even move this fallback (this patch) into that separate
> >> series as well? Will be easier to review/accept the rest.
> > 
> > I think it's fine to leave it? It shouldn't be particularly
> > interesting to the net folks to review, and without it any skb
> > with the linear part would break it, but perhaps it's not such
> > a concern for bnxt.
> > 
> 
> My preference is to leave it. Actually from real workloads, fully
> linearized skbs are not uncommon due to the minimum size for HDS to kick
> in for bnxt. Taking this out would imo make this patchset functionally
> broken. Since we're all in agreement here, let's defer the improvements
> as a follow up.

Sounds good!

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-09 16:28       ` Stanislav Fomichev
@ 2024-10-11 18:44         ` David Wei
  2024-10-11 22:02           ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: David Wei @ 2024-10-11 18:44 UTC (permalink / raw)
  To: Stanislav Fomichev, Pavel Begunkov
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, David Wei

On 2024-10-09 09:28, Stanislav Fomichev wrote:
> On 10/08, Pavel Begunkov wrote:
>> On 10/8/24 16:46, Stanislav Fomichev wrote:
>>> On 10/07, David Wei wrote:
>>>> From: Pavel Begunkov <[email protected]>
>>>>
>>>> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
>>>> which serves as a useful abstraction to share data and provide a
>>>> context. However, it's too devmem specific, and we want to reuse it for
>>>> other memory providers, and for that we need to decouple net_iov from
>>>> devmem. Make net_iov to point to a new base structure called
>>>> net_iov_area, which dmabuf_genpool_chunk_owner extends.
>>>>
>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>> Signed-off-by: David Wei <[email protected]>
>>>> ---
>>>>   include/net/netmem.h | 21 ++++++++++++++++++++-
>>>>   net/core/devmem.c    | 25 +++++++++++++------------
>>>>   net/core/devmem.h    | 25 +++++++++----------------
>>>>   3 files changed, 42 insertions(+), 29 deletions(-)
>>>>
>>>> diff --git a/include/net/netmem.h b/include/net/netmem.h
>>>> index 8a6e20be4b9d..3795ded30d2c 100644
>>>> --- a/include/net/netmem.h
>>>> +++ b/include/net/netmem.h
>>>> @@ -24,11 +24,20 @@ struct net_iov {
>>>>   	unsigned long __unused_padding;
>>>>   	unsigned long pp_magic;
>>>>   	struct page_pool *pp;
>>>> -	struct dmabuf_genpool_chunk_owner *owner;
>>>> +	struct net_iov_area *owner;
>>>
>>> Any reason not to use dmabuf_genpool_chunk_owner as is (or rename it
>>> to net_iov_area to generalize) with the fields that you don't need
>>> set to 0/NULL? container_of makes everything harder to follow :-(
>>
>> It can be that, but then io_uring would have a (null) pointer to
>> struct net_devmem_dmabuf_binding it knows nothing about and other
>> fields devmem might add in the future. Also, it reduces the
>> temptation for the common code to make assumptions about the origin
>> of the area / pp memory provider. IOW, I think it's cleaner
>> when separated like in this patch.
> 
> Ack, let's see whether other people find any issues with this approach.
> For me, it makes the devmem parts harder to read, so my preference
> is on dropping this patch and keeping owner=null on your side.

I don't mind at this point which approach to take right now. I would
prefer keeping dmabuf_genpool_chunk_owner today even if it results in a
nullptr in io_uring's case. Once there are more memory providers in the
future, I think it'll be clearer what sort of abstraction we might need
here.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-11  0:29   ` David Wei
@ 2024-10-11 19:43     ` Mina Almasry
  0 siblings, 0 replies; 124+ messages in thread
From: Mina Almasry @ 2024-10-11 19:43 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Thu, Oct 10, 2024 at 5:29 PM David Wei <[email protected]> wrote:
>
> On 2024-10-09 09:55, Mina Almasry wrote:
> > On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
> >>
> >> This patchset adds support for zero copy rx into userspace pages using
> >> io_uring, eliminating a kernel to user copy.
> >>
> >> We configure a page pool that a driver uses to fill a hw rx queue to
> >> hand out user pages instead of kernel pages. Any data that ends up
> >> hitting this hw rx queue will thus be dma'd into userspace memory
> >> directly, without needing to be bounced through kernel memory. 'Reading'
> >> data out of a socket instead becomes a _notification_ mechanism, where
> >> the kernel tells userspace where the data is. The overall approach is
> >> similar to the devmem TCP proposal.
> >>
> >> This relies on hw header/data split, flow steering and RSS to ensure
> >> packet headers remain in kernel memory and only desired flows hit a hw
> >> rx queue configured for zero copy. Configuring this is outside of the
> >> scope of this patchset.
> >>
> >> We share netdev core infra with devmem TCP. The main difference is that
> >> io_uring is used for the uAPI and the lifetime of all objects are bound
> >> to an io_uring instance.
> >
> > I've been thinking about this a bit, and I hope this feedback isn't
> > too late, but I think your work may be useful for users not using
> > io_uring. I.e. zero copy to host memory that is not dependent on page
> > aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
> >
> > If we refactor things around a bit we should be able to have the
> > memory tied to the RX queue similar to what AF_XDP does, and then we
> > should be able to zero copy to the memory via regular sockets and via
> > io_uring. This will be useful for us and other applications that would
> > like to ZC similar to what you're doing here but not necessarily
> > through io_uring.
>
> Using io_uring and trying to move away from a socket based interface is
> an explicit longer term goal. I see your proposal of adding a
> traditional socket based API as orthogonal to what we're trying to do.
> If someone is motivated enough to see this exist then they can build it
> themselves.
>

Yes, that was what I was suggesting. I (or whoever interested) would
build it ourselves. Just calling out that your bits to bind umem to an
rx-queue and/or the memory provider could be reused if it is re-usable
(or can be made re-usable). From a quick look it seems fine, nothing
requested here from this series. Sorry I made it seem I was asking you
to implement a sockets extension :-)

> >
> >> Data is 'read' using a new io_uring request
> >> type. When done, data is returned via a new shared refill queue. A zero
> >> copy page pool refills a hw rx queue from this refill queue directly. Of
> >> course, the lifetime of these data buffers are managed by io_uring
> >> rather than the networking stack, with different refcounting rules.
> >>
> >> This patchset is the first step adding basic zero copy support. We will
> >> extend this iteratively with new features e.g. dynamically allocated
> >> zero copy areas, THP support, dmabuf support, improved copy fallback,
> >> general optimisations and more.
> >>
> >> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
> >> aren't included since Taehee Yoo has already sent a more comprehensive
> >> patchset adding support in [1]. Google gve should already support this,
> >
> > This is an aside, but GVE supports this via the out-of-tree patches
> > I've been carrying on github. Uptsream we're working on adding the
> > prerequisite page_pool support.
> >
> >> and Mellanox mlx5 support is WIP pending driver changes.
> >>
> >> ===========
> >> Performance
> >> ===========
> >>
> >> Test setup:
> >> * AMD EPYC 9454
> >> * Broadcom BCM957508 200G
> >> * Kernel v6.11 base [2]
> >> * liburing fork [3]
> >> * kperf fork [4]
> >> * 4K MTU
> >> * Single TCP flow
> >>
> >> With application thread + net rx softirq pinned to _different_ cores:
> >>
> >> epoll
> >> 82.2 Gbps
> >>
> >> io_uring
> >> 116.2 Gbps (+41%)
> >>
> >> Pinned to _same_ core:
> >>
> >> epoll
> >> 62.6 Gbps
> >>
> >> io_uring
> >> 80.9 Gbps (+29%)
> >>
> >
> > Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
> > [1]  and io_uring zerocopy respectively?
> >
> > If not, I would like to see a comparison between TCP RX zerocopy and
> > this new io-uring zerocopy. For Google for example we use the TCP RX
> > zerocopy, I would like to see perf numbers possibly motivating us to
> > move to this new thing.
>
> No, it is comparing epoll without zero copy vs io_uring zero copy. Yes,
> that's a fair request. I will add epoll with TCP_ZEROCOPY_RECEIVE to
> kperf and compare.
>

Awesome to hear. For us, we do use TCP_ZEROCOPY_RECEIVE (with
sockets), so I'm unsure how much benefit we'll see if we use this.
Comparing against TCP_ZEROCOPY_RECEIVE will be more of an
apple-to-apple comparison and also motivate folks using the old thing
to switch to the new-thing.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-11 18:44         ` David Wei
@ 2024-10-11 22:02           ` Pavel Begunkov
  2024-10-11 22:25             ` Mina Almasry
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-11 22:02 UTC (permalink / raw)
  To: David Wei, Stanislav Fomichev
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry

On 10/11/24 19:44, David Wei wrote:
> On 2024-10-09 09:28, Stanislav Fomichev wrote:
>> On 10/08, Pavel Begunkov wrote:
>>> On 10/8/24 16:46, Stanislav Fomichev wrote:
>>>> On 10/07, David Wei wrote:
>>>>> From: Pavel Begunkov <[email protected]>
>>>>>
>>>>> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
>>>>> which serves as a useful abstraction to share data and provide a
>>>>> context. However, it's too devmem specific, and we want to reuse it for
>>>>> other memory providers, and for that we need to decouple net_iov from
>>>>> devmem. Make net_iov to point to a new base structure called
>>>>> net_iov_area, which dmabuf_genpool_chunk_owner extends.
>>>>>
>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>> Signed-off-by: David Wei <[email protected]>
>>>>> ---
>>>>>    include/net/netmem.h | 21 ++++++++++++++++++++-
>>>>>    net/core/devmem.c    | 25 +++++++++++++------------
>>>>>    net/core/devmem.h    | 25 +++++++++----------------
>>>>>    3 files changed, 42 insertions(+), 29 deletions(-)
>>>>>
>>>>> diff --git a/include/net/netmem.h b/include/net/netmem.h
>>>>> index 8a6e20be4b9d..3795ded30d2c 100644
>>>>> --- a/include/net/netmem.h
>>>>> +++ b/include/net/netmem.h
>>>>> @@ -24,11 +24,20 @@ struct net_iov {
>>>>>    	unsigned long __unused_padding;
>>>>>    	unsigned long pp_magic;
>>>>>    	struct page_pool *pp;
>>>>> -	struct dmabuf_genpool_chunk_owner *owner;
>>>>> +	struct net_iov_area *owner;
>>>>
>>>> Any reason not to use dmabuf_genpool_chunk_owner as is (or rename it
>>>> to net_iov_area to generalize) with the fields that you don't need
>>>> set to 0/NULL? container_of makes everything harder to follow :-(
>>>
>>> It can be that, but then io_uring would have a (null) pointer to
>>> struct net_devmem_dmabuf_binding it knows nothing about and other
>>> fields devmem might add in the future. Also, it reduces the
>>> temptation for the common code to make assumptions about the origin
>>> of the area / pp memory provider. IOW, I think it's cleaner
>>> when separated like in this patch.
>>
>> Ack, let's see whether other people find any issues with this approach.
>> For me, it makes the devmem parts harder to read, so my preference
>> is on dropping this patch and keeping owner=null on your side.
> 
> I don't mind at this point which approach to take right now. I would
> prefer keeping dmabuf_genpool_chunk_owner today even if it results in a
> nullptr in io_uring's case. Once there are more memory providers in the
> future, I think it'll be clearer what sort of abstraction we might need
> here.

That's the thing about abstractions, if we say that devmem is the
only first class citizen for net_iov and everything else by definition
is 2nd class that should strictly follow devmem TCP patterns, and/or
that struct dmabuf_genpool_chunk_owner is an integral part of net_iov
and should be reused by everyone, then preserving the current state
of the chunk owner is likely the right long term approach. If not, and
net_iov is actually a generic piece of infrastructure, then IMHO there
is no place for devmem sticking out of every bit single bit of it, with
structures that are devmem specific and can even be not defined without
devmem TCP enabled (fwiw, which is not an actual problem for
compilation, juts oddness).

This patch is one way to do it. The other way assumed is to
convert that binding pointer field to a type-less / void *
context / private pointer, but that seems worse. The difference
starts and the chunk owners, i.e. io_uring's area has to extend
the structure, and we'd still need to cast both that private filed
and the chunk owner / area (with container_of), + a couple more
reasons on top.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue
  2024-10-09 17:50   ` Jens Axboe
  2024-10-09 18:09     ` Jens Axboe
  2024-10-09 19:08     ` Pavel Begunkov
@ 2024-10-11 22:11     ` Pavel Begunkov
  2024-10-13 17:32     ` David Wei
  3 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-11 22:11 UTC (permalink / raw)
  To: Jens Axboe, David Wei, io-uring, netdev
  Cc: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry

On 10/9/24 18:50, Jens Axboe wrote:
> On 10/7/24 4:15 PM, David Wei wrote:
>> From: David Wei <[email protected]>
...
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index adc2524fd8e3..567cdb89711e 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -595,6 +597,9 @@ enum io_uring_register_op {
>>   	IORING_REGISTER_NAPI			= 27,
>>   	IORING_UNREGISTER_NAPI			= 28,
>>   
>> +	/* register a netdev hw rx queue for zerocopy */
>> +	IORING_REGISTER_ZCRX_IFQ		= 29,
>> +
> 
> Will need to change as the current tree has moved a bit beyond this. Not
> a huge deal, just an FYI as it obviously impacts userspace too.

Forgot to mention, I'll rebase it for completness for next iteration,
but I expect it'll need staging a branch with net/ changes shared
b/w pulled in both trees and taking io_uring bits after on top or
something similar. We can probably deal with all such io_uring
conflicts at that time.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-11 22:02           ` Pavel Begunkov
@ 2024-10-11 22:25             ` Mina Almasry
  2024-10-11 23:12               ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-11 22:25 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, Stanislav Fomichev, io-uring, netdev, Jens Axboe,
	Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Fri, Oct 11, 2024 at 3:02 PM Pavel Begunkov <[email protected]> wrote:
>
> On 10/11/24 19:44, David Wei wrote:
> > On 2024-10-09 09:28, Stanislav Fomichev wrote:
> >> On 10/08, Pavel Begunkov wrote:
> >>> On 10/8/24 16:46, Stanislav Fomichev wrote:
> >>>> On 10/07, David Wei wrote:
> >>>>> From: Pavel Begunkov <[email protected]>
> >>>>>
> >>>>> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
> >>>>> which serves as a useful abstraction to share data and provide a
> >>>>> context. However, it's too devmem specific, and we want to reuse it for
> >>>>> other memory providers, and for that we need to decouple net_iov from
> >>>>> devmem. Make net_iov to point to a new base structure called
> >>>>> net_iov_area, which dmabuf_genpool_chunk_owner extends.
> >>>>>
> >>>>> Signed-off-by: Pavel Begunkov <[email protected]>
> >>>>> Signed-off-by: David Wei <[email protected]>
> >>>>> ---
> >>>>>    include/net/netmem.h | 21 ++++++++++++++++++++-
> >>>>>    net/core/devmem.c    | 25 +++++++++++++------------
> >>>>>    net/core/devmem.h    | 25 +++++++++----------------
> >>>>>    3 files changed, 42 insertions(+), 29 deletions(-)
> >>>>>
> >>>>> diff --git a/include/net/netmem.h b/include/net/netmem.h
> >>>>> index 8a6e20be4b9d..3795ded30d2c 100644
> >>>>> --- a/include/net/netmem.h
> >>>>> +++ b/include/net/netmem.h
> >>>>> @@ -24,11 +24,20 @@ struct net_iov {
> >>>>>           unsigned long __unused_padding;
> >>>>>           unsigned long pp_magic;
> >>>>>           struct page_pool *pp;
> >>>>> - struct dmabuf_genpool_chunk_owner *owner;
> >>>>> + struct net_iov_area *owner;
> >>>>
> >>>> Any reason not to use dmabuf_genpool_chunk_owner as is (or rename it
> >>>> to net_iov_area to generalize) with the fields that you don't need
> >>>> set to 0/NULL? container_of makes everything harder to follow :-(
> >>>
> >>> It can be that, but then io_uring would have a (null) pointer to
> >>> struct net_devmem_dmabuf_binding it knows nothing about and other
> >>> fields devmem might add in the future. Also, it reduces the
> >>> temptation for the common code to make assumptions about the origin
> >>> of the area / pp memory provider. IOW, I think it's cleaner
> >>> when separated like in this patch.
> >>
> >> Ack, let's see whether other people find any issues with this approach.
> >> For me, it makes the devmem parts harder to read, so my preference
> >> is on dropping this patch and keeping owner=null on your side.
> >
> > I don't mind at this point which approach to take right now. I would
> > prefer keeping dmabuf_genpool_chunk_owner today even if it results in a
> > nullptr in io_uring's case. Once there are more memory providers in the
> > future, I think it'll be clearer what sort of abstraction we might need
> > here.
>
> That's the thing about abstractions, if we say that devmem is the
> only first class citizen for net_iov and everything else by definition
> is 2nd class that should strictly follow devmem TCP patterns, and/or
> that struct dmabuf_genpool_chunk_owner is an integral part of net_iov
> and should be reused by everyone, then preserving the current state
> of the chunk owner is likely the right long term approach. If not, and
> net_iov is actually a generic piece of infrastructure, then IMHO there
> is no place for devmem sticking out of every bit single bit of it, with
> structures that are devmem specific and can even be not defined without
> devmem TCP enabled (fwiw, which is not an actual problem for
> compilation, juts oddness).
>

There is no intention of devmem TCP being a first class citizen or
anything. Abstractly speaking, we're going to draw a line in the sand
and say everything past this line is devmem specific and should be
replaced by other users. In this patch you drew the line between
dmabuf_genpool_chunk_owner and net_iov_area, which is fine by me on
first look. What Stan and I were thinking at first glance is
preserving dmabuf_* (and renaming) and drawing the line somewhere
else, which would have also been fine.

My real issue is whether its safe to do all this container_of while
not always checking explicitly for the type of net_iov. I'm not 100%
sure checking in tcp.c alone is enough, yet. I need to take a deeper
look, no changes requested from me yet.

FWIW I'm out for the next couple of weeks. I'll have time to take a
look during that but not as much as now.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 03/15] net: generalise net_iov chunk owners
  2024-10-11 22:25             ` Mina Almasry
@ 2024-10-11 23:12               ` Pavel Begunkov
  0 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-11 23:12 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, Stanislav Fomichev, io-uring, netdev, Jens Axboe,
	Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 10/11/24 23:25, Mina Almasry wrote:
> On Fri, Oct 11, 2024 at 3:02 PM Pavel Begunkov <[email protected]> wrote:
>>
>> On 10/11/24 19:44, David Wei wrote:
>>> On 2024-10-09 09:28, Stanislav Fomichev wrote:
>>>> On 10/08, Pavel Begunkov wrote:
>>>>> On 10/8/24 16:46, Stanislav Fomichev wrote:
>>>>>> On 10/07, David Wei wrote:
>>>>>>> From: Pavel Begunkov <[email protected]>
>>>>>>>
>>>>>>> Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
>>>>>>> which serves as a useful abstraction to share data and provide a
>>>>>>> context. However, it's too devmem specific, and we want to reuse it for
>>>>>>> other memory providers, and for that we need to decouple net_iov from
>>>>>>> devmem. Make net_iov to point to a new base structure called
>>>>>>> net_iov_area, which dmabuf_genpool_chunk_owner extends.
>>>>>>>
>>>>>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>>>>>> Signed-off-by: David Wei <[email protected]>
>>>>>>> ---
>>>>>>>     include/net/netmem.h | 21 ++++++++++++++++++++-
>>>>>>>     net/core/devmem.c    | 25 +++++++++++++------------
>>>>>>>     net/core/devmem.h    | 25 +++++++++----------------
>>>>>>>     3 files changed, 42 insertions(+), 29 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/net/netmem.h b/include/net/netmem.h
>>>>>>> index 8a6e20be4b9d..3795ded30d2c 100644
>>>>>>> --- a/include/net/netmem.h
>>>>>>> +++ b/include/net/netmem.h
>>>>>>> @@ -24,11 +24,20 @@ struct net_iov {
>>>>>>>            unsigned long __unused_padding;
>>>>>>>            unsigned long pp_magic;
>>>>>>>            struct page_pool *pp;
>>>>>>> - struct dmabuf_genpool_chunk_owner *owner;
>>>>>>> + struct net_iov_area *owner;
>>>>>>
>>>>>> Any reason not to use dmabuf_genpool_chunk_owner as is (or rename it
>>>>>> to net_iov_area to generalize) with the fields that you don't need
>>>>>> set to 0/NULL? container_of makes everything harder to follow :-(
>>>>>
>>>>> It can be that, but then io_uring would have a (null) pointer to
>>>>> struct net_devmem_dmabuf_binding it knows nothing about and other
>>>>> fields devmem might add in the future. Also, it reduces the
>>>>> temptation for the common code to make assumptions about the origin
>>>>> of the area / pp memory provider. IOW, I think it's cleaner
>>>>> when separated like in this patch.
>>>>
>>>> Ack, let's see whether other people find any issues with this approach.
>>>> For me, it makes the devmem parts harder to read, so my preference
>>>> is on dropping this patch and keeping owner=null on your side.
>>>
>>> I don't mind at this point which approach to take right now. I would
>>> prefer keeping dmabuf_genpool_chunk_owner today even if it results in a
>>> nullptr in io_uring's case. Once there are more memory providers in the
>>> future, I think it'll be clearer what sort of abstraction we might need
>>> here.
>>
>> That's the thing about abstractions, if we say that devmem is the
>> only first class citizen for net_iov and everything else by definition
>> is 2nd class that should strictly follow devmem TCP patterns, and/or
>> that struct dmabuf_genpool_chunk_owner is an integral part of net_iov
>> and should be reused by everyone, then preserving the current state
>> of the chunk owner is likely the right long term approach. If not, and
>> net_iov is actually a generic piece of infrastructure, then IMHO there
>> is no place for devmem sticking out of every bit single bit of it, with
>> structures that are devmem specific and can even be not defined without
>> devmem TCP enabled (fwiw, which is not an actual problem for
>> compilation, juts oddness).
>>
> 
> There is no intention of devmem TCP being a first class citizen or
> anything.

Let me note to avoid being misread, that kind of prioritisation can
have place and that's fine, but that usually happens when you build
on top of older code or user base sizes are much different. And
again, theoretically dmabuf_genpool_chunk_owner could be common code,
i.e. if you want to use dmabuf you need to use the structure
regardless of the provider of choice, and it'll do all dmabuf
handling. But the current chunk owner goes beyond that, and
would need some splitting if someone tries to have that kind of
an abstraction.

> Abstractly speaking, we're going to draw a line in the sand
> and say everything past this line is devmem specific and should be
> replaced by other users. In this patch you drew the line between
> dmabuf_genpool_chunk_owner and net_iov_area, which is fine by me on
> first look. What Stan and I were thinking at first glance is
> preserving dmabuf_* (and renaming) and drawing the line somewhere
> else, which would have also been fine.

True enough, I drew the line when it was convenient, io_uring
needs an extendible abstraction that binds net_iovs, and we'll
also have several different sets of net_iovs, so it fell onto
the object holding the net_iov array as the most natural option.
In that sense, we could've had that binding pointing to an
allocated io_zcrx_area, which would then point further into
io_uring, but that's one extra indirection.

As an viable alternative I don't like that much, instead of
trying to share struct net_iov_area, we can just make struct
struct net_iov::owner completely provider dependent and make
it void *, providers will be allowed to store there whatever
they wish.

> My real issue is whether its safe to do all this container_of while
> not always checking explicitly for the type of net_iov. I'm not 100%
> sure checking in tcp.c alone is enough, yet. I need to take a deeper
> look, no changes requested from me yet.

That's done the typical way everything in the kernel and just
inheritance works. When you get into devmem.c the page pool
callbacks, you for sure know that net_iov's passed are devmem's,
nobody should ever take one net_iov's ops and call it with a
second net_iov without care, the page pool follows it.

When someone wants to operate with devmem net_iov's but doesn't
have a callback it has to validate the origin as tcp.c now does
in 5/15. The nice part is that this patch changes types, so all
such places either explicitly listed in this patch, or it has to
pass through one of the devmem.h helpers, which is yet trivially
checkable.

> FWIW I'm out for the next couple of weeks. I'll have time to take a
> look during that but not as much as now.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-10-10 17:54       ` Mina Almasry
@ 2024-10-13 17:25         ` David Wei
  2024-10-14 13:37           ` Pavel Begunkov
  2024-10-14 22:58           ` Mina Almasry
  0 siblings, 2 replies; 124+ messages in thread
From: David Wei @ 2024-10-13 17:25 UTC (permalink / raw)
  To: Mina Almasry, Pavel Begunkov
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, David Wei

On 2024-10-10 10:54, Mina Almasry wrote:
> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <[email protected]> wrote:
>>
>> On 10/9/24 22:00, Mina Almasry wrote:
>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>>>
>>>> From: Pavel Begunkov <[email protected]>
>>>>
>>>> page pool is now waiting for all ppiovs to return before destroying
>>>> itself, and for that to happen the memory provider might need to push
>>>> some buffers, flush caches and so on.
>>>>
>>>> todo: we'll try to get by without it before the final release
>>>>
>>>
>>> Is the intention to drop this todo and stick with this patch, or to
>>> move ahead with this patch?
>>
>> Heh, I overlooked this todo. The plan is to actually leave it
>> as is, it's by far the simplest way and doesn't really gets
>> into anyone's way as it's a slow path.
>>
>>> To be honest, I think I read in a follow up patch that you want to
>>> unref all the memory on page_pool_destory, which is not how the
>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
>>> memory. Changing that may be OK.
>>
>> It doesn't because it can't (not breaking anything), which is a
>> problem as the page pool might never get destroyed. io_uring
>> doesn't change that, a buffer can't be reclaimed while anything
>> in the kernel stack holds it. It's only when it's given to the
>> user we can force it back out of there.

The page pool will definitely be destroyed, the call to
netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
core will ensure that.

>>
>> And it has to happen one way or another, we can't trust the
>> user to put buffers back, it's just devmem does that by temporarily
>> attaching the lifetime of such buffers to a socket.
>>
> 
> (noob question) does io_uring not have a socket equivalent that you
> can tie the lifetime of the buffers to? I'm thinking there must be
> one, because in your patches IIRC you have the fill queues and the
> memory you bind from the userspace, there should be something that
> tells you that the userspace has exited/crashed and it's time to now
> destroy the fill queue and unbind the memory, right?
> 
> I'm thinking you may want to bind the lifetime of the buffers to that,
> instead of the lifetime of the pool. The pool will not be destroyed
> until the next driver/reset reconfiguration happens, right? That could
> be long long after the userspace has stopped using the memory.
> 

Yes, there are io_uring objects e.g. interface queue that hold
everything together. IIRC page pool destroy doesn't unref but it waits
for all pages that are handed out to skbs to be returned. So for us,
below might work:

1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
   queue and tries to free the old pp
2. At this point we're guaranteed that any packets hitting this rx queue
   will not go to user pages from our memory provider
3. Assume userspace is gone (either crash or gracefully terminating),
   unref the uref for all pages, same as what scrub() is doing today
4. Any pages that are still in skb frags will get freed when the sockets
   etc are closed
5. Rely on the pp delay release to eventually terminate and clean up

Let me know what you think Pavel.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue
  2024-10-09 17:50   ` Jens Axboe
                       ` (2 preceding siblings ...)
  2024-10-11 22:11     ` Pavel Begunkov
@ 2024-10-13 17:32     ` David Wei
  3 siblings, 0 replies; 124+ messages in thread
From: David Wei @ 2024-10-13 17:32 UTC (permalink / raw)
  To: Jens Axboe, io-uring, netdev
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern, Mina Almasry,
	David Wei

On 2024-10-09 10:50, Jens Axboe wrote:
> On 10/7/24 4:15 PM, David Wei wrote:
>> From: David Wei <[email protected]>
>>
>> Add a new object called an interface queue (ifq) that represents a net rx queue
>> that has been configured for zero copy. Each ifq is registered using a new
>> registration opcode IORING_REGISTER_ZCRX_IFQ.
>>
>> The refill queue is allocated by the kernel and mapped by userspace using a new
>> offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used
>> by userspace to return buffers that it is done with, which will then be re-used
>> by the netdev again.
>>
>> The main CQ ring is used to notify userspace of received data by using the
>> upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry
>> contains the offset + len to the data.
>>
>> For now, each io_uring instance only has a single ifq.
> 
> Looks pretty straight forward to me, but please wrap your commit
> messages at ~72 chars or it doesn't read so well in the git log.

Apologies, I rely on vim's text wrapping feature to format. I'll make
sure git commit messages are <72 chars in the future.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 05/15] net: prepare for non devmem TCP memory providers
  2024-10-09 21:45     ` Pavel Begunkov
@ 2024-10-13 22:33       ` Pavel Begunkov
  0 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-13 22:33 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/9/24 22:45, Pavel Begunkov wrote:
> On 10/9/24 21:56, Mina Almasry wrote:
>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>>
>>> From: Pavel Begunkov <[email protected]>
>>>
>>> There is a good bunch of places in generic paths assuming that the only
>>> page pool memory provider is devmem TCP. As we want to reuse the net_iov
>>> and provider infrastructure, we need to patch it up and explicitly check
>>> the provider type when we branch into devmem TCP code.
>>>
>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>> Signed-off-by: David Wei <[email protected]>
>>> ---
>>>   net/core/devmem.c         |  4 ++--
>>>   net/core/page_pool_user.c | 15 +++++++++------
>>>   net/ipv4/tcp.c            |  6 ++++++
>>>   3 files changed, 17 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/net/core/devmem.c b/net/core/devmem.c
>>> index 83d13eb441b6..b0733cf42505 100644
>>> --- a/net/core/devmem.c
>>> +++ b/net/core/devmem.c
>>> @@ -314,10 +314,10 @@ void dev_dmabuf_uninstall(struct net_device *dev)
>>>          unsigned int i;
>>>
>>>          for (i = 0; i < dev->real_num_rx_queues; i++) {
>>> -               binding = dev->_rx[i].mp_params.mp_priv;
>>> -               if (!binding)
>>> +               if (dev->_rx[i].mp_params.mp_ops != &dmabuf_devmem_ops)
>>>                          continue;
>>>
>>
>> Sorry if I missed it (and please ignore me if I did), but
>> dmabuf_devmem_ops are maybe not defined yet?
> 
> You exported it in devmem.h

A correction, this patchset exposed it before. This place is
fine, but I'll wrap it around into a function since it causes
compilation problems in other places for some configurations.


>> I'm also wondering how to find all the annyoing places where we need
>> to check this. Looks like maybe a grep for net_devmem_dmabuf_binding
>> is the way to go? I need to check whether these are all the places we
>> need the check but so far looks fine.
> 
> I whac-a-mole'd them the best I can following recent devmem TCP
> changes. Would be great if you take a look and might remember
> some more places to check. And thanks for the review!
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef
  2024-10-10 18:57         ` Pavel Begunkov
@ 2024-10-13 22:38           ` Pavel Begunkov
  0 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-13 22:38 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 10/10/24 19:57, Pavel Begunkov wrote:
> On 10/10/24 19:01, Mina Almasry wrote:
...
>>
>> To be honest the tradeoff wins in the other direction for me. The
>> extra boiler plate is not that bad, and we can be sure that any code
> 
> We can count how often people break builds because a change
> was compiled just with one configuration in mind. Unfortunately,
> I did it myself a fair share of times, and there is enough of
> build robot reports like that. It's not just about boiler plate
> but rather overall maintainability.
> 
>> that touches net_devmem_dmabuf_binding will get a valid internals
>> since it won't compile if the feature is disabled. This could be
>> critical and could be preventing bugs.
> 
> I don't see the concern, if devmem is compiled out there wouldn't
> be a devmem provider to even create it, and you don't need to
> worry. If you think someone would create a binding without a devmem,
> then I don't believe it'd be enough to hide a struct definition
> to prevent that in the first place.
> 
> I think the maintainers can tell whichever way they think is
> better, I can drop the patch, even though I think it's much
> better with it.
Having a second thought, I'll drop the patch as asked. The
change is not essential to the series, I shouldn't care about
devmem here.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* RE: [PATCH v1 00/15] io_uring zero copy rx
  2024-10-10 18:11                 ` Jens Axboe
@ 2024-10-14  8:42                   ` David Laight
  0 siblings, 0 replies; 124+ messages in thread
From: David Laight @ 2024-10-14  8:42 UTC (permalink / raw)
  To: 'Jens Axboe', David Ahern, David Wei,
	[email protected], [email protected]
  Cc: Pavel Begunkov, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, Mina Almasry

...
> > Tap side still a mystery, but it unblocked testing. I'll figure that
> > part out separately.
> 
> Further update - the above mystery was dhclient, thanks a lot to David
> for being able to figure that out very quickly.

I've seen that before - on the rx side.
Is there any way to defer the copy until the packet passes a filter?
Or better teach dhcp to use a normal UDP socket??

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-10-13 17:25         ` David Wei
@ 2024-10-14 13:37           ` Pavel Begunkov
  2024-10-14 22:58           ` Mina Almasry
  1 sibling, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-14 13:37 UTC (permalink / raw)
  To: David Wei, Mina Almasry
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/13/24 18:25, David Wei wrote:
> On 2024-10-10 10:54, Mina Almasry wrote:
>> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <[email protected]> wrote:
>>>
>>> On 10/9/24 22:00, Mina Almasry wrote:
>>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>>>>
>>>>> From: Pavel Begunkov <[email protected]>
>>>>>
>>>>> page pool is now waiting for all ppiovs to return before destroying
>>>>> itself, and for that to happen the memory provider might need to push
>>>>> some buffers, flush caches and so on.
>>>>>
>>>>> todo: we'll try to get by without it before the final release
>>>>>
>>>>
>>>> Is the intention to drop this todo and stick with this patch, or to
>>>> move ahead with this patch?
>>>
>>> Heh, I overlooked this todo. The plan is to actually leave it
>>> as is, it's by far the simplest way and doesn't really gets
>>> into anyone's way as it's a slow path.
>>>
>>>> To be honest, I think I read in a follow up patch that you want to
>>>> unref all the memory on page_pool_destory, which is not how the
>>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
>>>> memory. Changing that may be OK.
>>>
>>> It doesn't because it can't (not breaking anything), which is a
>>> problem as the page pool might never get destroyed. io_uring
>>> doesn't change that, a buffer can't be reclaimed while anything
>>> in the kernel stack holds it. It's only when it's given to the
>>> user we can force it back out of there.
> 
> The page pool will definitely be destroyed, the call to
> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
> core will ensure that.
> 
>>>
>>> And it has to happen one way or another, we can't trust the
>>> user to put buffers back, it's just devmem does that by temporarily
>>> attaching the lifetime of such buffers to a socket.
>>>
>>
>> (noob question) does io_uring not have a socket equivalent that you
>> can tie the lifetime of the buffers to? I'm thinking there must be
>> one, because in your patches IIRC you have the fill queues and the
>> memory you bind from the userspace, there should be something that
>> tells you that the userspace has exited/crashed and it's time to now
>> destroy the fill queue and unbind the memory, right?
>>
>> I'm thinking you may want to bind the lifetime of the buffers to that,
>> instead of the lifetime of the pool. The pool will not be destroyed
>> until the next driver/reset reconfiguration happens, right? That could
>> be long long after the userspace has stopped using the memory.
>>
> 
> Yes, there are io_uring objects e.g. interface queue that hold
> everything together. IIRC page pool destroy doesn't unref but it waits
> for all pages that are handed out to skbs to be returned. So for us,
> below might work:
> 
> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
>     queue and tries to free the old pp
> 2. At this point we're guaranteed that any packets hitting this rx queue
>     will not go to user pages from our memory provider
> 3. Assume userspace is gone (either crash or gracefully terminating),
>     unref the uref for all pages, same as what scrub() is doing today
> 4. Any pages that are still in skb frags will get freed when the sockets
>     etc are closed
> 5. Rely on the pp delay release to eventually terminate and clean up
> 
> Let me know what you think Pavel.

I'll get to this comment a bit later when I get some time to
remember what races we have to deal with without the callback.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-10-13 17:25         ` David Wei
  2024-10-14 13:37           ` Pavel Begunkov
@ 2024-10-14 22:58           ` Mina Almasry
  2024-10-16 17:42             ` Pavel Begunkov
  1 sibling, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-10-14 22:58 UTC (permalink / raw)
  To: David Wei
  Cc: Pavel Begunkov, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Sun, Oct 13, 2024 at 8:25 PM David Wei <[email protected]> wrote:
>
> On 2024-10-10 10:54, Mina Almasry wrote:
> > On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <[email protected]> wrote:
> >>
> >> On 10/9/24 22:00, Mina Almasry wrote:
> >>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
> >>>>
> >>>> From: Pavel Begunkov <[email protected]>
> >>>>
> >>>> page pool is now waiting for all ppiovs to return before destroying
> >>>> itself, and for that to happen the memory provider might need to push
> >>>> some buffers, flush caches and so on.
> >>>>
> >>>> todo: we'll try to get by without it before the final release
> >>>>
> >>>
> >>> Is the intention to drop this todo and stick with this patch, or to
> >>> move ahead with this patch?
> >>
> >> Heh, I overlooked this todo. The plan is to actually leave it
> >> as is, it's by far the simplest way and doesn't really gets
> >> into anyone's way as it's a slow path.
> >>
> >>> To be honest, I think I read in a follow up patch that you want to
> >>> unref all the memory on page_pool_destory, which is not how the
> >>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
> >>> memory. Changing that may be OK.
> >>
> >> It doesn't because it can't (not breaking anything), which is a
> >> problem as the page pool might never get destroyed. io_uring
> >> doesn't change that, a buffer can't be reclaimed while anything
> >> in the kernel stack holds it. It's only when it's given to the
> >> user we can force it back out of there.
>
> The page pool will definitely be destroyed, the call to
> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
> core will ensure that.
>
> >>
> >> And it has to happen one way or another, we can't trust the
> >> user to put buffers back, it's just devmem does that by temporarily
> >> attaching the lifetime of such buffers to a socket.
> >>
> >
> > (noob question) does io_uring not have a socket equivalent that you
> > can tie the lifetime of the buffers to? I'm thinking there must be
> > one, because in your patches IIRC you have the fill queues and the
> > memory you bind from the userspace, there should be something that
> > tells you that the userspace has exited/crashed and it's time to now
> > destroy the fill queue and unbind the memory, right?
> >
> > I'm thinking you may want to bind the lifetime of the buffers to that,
> > instead of the lifetime of the pool. The pool will not be destroyed
> > until the next driver/reset reconfiguration happens, right? That could
> > be long long after the userspace has stopped using the memory.
> >
>
> Yes, there are io_uring objects e.g. interface queue that hold
> everything together. IIRC page pool destroy doesn't unref but it waits
> for all pages that are handed out to skbs to be returned. So for us,
> below might work:
>
> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
>    queue and tries to free the old pp
> 2. At this point we're guaranteed that any packets hitting this rx queue
>    will not go to user pages from our memory provider
> 3. Assume userspace is gone (either crash or gracefully terminating),
>    unref the uref for all pages, same as what scrub() is doing today
> 4. Any pages that are still in skb frags will get freed when the sockets
>    etc are closed
> 5. Rely on the pp delay release to eventually terminate and clean up
>
> Let me know what you think Pavel.

Something roughly along those lines sounds more reasonable to me.

The critical point is as I said above, if you free the memory only
when the pp is destroyed, then the memory lives from 1 io_uring ZC
instance to the next. The next instance will see a reduced address
space because the previously destroyed io_uring ZC connection did not
free the memory. You could have users in production opening thousands
of io_uring ZC connections between rxq resets, and not cleaning up
those connections. In that case I think eventually they'll run out of
memory as the memory leaks until it's cleaned up with a pp destroy
(driver reset?).


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-10-14 22:58           ` Mina Almasry
@ 2024-10-16 17:42             ` Pavel Begunkov
  2024-11-01 17:18               ` Mina Almasry
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-10-16 17:42 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern

On 10/14/24 23:58, Mina Almasry wrote:
> On Sun, Oct 13, 2024 at 8:25 PM David Wei <[email protected]> wrote:
>>
>> On 2024-10-10 10:54, Mina Almasry wrote:
>>> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <[email protected]> wrote:
>>>>
>>>> On 10/9/24 22:00, Mina Almasry wrote:
>>>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
>>>>>>
>>>>>> From: Pavel Begunkov <[email protected]>
>>>>>>
>>>>>> page pool is now waiting for all ppiovs to return before destroying
>>>>>> itself, and for that to happen the memory provider might need to push
>>>>>> some buffers, flush caches and so on.
>>>>>>
>>>>>> todo: we'll try to get by without it before the final release
>>>>>>
>>>>>
>>>>> Is the intention to drop this todo and stick with this patch, or to
>>>>> move ahead with this patch?
>>>>
>>>> Heh, I overlooked this todo. The plan is to actually leave it
>>>> as is, it's by far the simplest way and doesn't really gets
>>>> into anyone's way as it's a slow path.
>>>>
>>>>> To be honest, I think I read in a follow up patch that you want to
>>>>> unref all the memory on page_pool_destory, which is not how the
>>>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
>>>>> memory. Changing that may be OK.
>>>>
>>>> It doesn't because it can't (not breaking anything), which is a
>>>> problem as the page pool might never get destroyed. io_uring
>>>> doesn't change that, a buffer can't be reclaimed while anything
>>>> in the kernel stack holds it. It's only when it's given to the
>>>> user we can force it back out of there.
>>
>> The page pool will definitely be destroyed, the call to
>> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
>> core will ensure that.
>>
>>>>
>>>> And it has to happen one way or another, we can't trust the
>>>> user to put buffers back, it's just devmem does that by temporarily
>>>> attaching the lifetime of such buffers to a socket.
>>>>
>>>
>>> (noob question) does io_uring not have a socket equivalent that you
>>> can tie the lifetime of the buffers to? I'm thinking there must be

You can say it is bound to io_uring / io_uring's object
representing the queue.

>>> one, because in your patches IIRC you have the fill queues and the
>>> memory you bind from the userspace, there should be something that
>>> tells you that the userspace has exited/crashed and it's time to now
>>> destroy the fill queue and unbind the memory, right?
>>>
>>> I'm thinking you may want to bind the lifetime of the buffers to that,
>>> instead of the lifetime of the pool. The pool will not be destroyed
>>> until the next driver/reset reconfiguration happens, right? That could
>>> be long long after the userspace has stopped using the memory.

io_uring will reset the queue if it dies / requested to release
the queue.

>> Yes, there are io_uring objects e.g. interface queue that hold
>> everything together. IIRC page pool destroy doesn't unref but it waits
>> for all pages that are handed out to skbs to be returned. So for us,
>> below might work:
>>
>> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
>>     queue and tries to free the old pp
>> 2. At this point we're guaranteed that any packets hitting this rx queue
>>     will not go to user pages from our memory provider

It's reasonable to assume that the driver will start destroying
the page pool, but I wouldn't rely on it when it comes to the
kernel state correctness, i.e. not crashing the kernel. It's a bit
fragile, drivers always tend to do all kinds of interesting stuff,
I'd rather deal with a loud io_uring / page pool leak in case of
some weirdness. And that means we can't really guarantee the above
and need to care about not racing with allocations.

>> 3. Assume userspace is gone (either crash or gracefully terminating),
>>     unref the uref for all pages, same as what scrub() is doing today
>> 4. Any pages that are still in skb frags will get freed when the sockets
>>     etc are closed

And we need to prevent from requests receiving netmem that are
already pushed to sockets.

>> 5. Rely on the pp delay release to eventually terminate and clean up
>>
>> Let me know what you think Pavel.

I think it's reasonable to leave it as is for now, I don't believe
anyone cares much about a simple slow path memory provider-only
callback. And we can always kill it later on if we find a good way
to synchronise pieces, which will be more apparent when we add some
more registration dynamism on top, when/if this patchset is merged.

In short, let's resend the series with the callback, see if
maintainers have a strong opinion, and otherwise I'd say it
should be fine as is.

> Something roughly along those lines sounds more reasonable to me.
> 
> The critical point is as I said above, if you free the memory only
> when the pp is destroyed, then the memory lives from 1 io_uring ZC
> instance to the next. The next instance will see a reduced address
> space because the previously destroyed io_uring ZC connection did not
> free the memory. You could have users in production opening thousands
> of io_uring ZC connections between rxq resets, and not cleaning up
> those connections. In that case I think eventually they'll run out of
> memory as the memory leaks until it's cleaned up with a pp destroy
> (driver reset?).

Not sure what giving memory from one io_uring zc instance to
another means. And it's perfectly valid to receive a buffer, close
the socket and only after use the data, it logically belongs to
the user, not the socket. It's only bound to io_uring zcrx/queue
object for clean up purposes if io_uring goes down, it's different
from devmem TCP.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-10-16 17:42             ` Pavel Begunkov
@ 2024-11-01 17:18               ` Mina Almasry
  2024-11-01 18:35                 ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-11-01 17:18 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Wed, Oct 16, 2024 at 10:42 AM Pavel Begunkov <[email protected]> wrote:
>
> On 10/14/24 23:58, Mina Almasry wrote:
> > On Sun, Oct 13, 2024 at 8:25 PM David Wei <[email protected]> wrote:
> >>
> >> On 2024-10-10 10:54, Mina Almasry wrote:
> >>> On Wed, Oct 9, 2024 at 2:58 PM Pavel Begunkov <[email protected]> wrote:
> >>>>
> >>>> On 10/9/24 22:00, Mina Almasry wrote:
> >>>>> On Mon, Oct 7, 2024 at 3:16 PM David Wei <[email protected]> wrote:
> >>>>>>
> >>>>>> From: Pavel Begunkov <[email protected]>
> >>>>>>
> >>>>>> page pool is now waiting for all ppiovs to return before destroying
> >>>>>> itself, and for that to happen the memory provider might need to push
> >>>>>> some buffers, flush caches and so on.
> >>>>>>
> >>>>>> todo: we'll try to get by without it before the final release
> >>>>>>
> >>>>>
> >>>>> Is the intention to drop this todo and stick with this patch, or to
> >>>>> move ahead with this patch?
> >>>>
> >>>> Heh, I overlooked this todo. The plan is to actually leave it
> >>>> as is, it's by far the simplest way and doesn't really gets
> >>>> into anyone's way as it's a slow path.
> >>>>
> >>>>> To be honest, I think I read in a follow up patch that you want to
> >>>>> unref all the memory on page_pool_destory, which is not how the
> >>>>> page_pool is used today. Tdoay page_pool_destroy does not reclaim
> >>>>> memory. Changing that may be OK.
> >>>>
> >>>> It doesn't because it can't (not breaking anything), which is a
> >>>> problem as the page pool might never get destroyed. io_uring
> >>>> doesn't change that, a buffer can't be reclaimed while anything
> >>>> in the kernel stack holds it. It's only when it's given to the
> >>>> user we can force it back out of there.
> >>
> >> The page pool will definitely be destroyed, the call to
> >> netdev_rx_queue_restart() with mp_ops/mp_priv set to null and netdev
> >> core will ensure that.
> >>
> >>>>
> >>>> And it has to happen one way or another, we can't trust the
> >>>> user to put buffers back, it's just devmem does that by temporarily
> >>>> attaching the lifetime of such buffers to a socket.
> >>>>
> >>>
> >>> (noob question) does io_uring not have a socket equivalent that you
> >>> can tie the lifetime of the buffers to? I'm thinking there must be
>
> You can say it is bound to io_uring / io_uring's object
> representing the queue.
>
> >>> one, because in your patches IIRC you have the fill queues and the
> >>> memory you bind from the userspace, there should be something that
> >>> tells you that the userspace has exited/crashed and it's time to now
> >>> destroy the fill queue and unbind the memory, right?
> >>>
> >>> I'm thinking you may want to bind the lifetime of the buffers to that,
> >>> instead of the lifetime of the pool. The pool will not be destroyed
> >>> until the next driver/reset reconfiguration happens, right? That could
> >>> be long long after the userspace has stopped using the memory.
>
> io_uring will reset the queue if it dies / requested to release
> the queue.
>
> >> Yes, there are io_uring objects e.g. interface queue that hold
> >> everything together. IIRC page pool destroy doesn't unref but it waits
> >> for all pages that are handed out to skbs to be returned. So for us,
> >> below might work:
> >>
> >> 1. Call netdev_rx_queue_restart() which allocates a new pp for the rx
> >>     queue and tries to free the old pp
> >> 2. At this point we're guaranteed that any packets hitting this rx queue
> >>     will not go to user pages from our memory provider
>
> It's reasonable to assume that the driver will start destroying
> the page pool, but I wouldn't rely on it when it comes to the
> kernel state correctness, i.e. not crashing the kernel. It's a bit
> fragile, drivers always tend to do all kinds of interesting stuff,
> I'd rather deal with a loud io_uring / page pool leak in case of
> some weirdness. And that means we can't really guarantee the above
> and need to care about not racing with allocations.
>
> >> 3. Assume userspace is gone (either crash or gracefully terminating),
> >>     unref the uref for all pages, same as what scrub() is doing today
> >> 4. Any pages that are still in skb frags will get freed when the sockets
> >>     etc are closed
>
> And we need to prevent from requests receiving netmem that are
> already pushed to sockets.
>
> >> 5. Rely on the pp delay release to eventually terminate and clean up
> >>
> >> Let me know what you think Pavel.
>
> I think it's reasonable to leave it as is for now, I don't believe
> anyone cares much about a simple slow path memory provider-only
> callback. And we can always kill it later on if we find a good way
> to synchronise pieces, which will be more apparent when we add some
> more registration dynamism on top, when/if this patchset is merged.
>
> In short, let's resend the series with the callback, see if
> maintainers have a strong opinion, and otherwise I'd say it
> should be fine as is.
>
> > Something roughly along those lines sounds more reasonable to me.
> >
> > The critical point is as I said above, if you free the memory only
> > when the pp is destroyed, then the memory lives from 1 io_uring ZC
> > instance to the next. The next instance will see a reduced address
> > space because the previously destroyed io_uring ZC connection did not
> > free the memory. You could have users in production opening thousands
> > of io_uring ZC connections between rxq resets, and not cleaning up
> > those connections. In that case I think eventually they'll run out of
> > memory as the memory leaks until it's cleaned up with a pp destroy
> > (driver reset?).
>
> Not sure what giving memory from one io_uring zc instance to
> another means. And it's perfectly valid to receive a buffer, close
> the socket and only after use the data, it logically belongs to
> the user, not the socket. It's only bound to io_uring zcrx/queue
> object for clean up purposes if io_uring goes down, it's different
> from devmem TCP.
>

(responding here because I'm looking at the latest iteration after
vacation, but the discussion is here)

Huh, interesting. For devmem TCP we bind a region of memory to the
queue once, and after that we can create N connections all reusing the
same memory region. Is that not the case for io_uring? There are no
docs or selftest with the series to show sample code using this, but
the cover letter mentions that RSS + flow steering needs to be
configured for io ZC to work. The configuration of flow steering
implies that the user is responsible for initiating the connection. If
the user is initiating 1 connection then they can initiate many
without reconfiguring the memory binding, right?

When the user initiates the second connection, any pages not cleaned
up from the previous connection (because we're waiting for the scrub
callback to be hit), will be occupied when they should not be, right?

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-11-01 17:18               ` Mina Almasry
@ 2024-11-01 18:35                 ` Pavel Begunkov
  2024-11-01 19:24                   ` Mina Almasry
  0 siblings, 1 reply; 124+ messages in thread
From: Pavel Begunkov @ 2024-11-01 18:35 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 11/1/24 17:18, Mina Almasry wrote:
> On Wed, Oct 16, 2024 at 10:42 AM Pavel Begunkov <[email protected]> wrote:
...
>>> The critical point is as I said above, if you free the memory only
>>> when the pp is destroyed, then the memory lives from 1 io_uring ZC
>>> instance to the next. The next instance will see a reduced address
>>> space because the previously destroyed io_uring ZC connection did not
>>> free the memory. You could have users in production opening thousands
>>> of io_uring ZC connections between rxq resets, and not cleaning up
>>> those connections. In that case I think eventually they'll run out of
>>> memory as the memory leaks until it's cleaned up with a pp destroy
>>> (driver reset?).
>>
>> Not sure what giving memory from one io_uring zc instance to
>> another means. And it's perfectly valid to receive a buffer, close
>> the socket and only after use the data, it logically belongs to
>> the user, not the socket. It's only bound to io_uring zcrx/queue
>> object for clean up purposes if io_uring goes down, it's different
>> from devmem TCP.
>>
> 
> (responding here because I'm looking at the latest iteration after
> vacation, but the discussion is here)
> 
> Huh, interesting. For devmem TCP we bind a region of memory to the
> queue once, and after that we can create N connections all reusing the
> same memory region. Is that not the case for io_uring? There are no

Hmm, I think we already discussed the same question before. Yes, it
does indeed support arbitrary number of connections. For what I was
saying above, the devmem TCP analogy would be attaching buffers to the
netlink socket instead of a tcp socket (that new xarray you added) when
you give it to user space. Then, you can close the connection after a
receive and the buffer you've got would still be alive.

That's pretty intuitive as well, with normal receives the kernel
doesn't nuke the buffer you got data into from a normal recv(2) just
because the connection got closed.

> docs or selftest with the series to show sample code using this, but

There should be a good bunch of tests in liburing if you follow
links in the cover letter, as well as added support to some
benchmark tools, kperf and netbench. Also, as mentioned, need to
add a simpler example to liburing, not sure why it was removed.
There will also be man pages, that's better to be done after
merging it since things could change.


> the cover letter mentions that RSS + flow steering needs to be
> configured for io ZC to work. The configuration of flow steering
> implies that the user is responsible for initiating the connection. If
> the user is initiating 1 connection then they can initiate many
> without reconfiguring the memory binding, right?

Right

> When the user initiates the second connection, any pages not cleaned
> up from the previous connection (because we're waiting for the scrub
> callback to be hit), will be occupied when they should not be, right?

I'm not sure what you mean, but seems like the question comes from
the assumptions that it supports only one connection at a time,
which is not the case.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-11-01 18:35                 ` Pavel Begunkov
@ 2024-11-01 19:24                   ` Mina Almasry
  2024-11-01 21:38                     ` Pavel Begunkov
  0 siblings, 1 reply; 124+ messages in thread
From: Mina Almasry @ 2024-11-01 19:24 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On Fri, Nov 1, 2024 at 11:34 AM Pavel Begunkov <[email protected]> wrote:
>
> On 11/1/24 17:18, Mina Almasry wrote:
> > On Wed, Oct 16, 2024 at 10:42 AM Pavel Begunkov <[email protected]> wrote:
> ...
> >>> The critical point is as I said above, if you free the memory only
> >>> when the pp is destroyed, then the memory lives from 1 io_uring ZC
> >>> instance to the next. The next instance will see a reduced address
> >>> space because the previously destroyed io_uring ZC connection did not
> >>> free the memory. You could have users in production opening thousands
> >>> of io_uring ZC connections between rxq resets, and not cleaning up
> >>> those connections. In that case I think eventually they'll run out of
> >>> memory as the memory leaks until it's cleaned up with a pp destroy
> >>> (driver reset?).
> >>
> >> Not sure what giving memory from one io_uring zc instance to
> >> another means. And it's perfectly valid to receive a buffer, close
> >> the socket and only after use the data, it logically belongs to
> >> the user, not the socket. It's only bound to io_uring zcrx/queue
> >> object for clean up purposes if io_uring goes down, it's different
> >> from devmem TCP.
> >>
> >
> > (responding here because I'm looking at the latest iteration after
> > vacation, but the discussion is here)
> >
> > Huh, interesting. For devmem TCP we bind a region of memory to the
> > queue once, and after that we can create N connections all reusing the
> > same memory region. Is that not the case for io_uring? There are no
>
> Hmm, I think we already discussed the same question before. Yes, it
> does indeed support arbitrary number of connections. For what I was
> saying above, the devmem TCP analogy would be attaching buffers to the
> netlink socket instead of a tcp socket (that new xarray you added) when
> you give it to user space. Then, you can close the connection after a
> receive and the buffer you've got would still be alive.
>

Ah, I see. You're making a tradeoff here. You leave the buffers alive
after each connection so the userspace can still use them if it wishes
but they are of course unavailable for other connections.

But in our case (and I'm guessing yours) the process that will set up
the io_uring memory provider/RSS/flow steering will be a different
process from the one that sends/receive data, no? Because the former
requires CAP_NET_ADMIN privileges while the latter will not. If they
are 2 different processes, what happens when the latter process doing
the send/receive crashes? Does the memory stay unavailable until the
CAP_NET_ADMIN process exits? Wouldn't it be better to tie the lifetime
of the buffers of the connection? Sure, the buffers will become
unavailable after the connection is closed, but at least you don't
'leak' memory on send/receive process crashes.

Unless of course you're saying that only CAP_NET_ADMIN processes will
run io_rcrx connections. Then they can do their own mp setup/RSS/flow
steering and there is no concern when the process crashes because
everything will be cleaned up. But that's a big limitation to put on
the usage of the feature no?

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback
  2024-11-01 19:24                   ` Mina Almasry
@ 2024-11-01 21:38                     ` Pavel Begunkov
  0 siblings, 0 replies; 124+ messages in thread
From: Pavel Begunkov @ 2024-11-01 21:38 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern

On 11/1/24 19:24, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 11:34 AM Pavel Begunkov <[email protected]> wrote:
...
>>> Huh, interesting. For devmem TCP we bind a region of memory to the
>>> queue once, and after that we can create N connections all reusing the
>>> same memory region. Is that not the case for io_uring? There are no
>>
>> Hmm, I think we already discussed the same question before. Yes, it
>> does indeed support arbitrary number of connections. For what I was
>> saying above, the devmem TCP analogy would be attaching buffers to the
>> netlink socket instead of a tcp socket (that new xarray you added) when
>> you give it to user space. Then, you can close the connection after a
>> receive and the buffer you've got would still be alive.
>>
> 
> Ah, I see. You're making a tradeoff here. You leave the buffers alive
> after each connection so the userspace can still use them if it wishes
> but they are of course unavailable for other connections.
> 
> But in our case (and I'm guessing yours) the process that will set up
> the io_uring memory provider/RSS/flow steering will be a different
> process from the one that sends/receive data, no? Because the former
> requires CAP_NET_ADMIN privileges while the latter will not. If they
> are 2 different processes, what happens when the latter process doing
> the send/receive crashes? Does the memory stay unavailable until the
> CAP_NET_ADMIN process exits? Wouldn't it be better to tie the lifetime
> of the buffers of the connection? Sure, the buffers will become

That's the tradeoff google is willing to do in the framework,
which is fine, but it's not without cost, e.g. you need to
store/erase into the xarray, and it's a design choice in other
aspects, like you can't release the page pool if the socket you
got a buffer from is still alive but the net_iov hasn't been
returned.

> unavailable after the connection is closed, but at least you don't
> 'leak' memory on send/receive process crashes.
> 
> Unless of course you're saying that only CAP_NET_ADMIN processes will

The user can pass io_uring instance itself

> run io_rcrx connections. Then they can do their own mp setup/RSS/flow
> steering and there is no concern when the process crashes because
> everything will be cleaned up. But that's a big limitation to put on
> the usage of the feature no?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 124+ messages in thread

end of thread, other threads:[~2024-11-01 21:38 UTC | newest]

Thread overview: 124+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-07 22:15 ` [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef David Wei
2024-10-09 20:17   ` Mina Almasry
2024-10-09 23:16     ` Pavel Begunkov
2024-10-10 18:01       ` Mina Almasry
2024-10-10 18:57         ` Pavel Begunkov
2024-10-13 22:38           ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 02/15] net: prefix devmem specific helpers David Wei
2024-10-09 20:19   ` Mina Almasry
2024-10-07 22:15 ` [PATCH v1 03/15] net: generalise net_iov chunk owners David Wei
2024-10-08 15:46   ` Stanislav Fomichev
2024-10-08 16:34     ` Pavel Begunkov
2024-10-09 16:28       ` Stanislav Fomichev
2024-10-11 18:44         ` David Wei
2024-10-11 22:02           ` Pavel Begunkov
2024-10-11 22:25             ` Mina Almasry
2024-10-11 23:12               ` Pavel Begunkov
2024-10-09 20:44   ` Mina Almasry
2024-10-09 22:13     ` Pavel Begunkov
2024-10-09 22:19     ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 04/15] net: page_pool: create hooks for custom page providers David Wei
2024-10-09 20:49   ` Mina Almasry
2024-10-09 22:02     ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 05/15] net: prepare for non devmem TCP memory providers David Wei
2024-10-09 20:56   ` Mina Almasry
2024-10-09 21:45     ` Pavel Begunkov
2024-10-13 22:33       ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback David Wei
2024-10-09 21:00   ` Mina Almasry
2024-10-09 21:59     ` Pavel Begunkov
2024-10-10 17:54       ` Mina Almasry
2024-10-13 17:25         ` David Wei
2024-10-14 13:37           ` Pavel Begunkov
2024-10-14 22:58           ` Mina Almasry
2024-10-16 17:42             ` Pavel Begunkov
2024-11-01 17:18               ` Mina Almasry
2024-11-01 18:35                 ` Pavel Begunkov
2024-11-01 19:24                   ` Mina Almasry
2024-11-01 21:38                     ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 07/15] net: page pool: add helper creating area from pages David Wei
2024-10-09 21:11   ` Mina Almasry
2024-10-09 21:34     ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 08/15] net: add helper executing custom callback from napi David Wei
2024-10-08 22:25   ` Joe Damato
2024-10-09 15:09     ` Pavel Begunkov
2024-10-09 16:13       ` Joe Damato
2024-10-09 19:12         ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
2024-10-09 17:50   ` Jens Axboe
2024-10-09 18:09     ` Jens Axboe
2024-10-09 19:08     ` Pavel Begunkov
2024-10-11 22:11     ` Pavel Begunkov
2024-10-13 17:32     ` David Wei
2024-10-07 22:15 ` [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area David Wei
2024-10-09 18:02   ` Jens Axboe
2024-10-09 19:05     ` Pavel Begunkov
2024-10-09 19:06       ` Jens Axboe
2024-10-09 21:29   ` Mina Almasry
2024-10-07 22:15 ` [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
2024-10-09 18:10   ` Jens Axboe
2024-10-09 22:01   ` Mina Almasry
2024-10-09 22:58     ` Pavel Begunkov
2024-10-10 18:19       ` Mina Almasry
2024-10-10 20:26         ` Pavel Begunkov
2024-10-10 20:53           ` Mina Almasry
2024-10-10 20:58             ` Mina Almasry
2024-10-10 21:22             ` Pavel Begunkov
2024-10-11  0:32               ` Mina Almasry
2024-10-11  1:49                 ` Pavel Begunkov
2024-10-07 22:16 ` [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request David Wei
2024-10-09 18:28   ` Jens Axboe
2024-10-09 18:51     ` Pavel Begunkov
2024-10-09 19:01       ` Jens Axboe
2024-10-09 19:27         ` Pavel Begunkov
2024-10-09 19:42           ` Jens Axboe
2024-10-09 19:47             ` Pavel Begunkov
2024-10-09 19:50               ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 13/15] io_uring/zcrx: add copy fallback David Wei
2024-10-08 15:58   ` Stanislav Fomichev
2024-10-08 16:39     ` Pavel Begunkov
2024-10-08 16:40     ` David Wei
2024-10-09 16:30       ` Stanislav Fomichev
2024-10-09 23:05         ` Pavel Begunkov
2024-10-11  6:22           ` David Wei
2024-10-11 14:43             ` Stanislav Fomichev
2024-10-09 18:38   ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2024-10-09 18:42   ` Jens Axboe
2024-10-10 13:09     ` Pavel Begunkov
2024-10-10 13:19       ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 15/15] io_uring/zcrx: throttle receive requests David Wei
2024-10-09 18:43   ` Jens Axboe
2024-10-07 22:20 ` [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-08 23:10 ` Joe Damato
2024-10-09 15:07   ` Pavel Begunkov
2024-10-09 16:10     ` Joe Damato
2024-10-09 16:12     ` Jens Axboe
2024-10-11  6:15       ` David Wei
2024-10-09 15:27 ` Jens Axboe
2024-10-09 15:38   ` David Ahern
2024-10-09 15:43     ` Jens Axboe
2024-10-09 15:49       ` Pavel Begunkov
2024-10-09 15:50         ` Jens Axboe
2024-10-09 16:35       ` David Ahern
2024-10-09 16:50         ` Jens Axboe
2024-10-09 16:53           ` Jens Axboe
2024-10-09 17:12             ` Jens Axboe
2024-10-10 14:21               ` Jens Axboe
2024-10-10 15:03                 ` David Ahern
2024-10-10 15:15                   ` Jens Axboe
2024-10-10 18:11                 ` Jens Axboe
2024-10-14  8:42                   ` David Laight
2024-10-09 16:55 ` Mina Almasry
2024-10-09 16:57   ` Jens Axboe
2024-10-09 19:32     ` Mina Almasry
2024-10-09 19:43       ` Pavel Begunkov
2024-10-09 19:47       ` Jens Axboe
2024-10-09 17:19   ` David Ahern
2024-10-09 18:21   ` Pedro Tammela
2024-10-10 13:19     ` Pavel Begunkov
2024-10-11  0:35     ` David Wei
2024-10-11 14:28       ` Pedro Tammela
2024-10-11  0:29   ` David Wei
2024-10-11 19:43     ` Mina Almasry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox