[PATCH v7 00/15] io_uring zero copy rx

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH v7 00/15] io_uring zero copy rx
@ 2024-10-29 23:05 David Wei
  2024-10-29 23:05 ` [PATCH v7 01/15] net: prefix devmem specific helpers David Wei
                   ` (14 more replies)
  0 siblings, 15 replies; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

This patchset adds support for zero copy rx into userspace pages using
io_uring, eliminating a kernel to user copy.

We configure a page pool that a driver uses to fill a hw rx queue to
hand out user pages instead of kernel pages. Any data that ends up
hitting this hw rx queue will thus be dma'd into userspace memory
directly, without needing to be bounced through kernel memory. 'Reading'
data out of a socket instead becomes a _notification_ mechanism, where
the kernel tells userspace where the data is. The overall approach is
similar to the devmem TCP proposal.

This relies on hw header/data split, flow steering and RSS to ensure
packet headers remain in kernel memory and only desired flows hit a hw
rx queue configured for zero copy. Configuring this is outside of the
scope of this patchset.

We share netdev core infra with devmem TCP. The main difference is that
io_uring is used for the uAPI and the lifetime of all objects are bound
to an io_uring instance. Data is 'read' using a new io_uring request
type. When done, data is returned via a new shared refill queue. A zero
copy page pool refills a hw rx queue from this refill queue directly. Of
course, the lifetime of these data buffers are managed by io_uring
rather than the networking stack, with different refcounting rules.

This patchset is the first step adding basic zero copy support. We will
extend this iteratively with new features e.g. dynamically allocated
zero copy areas, THP support, dmabuf support, improved copy fallback,
general optimisations and more.

In terms of netdev support, we're first targeting Broadcom bnxt. Patches
aren't included since Taehee Yoo has already sent a more comprehensive
patchset adding support in [1]. Google gve should already support this,
and Mellanox mlx5 support is WIP pending driver changes.

===========
Performance
===========

Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet.

Test setup:
* AMD EPYC 9454
* Broadcom BCM957508 200G
* Kernel v6.11 base [2]
* liburing fork [3]
* kperf fork [4]
* 4K MTU
* Single TCP flow

With application thread + net rx softirq pinned to _different_ cores:

+-------------------------------+
| epoll     | io_uring          |
|-----------|-------------------|
| 82.2 Gbps | 116.2 Gbps (+41%) |
+-------------------------------+

Pinned to _same_ core:

+-------------------------------+
| epoll     | io_uring          |
|-----------|-------------------|
| 62.6 Gbps | 80.9 Gbps (+29%)  |
+-------------------------------+

==============
Patch overview
==============

Networking folks would be mostly interested in patches 1-8, 11, 12 and
14. Patches 1-8 make the necessary prerequisite changes in netdev core.

Patch 11 implements struct memory_provider_ops, patch 12 adds a
recv_actor_t func used with tcp_read_sock(), and patch 14 passes it all
to netdev via the queue API.

io_uring folks would be mostly interested in patches 9-15:

* Initial registration that sets up a hw rx queue.
* Shared ringbuf for userspace to return buffers.
* New request type for doing zero copy rx reads.

=====
Links
=====

Broadcom bnxt support:
[1]: https://lore.kernel.org/netdev/[email protected]/

Linux kernel branch:
[2]: https://github.com/isilence/linux.git zcrx/v7

liburing for testing:
[3]: https://github.com/isilence/liburing.git zcrx/next

kperf for testing:
[4]: https://git.kernel.dk/kperf.git

Changes in v7:
--------------
net:
* Use NAPI_F_PREFER_BUSY_POLL for napi_execute + stylistics changes.

Changes in v6:
--------------
Please note: Comparison with TCP_ZEROCOPY_RECEIVE isn't done yet.

net:
* Drop a devmem.h clean up patch.
* Migrate to netdev_get_by_index from deprecated API.
* Fix !CONFIG_NET_DEVMEM build.
* Don’t return into the page pool cache directly, use a new helper 
* Refactor napi_execute

io_uring:
* Require IORING_RECV_MULTISHOT flag set.
* Add unselectable CONFIG_IO_URING_ZCRX.
* Pulled latest io_uring changes.
* Unexport io_uring_pp_zc_ops.

Changes in v5:
--------------
* Rebase on top of merged net_iov + netmem infra.
* Decouple net_iov from devmem TCP.
* Use netdev queue API to allocate an rx queue.
* Minor uAPI enhancements for future extensibility.
* QoS improvements with request throttling.

Changes in RFC v4:
------------------
* Rebased on top of Mina Almasry's TCP devmem patchset and latest
  net-next, now sharing common infra e.g.:
    * netmem_t and net_iovs
    * Page pool memory provider
* The registered buffer (rbuf) completion queue where completions from
  io_recvzc requests are posted is removed. Now these post into the main
  completion queue, using big (32-byte) CQEs. The first 16 bytes is an
  ordinary CQE, while the latter 16 bytes contain the io_uring_rbuf_cqe
  as before. This vastly simplifies the uAPI and removes a level of
  indirection in userspace when looking for payloads.
  * The rbuf refill queue is still needed for userspace to return
    buffers to kernel.
* Simplified code and uAPI on the io_uring side, particularly
  io_recvzc() and io_zc_rx_recv(). Many unnecessary lines were removed
  e.g. extra msg flags, readlen, etc.

Changes in RFC v3:
------------------
* Rebased on top of Jakub Kicinski's memory provider API RFC. The ZC
  pool added is now a backend for memory provider.
* We're also reusing ppiov infrastructure. The refcounting rules stay
  the same but it's shifted into ppiov->refcount. That lets us to
  flexibly manage buffer lifetimes without adding any extra code to the
  common networking paths. It'd also make it easier to support dmabufs
  and device memory in the future.
  * io_uring also knows about pages, and so ppiovs might unnecessarily
    break tools inspecting data, that can easily be solved later.

Many patches are not for upstream as they depend on work in progress,
namely from Mina:

* struct netmem_t
* Driver ndo commands for Rx queue configs
* struct page_pool_iov and shared pp infra

Changes in RFC v2:
------------------
* Added copy fallback support if userspace memory allocated for ZC Rx
  runs out, or if header splitting or flow steering fails.
* Added veth support for ZC Rx, for testing and demonstration. We will
  need to figure out what driver would be best for such testing
  functionality in the future. Perhaps netdevsim?
* Added socket registration API to io_uring to associate specific
  sockets with ifqs/Rx queues for ZC.
* Added multi-socket support, such that multiple connections can be
  steered into the same hardware Rx queue.
* Added Netbench server/client support.

David Wei (4):
  io_uring/zcrx: add interface queue and refill queue
  io_uring/zcrx: add io_zcrx_area
  io_uring/zcrx: add io_recvzc request
  io_uring/zcrx: set pp memory provider for an rx queue

Jakub Kicinski (1):
  net: page_pool: create hooks for custom page providers

Pavel Begunkov (10):
  net: prefix devmem specific helpers
  net: generalise net_iov chunk owners
  net: prepare for non devmem TCP memory providers
  net: page_pool: add ->scrub mem provider callback
  net: page pool: add helper creating area from pages
  net: page_pool: introduce page_pool_mp_return_in_cache
  net: add helper executing custom callback from napi
  io_uring/zcrx: implement zerocopy receive pp memory provider
  io_uring/zcrx: add copy fallback
  io_uring/zcrx: throttle receive requests

 Kconfig                                 |   2 +
 include/linux/io_uring_types.h          |   3 +
 include/net/busy_poll.h                 |   6 +
 include/net/netmem.h                    |  21 +-
 include/net/page_pool/memory_provider.h |  14 +
 include/net/page_pool/types.h           |  10 +
 include/uapi/linux/io_uring.h           |  54 ++
 io_uring/KConfig                        |  10 +
 io_uring/Makefile                       |   1 +
 io_uring/io_uring.c                     |   7 +
 io_uring/io_uring.h                     |  10 +
 io_uring/memmap.c                       |   8 +
 io_uring/net.c                          |  74 +++
 io_uring/opdef.c                        |  16 +
 io_uring/register.c                     |   7 +
 io_uring/rsrc.c                         |   2 +-
 io_uring/rsrc.h                         |   1 +
 io_uring/zcrx.c                         | 838 ++++++++++++++++++++++++
 io_uring/zcrx.h                         |  76 +++
 net/core/dev.c                          |  81 ++-
 net/core/devmem.c                       |  51 +-
 net/core/devmem.h                       |  45 +-
 net/core/page_pool.c                    | 102 ++-
 net/core/page_pool_user.c               |  15 +-
 net/ipv4/tcp.c                          |   8 +-
 25 files changed, 1388 insertions(+), 74 deletions(-)
 create mode 100644 include/net/page_pool/memory_provider.h
 create mode 100644 io_uring/KConfig
 create mode 100644 io_uring/zcrx.c
 create mode 100644 io_uring/zcrx.h

-- 
2.43.5

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v7 01/15] net: prefix devmem specific helpers
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-11-01 14:57   ` Mina Almasry
  2024-10-29 23:05 ` [PATCH v7 02/15] net: generalise net_iov chunk owners David Wei
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

Add prefixes to all helpers that are specific to devmem TCP, i.e.
net_iov_binding[_id].

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 net/core/devmem.c |  2 +-
 net/core/devmem.h | 14 +++++++-------
 net/ipv4/tcp.c    |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 11b91c12ee11..858982858f81 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -93,7 +93,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
 
 void net_devmem_free_dmabuf(struct net_iov *niov)
 {
-	struct net_devmem_dmabuf_binding *binding = net_iov_binding(niov);
+	struct net_devmem_dmabuf_binding *binding = net_devmem_iov_binding(niov);
 	unsigned long dma_addr = net_devmem_get_dma_addr(niov);
 
 	if (WARN_ON(!gen_pool_has_addr(binding->chunk_pool, dma_addr,
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 76099ef9c482..99782ddeca40 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -86,11 +86,16 @@ static inline unsigned int net_iov_idx(const struct net_iov *niov)
 }
 
 static inline struct net_devmem_dmabuf_binding *
-net_iov_binding(const struct net_iov *niov)
+net_devmem_iov_binding(const struct net_iov *niov)
 {
 	return net_iov_owner(niov)->binding;
 }
 
+static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
+{
+	return net_devmem_iov_binding(niov)->id;
+}
+
 static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
 {
 	struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
@@ -99,11 +104,6 @@ static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
 	       ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT);
 }
 
-static inline u32 net_iov_binding_id(const struct net_iov *niov)
-{
-	return net_iov_owner(niov)->binding->id;
-}
-
 static inline void
 net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding)
 {
@@ -171,7 +171,7 @@ static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
 	return 0;
 }
 
-static inline u32 net_iov_binding_id(const struct net_iov *niov)
+static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
 {
 	return 0;
 }
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 82cc4a5633ce..e928efc22f80 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2494,7 +2494,7 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 
 				/* Will perform the exchange later */
 				dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx];
-				dmabuf_cmsg.dmabuf_id = net_iov_binding_id(niov);
+				dmabuf_cmsg.dmabuf_id = net_devmem_iov_binding_id(niov);
 
 				offset += copy;
 				remaining_len -= copy;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 02/15] net: generalise net_iov chunk owners
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
  2024-10-29 23:05 ` [PATCH v7 01/15] net: prefix devmem specific helpers David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-10-29 23:05 ` [PATCH v7 03/15] net: page_pool: create hooks for custom page providers David Wei
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

Currently net_iov stores a pointer to struct dmabuf_genpool_chunk_owner,
which serves as a useful abstraction to share data and provide a
context. However, it's too devmem specific, and we want to reuse it for
other memory providers, and for that we need to decouple net_iov from
devmem. Make net_iov to point to a new base structure called
net_iov_area, which dmabuf_genpool_chunk_owner extends.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/netmem.h | 21 ++++++++++++++++++++-
 net/core/devmem.c    | 25 +++++++++++++------------
 net/core/devmem.h    | 25 +++++++++----------------
 3 files changed, 42 insertions(+), 29 deletions(-)

diff --git a/include/net/netmem.h b/include/net/netmem.h
index 8a6e20be4b9d..3795ded30d2c 100644
--- a/include/net/netmem.h
+++ b/include/net/netmem.h
@@ -24,11 +24,20 @@ struct net_iov {
 	unsigned long __unused_padding;
 	unsigned long pp_magic;
 	struct page_pool *pp;
-	struct dmabuf_genpool_chunk_owner *owner;
+	struct net_iov_area *owner;
 	unsigned long dma_addr;
 	atomic_long_t pp_ref_count;
 };
 
+struct net_iov_area {
+	/* Array of net_iovs for this area. */
+	struct net_iov *niovs;
+	size_t num_niovs;
+
+	/* Offset into the dma-buf where this chunk starts.  */
+	unsigned long base_virtual;
+};
+
 /* These fields in struct page are used by the page_pool and net stack:
  *
  *        struct {
@@ -54,6 +63,16 @@ NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr);
 NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count);
 #undef NET_IOV_ASSERT_OFFSET
 
+static inline struct net_iov_area *net_iov_owner(const struct net_iov *niov)
+{
+	return niov->owner;
+}
+
+static inline unsigned int net_iov_idx(const struct net_iov *niov)
+{
+	return niov - net_iov_owner(niov)->niovs;
+}
+
 /* netmem */
 
 /**
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 858982858f81..5c10cf0e2a18 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -32,14 +32,15 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
 {
 	struct dmabuf_genpool_chunk_owner *owner = chunk->owner;
 
-	kvfree(owner->niovs);
+	kvfree(owner->area.niovs);
 	kfree(owner);
 }
 
 static dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov)
 {
-	struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
+	struct dmabuf_genpool_chunk_owner *owner;
 
+	owner = net_devmem_iov_to_chunk_owner(niov);
 	return owner->base_dma_addr +
 	       ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT);
 }
@@ -82,7 +83,7 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding)
 
 	offset = dma_addr - owner->base_dma_addr;
 	index = offset / PAGE_SIZE;
-	niov = &owner->niovs[index];
+	niov = &owner->area.niovs[index];
 
 	niov->pp_magic = 0;
 	niov->pp = NULL;
@@ -250,9 +251,9 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 			goto err_free_chunks;
 		}
 
-		owner->base_virtual = virtual;
+		owner->area.base_virtual = virtual;
 		owner->base_dma_addr = dma_addr;
-		owner->num_niovs = len / PAGE_SIZE;
+		owner->area.num_niovs = len / PAGE_SIZE;
 		owner->binding = binding;
 
 		err = gen_pool_add_owner(binding->chunk_pool, dma_addr,
@@ -264,17 +265,17 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
 			goto err_free_chunks;
 		}
 
-		owner->niovs = kvmalloc_array(owner->num_niovs,
-					      sizeof(*owner->niovs),
-					      GFP_KERNEL);
-		if (!owner->niovs) {
+		owner->area.niovs = kvmalloc_array(owner->area.num_niovs,
+						   sizeof(*owner->area.niovs),
+						   GFP_KERNEL);
+		if (!owner->area.niovs) {
 			err = -ENOMEM;
 			goto err_free_chunks;
 		}
 
-		for (i = 0; i < owner->num_niovs; i++) {
-			niov = &owner->niovs[i];
-			niov->owner = owner;
+		for (i = 0; i < owner->area.num_niovs; i++) {
+			niov = &owner->area.niovs[i];
+			niov->owner = &owner->area;
 			page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov),
 						      net_devmem_get_dma_addr(niov));
 		}
diff --git a/net/core/devmem.h b/net/core/devmem.h
index 99782ddeca40..a2b9913e9a17 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -10,6 +10,8 @@
 #ifndef _NET_DEVMEM_H
 #define _NET_DEVMEM_H
 
+#include <net/netmem.h>
+
 struct netlink_ext_ack;
 
 struct net_devmem_dmabuf_binding {
@@ -51,17 +53,11 @@ struct net_devmem_dmabuf_binding {
  * allocations from this chunk.
  */
 struct dmabuf_genpool_chunk_owner {
-	/* Offset into the dma-buf where this chunk starts.  */
-	unsigned long base_virtual;
+	struct net_iov_area area;
+	struct net_devmem_dmabuf_binding *binding;
 
 	/* dma_addr of the start of the chunk.  */
 	dma_addr_t base_dma_addr;
-
-	/* Array of net_iovs for this chunk. */
-	struct net_iov *niovs;
-	size_t num_niovs;
-
-	struct net_devmem_dmabuf_binding *binding;
 };
 
 void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding);
@@ -75,20 +71,17 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 void dev_dmabuf_uninstall(struct net_device *dev);
 
 static inline struct dmabuf_genpool_chunk_owner *
-net_iov_owner(const struct net_iov *niov)
+net_devmem_iov_to_chunk_owner(const struct net_iov *niov)
 {
-	return niov->owner;
-}
+	struct net_iov_area *owner = net_iov_owner(niov);
 
-static inline unsigned int net_iov_idx(const struct net_iov *niov)
-{
-	return niov - net_iov_owner(niov)->niovs;
+	return container_of(owner, struct dmabuf_genpool_chunk_owner, area);
 }
 
 static inline struct net_devmem_dmabuf_binding *
 net_devmem_iov_binding(const struct net_iov *niov)
 {
-	return net_iov_owner(niov)->binding;
+	return net_devmem_iov_to_chunk_owner(niov)->binding;
 }
 
 static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
@@ -98,7 +91,7 @@ static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
 
 static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov)
 {
-	struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov);
+	struct net_iov_area *owner = net_iov_owner(niov);
 
 	return owner->base_virtual +
 	       ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 03/15] net: page_pool: create hooks for custom page providers
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
  2024-10-29 23:05 ` [PATCH v7 01/15] net: prefix devmem specific helpers David Wei
  2024-10-29 23:05 ` [PATCH v7 02/15] net: generalise net_iov chunk owners David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-11-05 16:28   ` Alexander Lobakin
  2024-10-29 23:05 ` [PATCH v7 04/15] net: prepare for non devmem TCP memory providers David Wei
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Jakub Kicinski <[email protected]>

The page providers which try to reuse the same pages will
need to hold onto the ref, even if page gets released from
the pool - as in releasing the page from the pp just transfers
the "ownership" reference from pp to the provider, and provider
will wait for other references to be gone before feeding this
page back into the pool.

Signed-off-by: Jakub Kicinski <[email protected]>
[Pavel] Rebased, renamed callback, +converted devmem
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/types.h |  9 +++++++++
 net/core/devmem.c             | 14 +++++++++++++-
 net/core/page_pool.c          | 17 +++++++++--------
 3 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index c022c410abe3..8a35fe474adb 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -152,8 +152,16 @@ struct page_pool_stats {
  */
 #define PAGE_POOL_FRAG_GROUP_ALIGN	(4 * sizeof(long))
 
+struct memory_provider_ops {
+	netmem_ref (*alloc_netmems)(struct page_pool *pool, gfp_t gfp);
+	bool (*release_netmem)(struct page_pool *pool, netmem_ref netmem);
+	int (*init)(struct page_pool *pool);
+	void (*destroy)(struct page_pool *pool);
+};
+
 struct pp_memory_provider_params {
 	void *mp_priv;
+	const struct memory_provider_ops *mp_ops;
 };
 
 struct page_pool {
@@ -215,6 +223,7 @@ struct page_pool {
 	struct ptr_ring ring;
 
 	void *mp_priv;
+	const struct memory_provider_ops *mp_ops;
 
 #ifdef CONFIG_PAGE_POOL_STATS
 	/* recycle stats are per-cpu to avoid locking */
diff --git a/net/core/devmem.c b/net/core/devmem.c
index 5c10cf0e2a18..01738029e35c 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -26,6 +26,8 @@
 /* Protected by rtnl_lock() */
 static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
 
+static const struct memory_provider_ops dmabuf_devmem_ops;
+
 static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
 					       struct gen_pool_chunk *chunk,
 					       void *not_used)
@@ -117,6 +119,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
 		WARN_ON(rxq->mp_params.mp_priv != binding);
 
 		rxq->mp_params.mp_priv = NULL;
+		rxq->mp_params.mp_ops = NULL;
 
 		rxq_idx = get_netdev_rx_queue_index(rxq);
 
@@ -142,7 +145,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 	}
 
 	rxq = __netif_get_rx_queue(dev, rxq_idx);
-	if (rxq->mp_params.mp_priv) {
+	if (rxq->mp_params.mp_ops) {
 		NL_SET_ERR_MSG(extack, "designated queue already memory provider bound");
 		return -EEXIST;
 	}
@@ -160,6 +163,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 		return err;
 
 	rxq->mp_params.mp_priv = binding;
+	rxq->mp_params.mp_ops = &dmabuf_devmem_ops;
 
 	err = netdev_rx_queue_restart(dev, rxq_idx);
 	if (err)
@@ -169,6 +173,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 
 err_xa_erase:
 	rxq->mp_params.mp_priv = NULL;
+	rxq->mp_params.mp_ops = NULL;
 	xa_erase(&binding->bound_rxqs, xa_idx);
 
 	return err;
@@ -388,3 +393,10 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
 	/* We don't want the page pool put_page()ing our net_iovs. */
 	return false;
 }
+
+static const struct memory_provider_ops dmabuf_devmem_ops = {
+	.init			= mp_dmabuf_devmem_init,
+	.destroy		= mp_dmabuf_devmem_destroy,
+	.alloc_netmems		= mp_dmabuf_devmem_alloc_netmems,
+	.release_netmem		= mp_dmabuf_devmem_release_page,
+};
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a813d30d2135..c21c5b9edc68 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -284,10 +284,11 @@ static int page_pool_init(struct page_pool *pool,
 		rxq = __netif_get_rx_queue(pool->slow.netdev,
 					   pool->slow.queue_idx);
 		pool->mp_priv = rxq->mp_params.mp_priv;
+		pool->mp_ops = rxq->mp_params.mp_ops;
 	}
 
-	if (pool->mp_priv) {
-		err = mp_dmabuf_devmem_init(pool);
+	if (pool->mp_ops) {
+		err = pool->mp_ops->init(pool);
 		if (err) {
 			pr_warn("%s() mem-provider init failed %d\n", __func__,
 				err);
@@ -584,8 +585,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp)
 		return netmem;
 
 	/* Slow-path: cache empty, do real allocation */
-	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
-		netmem = mp_dmabuf_devmem_alloc_netmems(pool, gfp);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		netmem = pool->mp_ops->alloc_netmems(pool, gfp);
 	else
 		netmem = __page_pool_alloc_pages_slow(pool, gfp);
 	return netmem;
@@ -676,8 +677,8 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	bool put;
 
 	put = true;
-	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv)
-		put = mp_dmabuf_devmem_release_page(pool, netmem);
+	if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops)
+		put = pool->mp_ops->release_netmem(pool, netmem);
 	else
 		__page_pool_release_page_dma(pool, netmem);
 
@@ -1010,8 +1011,8 @@ static void __page_pool_destroy(struct page_pool *pool)
 	page_pool_unlist(pool);
 	page_pool_uninit(pool);
 
-	if (pool->mp_priv) {
-		mp_dmabuf_devmem_destroy(pool);
+	if (pool->mp_ops) {
+		pool->mp_ops->destroy(pool);
 		static_branch_dec(&page_pool_mem_providers);
 	}
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 04/15] net: prepare for non devmem TCP memory providers
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (2 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 03/15] net: page_pool: create hooks for custom page providers David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-11-01 17:09   ` Mina Almasry
  2024-10-29 23:05 ` [PATCH v7 05/15] net: page_pool: add ->scrub mem provider callback David Wei
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

There is a good bunch of places in generic paths assuming that the only
page pool memory provider is devmem TCP. As we want to reuse the net_iov
and provider infrastructure, we need to patch it up and explicitly check
the provider type when we branch into devmem TCP code.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 net/core/devmem.c         | 10 ++++++++--
 net/core/devmem.h         |  8 ++++++++
 net/core/page_pool_user.c | 15 +++++++++------
 net/ipv4/tcp.c            |  6 ++++++
 4 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 01738029e35c..78983a98e5dc 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -28,6 +28,12 @@ static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
 
 static const struct memory_provider_ops dmabuf_devmem_ops;
 
+bool net_is_devmem_page_pool_ops(const struct memory_provider_ops *ops)
+{
+	return ops == &dmabuf_devmem_ops;
+}
+EXPORT_SYMBOL_GPL(net_is_devmem_page_pool_ops);
+
 static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
 					       struct gen_pool_chunk *chunk,
 					       void *not_used)
@@ -316,10 +322,10 @@ void dev_dmabuf_uninstall(struct net_device *dev)
 	unsigned int i;
 
 	for (i = 0; i < dev->real_num_rx_queues; i++) {
-		binding = dev->_rx[i].mp_params.mp_priv;
-		if (!binding)
+		if (dev->_rx[i].mp_params.mp_ops != &dmabuf_devmem_ops)
 			continue;
 
+		binding = dev->_rx[i].mp_params.mp_priv;
 		xa_for_each(&binding->bound_rxqs, xa_idx, rxq)
 			if (rxq == &dev->_rx[i]) {
 				xa_erase(&binding->bound_rxqs, xa_idx);
diff --git a/net/core/devmem.h b/net/core/devmem.h
index a2b9913e9a17..a3fdd66bb05b 100644
--- a/net/core/devmem.h
+++ b/net/core/devmem.h
@@ -116,6 +116,8 @@ struct net_iov *
 net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding);
 void net_devmem_free_dmabuf(struct net_iov *ppiov);
 
+bool net_is_devmem_page_pool_ops(const struct memory_provider_ops *ops);
+
 #else
 struct net_devmem_dmabuf_binding;
 
@@ -168,6 +170,12 @@ static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
 {
 	return 0;
 }
+
+static inline bool
+net_is_devmem_page_pool_ops(const struct memory_provider_ops *ops)
+{
+	return false;
+}
 #endif
 
 #endif /* _NET_DEVMEM_H */
diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
index 48335766c1bf..604862a73535 100644
--- a/net/core/page_pool_user.c
+++ b/net/core/page_pool_user.c
@@ -214,7 +214,7 @@ static int
 page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
 		  const struct genl_info *info)
 {
-	struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
+	struct net_devmem_dmabuf_binding *binding;
 	size_t inflight, refsz;
 	void *hdr;
 
@@ -244,8 +244,11 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
 			 pool->user.detach_time))
 		goto err_cancel;
 
-	if (binding && nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id))
-		goto err_cancel;
+	if (net_is_devmem_page_pool_ops(pool->mp_ops)) {
+		binding = pool->mp_priv;
+		if (nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id))
+			goto err_cancel;
+	}
 
 	genlmsg_end(rsp, hdr);
 
@@ -353,16 +356,16 @@ void page_pool_unlist(struct page_pool *pool)
 int page_pool_check_memory_provider(struct net_device *dev,
 				    struct netdev_rx_queue *rxq)
 {
-	struct net_devmem_dmabuf_binding *binding = rxq->mp_params.mp_priv;
+	void *mp_priv = rxq->mp_params.mp_priv;
 	struct page_pool *pool;
 	struct hlist_node *n;
 
-	if (!binding)
+	if (!mp_priv)
 		return 0;
 
 	mutex_lock(&page_pools_lock);
 	hlist_for_each_entry_safe(pool, n, &dev->page_pools, user.list) {
-		if (pool->mp_priv != binding)
+		if (pool->mp_priv != mp_priv)
 			continue;
 
 		if (pool->slow.queue_idx == get_netdev_rx_queue_index(rxq)) {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e928efc22f80..31e01da61c12 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -277,6 +277,7 @@
 #include <net/ip.h>
 #include <net/sock.h>
 #include <net/rstreason.h>
+#include <net/page_pool/types.h>
 
 #include <linux/uaccess.h>
 #include <asm/ioctls.h>
@@ -2476,6 +2477,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
 			}
 
 			niov = skb_frag_net_iov(frag);
+			if (net_is_devmem_page_pool_ops(niov->pp->mp_ops)) {
+				err = -ENODEV;
+				goto out;
+			}
+
 			end = start + skb_frag_size(frag);
 			copy = end - offset;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 05/15] net: page_pool: add ->scrub mem provider callback
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (3 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 04/15] net: prepare for non devmem TCP memory providers David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-10-29 23:05 ` [PATCH v7 06/15] net: page pool: add helper creating area from pages David Wei
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

Some page pool memory providers like io_uring need to catch the point
when the page pool is asked to be destroyed. ->destroy is not enough
because it relies on the page pool to wait for its buffers first, but
for that to happen a provider might need to react, e.g. to collect all
buffers that are currently given to the user space.

Add a new provider's scrub callback serving the purpose and called off
the pp's generic (cold) scrubbing path, i.e. page_pool_scrub().

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/types.h | 1 +
 net/core/page_pool.c          | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
index 8a35fe474adb..fd0376ad0d26 100644
--- a/include/net/page_pool/types.h
+++ b/include/net/page_pool/types.h
@@ -157,6 +157,7 @@ struct memory_provider_ops {
 	bool (*release_netmem)(struct page_pool *pool, netmem_ref netmem);
 	int (*init)(struct page_pool *pool);
 	void (*destroy)(struct page_pool *pool);
+	void (*scrub)(struct page_pool *pool);
 };
 
 struct pp_memory_provider_params {
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index c21c5b9edc68..9a675e16e6a4 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1038,6 +1038,9 @@ static void page_pool_empty_alloc_cache_once(struct page_pool *pool)
 
 static void page_pool_scrub(struct page_pool *pool)
 {
+	if (pool->mp_ops && pool->mp_ops->scrub)
+		pool->mp_ops->scrub(pool);
+
 	page_pool_empty_alloc_cache_once(pool);
 	pool->destroy_cnt++;
 
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 06/15] net: page pool: add helper creating area from pages
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (4 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 05/15] net: page_pool: add ->scrub mem provider callback David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-11-01 17:33   ` Mina Almasry
  2024-10-29 23:05 ` [PATCH v7 07/15] net: page_pool: introduce page_pool_mp_return_in_cache David Wei
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

Add a helper that takes an array of pages and initialises passed in
memory provider's area with them, where each net_iov takes one page.
It's also responsible for setting up dma mappings.

We keep it in page_pool.c not to leak netmem details to outside
providers like io_uring, which don't have access to netmem_priv.h
and other private helpers.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/memory_provider.h | 10 ++++
 net/core/page_pool.c                    | 63 ++++++++++++++++++++++++-
 2 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100644 include/net/page_pool/memory_provider.h

diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
new file mode 100644
index 000000000000..83d7eec0058d
--- /dev/null
+++ b/include/net/page_pool/memory_provider.h
@@ -0,0 +1,10 @@
+#ifndef _NET_PAGE_POOL_MEMORY_PROVIDER_H
+#define _NET_PAGE_POOL_MEMORY_PROVIDER_H
+
+int page_pool_mp_init_paged_area(struct page_pool *pool,
+				struct net_iov_area *area,
+				struct page **pages);
+void page_pool_mp_release_area(struct page_pool *pool,
+				struct net_iov_area *area);
+
+#endif
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 9a675e16e6a4..8bd4a3c80726 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -13,6 +13,7 @@
 
 #include <net/netdev_rx_queue.h>
 #include <net/page_pool/helpers.h>
+#include <net/page_pool/memory_provider.h>
 #include <net/xdp.h>
 
 #include <linux/dma-direction.h>
@@ -459,7 +460,8 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
 		__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
 }
 
-static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_dma_map_page(struct page_pool *pool, netmem_ref netmem,
+				   struct page *page)
 {
 	dma_addr_t dma;
 
@@ -468,7 +470,7 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	 * into page private data (i.e 32bit cpu with 64bit DMA caps)
 	 * This mapping is kept for lifetime of page, until leaving pool.
 	 */
-	dma = dma_map_page_attrs(pool->p.dev, netmem_to_page(netmem), 0,
+	dma = dma_map_page_attrs(pool->p.dev, page, 0,
 				 (PAGE_SIZE << pool->p.order), pool->p.dma_dir,
 				 DMA_ATTR_SKIP_CPU_SYNC |
 					 DMA_ATTR_WEAK_ORDERING);
@@ -490,6 +492,11 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
 	return false;
 }
 
+static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
+{
+	return page_pool_dma_map_page(pool, netmem, netmem_to_page(netmem));
+}
+
 static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
 						 gfp_t gfp)
 {
@@ -1154,3 +1161,55 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
 	}
 }
 EXPORT_SYMBOL(page_pool_update_nid);
+
+static void page_pool_release_page_dma(struct page_pool *pool,
+				       netmem_ref netmem)
+{
+	__page_pool_release_page_dma(pool, netmem);
+}
+
+int page_pool_mp_init_paged_area(struct page_pool *pool,
+				 struct net_iov_area *area,
+				 struct page **pages)
+{
+	struct net_iov *niov;
+	netmem_ref netmem;
+	int i, ret = 0;
+
+	if (!pool->dma_map)
+		return -EOPNOTSUPP;
+
+	for (i = 0; i < area->num_niovs; i++) {
+		niov = &area->niovs[i];
+		netmem = net_iov_to_netmem(niov);
+
+		page_pool_set_pp_info(pool, netmem);
+		if (!page_pool_dma_map_page(pool, netmem, pages[i])) {
+			ret = -EINVAL;
+			goto err_unmap_dma;
+		}
+	}
+	return 0;
+
+err_unmap_dma:
+	while (i--) {
+		netmem = net_iov_to_netmem(&area->niovs[i]);
+		page_pool_release_page_dma(pool, netmem);
+	}
+	return ret;
+}
+
+void page_pool_mp_release_area(struct page_pool *pool,
+			       struct net_iov_area *area)
+{
+	int i;
+
+	if (!pool->dma_map)
+		return;
+
+	for (i = 0; i < area->num_niovs; i++) {
+		struct net_iov *niov = &area->niovs[i];
+
+		page_pool_release_page_dma(pool, net_iov_to_netmem(niov));
+	}
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 07/15] net: page_pool: introduce page_pool_mp_return_in_cache
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (5 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 06/15] net: page pool: add helper creating area from pages David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-11-01 20:05   ` Mina Almasry
  2024-10-29 23:05 ` [PATCH v7 08/15] net: add helper executing custom callback from napi David Wei
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

Add a helper that allows a page pool memory provider to efficiently
return a netmem off the allocation callback.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/page_pool/memory_provider.h |  4 ++++
 net/core/page_pool.c                    | 19 +++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
index 83d7eec0058d..352b3a35d31c 100644
--- a/include/net/page_pool/memory_provider.h
+++ b/include/net/page_pool/memory_provider.h
@@ -1,3 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
 #ifndef _NET_PAGE_POOL_MEMORY_PROVIDER_H
 #define _NET_PAGE_POOL_MEMORY_PROVIDER_H
 
@@ -7,4 +9,6 @@ int page_pool_mp_init_paged_area(struct page_pool *pool,
 void page_pool_mp_release_area(struct page_pool *pool,
 				struct net_iov_area *area);
 
+void page_pool_mp_return_in_cache(struct page_pool *pool, netmem_ref netmem);
+
 #endif
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 8bd4a3c80726..9078107c906d 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -1213,3 +1213,22 @@ void page_pool_mp_release_area(struct page_pool *pool,
 		page_pool_release_page_dma(pool, net_iov_to_netmem(niov));
 	}
 }
+
+/*
+ * page_pool_mp_return_in_cache() - return a netmem to the allocation cache.
+ * @pool:	pool from which pages were allocated
+ * @netmem:	netmem to return
+ *
+ * Return already allocated and accounted netmem to the page pool's allocation
+ * cache. The function doesn't provide synchronisation and must only be called
+ * from the napi context.
+ */
+void page_pool_mp_return_in_cache(struct page_pool *pool, netmem_ref netmem)
+{
+	if (WARN_ON_ONCE(pool->alloc.count >= PP_ALLOC_CACHE_REFILL))
+		return;
+
+	page_pool_dma_sync_for_device(pool, netmem, -1);
+	page_pool_fragment_netmem(netmem, 1);
+	pool->alloc.cache[pool->alloc.count++] = netmem;
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 08/15] net: add helper executing custom callback from napi
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (6 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 07/15] net: page_pool: introduce page_pool_mp_return_in_cache David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-10-29 23:05 ` [PATCH v7 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

It's useful to have napi private bits and pieces like page pool's fast
allocating cache, so that the hot allocation path doesn't have to do any
additional synchronisation. In case of io_uring memory provider
introduced in following patches, we keep the consumer end of the
io_uring's refill queue private to napi as it's a hot path.

However, from time to time we need to synchronise with the napi, for
example to add more user memory or allocate fallback buffers. Add a
helper function napi_execute that allows to run a custom callback from
under napi context so that it can access and modify napi protected
parts of io_uring. It works similar to busy polling and stops napi from
running in the meantime, so it's supposed to be a slow control path.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/net/busy_poll.h |  6 +++
 net/core/dev.c          | 81 ++++++++++++++++++++++++++++++++---------
 2 files changed, 70 insertions(+), 17 deletions(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index f03040baaefd..3fd9e65731e9 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -47,6 +47,7 @@ bool sk_busy_loop_end(void *p, unsigned long start_time);
 void napi_busy_loop(unsigned int napi_id,
 		    bool (*loop_end)(void *, unsigned long),
 		    void *loop_end_arg, bool prefer_busy_poll, u16 budget);
+void napi_execute(unsigned napi_id, void (*cb)(void *), void *cb_arg);
 
 void napi_busy_loop_rcu(unsigned int napi_id,
 			bool (*loop_end)(void *, unsigned long),
@@ -63,6 +64,11 @@ static inline bool sk_can_busy_loop(struct sock *sk)
 	return false;
 }
 
+static inline void napi_execute(unsigned napi_id,
+				void (*cb)(void *), void *cb_arg)
+{
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static inline unsigned long busy_loop_current_time(void)
diff --git a/net/core/dev.c b/net/core/dev.c
index c682173a7642..5ea455e64363 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6347,6 +6347,30 @@ enum {
 	NAPI_F_END_ON_RESCHED	= 2,
 };
 
+static inline bool napi_state_start_busy_polling(struct napi_struct *napi,
+						 unsigned flags)
+{
+	unsigned long val = READ_ONCE(napi->state);
+
+	/* If multiple threads are competing for this napi,
+	 * we avoid dirtying napi->state as much as we can.
+	 */
+	if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
+		   NAPIF_STATE_IN_BUSY_POLL))
+		goto fail;
+
+	if (cmpxchg(&napi->state, val,
+		    val | NAPIF_STATE_IN_BUSY_POLL |
+			  NAPIF_STATE_SCHED) != val)
+		goto fail;
+
+	return true;
+fail:
+	if (flags & NAPI_F_PREFER_BUSY_POLL)
+		set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
+	return false;
+}
+
 static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
 			   unsigned flags, u16 budget)
 {
@@ -6422,24 +6446,8 @@ static void __napi_busy_loop(unsigned int napi_id,
 		local_bh_disable();
 		bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
 		if (!napi_poll) {
-			unsigned long val = READ_ONCE(napi->state);
-
-			/* If multiple threads are competing for this napi,
-			 * we avoid dirtying napi->state as much as we can.
-			 */
-			if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
-				   NAPIF_STATE_IN_BUSY_POLL)) {
-				if (flags & NAPI_F_PREFER_BUSY_POLL)
-					set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
+			if (!napi_state_start_busy_polling(napi, flags))
 				goto count;
-			}
-			if (cmpxchg(&napi->state, val,
-				    val | NAPIF_STATE_IN_BUSY_POLL |
-					  NAPIF_STATE_SCHED) != val) {
-				if (flags & NAPI_F_PREFER_BUSY_POLL)
-					set_bit(NAPI_STATE_PREFER_BUSY_POLL, &napi->state);
-				goto count;
-			}
 			have_poll_lock = netpoll_poll_lock(napi);
 			napi_poll = napi->poll;
 		}
@@ -6503,6 +6511,45 @@ void napi_busy_loop(unsigned int napi_id,
 }
 EXPORT_SYMBOL(napi_busy_loop);
 
+void napi_execute(unsigned napi_id,
+		  void (*cb)(void *), void *cb_arg)
+{
+	unsigned flags = NAPI_F_PREFER_BUSY_POLL;
+	void *have_poll_lock = NULL;
+	struct napi_struct *napi;
+
+	rcu_read_lock();
+	napi = napi_by_id(napi_id);
+	if (!napi) {
+		rcu_read_unlock();
+		return;
+	}
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_disable();
+
+	for (;;) {
+		local_bh_disable();
+
+		if (napi_state_start_busy_polling(napi, flags)) {
+			have_poll_lock = netpoll_poll_lock(napi);
+			cb(cb_arg);
+			local_bh_enable();
+			busy_poll_stop(napi, have_poll_lock, flags, 1);
+			break;
+		}
+
+		local_bh_enable();
+		if (unlikely(need_resched()))
+			break;
+		cpu_relax();
+	}
+
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		preempt_enable();
+	rcu_read_unlock();
+}
+
 #endif /* CONFIG_NET_RX_BUSY_POLL */
 
 static void __napi_hash_add_with_id(struct napi_struct *napi,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 09/15] io_uring/zcrx: add interface queue and refill queue
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (7 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 08/15] net: add helper executing custom callback from napi David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-10-29 23:05 ` [PATCH v7 10/15] io_uring/zcrx: add io_zcrx_area David Wei
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: David Wei <[email protected]>

Add a new object called an interface queue (ifq) that represents a net
rx queue that has been configured for zero copy. Each ifq is registered
using a new registration opcode IORING_REGISTER_ZCRX_IFQ.

The refill queue is allocated by the kernel and mapped by userspace
using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main
SQ/CQ. It is used by userspace to return buffers that it is done with,
which will then be re-used by the netdev again.

The main CQ ring is used to notify userspace of received data by using
the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each
entry contains the offset + len to the data.

For now, each io_uring instance only has a single ifq.

Signed-off-by: David Wei <[email protected]>
---
 Kconfig                        |   2 +
 include/linux/io_uring_types.h |   3 +
 include/uapi/linux/io_uring.h  |  43 ++++++++++
 io_uring/KConfig               |  10 +++
 io_uring/Makefile              |   1 +
 io_uring/io_uring.c            |   7 ++
 io_uring/memmap.c              |   8 ++
 io_uring/register.c            |   7 ++
 io_uring/zcrx.c                | 143 +++++++++++++++++++++++++++++++++
 io_uring/zcrx.h                |  39 +++++++++
 10 files changed, 263 insertions(+)
 create mode 100644 io_uring/KConfig
 create mode 100644 io_uring/zcrx.c
 create mode 100644 io_uring/zcrx.h

diff --git a/Kconfig b/Kconfig
index 745bc773f567..529ea7694ba9 100644
--- a/Kconfig
+++ b/Kconfig
@@ -30,3 +30,5 @@ source "lib/Kconfig"
 source "lib/Kconfig.debug"
 
 source "Documentation/Kconfig"
+
+source "io_uring/KConfig"
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 6d3ee71bd832..f0f6d3d186b1 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -39,6 +39,8 @@ enum io_uring_cmd_flags {
 	IO_URING_F_COMPAT		= (1 << 12),
 };
 
+struct io_zcrx_ifq;
+
 struct io_wq_work_node {
 	struct io_wq_work_node *next;
 };
@@ -373,6 +375,7 @@ struct io_ring_ctx {
 	struct io_alloc_cache		rsrc_node_cache;
 	struct wait_queue_head		rsrc_quiesce_wq;
 	unsigned			rsrc_quiesce;
+	struct io_zcrx_ifq		*ifq;
 
 	u32			pers_next;
 	struct xarray		personalities;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 86cb385fe0b5..d398e19f8eea 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -467,6 +467,8 @@ struct io_uring_cqe {
 #define IORING_OFF_PBUF_RING		0x80000000ULL
 #define IORING_OFF_PBUF_SHIFT		16
 #define IORING_OFF_MMAP_MASK		0xf8000000ULL
+#define IORING_OFF_RQ_RING		0x20000000ULL
+#define IORING_OFF_RQ_SHIFT		16
 
 /*
  * Filled with the offset for mmap(2)
@@ -615,6 +617,9 @@ enum io_uring_register_op {
 	/* send MSG_RING without having a ring */
 	IORING_REGISTER_SEND_MSG_RING		= 31,
 
+	/* register a netdev hw rx queue for zerocopy */
+	IORING_REGISTER_ZCRX_IFQ		= 32,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
@@ -845,6 +850,44 @@ enum io_uring_socket_op {
 	SOCKET_URING_OP_SETSOCKOPT,
 };
 
+/* Zero copy receive refill queue entry */
+struct io_uring_zcrx_rqe {
+	__u64	off;
+	__u32	len;
+	__u32	__pad;
+};
+
+struct io_uring_zcrx_cqe {
+	__u64	off;
+	__u64	__pad;
+};
+
+/* The bit from which area id is encoded into offsets */
+#define IORING_ZCRX_AREA_SHIFT	48
+#define IORING_ZCRX_AREA_MASK	(~(((__u64)1 << IORING_ZCRX_AREA_SHIFT) - 1))
+
+struct io_uring_zcrx_offsets {
+	__u32	head;
+	__u32	tail;
+	__u32	rqes;
+	__u32	mmap_sz;
+	__u64	__resv[2];
+};
+
+/*
+ * Argument for IORING_REGISTER_ZCRX_IFQ
+ */
+struct io_uring_zcrx_ifq_reg {
+	__u32	if_idx;
+	__u32	if_rxq;
+	__u32	rq_entries;
+	__u32	flags;
+
+	__u64	area_ptr; /* pointer to struct io_uring_zcrx_area_reg */
+	struct io_uring_zcrx_offsets offsets;
+	__u64	__resv[3];
+};
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/io_uring/KConfig b/io_uring/KConfig
new file mode 100644
index 000000000000..9e2a4beba1ef
--- /dev/null
+++ b/io_uring/KConfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# io_uring configuration
+#
+
+config IO_URING_ZCRX
+	def_bool y
+	depends on PAGE_POOL
+	depends on INET
+	depends on NET_RX_BUSY_POLL
diff --git a/io_uring/Makefile b/io_uring/Makefile
index 53167bef37d7..a95b0b8229c9 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -14,6 +14,7 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
 					epoll.o statx.o timeout.o fdinfo.o \
 					cancel.o waitid.o register.o \
 					truncate.o memmap.o
+obj-$(CONFIG_IO_URING_ZCRX)	+= zcrx.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
 obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 58b401900b41..49d73c793f06 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -98,6 +98,7 @@
 #include "uring_cmd.h"
 #include "msg_ring.h"
 #include "memmap.h"
+#include "zcrx.h"
 
 #include "timeout.h"
 #include "poll.h"
@@ -2729,6 +2730,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		return;
 
 	mutex_lock(&ctx->uring_lock);
+	io_unregister_zcrx_ifqs(ctx);
 	if (ctx->buf_data)
 		__io_sqe_buffers_unregister(ctx);
 	if (ctx->file_data)
@@ -2903,6 +2905,11 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 			io_cqring_overflow_kill(ctx);
 			mutex_unlock(&ctx->uring_lock);
 		}
+		if (ctx->ifq) {
+			mutex_lock(&ctx->uring_lock);
+			io_shutdown_zcrx_ifqs(ctx);
+			mutex_unlock(&ctx->uring_lock);
+		}
 
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
 			io_move_task_work_from_local(ctx);
diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index a0f32a255fd1..4c384e8615f6 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -12,6 +12,7 @@
 
 #include "memmap.h"
 #include "kbuf.h"
+#include "zcrx.h"
 
 static void *io_mem_alloc_compound(struct page **pages, int nr_pages,
 				   size_t size, gfp_t gfp)
@@ -223,6 +224,10 @@ static void *io_uring_validate_mmap_request(struct file *file, loff_t pgoff,
 		io_put_bl(ctx, bl);
 		return ptr;
 		}
+	case IORING_OFF_RQ_RING:
+		if (!ctx->ifq)
+			return ERR_PTR(-EINVAL);
+		return ctx->ifq->rq_ring;
 	}
 
 	return ERR_PTR(-EINVAL);
@@ -261,6 +266,9 @@ __cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
 						ctx->n_sqe_pages);
 	case IORING_OFF_PBUF_RING:
 		return io_pbuf_mmap(file, vma);
+	case IORING_OFF_RQ_RING:
+		return io_uring_mmap_pages(ctx, vma, ctx->ifq->rqe_pages,
+						ctx->ifq->n_rqe_pages);
 	}
 
 	return -EINVAL;
diff --git a/io_uring/register.c b/io_uring/register.c
index 52b2f9b74af8..1fac52b14e3d 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -29,6 +29,7 @@
 #include "napi.h"
 #include "eventfd.h"
 #include "msg_ring.h"
+#include "zcrx.h"
 
 #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
 				 IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -549,6 +550,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_clone_buffers(ctx, arg);
 		break;
+	case IORING_REGISTER_ZCRX_IFQ:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_zcrx_ifq(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
new file mode 100644
index 000000000000..4c53fd4f7bb3
--- /dev/null
+++ b/io_uring/zcrx.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/io_uring.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "io_uring.h"
+#include "kbuf.h"
+#include "memmap.h"
+#include "zcrx.h"
+
+#define IO_RQ_MAX_ENTRIES		32768
+
+static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
+				 struct io_uring_zcrx_ifq_reg *reg)
+{
+	size_t off, size;
+	void *ptr;
+
+	off = sizeof(struct io_uring);
+	size = off + sizeof(struct io_uring_zcrx_rqe) * reg->rq_entries;
+
+	ptr = io_pages_map(&ifq->rqe_pages, &ifq->n_rqe_pages, size);
+	if (IS_ERR(ptr))
+		return PTR_ERR(ptr);
+
+	ifq->rq_ring = (struct io_uring *)ptr;
+	ifq->rqes = (struct io_uring_zcrx_rqe *)(ptr + off);
+	return 0;
+}
+
+static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq)
+{
+	io_pages_unmap(ifq->rq_ring, &ifq->rqe_pages, &ifq->n_rqe_pages, true);
+	ifq->rq_ring = NULL;
+	ifq->rqes = NULL;
+}
+
+static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
+{
+	struct io_zcrx_ifq *ifq;
+
+	ifq = kzalloc(sizeof(*ifq), GFP_KERNEL);
+	if (!ifq)
+		return NULL;
+
+	ifq->if_rxq = -1;
+	ifq->ctx = ctx;
+	return ifq;
+}
+
+static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
+{
+	io_free_rbuf_ring(ifq);
+	kfree(ifq);
+}
+
+int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
+			  struct io_uring_zcrx_ifq_reg __user *arg)
+{
+	struct io_uring_zcrx_ifq_reg reg;
+	struct io_zcrx_ifq *ifq;
+	size_t ring_sz, rqes_sz;
+	int ret;
+
+	/*
+	 * 1. Interface queue allocation.
+	 * 2. It can observe data destined for sockets of other tasks.
+	 */
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	/* mandatory io_uring features for zc rx */
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN &&
+	      ctx->flags & IORING_SETUP_CQE32))
+		return -EINVAL;
+	if (ctx->ifq)
+		return -EBUSY;
+	if (copy_from_user(&reg, arg, sizeof(reg)))
+		return -EFAULT;
+	if (reg.__resv[0] || reg.__resv[1] || reg.__resv[2])
+		return -EINVAL;
+	if (reg.if_rxq == -1 || !reg.rq_entries || reg.flags)
+		return -EINVAL;
+	if (reg.rq_entries > IO_RQ_MAX_ENTRIES) {
+		if (!(ctx->flags & IORING_SETUP_CLAMP))
+			return -EINVAL;
+		reg.rq_entries = IO_RQ_MAX_ENTRIES;
+	}
+	reg.rq_entries = roundup_pow_of_two(reg.rq_entries);
+
+	if (!reg.area_ptr)
+		return -EFAULT;
+
+	ifq = io_zcrx_ifq_alloc(ctx);
+	if (!ifq)
+		return -ENOMEM;
+
+	ret = io_allocate_rbuf_ring(ifq, &reg);
+	if (ret)
+		goto err;
+
+	ifq->rq_entries = reg.rq_entries;
+	ifq->if_rxq = reg.if_rxq;
+
+	ring_sz = sizeof(struct io_uring);
+	rqes_sz = sizeof(struct io_uring_zcrx_rqe) * ifq->rq_entries;
+	reg.offsets.mmap_sz = ring_sz + rqes_sz;
+	reg.offsets.rqes = ring_sz;
+	reg.offsets.head = offsetof(struct io_uring, head);
+	reg.offsets.tail = offsetof(struct io_uring, tail);
+
+	if (copy_to_user(arg, &reg, sizeof(reg))) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	ctx->ifq = ifq;
+	return 0;
+err:
+	io_zcrx_ifq_free(ifq);
+	return ret;
+}
+
+void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+	struct io_zcrx_ifq *ifq = ctx->ifq;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	if (!ifq)
+		return;
+
+	ctx->ifq = NULL;
+	io_zcrx_ifq_free(ifq);
+}
+
+void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+	lockdep_assert_held(&ctx->uring_lock);
+}
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
new file mode 100644
index 000000000000..1f76eecac5fd
--- /dev/null
+++ b/io_uring/zcrx.h
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IOU_ZC_RX_H
+#define IOU_ZC_RX_H
+
+#include <linux/io_uring_types.h>
+
+struct io_zcrx_ifq {
+	struct io_ring_ctx		*ctx;
+	struct net_device		*dev;
+	struct io_uring			*rq_ring;
+	struct io_uring_zcrx_rqe 	*rqes;
+	u32				rq_entries;
+
+	unsigned short			n_rqe_pages;
+	struct page			**rqe_pages;
+
+	u32				if_rxq;
+};
+
+#if defined(CONFIG_IO_URING_ZCRX)
+int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
+			 struct io_uring_zcrx_ifq_reg __user *arg);
+void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);
+void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);
+#else
+static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
+					struct io_uring_zcrx_ifq_reg __user *arg)
+{
+	return -EOPNOTSUPP;
+}
+static inline void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
+{
+}
+#endif
+
+#endif
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 10/15] io_uring/zcrx: add io_zcrx_area
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (8 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-10-29 23:05 ` [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: David Wei <[email protected]>

Add io_zcrx_area that represents a region of userspace memory that is
used for zero copy. During ifq registration, userspace passes in the
uaddr and len of userspace memory, which is then pinned by the kernel.
Each net_iov is mapped to one of these pages.

The freelist is a spinlock protected list that keeps track of all the
net_iovs/pages that aren't used.

For now, there is only one area per ifq and area registration happens
implicitly as part of ifq registration. There is no API for
adding/removing areas yet. The struct for area registration is there for
future extensibility once we support multiple areas and TCP devmem.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 include/uapi/linux/io_uring.h |  9 ++++
 io_uring/rsrc.c               |  2 +-
 io_uring/rsrc.h               |  1 +
 io_uring/zcrx.c               | 93 ++++++++++++++++++++++++++++++++++-
 io_uring/zcrx.h               | 16 ++++++
 5 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d398e19f8eea..d43183264dcf 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -874,6 +874,15 @@ struct io_uring_zcrx_offsets {
 	__u64	__resv[2];
 };
 
+struct io_uring_zcrx_area_reg {
+	__u64	addr;
+	__u64	len;
+	__u64	rq_area_token;
+	__u32	flags;
+	__u32	__resv1;
+	__u64	__resv2[2];
+};
+
 /*
  * Argument for IORING_REGISTER_ZCRX_IFQ
  */
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index fa5f27496aef..a6d44d81f1a1 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -86,7 +86,7 @@ static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
 	return 0;
 }
 
-static int io_buffer_validate(struct iovec *iov)
+int io_buffer_validate(struct iovec *iov)
 {
 	unsigned long tmp, acct_len = iov->iov_len + (PAGE_SIZE - 1);
 
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index c50d4be4aa6d..9cc9b1b320e3 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -83,6 +83,7 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
 			    unsigned size, unsigned type);
 int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
 			unsigned int size, unsigned int type);
+int io_buffer_validate(struct iovec *iov);
 
 static inline void io_put_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
 {
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 4c53fd4f7bb3..a276572fe953 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -10,6 +10,7 @@
 #include "kbuf.h"
 #include "memmap.h"
 #include "zcrx.h"
+#include "rsrc.h"
 
 #define IO_RQ_MAX_ENTRIES		32768
 
@@ -38,6 +39,83 @@ static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq)
 	ifq->rqes = NULL;
 }
 
+static void io_zcrx_free_area(struct io_zcrx_area *area)
+{
+	if (area->freelist)
+		kvfree(area->freelist);
+	if (area->nia.niovs)
+		kvfree(area->nia.niovs);
+	if (area->pages) {
+		unpin_user_pages(area->pages, area->nia.num_niovs);
+		kvfree(area->pages);
+	}
+	kfree(area);
+}
+
+static int io_zcrx_create_area(struct io_ring_ctx *ctx,
+			       struct io_zcrx_ifq *ifq,
+			       struct io_zcrx_area **res,
+			       struct io_uring_zcrx_area_reg *area_reg)
+{
+	struct io_zcrx_area *area;
+	int i, ret, nr_pages;
+	struct iovec iov;
+
+	if (area_reg->flags || area_reg->rq_area_token)
+		return -EINVAL;
+	if (area_reg->__resv1 || area_reg->__resv2[0] || area_reg->__resv2[1])
+		return -EINVAL;
+	if (area_reg->addr & ~PAGE_MASK || area_reg->len & ~PAGE_MASK)
+		return -EINVAL;
+
+	iov.iov_base = u64_to_user_ptr(area_reg->addr);
+	iov.iov_len = area_reg->len;
+	ret = io_buffer_validate(&iov);
+	if (ret)
+		return ret;
+
+	ret = -ENOMEM;
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (!area)
+		goto err;
+
+	area->pages = io_pin_pages((unsigned long)area_reg->addr, area_reg->len,
+				   &nr_pages);
+	if (IS_ERR(area->pages)) {
+		ret = PTR_ERR(area->pages);
+		area->pages = NULL;
+		goto err;
+	}
+	area->nia.num_niovs = nr_pages;
+
+	area->nia.niovs = kvmalloc_array(nr_pages, sizeof(area->nia.niovs[0]),
+					 GFP_KERNEL | __GFP_ZERO);
+	if (!area->nia.niovs)
+		goto err;
+
+	area->freelist = kvmalloc_array(nr_pages, sizeof(area->freelist[0]),
+					GFP_KERNEL | __GFP_ZERO);
+	if (!area->freelist)
+		goto err;
+
+	for (i = 0; i < nr_pages; i++) {
+		area->freelist[i] = i;
+	}
+
+	area->free_count = nr_pages;
+	area->ifq = ifq;
+	/* we're only supporting one area per ifq for now */
+	area->area_id = 0;
+	area_reg->rq_area_token = (u64)area->area_id << IORING_ZCRX_AREA_SHIFT;
+	spin_lock_init(&area->freelist_lock);
+	*res = area;
+	return 0;
+err:
+	if (area)
+		io_zcrx_free_area(area);
+	return ret;
+}
+
 static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
 {
 	struct io_zcrx_ifq *ifq;
@@ -53,6 +131,9 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
 
 static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
 {
+	if (ifq->area)
+		io_zcrx_free_area(ifq->area);
+
 	io_free_rbuf_ring(ifq);
 	kfree(ifq);
 }
@@ -60,6 +141,7 @@ static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
 int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 			  struct io_uring_zcrx_ifq_reg __user *arg)
 {
+	struct io_uring_zcrx_area_reg area;
 	struct io_uring_zcrx_ifq_reg reg;
 	struct io_zcrx_ifq *ifq;
 	size_t ring_sz, rqes_sz;
@@ -91,7 +173,7 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 	}
 	reg.rq_entries = roundup_pow_of_two(reg.rq_entries);
 
-	if (!reg.area_ptr)
+	if (copy_from_user(&area, u64_to_user_ptr(reg.area_ptr), sizeof(area)))
 		return -EFAULT;
 
 	ifq = io_zcrx_ifq_alloc(ctx);
@@ -102,6 +184,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 	if (ret)
 		goto err;
 
+	ret = io_zcrx_create_area(ctx, ifq, &ifq->area, &area);
+	if (ret)
+		goto err;
+
 	ifq->rq_entries = reg.rq_entries;
 	ifq->if_rxq = reg.if_rxq;
 
@@ -116,7 +202,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 		ret = -EFAULT;
 		goto err;
 	}
-
+	if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
+		ret = -EFAULT;
+		goto err;
+	}
 	ctx->ifq = ifq;
 	return 0;
 err:
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 1f76eecac5fd..a8db61498c67 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -3,10 +3,26 @@
 #define IOU_ZC_RX_H
 
 #include <linux/io_uring_types.h>
+#include <net/page_pool/types.h>
+
+struct io_zcrx_area {
+	struct net_iov_area	nia;
+	struct io_zcrx_ifq	*ifq;
+
+	u16			area_id;
+	struct page		**pages;
+
+	/* freelist */
+	spinlock_t		freelist_lock ____cacheline_aligned_in_smp;
+	u32			free_count;
+	u32			*freelist;
+};
 
 struct io_zcrx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
+	struct io_zcrx_area		*area;
+
 	struct io_uring			*rq_ring;
 	struct io_uring_zcrx_rqe 	*rqes;
 	u32				rq_entries;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (9 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 10/15] io_uring/zcrx: add io_zcrx_area David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-11-01 20:06   ` Mina Almasry
  2024-10-29 23:05 ` [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request David Wei
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

Implement a page pool memory provider for io_uring to receieve in a
zero copy fashion. For that, the provider allocates user pages wrapped
around into struct net_iovs, that are stored in a previously registered
struct net_iov_area.

Unlike with traditional receives, for which pages from a page pool can
be deallocated right after the user receives data, e.g. via recv(2),
we extend the lifetime by recycling buffers only after the user space
acknowledges that it's done processing the data via the refill queue.
Before handing buffers to the user, we mark them by bumping the refcount
by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
returned back. When the corresponding io_uring instance and/or page pool
are destroyed, we'll force back all buffers that are currently in the
user space in ->io_pp_zc_scrub by clearing the bias.

Refcounting and lifetime:

Initially, all buffers are considered unallocated and stored in
->freelist, at which point they are not yet directly exposed to the core
page pool code and not accounted to page pool's pages_state_hold_cnt.
The ->alloc_netmems callback will allocate them by placing into the
page pool's cache, setting the refcount to 1 as usual and adjusting
pages_state_hold_cnt.

Then, either the buffer is dropped and returns back to the page pool
into the ->freelist via io_pp_zc_release_netmem, in which case the page
pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
likely the buffer will go through the network/protocol stacks and end up
in the corresponding socket's receive queue. From there the user can get
it via an new io_uring request implemented in following patches. As
mentioned above, before giving a buffer to the user we bump the refcount
by IO_ZC_RX_UREF.

Once the user is done with the buffer processing, it must return it back
via the refill queue, from where our ->alloc_netmems implementation can
grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
there are no more users left. As we place such buffers right back into
the page pools fast cache and they didn't go through the normal pp
release path, they are still considered "allocated" and no pp hold_cnt
is required. For the same reason we dma sync buffers for the device
in io_zc_add_pp_cache().

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 io_uring/zcrx.c | 215 ++++++++++++++++++++++++++++++++++++++++++++++++
 io_uring/zcrx.h |   5 ++
 2 files changed, 220 insertions(+)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index a276572fe953..aad35676207e 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -2,7 +2,12 @@
 #include <linux/kernel.h>
 #include <linux/errno.h>
 #include <linux/mm.h>
+#include <linux/nospec.h>
+#include <linux/netdevice.h>
 #include <linux/io_uring.h>
+#include <net/page_pool/helpers.h>
+#include <net/page_pool/memory_provider.h>
+#include <trace/events/page_pool.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -14,6 +19,16 @@
 
 #define IO_RQ_MAX_ENTRIES		32768
 
+__maybe_unused
+static const struct memory_provider_ops io_uring_pp_zc_ops;
+
+static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
+{
+	struct net_iov_area *owner = net_iov_owner(niov);
+
+	return container_of(owner, struct io_zcrx_area, nia);
+}
+
 static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
 				 struct io_uring_zcrx_ifq_reg *reg)
 {
@@ -99,6 +114,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
 		goto err;
 
 	for (i = 0; i < nr_pages; i++) {
+		struct net_iov *niov = &area->nia.niovs[i];
+
+		niov->owner = &area->nia;
 		area->freelist[i] = i;
 	}
 
@@ -230,3 +248,200 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 {
 	lockdep_assert_held(&ctx->uring_lock);
 }
+
+static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
+{
+	return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
+}
+
+static bool io_zcrx_put_niov_uref(struct net_iov *niov)
+{
+	if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
+		return false;
+
+	return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
+}
+
+static inline void io_zc_add_pp_cache(struct page_pool *pp,
+				      struct net_iov *niov)
+{
+}
+
+static inline u32 io_zcrx_rqring_entries(struct io_zcrx_ifq *ifq)
+{
+	u32 entries;
+
+	entries = smp_load_acquire(&ifq->rq_ring->tail) - ifq->cached_rq_head;
+	return min(entries, ifq->rq_entries);
+}
+
+static struct io_uring_zcrx_rqe *io_zcrx_get_rqe(struct io_zcrx_ifq *ifq,
+						 unsigned mask)
+{
+	unsigned int idx = ifq->cached_rq_head++ & mask;
+
+	return &ifq->rqes[idx];
+}
+
+static void io_zcrx_ring_refill(struct page_pool *pp,
+				struct io_zcrx_ifq *ifq)
+{
+	unsigned int entries = io_zcrx_rqring_entries(ifq);
+	unsigned int mask = ifq->rq_entries - 1;
+
+	entries = min_t(unsigned, entries, PP_ALLOC_CACHE_REFILL - pp->alloc.count);
+	if (unlikely(!entries))
+		return;
+
+	do {
+		struct io_uring_zcrx_rqe *rqe = io_zcrx_get_rqe(ifq, mask);
+		struct io_zcrx_area *area;
+		struct net_iov *niov;
+		unsigned niov_idx, area_idx;
+
+		area_idx = rqe->off >> IORING_ZCRX_AREA_SHIFT;
+		niov_idx = (rqe->off & ~IORING_ZCRX_AREA_MASK) / PAGE_SIZE;
+
+		if (unlikely(rqe->__pad || area_idx))
+			continue;
+		area = ifq->area;
+
+		if (unlikely(niov_idx >= area->nia.num_niovs))
+			continue;
+		niov_idx = array_index_nospec(niov_idx, area->nia.num_niovs);
+
+		niov = &area->nia.niovs[niov_idx];
+		if (!io_zcrx_put_niov_uref(niov))
+			continue;
+		page_pool_mp_return_in_cache(pp, net_iov_to_netmem(niov));
+	} while (--entries);
+
+	smp_store_release(&ifq->rq_ring->head, ifq->cached_rq_head);
+}
+
+static void io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *ifq)
+{
+	struct io_zcrx_area *area = ifq->area;
+
+	spin_lock_bh(&area->freelist_lock);
+	while (area->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
+		struct net_iov *niov;
+		u32 pgid;
+
+		pgid = area->freelist[--area->free_count];
+		niov = &area->nia.niovs[pgid];
+
+		page_pool_mp_return_in_cache(pp, net_iov_to_netmem(niov));
+
+		pp->pages_state_hold_cnt++;
+		trace_page_pool_state_hold(pp, net_iov_to_netmem(niov),
+					   pp->pages_state_hold_cnt);
+	}
+	spin_unlock_bh(&area->freelist_lock);
+}
+
+static void io_zcrx_recycle_niov(struct net_iov *niov)
+{
+	struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+
+	spin_lock_bh(&area->freelist_lock);
+	area->freelist[area->free_count++] = net_iov_idx(niov);
+	spin_unlock_bh(&area->freelist_lock);
+}
+
+static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
+{
+	struct io_zcrx_ifq *ifq = pp->mp_priv;
+
+	/* pp should already be ensuring that */
+	if (unlikely(pp->alloc.count))
+		goto out_return;
+
+	io_zcrx_ring_refill(pp, ifq);
+	if (likely(pp->alloc.count))
+		goto out_return;
+
+	io_zcrx_refill_slow(pp, ifq);
+	if (!pp->alloc.count)
+		return 0;
+out_return:
+	return pp->alloc.cache[--pp->alloc.count];
+}
+
+static bool io_pp_zc_release_netmem(struct page_pool *pp, netmem_ref netmem)
+{
+	struct net_iov *niov;
+
+	if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+		return false;
+
+	niov = netmem_to_net_iov(netmem);
+
+	if (io_zcrx_niov_put(niov, 1))
+		io_zcrx_recycle_niov(niov);
+	return false;
+}
+
+static void io_pp_zc_scrub(struct page_pool *pp)
+{
+	struct io_zcrx_ifq *ifq = pp->mp_priv;
+	struct io_zcrx_area *area = ifq->area;
+	int i;
+
+	/* Reclaim back all buffers given to the user space. */
+	for (i = 0; i < area->nia.num_niovs; i++) {
+		struct net_iov *niov = &area->nia.niovs[i];
+		int count;
+
+		if (!io_zcrx_put_niov_uref(niov))
+			continue;
+		io_zcrx_recycle_niov(niov);
+
+		count = atomic_inc_return_relaxed(&pp->pages_state_release_cnt);
+		trace_page_pool_state_release(pp, net_iov_to_netmem(niov), count);
+	}
+}
+
+static int io_pp_zc_init(struct page_pool *pp)
+{
+	struct io_zcrx_ifq *ifq = pp->mp_priv;
+	struct io_zcrx_area *area = ifq->area;
+	int ret;
+
+	if (!ifq)
+		return -EINVAL;
+	if (pp->p.order != 0)
+		return -EINVAL;
+	if (!pp->p.napi)
+		return -EINVAL;
+
+	ret = page_pool_mp_init_paged_area(pp, &area->nia, area->pages);
+	if (ret)
+		return ret;
+
+	percpu_ref_get(&ifq->ctx->refs);
+	ifq->pp = pp;
+	return 0;
+}
+
+static void io_pp_zc_destroy(struct page_pool *pp)
+{
+	struct io_zcrx_ifq *ifq = pp->mp_priv;
+	struct io_zcrx_area *area = ifq->area;
+
+	page_pool_mp_release_area(pp, &ifq->area->nia);
+
+	ifq->pp = NULL;
+
+	if (WARN_ON_ONCE(area->free_count != area->nia.num_niovs))
+		return;
+	percpu_ref_put(&ifq->ctx->refs);
+}
+
+static const struct memory_provider_ops io_uring_pp_zc_ops = {
+	.alloc_netmems		= io_pp_zc_alloc_netmems,
+	.release_netmem		= io_pp_zc_release_netmem,
+	.init			= io_pp_zc_init,
+	.destroy		= io_pp_zc_destroy,
+	.scrub			= io_pp_zc_scrub,
+};
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index a8db61498c67..464b4bd89b64 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -5,6 +5,9 @@
 #include <linux/io_uring_types.h>
 #include <net/page_pool/types.h>
 
+#define IO_ZC_RX_UREF			0x10000
+#define IO_ZC_RX_KREF_MASK		(IO_ZC_RX_UREF - 1)
+
 struct io_zcrx_area {
 	struct net_iov_area	nia;
 	struct io_zcrx_ifq	*ifq;
@@ -22,10 +25,12 @@ struct io_zcrx_ifq {
 	struct io_ring_ctx		*ctx;
 	struct net_device		*dev;
 	struct io_zcrx_area		*area;
+	struct page_pool		*pp;
 
 	struct io_uring			*rq_ring;
 	struct io_uring_zcrx_rqe 	*rqes;
 	u32				rq_entries;
+	u32				cached_rq_head;
 
 	unsigned short			n_rqe_pages;
 	struct page			**rqe_pages;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (10 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-11-01 20:11   ` Mina Almasry
  2024-10-29 23:05 ` [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a
socket. Only the connection should be land on the specific rx queue set
up for zero copy, and the socket must be handled by the io_uring
instance that the rx queue was registered for zero copy with. That's
because neither net_iovs / buffers from our queue can be read by outside
applications, nor zero copy is possible if traffic for the zero copy
connection goes to another queue. This coordination is outside of the
scope of this patch series. Also, any traffic directed to the zero copy
enabled queue is immediately visible to the application, which is why
CAP_NET_ADMIN is required at the registeration step.

Of course, no data is actually read out of the socket, it has already
been copied by the netdev into userspace memory via DMA. OP_RECV_ZC
reads skbs out of the socket and checks that its frags are indeed
net_iovs that belong to io_uring. A cqe is queued for each one of these
frags.

Recall that each cqe is a big cqe, with the top half being an
io_uring_zcrx_cqe. The cqe res field contains the len or error. The
lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off
field contain the offset relative to the start of the zero copy area.
The upper part of the off field is trivially zero, and will be used
to carry the area id.

For now, there is no limit as to how much work each OP_RECV_ZC request
does. It will attempt to drain a socket of all available data. This
request always operates in multishot mode.

Signed-off-by: David Wei <[email protected]>
---
 include/uapi/linux/io_uring.h |   2 +
 io_uring/io_uring.h           |  10 ++
 io_uring/net.c                |  71 +++++++++++++
 io_uring/opdef.c              |  16 +++
 io_uring/zcrx.c               | 181 +++++++++++++++++++++++++++++++++-
 io_uring/zcrx.h               |  11 +++
 6 files changed, 290 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d43183264dcf..0dcb239ebc59 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -87,6 +87,7 @@ struct io_uring_sqe {
 	union {
 		__s32	splice_fd_in;
 		__u32	file_index;
+		__u32	zcrx_ifq_idx;
 		__u32	optlen;
 		struct {
 			__u16	addr_len;
@@ -259,6 +260,7 @@ enum io_uring_op {
 	IORING_OP_FTRUNCATE,
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
+	IORING_OP_RECV_ZC,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 12e8fca73891..f02efae82444 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -176,6 +176,16 @@ static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret
 	return io_get_cqe_overflow(ctx, ret, false);
 }
 
+static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
+					       struct io_uring_cqe **cqe_ret)
+{
+	io_lockdep_assert_cq_locked(ctx);
+
+	ctx->cq_extra++;
+	ctx->submit_state.cq_flush = true;
+	return io_get_cqe(ctx, cqe_ret);
+}
+
 static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
 					    struct io_kiocb *req)
 {
diff --git a/io_uring/net.c b/io_uring/net.c
index 2040195e33ab..5c015b47582f 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -16,6 +16,7 @@
 #include "net.h"
 #include "notif.h"
 #include "rsrc.h"
+#include "zcrx.h"
 
 #if defined(CONFIG_NET)
 struct io_shutdown {
@@ -88,6 +89,13 @@ struct io_sr_msg {
  */
 #define MULTISHOT_MAX_RETRY	32
 
+struct io_recvzc {
+	struct file			*file;
+	unsigned			msg_flags;
+	u16				flags;
+	struct io_zcrx_ifq		*ifq;
+};
+
 int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
@@ -1208,6 +1216,69 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	return ret;
 }
 
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+	unsigned ifq_idx;
+
+	if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr ||
+		     sqe->len || sqe->addr3))
+		return -EINVAL;
+
+	ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx);
+	if (ifq_idx != 0)
+		return -EINVAL;
+	zc->ifq = req->ctx->ifq;
+	if (!zc->ifq)
+		return -EINVAL;
+
+	zc->flags = READ_ONCE(sqe->ioprio);
+	zc->msg_flags = READ_ONCE(sqe->msg_flags);
+	if (zc->msg_flags)
+		return -EINVAL;
+	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECV_MULTISHOT))
+		return -EINVAL;
+	/* multishot required */
+	if (!(zc->flags & IORING_RECV_MULTISHOT))
+		return -EINVAL;
+	/* All data completions are posted as aux CQEs. */
+	req->flags |= REQ_F_APOLL_MULTISHOT;
+
+	return 0;
+}
+
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc);
+	struct socket *sock;
+	int ret;
+
+	if (!(req->flags & REQ_F_POLLED) &&
+	    (zc->flags & IORING_RECVSEND_POLL_FIRST))
+		return -EAGAIN;
+
+	sock = sock_from_file(req->file);
+	if (unlikely(!sock))
+		return -ENOTSOCK;
+
+	ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT);
+	if (unlikely(ret <= 0) && ret != -EAGAIN) {
+		if (ret == -ERESTARTSYS)
+			ret = -EINTR;
+
+		req_set_fail(req);
+		io_req_set_res(req, ret, 0);
+
+		if (issue_flags & IO_URING_F_MULTISHOT)
+			return IOU_STOP_MULTISHOT;
+		return IOU_OK;
+	}
+
+	if (issue_flags & IO_URING_F_MULTISHOT)
+		return IOU_ISSUE_SKIP_COMPLETE;
+	return -EAGAIN;
+}
+
 void io_send_zc_cleanup(struct io_kiocb *req)
 {
 	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index a2be3bbca5ff..599eb3ea5ff4 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -36,6 +36,7 @@
 #include "waitid.h"
 #include "futex.h"
 #include "truncate.h"
+#include "zcrx.h"
 
 static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags)
 {
@@ -513,6 +514,18 @@ const struct io_issue_def io_issue_defs[] = {
 		.async_size		= sizeof(struct io_async_msghdr),
 #else
 		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_RECV_ZC] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.pollin			= 1,
+		.ioprio			= 1,
+#if defined(CONFIG_NET)
+		.prep			= io_recvzc_prep,
+		.issue			= io_recvzc,
+#else
+		.prep			= io_eopnotsupp_prep,
 #endif
 	},
 };
@@ -742,6 +755,9 @@ const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_LISTEN] = {
 		.name			= "LISTEN",
 	},
+	[IORING_OP_RECV_ZC] = {
+		.name			= "RECV_ZC",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index aad35676207e..477b0d1b7b91 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -8,6 +8,8 @@
 #include <net/page_pool/helpers.h>
 #include <net/page_pool/memory_provider.h>
 #include <trace/events/page_pool.h>
+#include <net/tcp.h>
+#include <net/rps.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -19,7 +21,12 @@
 
 #define IO_RQ_MAX_ENTRIES		32768
 
-__maybe_unused
+struct io_zcrx_args {
+	struct io_kiocb		*req;
+	struct io_zcrx_ifq	*ifq;
+	struct socket		*sock;
+};
+
 static const struct memory_provider_ops io_uring_pp_zc_ops;
 
 static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
@@ -249,6 +256,11 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 	lockdep_assert_held(&ctx->uring_lock);
 }
 
+static void io_zcrx_get_buf_uref(struct net_iov *niov)
+{
+	atomic_long_add(IO_ZC_RX_UREF, &niov->pp_ref_count);
+}
+
 static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
 {
 	return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
@@ -445,3 +457,170 @@ static const struct memory_provider_ops io_uring_pp_zc_ops = {
 	.destroy		= io_pp_zc_destroy,
 	.scrub			= io_pp_zc_scrub,
 };
+
+static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
+			      struct io_zcrx_ifq *ifq, int off, int len)
+{
+	struct io_uring_zcrx_cqe *rcqe;
+	struct io_zcrx_area *area;
+	struct io_uring_cqe *cqe;
+	u64 offset;
+
+	if (!io_defer_get_uncommited_cqe(req->ctx, &cqe))
+		return false;
+
+	cqe->user_data = req->cqe.user_data;
+	cqe->res = len;
+	cqe->flags = IORING_CQE_F_MORE;
+
+	area = io_zcrx_iov_to_area(niov);
+	offset = off + (net_iov_idx(niov) << PAGE_SHIFT);
+	rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);
+	rcqe->off = offset + ((u64)area->area_id << IORING_ZCRX_AREA_SHIFT);
+	rcqe->__pad = 0;
+	return true;
+}
+
+static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+			     const skb_frag_t *frag, int off, int len)
+{
+	struct net_iov *niov;
+
+	off += skb_frag_off(frag);
+
+	if (unlikely(!skb_frag_is_net_iov(frag)))
+		return -EOPNOTSUPP;
+
+	niov = netmem_to_net_iov(frag->netmem);
+	if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
+	    niov->pp->mp_priv != ifq)
+		return -EFAULT;
+
+	if (!io_zcrx_queue_cqe(req, niov, ifq, off, len))
+		return -ENOSPC;
+	io_zcrx_get_buf_uref(niov);
+	return len;
+}
+
+static int
+io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
+		 unsigned int offset, size_t len)
+{
+	struct io_zcrx_args *args = desc->arg.data;
+	struct io_zcrx_ifq *ifq = args->ifq;
+	struct io_kiocb *req = args->req;
+	struct sk_buff *frag_iter;
+	unsigned start, start_off;
+	int i, copy, end, off;
+	int ret = 0;
+
+	start = skb_headlen(skb);
+	start_off = offset;
+
+	if (offset < start)
+		return -EOPNOTSUPP;
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		const skb_frag_t *frag;
+
+		if (WARN_ON(start > offset + len))
+			return -EFAULT;
+
+		frag = &skb_shinfo(skb)->frags[i];
+		end = start + skb_frag_size(frag);
+
+		if (offset < end) {
+			copy = end - offset;
+			if (copy > len)
+				copy = len;
+
+			off = offset - start;
+			ret = io_zcrx_recv_frag(req, ifq, frag, off, copy);
+			if (ret < 0)
+				goto out;
+
+			offset += ret;
+			len -= ret;
+			if (len == 0 || ret != copy)
+				goto out;
+		}
+		start = end;
+	}
+
+	skb_walk_frags(skb, frag_iter) {
+		if (WARN_ON(start > offset + len))
+			return -EFAULT;
+
+		end = start + frag_iter->len;
+		if (offset < end) {
+			copy = end - offset;
+			if (copy > len)
+				copy = len;
+
+			off = offset - start;
+			ret = io_zcrx_recv_skb(desc, frag_iter, off, copy);
+			if (ret < 0)
+				goto out;
+
+			offset += ret;
+			len -= ret;
+			if (len == 0 || ret != copy)
+				goto out;
+		}
+		start = end;
+	}
+
+out:
+	if (offset == start_off)
+		return ret;
+	return offset - start_off;
+}
+
+static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+				struct sock *sk, int flags)
+{
+	struct io_zcrx_args args = {
+		.req = req,
+		.ifq = ifq,
+		.sock = sk->sk_socket,
+	};
+	read_descriptor_t rd_desc = {
+		.count = 1,
+		.arg.data = &args,
+	};
+	int ret;
+
+	lock_sock(sk);
+	ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb);
+	if (ret <= 0) {
+		if (ret < 0 || sock_flag(sk, SOCK_DONE))
+			goto out;
+		if (sk->sk_err)
+			ret = sock_error(sk);
+		else if (sk->sk_shutdown & RCV_SHUTDOWN)
+			goto out;
+		else if (sk->sk_state == TCP_CLOSE)
+			ret = -ENOTCONN;
+		else
+			ret = -EAGAIN;
+	} else if (sock_flag(sk, SOCK_DONE)) {
+		/* Make it to retry until it finally gets 0. */
+		ret = -EAGAIN;
+	}
+out:
+	release_sock(sk);
+	return ret;
+}
+
+int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+		 struct socket *sock, unsigned int flags)
+{
+	struct sock *sk = sock->sk;
+	const struct proto *prot = READ_ONCE(sk->sk_prot);
+
+	if (prot->recvmsg != tcp_recvmsg)
+		return -EPROTONOSUPPORT;
+
+	sock_rps_record_flow(sk);
+	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags);
+}
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 464b4bd89b64..1f039ad45a63 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -3,6 +3,7 @@
 #define IOU_ZC_RX_H
 
 #include <linux/io_uring_types.h>
+#include <linux/socket.h>
 #include <net/page_pool/types.h>
 
 #define IO_ZC_RX_UREF			0x10000
@@ -43,6 +44,8 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 			 struct io_uring_zcrx_ifq_reg __user *arg);
 void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);
 void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);
+int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+		 struct socket *sock, unsigned int flags);
 #else
 static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 					struct io_uring_zcrx_ifq_reg __user *arg)
@@ -55,6 +58,14 @@ static inline void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
 static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 {
 }
+static inline int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+			       struct socket *sock, unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
+int io_recvzc(struct io_kiocb *req, unsigned int issue_flags);
+int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+
 #endif
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (11 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-11-01 20:16   ` Mina Almasry
  2024-10-29 23:05 ` [PATCH v7 14/15] io_uring/zcrx: add copy fallback David Wei
  2024-10-29 23:05 ` [PATCH v7 15/15] io_uring/zcrx: throttle receive requests David Wei
  14 siblings, 1 reply; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: David Wei <[email protected]>

Set the page pool memory provider for the rx queue configured for zero
copy to io_uring. Then the rx queue is reset using
netdev_rx_queue_restart() and netdev core + page pool will take care of
filling the rx queue from the io_uring zero copy memory provider.

For now, there is only one ifq so its destruction happens implicitly
during io_uring cleanup.

Signed-off-by: David Wei <[email protected]>
---
 io_uring/zcrx.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++--
 io_uring/zcrx.h |  2 ++
 2 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 477b0d1b7b91..3f4625730dbd 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -8,6 +8,7 @@
 #include <net/page_pool/helpers.h>
 #include <net/page_pool/memory_provider.h>
 #include <trace/events/page_pool.h>
+#include <net/netdev_rx_queue.h>
 #include <net/tcp.h>
 #include <net/rps.h>
 
@@ -36,6 +37,65 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
 	return container_of(owner, struct io_zcrx_area, nia);
 }
 
+static int io_open_zc_rxq(struct io_zcrx_ifq *ifq, unsigned ifq_idx)
+{
+	struct netdev_rx_queue *rxq;
+	struct net_device *dev = ifq->dev;
+	int ret;
+
+	ASSERT_RTNL();
+
+	if (ifq_idx >= dev->num_rx_queues)
+		return -EINVAL;
+	ifq_idx = array_index_nospec(ifq_idx, dev->num_rx_queues);
+
+	rxq = __netif_get_rx_queue(ifq->dev, ifq_idx);
+	if (rxq->mp_params.mp_priv)
+		return -EEXIST;
+
+	ifq->if_rxq = ifq_idx;
+	rxq->mp_params.mp_ops = &io_uring_pp_zc_ops;
+	rxq->mp_params.mp_priv = ifq;
+	ret = netdev_rx_queue_restart(ifq->dev, ifq->if_rxq);
+	if (ret)
+		goto fail;
+	return 0;
+fail:
+	rxq->mp_params.mp_ops = NULL;
+	rxq->mp_params.mp_priv = NULL;
+	ifq->if_rxq = -1;
+	return ret;
+}
+
+static void io_close_zc_rxq(struct io_zcrx_ifq *ifq)
+{
+	struct netdev_rx_queue *rxq;
+	int err;
+
+	if (ifq->if_rxq == -1)
+		return;
+
+	rtnl_lock();
+	if (WARN_ON_ONCE(ifq->if_rxq >= ifq->dev->num_rx_queues)) {
+		rtnl_unlock();
+		return;
+	}
+
+	rxq = __netif_get_rx_queue(ifq->dev, ifq->if_rxq);
+
+	WARN_ON_ONCE(rxq->mp_params.mp_priv != ifq);
+
+	rxq->mp_params.mp_ops = NULL;
+	rxq->mp_params.mp_priv = NULL;
+
+	err = netdev_rx_queue_restart(ifq->dev, ifq->if_rxq);
+	if (err)
+		pr_devel("io_uring: can't restart a queue on zcrx close\n");
+
+	rtnl_unlock();
+	ifq->if_rxq = -1;
+}
+
 static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
 				 struct io_uring_zcrx_ifq_reg *reg)
 {
@@ -156,9 +216,12 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
 
 static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
 {
+	io_close_zc_rxq(ifq);
+
 	if (ifq->area)
 		io_zcrx_free_area(ifq->area);
-
+	if (ifq->dev)
+		netdev_put(ifq->dev, &ifq->netdev_tracker);
 	io_free_rbuf_ring(ifq);
 	kfree(ifq);
 }
@@ -214,7 +277,18 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 		goto err;
 
 	ifq->rq_entries = reg.rq_entries;
-	ifq->if_rxq = reg.if_rxq;
+
+	ret = -ENODEV;
+	rtnl_lock();
+	ifq->dev = netdev_get_by_index(current->nsproxy->net_ns, reg.if_idx,
+				       &ifq->netdev_tracker, GFP_KERNEL);
+	if (!ifq->dev)
+		goto err_rtnl_unlock;
+
+	ret = io_open_zc_rxq(ifq, reg.if_rxq);
+	if (ret)
+		goto err_rtnl_unlock;
+	rtnl_unlock();
 
 	ring_sz = sizeof(struct io_uring);
 	rqes_sz = sizeof(struct io_uring_zcrx_rqe) * ifq->rq_entries;
@@ -224,15 +298,20 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 	reg.offsets.tail = offsetof(struct io_uring, tail);
 
 	if (copy_to_user(arg, &reg, sizeof(reg))) {
+		io_close_zc_rxq(ifq);
 		ret = -EFAULT;
 		goto err;
 	}
 	if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) {
+		io_close_zc_rxq(ifq);
 		ret = -EFAULT;
 		goto err;
 	}
 	ctx->ifq = ifq;
 	return 0;
+
+err_rtnl_unlock:
+	rtnl_unlock();
 err:
 	io_zcrx_ifq_free(ifq);
 	return ret;
@@ -254,6 +333,9 @@ void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx)
 void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 {
 	lockdep_assert_held(&ctx->uring_lock);
+
+	if (ctx->ifq)
+		io_close_zc_rxq(ctx->ifq);
 }
 
 static void io_zcrx_get_buf_uref(struct net_iov *niov)
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 1f039ad45a63..d3f6b6cdd647 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -5,6 +5,7 @@
 #include <linux/io_uring_types.h>
 #include <linux/socket.h>
 #include <net/page_pool/types.h>
+#include <net/net_trackers.h>
 
 #define IO_ZC_RX_UREF			0x10000
 #define IO_ZC_RX_KREF_MASK		(IO_ZC_RX_UREF - 1)
@@ -37,6 +38,7 @@ struct io_zcrx_ifq {
 	struct page			**rqe_pages;
 
 	u32				if_rxq;
+	netdevice_tracker		netdev_tracker;
 };
 
 #if defined(CONFIG_IO_URING_ZCRX)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 14/15] io_uring/zcrx: add copy fallback
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (12 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
@ 2024-10-29 23:05 ` David Wei
  2024-10-29 23:05 ` [PATCH v7 15/15] io_uring/zcrx: throttle receive requests David Wei
  14 siblings, 0 replies; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

There are scenarios in which the zerocopy path might get a normal
in-kernel buffer, it could be a mis-steered packet or simply the linear
part of an skb. Another use case is to allow the driver to allocate
kernel pages when it's out of zc buffers, which makes it more resilient
to spikes in load and allow the user to choose the balance between the
amount of memory provided and performance.

At the moment we fail such requests. Instead, grab a buffer from the
page pool, copy data there, and return back to user in the usual way.
Because the refill ring is private to the napi our page pool is running
from, it's done by stopping the napi via napi_execute() helper. It grabs
only one buffer, which is inefficient, and improving it is left for
follow up patches.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 io_uring/zcrx.c | 133 +++++++++++++++++++++++++++++++++++++++++++++---
 io_uring/zcrx.h |   1 +
 2 files changed, 127 insertions(+), 7 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 3f4625730dbd..1f4db70e3370 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -5,6 +5,8 @@
 #include <linux/nospec.h>
 #include <linux/netdevice.h>
 #include <linux/io_uring.h>
+#include <linux/skbuff_ref.h>
+#include <net/busy_poll.h>
 #include <net/page_pool/helpers.h>
 #include <net/page_pool/memory_provider.h>
 #include <trace/events/page_pool.h>
@@ -28,6 +30,11 @@ struct io_zcrx_args {
 	struct socket		*sock;
 };
 
+struct io_zc_refill_data {
+	struct io_zcrx_ifq *ifq;
+	struct net_iov *niov;
+};
+
 static const struct memory_provider_ops io_uring_pp_zc_ops;
 
 static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
@@ -37,6 +44,13 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
 	return container_of(owner, struct io_zcrx_area, nia);
 }
 
+static inline struct page *io_zcrx_iov_page(const struct net_iov *niov)
+{
+	struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+
+	return area->pages[net_iov_idx(niov)];
+}
+
 static int io_open_zc_rxq(struct io_zcrx_ifq *ifq, unsigned ifq_idx)
 {
 	struct netdev_rx_queue *rxq;
@@ -59,6 +73,13 @@ static int io_open_zc_rxq(struct io_zcrx_ifq *ifq, unsigned ifq_idx)
 	ret = netdev_rx_queue_restart(ifq->dev, ifq->if_rxq);
 	if (ret)
 		goto fail;
+
+	if (WARN_ON_ONCE(!ifq->pp)) {
+		ret = -EFAULT;
+		goto fail;
+	}
+	/* grab napi_id while still under rtnl */
+	ifq->napi_id = ifq->pp->p.napi->napi_id;
 	return 0;
 fail:
 	rxq->mp_params.mp_ops = NULL;
@@ -526,6 +547,7 @@ static void io_pp_zc_destroy(struct page_pool *pp)
 	page_pool_mp_release_area(pp, &ifq->area->nia);
 
 	ifq->pp = NULL;
+	ifq->napi_id = 0;
 
 	if (WARN_ON_ONCE(area->free_count != area->nia.num_niovs))
 		return;
@@ -540,6 +562,34 @@ static const struct memory_provider_ops io_uring_pp_zc_ops = {
 	.scrub			= io_pp_zc_scrub,
 };
 
+static void io_napi_refill(void *data)
+{
+	struct io_zc_refill_data *rd = data;
+	struct io_zcrx_ifq *ifq = rd->ifq;
+	netmem_ref netmem;
+
+	if (WARN_ON_ONCE(!ifq->pp))
+		return;
+
+	netmem = page_pool_alloc_netmem(ifq->pp, GFP_ATOMIC | __GFP_NOWARN);
+	if (!netmem)
+		return;
+	if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+		return;
+
+	rd->niov = netmem_to_net_iov(netmem);
+}
+
+static struct net_iov *io_zc_get_buf_task_safe(struct io_zcrx_ifq *ifq)
+{
+	struct io_zc_refill_data rd = {
+		.ifq = ifq,
+	};
+
+	napi_execute(ifq->napi_id, io_napi_refill, &rd);
+	return rd.niov;
+}
+
 static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
 			      struct io_zcrx_ifq *ifq, int off, int len)
 {
@@ -563,6 +613,45 @@ static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
 	return true;
 }
 
+static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+				  void *data, unsigned int offset, size_t len)
+{
+	size_t copy_size, copied = 0;
+	int ret = 0, off = 0;
+	struct page *page;
+	u8 *vaddr;
+
+	do {
+		struct net_iov *niov;
+
+		niov = io_zc_get_buf_task_safe(ifq);
+		if (!niov) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		page = io_zcrx_iov_page(niov);
+		vaddr = kmap_local_page(page);
+		copy_size = min_t(size_t, PAGE_SIZE, len);
+		memcpy(vaddr, data + offset, copy_size);
+		kunmap_local(vaddr);
+
+		if (!io_zcrx_queue_cqe(req, niov, ifq, off, copy_size)) {
+			napi_pp_put_page(net_iov_to_netmem(niov));
+			return -ENOSPC;
+		}
+
+		io_zcrx_get_buf_uref(niov);
+		napi_pp_put_page(net_iov_to_netmem(niov));
+
+		offset += copy_size;
+		len -= copy_size;
+		copied += copy_size;
+	} while (offset < len);
+
+	return copied ? copied : ret;
+}
+
 static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 			     const skb_frag_t *frag, int off, int len)
 {
@@ -570,8 +659,24 @@ static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 
 	off += skb_frag_off(frag);
 
-	if (unlikely(!skb_frag_is_net_iov(frag)))
-		return -EOPNOTSUPP;
+	if (unlikely(!skb_frag_is_net_iov(frag))) {
+		struct page *page = skb_frag_page(frag);
+		u32 p_off, p_len, t, copied = 0;
+		u8 *vaddr;
+		int ret = 0;
+
+		skb_frag_foreach_page(frag, off, len,
+				      page, p_off, p_len, t) {
+			vaddr = kmap_local_page(page);
+			ret = io_zcrx_copy_chunk(req, ifq, vaddr, p_off, p_len);
+			kunmap_local(vaddr);
+
+			if (ret < 0)
+				return copied ? copied : ret;
+			copied += ret;
+		}
+		return copied;
+	}
 
 	niov = netmem_to_net_iov(frag->netmem);
 	if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
@@ -592,15 +697,29 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 	struct io_zcrx_ifq *ifq = args->ifq;
 	struct io_kiocb *req = args->req;
 	struct sk_buff *frag_iter;
-	unsigned start, start_off;
+	unsigned start, start_off = offset;
 	int i, copy, end, off;
 	int ret = 0;
 
-	start = skb_headlen(skb);
-	start_off = offset;
+	if (unlikely(offset < skb_headlen(skb))) {
+		ssize_t copied;
+		size_t to_copy;
 
-	if (offset < start)
-		return -EOPNOTSUPP;
+		to_copy = min_t(size_t, skb_headlen(skb) - offset, len);
+		copied = io_zcrx_copy_chunk(req, ifq, skb->data, offset, to_copy);
+		if (copied < 0) {
+			ret = copied;
+			goto out;
+		}
+		offset += copied;
+		len -= copied;
+		if (!len)
+			goto out;
+		if (offset != skb_headlen(skb))
+			goto out;
+	}
+
+	start = skb_headlen(skb);
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		const skb_frag_t *frag;
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index d3f6b6cdd647..5d7920972e95 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -39,6 +39,7 @@ struct io_zcrx_ifq {
 
 	u32				if_rxq;
 	netdevice_tracker		netdev_tracker;
+	unsigned			napi_id;
 };
 
 #if defined(CONFIG_IO_URING_ZCRX)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v7 15/15] io_uring/zcrx: throttle receive requests
  2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
                   ` (13 preceding siblings ...)
  2024-10-29 23:05 ` [PATCH v7 14/15] io_uring/zcrx: add copy fallback David Wei
@ 2024-10-29 23:05 ` David Wei
  14 siblings, 0 replies; 47+ messages in thread
From: David Wei @ 2024-10-29 23:05 UTC (permalink / raw)
  To: io-uring, netdev
  Cc: Jens Axboe, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

From: Pavel Begunkov <[email protected]>

io_zc_rx_tcp_recvmsg() continues until it fails or there is nothing to
receive. If the other side sends fast enough, we might get stuck in
io_zc_rx_tcp_recvmsg() producing more and more CQEs but not letting the
user to handle them leading to unbound latencies.

Break out of it based on an arbitrarily chosen limit, the upper layer
will either return to userspace or requeue the request.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: David Wei <[email protected]>
---
 io_uring/net.c  |  5 ++++-
 io_uring/zcrx.c | 17 ++++++++++++++---
 io_uring/zcrx.h |  6 ++++--
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 5c015b47582f..f44344942d32 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1261,10 +1261,13 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(!sock))
 		return -ENOTSOCK;
 
-	ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT);
+	ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT,
+			   issue_flags);
 	if (unlikely(ret <= 0) && ret != -EAGAIN) {
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
+		if (ret == IOU_REQUEUE)
+			return IOU_REQUEUE;
 
 		req_set_fail(req);
 		io_req_set_res(req, ret, 0);
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 1f4db70e3370..a2c753e8e46e 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -24,10 +24,13 @@
 
 #define IO_RQ_MAX_ENTRIES		32768
 
+#define IO_SKBS_PER_CALL_LIMIT	20
+
 struct io_zcrx_args {
 	struct io_kiocb		*req;
 	struct io_zcrx_ifq	*ifq;
 	struct socket		*sock;
+	unsigned		nr_skbs;
 };
 
 struct io_zc_refill_data {
@@ -701,6 +704,9 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 	int i, copy, end, off;
 	int ret = 0;
 
+	if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT))
+		return -EAGAIN;
+
 	if (unlikely(offset < skb_headlen(skb))) {
 		ssize_t copied;
 		size_t to_copy;
@@ -778,7 +784,8 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 }
 
 static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
-				struct sock *sk, int flags)
+				struct sock *sk, int flags,
+				unsigned int issue_flags)
 {
 	struct io_zcrx_args args = {
 		.req = req,
@@ -804,6 +811,9 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 			ret = -ENOTCONN;
 		else
 			ret = -EAGAIN;
+	} else if (unlikely(args.nr_skbs > IO_SKBS_PER_CALL_LIMIT) &&
+		   (issue_flags & IO_URING_F_MULTISHOT)) {
+		ret = IOU_REQUEUE;
 	} else if (sock_flag(sk, SOCK_DONE)) {
 		/* Make it to retry until it finally gets 0. */
 		ret = -EAGAIN;
@@ -814,7 +824,8 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 }
 
 int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
-		 struct socket *sock, unsigned int flags)
+		 struct socket *sock, unsigned int flags,
+		 unsigned int issue_flags)
 {
 	struct sock *sk = sock->sk;
 	const struct proto *prot = READ_ONCE(sk->sk_prot);
@@ -823,5 +834,5 @@ int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 		return -EPROTONOSUPPORT;
 
 	sock_rps_record_flow(sk);
-	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags);
+	return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags);
 }
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 5d7920972e95..45485bdce61a 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -48,7 +48,8 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx);
 void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx);
 int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
-		 struct socket *sock, unsigned int flags);
+		 struct socket *sock, unsigned int flags,
+		 unsigned int issue_flags);
 #else
 static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 					struct io_uring_zcrx_ifq_reg __user *arg)
@@ -62,7 +63,8 @@ static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
 {
 }
 static inline int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
-			       struct socket *sock, unsigned int flags)
+				struct socket *sock, unsigned int flags,
+				unsigned int issue_flags)
 {
 	return -EOPNOTSUPP;
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 01/15] net: prefix devmem specific helpers
  2024-10-29 23:05 ` [PATCH v7 01/15] net: prefix devmem specific helpers David Wei
@ 2024-11-01 14:57   ` Mina Almasry
  0 siblings, 0 replies; 47+ messages in thread
From: Mina Almasry @ 2024-11-01 14:57 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Add prefixes to all helpers that are specific to devmem TCP, i.e.
> net_iov_binding[_id].
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>

Reviewed-by: Mina Almasry <[email protected]>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 04/15] net: prepare for non devmem TCP memory providers
  2024-10-29 23:05 ` [PATCH v7 04/15] net: prepare for non devmem TCP memory providers David Wei
@ 2024-11-01 17:09   ` Mina Almasry
  2024-11-01 17:41     ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Mina Almasry @ 2024-11-01 17:09 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> There is a good bunch of places in generic paths assuming that the only
> page pool memory provider is devmem TCP. As we want to reuse the net_iov
> and provider infrastructure, we need to patch it up and explicitly check
> the provider type when we branch into devmem TCP code.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  net/core/devmem.c         | 10 ++++++++--
>  net/core/devmem.h         |  8 ++++++++
>  net/core/page_pool_user.c | 15 +++++++++------
>  net/ipv4/tcp.c            |  6 ++++++
>  4 files changed, 31 insertions(+), 8 deletions(-)
>
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 01738029e35c..78983a98e5dc 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -28,6 +28,12 @@ static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
>
>  static const struct memory_provider_ops dmabuf_devmem_ops;
>
> +bool net_is_devmem_page_pool_ops(const struct memory_provider_ops *ops)
> +{
> +       return ops == &dmabuf_devmem_ops;
> +}
> +EXPORT_SYMBOL_GPL(net_is_devmem_page_pool_ops);
> +
>  static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
>                                                struct gen_pool_chunk *chunk,
>                                                void *not_used)
> @@ -316,10 +322,10 @@ void dev_dmabuf_uninstall(struct net_device *dev)
>         unsigned int i;
>
>         for (i = 0; i < dev->real_num_rx_queues; i++) {
> -               binding = dev->_rx[i].mp_params.mp_priv;
> -               if (!binding)
> +               if (dev->_rx[i].mp_params.mp_ops != &dmabuf_devmem_ops)
>                         continue;
>

Use the net_is_devmem_page_pool_ops helper here?

> +               binding = dev->_rx[i].mp_params.mp_priv;
>                 xa_for_each(&binding->bound_rxqs, xa_idx, rxq)
>                         if (rxq == &dev->_rx[i]) {
>                                 xa_erase(&binding->bound_rxqs, xa_idx);
> diff --git a/net/core/devmem.h b/net/core/devmem.h
> index a2b9913e9a17..a3fdd66bb05b 100644
> --- a/net/core/devmem.h
> +++ b/net/core/devmem.h
> @@ -116,6 +116,8 @@ struct net_iov *
>  net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding);
>  void net_devmem_free_dmabuf(struct net_iov *ppiov);
>
> +bool net_is_devmem_page_pool_ops(const struct memory_provider_ops *ops);
> +
>  #else
>  struct net_devmem_dmabuf_binding;
>
> @@ -168,6 +170,12 @@ static inline u32 net_devmem_iov_binding_id(const struct net_iov *niov)
>  {
>         return 0;
>  }
> +
> +static inline bool
> +net_is_devmem_page_pool_ops(const struct memory_provider_ops *ops)
> +{
> +       return false;
> +}
>  #endif
>
>  #endif /* _NET_DEVMEM_H */
> diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
> index 48335766c1bf..604862a73535 100644
> --- a/net/core/page_pool_user.c
> +++ b/net/core/page_pool_user.c
> @@ -214,7 +214,7 @@ static int
>  page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
>                   const struct genl_info *info)
>  {
> -       struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> +       struct net_devmem_dmabuf_binding *binding;
>         size_t inflight, refsz;
>         void *hdr;
>
> @@ -244,8 +244,11 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
>                          pool->user.detach_time))
>                 goto err_cancel;
>
> -       if (binding && nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id))
> -               goto err_cancel;
> +       if (net_is_devmem_page_pool_ops(pool->mp_ops)) {
> +               binding = pool->mp_priv;
> +               if (nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id))
> +                       goto err_cancel;
> +       }

Worthy of note is that I think Jakub asked for this introspection, and
likely you should also add similar introspection. I.e. page_pool
dumping should likely be improved to dump that it's bound to io_uring
memory. Not sure what io_uring memory 'id' equivalent would be, if
any.

>
>         genlmsg_end(rsp, hdr);
>
> @@ -353,16 +356,16 @@ void page_pool_unlist(struct page_pool *pool)
>  int page_pool_check_memory_provider(struct net_device *dev,
>                                     struct netdev_rx_queue *rxq)
>  {
> -       struct net_devmem_dmabuf_binding *binding = rxq->mp_params.mp_priv;
> +       void *mp_priv = rxq->mp_params.mp_priv;
>         struct page_pool *pool;
>         struct hlist_node *n;
>
> -       if (!binding)
> +       if (!mp_priv)
>                 return 0;
>
>         mutex_lock(&page_pools_lock);
>         hlist_for_each_entry_safe(pool, n, &dev->page_pools, user.list) {
> -               if (pool->mp_priv != binding)
> +               if (pool->mp_priv != mp_priv)
>                         continue;
>
>                 if (pool->slow.queue_idx == get_netdev_rx_queue_index(rxq)) {
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e928efc22f80..31e01da61c12 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -277,6 +277,7 @@
>  #include <net/ip.h>
>  #include <net/sock.h>
>  #include <net/rstreason.h>
> +#include <net/page_pool/types.h>
>
>  #include <linux/uaccess.h>
>  #include <asm/ioctls.h>
> @@ -2476,6 +2477,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
>                         }
>
>                         niov = skb_frag_net_iov(frag);
> +                       if (net_is_devmem_page_pool_ops(niov->pp->mp_ops)) {
> +                               err = -ENODEV;
> +                               goto out;
> +                       }
> +

I think this check needs to go in the caller. Currently the caller
assumes that if !skb_frags_readable(), then the frag is dma-buf, and
calls tcp_recvmsg_dmabuf on it. The caller needs to check that the
frag is specifically a dma-buf frag now.

Can io_uring frags somehow end up in tcp_recvmsg_locked? You're still
using the tcp stack with io_uring ZC right? So I suspect they might?

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 06/15] net: page pool: add helper creating area from pages
  2024-10-29 23:05 ` [PATCH v7 06/15] net: page pool: add helper creating area from pages David Wei
@ 2024-11-01 17:33   ` Mina Almasry
  2024-11-01 18:16     ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Mina Almasry @ 2024-11-01 17:33 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Add a helper that takes an array of pages and initialises passed in
> memory provider's area with them, where each net_iov takes one page.
> It's also responsible for setting up dma mappings.
>
> We keep it in page_pool.c not to leak netmem details to outside
> providers like io_uring, which don't have access to netmem_priv.h
> and other private helpers.
>

I honestly prefer leaking netmem_priv.h details into the io_uring
rather than having io_uring provider specific code in page_pool.c.

> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  include/net/page_pool/memory_provider.h | 10 ++++
>  net/core/page_pool.c                    | 63 ++++++++++++++++++++++++-
>  2 files changed, 71 insertions(+), 2 deletions(-)
>  create mode 100644 include/net/page_pool/memory_provider.h
>
> diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
> new file mode 100644
> index 000000000000..83d7eec0058d
> --- /dev/null
> +++ b/include/net/page_pool/memory_provider.h
> @@ -0,0 +1,10 @@
> +#ifndef _NET_PAGE_POOL_MEMORY_PROVIDER_H
> +#define _NET_PAGE_POOL_MEMORY_PROVIDER_H
> +
> +int page_pool_mp_init_paged_area(struct page_pool *pool,
> +                               struct net_iov_area *area,
> +                               struct page **pages);
> +void page_pool_mp_release_area(struct page_pool *pool,
> +                               struct net_iov_area *area);
> +
> +#endif
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 9a675e16e6a4..8bd4a3c80726 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -13,6 +13,7 @@
>
>  #include <net/netdev_rx_queue.h>
>  #include <net/page_pool/helpers.h>
> +#include <net/page_pool/memory_provider.h>
>  #include <net/xdp.h>
>
>  #include <linux/dma-direction.h>
> @@ -459,7 +460,8 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
>                 __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
>  }
>
> -static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
> +static bool page_pool_dma_map_page(struct page_pool *pool, netmem_ref netmem,
> +                                  struct page *page)

I have to say this is confusing for me. Passing in both the netmem and
the page is weird. The page is the one being mapped and the
netmem->dma_addr is the one being filled with the mapping.

Netmem is meant to be an abstraction over page. Passing both makes
little sense to me. The reason you're doing this is because the
io_uring memory provider is in a bit of a weird design IMO where the
memory comes in pages but it doesn't want to use paged-backed-netmem.
Instead it uses net_iov-backed-netmem and there is an out of band page
to be managed.

I think it would make sense to use paged-backed-netmem for your use
case, or at least I don't see why it wouldn't work. Memory providers
were designed to handle the hugepage usecase where the memory
allocated by the provider is pages. Is there a reason that doesn't
work for you as well?

If you really need to use net_iov-backed-netmem, can we put this
weirdness in the provider? I don't know that we want a generic-looking
dma_map function which is a bit confusing to take a netmem and a page.

>  {
>         dma_addr_t dma;
>
> @@ -468,7 +470,7 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
>          * into page private data (i.e 32bit cpu with 64bit DMA caps)
>          * This mapping is kept for lifetime of page, until leaving pool.
>          */
> -       dma = dma_map_page_attrs(pool->p.dev, netmem_to_page(netmem), 0,
> +       dma = dma_map_page_attrs(pool->p.dev, page, 0,
>                                  (PAGE_SIZE << pool->p.order), pool->p.dma_dir,
>                                  DMA_ATTR_SKIP_CPU_SYNC |
>                                          DMA_ATTR_WEAK_ORDERING);
> @@ -490,6 +492,11 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
>         return false;
>  }
>
> +static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
> +{
> +       return page_pool_dma_map_page(pool, netmem, netmem_to_page(netmem));
> +}
> +
>  static struct page *__page_pool_alloc_page_order(struct page_pool *pool,
>                                                  gfp_t gfp)
>  {
> @@ -1154,3 +1161,55 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
>         }
>  }
>  EXPORT_SYMBOL(page_pool_update_nid);
> +
> +static void page_pool_release_page_dma(struct page_pool *pool,
> +                                      netmem_ref netmem)
> +{
> +       __page_pool_release_page_dma(pool, netmem);
> +}
> +

Is this wrapper necessary? Do you wanna rename the original function
to remove the __ instead of a adding a wrapper?

> +int page_pool_mp_init_paged_area(struct page_pool *pool,
> +                                struct net_iov_area *area,
> +                                struct page **pages)
> +{
> +       struct net_iov *niov;
> +       netmem_ref netmem;
> +       int i, ret = 0;
> +
> +       if (!pool->dma_map)
> +               return -EOPNOTSUPP;
> +
> +       for (i = 0; i < area->num_niovs; i++) {
> +               niov = &area->niovs[i];
> +               netmem = net_iov_to_netmem(niov);
> +
> +               page_pool_set_pp_info(pool, netmem);
> +               if (!page_pool_dma_map_page(pool, netmem, pages[i])) {
> +                       ret = -EINVAL;
> +                       goto err_unmap_dma;
> +               }
> +       }
> +       return 0;
> +
> +err_unmap_dma:
> +       while (i--) {
> +               netmem = net_iov_to_netmem(&area->niovs[i]);
> +               page_pool_release_page_dma(pool, netmem);
> +       }
> +       return ret;
> +}
> +
> +void page_pool_mp_release_area(struct page_pool *pool,
> +                              struct net_iov_area *area)
> +{
> +       int i;
> +
> +       if (!pool->dma_map)
> +               return;
> +
> +       for (i = 0; i < area->num_niovs; i++) {
> +               struct net_iov *niov = &area->niovs[i];
> +
> +               page_pool_release_page_dma(pool, net_iov_to_netmem(niov));
> +       }
> +}

Similarly I these 2 functions belong in the provider to be honest.


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 04/15] net: prepare for non devmem TCP memory providers
  2024-11-01 17:09   ` Mina Almasry
@ 2024-11-01 17:41     ` Pavel Begunkov
  2024-11-04 20:20       ` Mina Almasry
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-01 17:41 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 11/1/24 17:09, Mina Almasry wrote:
> On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
...
>> +
>>   static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
>>                                                 struct gen_pool_chunk *chunk,
>>                                                 void *not_used)
>> @@ -316,10 +322,10 @@ void dev_dmabuf_uninstall(struct net_device *dev)
>>          unsigned int i;
>>
>>          for (i = 0; i < dev->real_num_rx_queues; i++) {
>> -               binding = dev->_rx[i].mp_params.mp_priv;
>> -               if (!binding)
>> +               if (dev->_rx[i].mp_params.mp_ops != &dmabuf_devmem_ops)
>>                          continue;
>>
> 
> Use the net_is_devmem_page_pool_ops helper here?

It could, but the function is there primarily for outside users to
avoid ifdefs and build problems. I don't think it worth reiteration?
I'll change if there is a next version.

...
>> @@ -244,8 +244,11 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool,
>>                           pool->user.detach_time))
>>                  goto err_cancel;
>>
>> -       if (binding && nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id))
>> -               goto err_cancel;
>> +       if (net_is_devmem_page_pool_ops(pool->mp_ops)) {
>> +               binding = pool->mp_priv;
>> +               if (nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id))
>> +                       goto err_cancel;
>> +       }
> 
> Worthy of note is that I think Jakub asked for this introspection, and
> likely you should also add similar introspection. I.e. page_pool

I think we can patch it up after merging the series? Depends what Jakub
thinks. In any case, I can't parse io_uring ops here until a later patch
adding those ops, so it'd be a new patch if it's a part of this series.

> dumping should likely be improved to dump that it's bound to io_uring
> memory. Not sure what io_uring memory 'id' equivalent would be, if
> any.

I don't think io_uring have any id to give. What is it for in the
first place? Do you give it to netlink to communicate with devmem
TCP or something similar?

>>          genlmsg_end(rsp, hdr);
>>
>> @@ -353,16 +356,16 @@ void page_pool_unlist(struct page_pool *pool)
>>   int page_pool_check_memory_provider(struct net_device *dev,
>>                                      struct netdev_rx_queue *rxq)
...
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index e928efc22f80..31e01da61c12 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -277,6 +277,7 @@
>>   #include <net/ip.h>
>>   #include <net/sock.h>
>>   #include <net/rstreason.h>
>> +#include <net/page_pool/types.h>
>>
>>   #include <linux/uaccess.h>
>>   #include <asm/ioctls.h>
>> @@ -2476,6 +2477,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
>>                          }
>>
>>                          niov = skb_frag_net_iov(frag);
>> +                       if (net_is_devmem_page_pool_ops(niov->pp->mp_ops)) {
>> +                               err = -ENODEV;
>> +                               goto out;
>> +                       }
>> +
> 
> I think this check needs to go in the caller. Currently the caller
> assumes that if !skb_frags_readable(), then the frag is dma-buf, and

io_uring originated netmem that are marked unreadable as well
and so will end up in tcp_recvmsg_dmabuf(), then we reject and
fail since they should not be fed to devmem TCP. It should be
fine from correctness perspective.

We need to check frags, and that's the place where we iterate
frags. Another option is to add a loop in tcp_recvmsg_locked
walking over all frags of an skb and doing the checks, but
that's an unnecessary performance burden to devmem.

> calls tcp_recvmsg_dmabuf on it. The caller needs to check that the
> frag is specifically a dma-buf frag now.
> 
> Can io_uring frags somehow end up in tcp_recvmsg_locked? You're still
> using the tcp stack with io_uring ZC right? So I suspect they might?

All of them are using the same socket rx queue, so yes, any of them
can see any type of packet non net_iov / pages

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 06/15] net: page pool: add helper creating area from pages
  2024-11-01 17:33   ` Mina Almasry
@ 2024-11-01 18:16     ` Pavel Begunkov
  2024-11-05 23:34       ` Mina Almasry
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-01 18:16 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 11/1/24 17:33, Mina Almasry wrote:
> On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>>
>> From: Pavel Begunkov <[email protected]>
>>
>> Add a helper that takes an array of pages and initialises passed in
>> memory provider's area with them, where each net_iov takes one page.
>> It's also responsible for setting up dma mappings.
>>
>> We keep it in page_pool.c not to leak netmem details to outside
>> providers like io_uring, which don't have access to netmem_priv.h
>> and other private helpers.
>>
> 
> I honestly prefer leaking netmem_priv.h details into the io_uring
> rather than having io_uring provider specific code in page_pool.c.

Even though Jakub didn't comment on this patch, but he definitely
wasn't fond of giving all those headers to non net/ users. I guess
I can't please everyone. One middle option is to make the page
pool helper more granular, i.e. taking care of one netmem at
a time, and moving the loop to io_uring, but I don't think it
changes anything.

...
>>   #include <linux/dma-direction.h>
>> @@ -459,7 +460,8 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
>>                  __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
>>   }
>>
>> -static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
>> +static bool page_pool_dma_map_page(struct page_pool *pool, netmem_ref netmem,
>> +                                  struct page *page)
> 
> I have to say this is confusing for me. Passing in both the netmem and
> the page is weird. The page is the one being mapped and the
> netmem->dma_addr is the one being filled with the mapping.

the page argument provides a mapping, the netmem gives the object
where the mapping is set. netmem could be the same as the page
argument, but I don't think it's inherently wrong, and it's an
internal helper anyway. I can entirely copy paste the function, I
don't think it's anyhow an improvement.

> Netmem is meant to be an abstraction over page. Passing both makes
> little sense to me. The reason you're doing this is because the
> io_uring memory provider is in a bit of a weird design IMO where the
> memory comes in pages but it doesn't want to use paged-backed-netmem.

Mina, as explained it before, I view it rather as an abstraction
that helps with finer grained control over memory and extending
it this way, I don't think it's such a stretch, and it doesn't
change much for the networking stack overall. Not fitting into
devmem TCP category doesn't make it weird.

> Instead it uses net_iov-backed-netmem and there is an out of band page
> to be managed.
> 
> I think it would make sense to use paged-backed-netmem for your use
> case, or at least I don't see why it wouldn't work. Memory providers

It's a user page, we can't make assumptions about it, we can't
reuse space in struct page like for pp refcounting (unlike when
it's allocated by the kernel), we can't use the normal page
refcounting.

If that's the direction people prefer, we can roll back to v1 from
a couple years ago, fill skbs fill user pages, attach ubuf_info to
every skb, and whack-a-mole'ing all places where the page could be
put down or such, pretty similarly what net_iov does. Honestly, I
thought that reusing common infra so that the net stack doesn't
need a different path per interface was a good idea.

> were designed to handle the hugepage usecase where the memory
> allocated by the provider is pages. Is there a reason that doesn't
> work for you as well?
> 
> If you really need to use net_iov-backed-netmem, can we put this
> weirdness in the provider? I don't know that we want a generic-looking
> dma_map function which is a bit confusing to take a netmem and a page.> 
...
>> +
>> +static void page_pool_release_page_dma(struct page_pool *pool,
>> +                                      netmem_ref netmem)
>> +{
>> +       __page_pool_release_page_dma(pool, netmem);
>> +}
>> +
> 
> Is this wrapper necessary? Do you wanna rename the original function
> to remove the __ instead of a adding a wrapper?

I only added it here to cast away __always_inline since it's used in
a slow / setup path. It shouldn't change the binary, but I'm not a huge
fan of leaving the hint for the code where it's not needed.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 07/15] net: page_pool: introduce page_pool_mp_return_in_cache
  2024-10-29 23:05 ` [PATCH v7 07/15] net: page_pool: introduce page_pool_mp_return_in_cache David Wei
@ 2024-11-01 20:05   ` Mina Almasry
  0 siblings, 0 replies; 47+ messages in thread
From: Mina Almasry @ 2024-11-01 20:05 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Add a helper that allows a page pool memory provider to efficiently
> return a netmem off the allocation callback.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  include/net/page_pool/memory_provider.h |  4 ++++
>  net/core/page_pool.c                    | 19 +++++++++++++++++++
>  2 files changed, 23 insertions(+)
>
> diff --git a/include/net/page_pool/memory_provider.h b/include/net/page_pool/memory_provider.h
> index 83d7eec0058d..352b3a35d31c 100644
> --- a/include/net/page_pool/memory_provider.h
> +++ b/include/net/page_pool/memory_provider.h
> @@ -1,3 +1,5 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
>  #ifndef _NET_PAGE_POOL_MEMORY_PROVIDER_H
>  #define _NET_PAGE_POOL_MEMORY_PROVIDER_H
>
> @@ -7,4 +9,6 @@ int page_pool_mp_init_paged_area(struct page_pool *pool,
>  void page_pool_mp_release_area(struct page_pool *pool,
>                                 struct net_iov_area *area);
>
> +void page_pool_mp_return_in_cache(struct page_pool *pool, netmem_ref netmem);
> +
>  #endif
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 8bd4a3c80726..9078107c906d 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -1213,3 +1213,22 @@ void page_pool_mp_release_area(struct page_pool *pool,
>                 page_pool_release_page_dma(pool, net_iov_to_netmem(niov));
>         }
>  }
> +
> +/*
> + * page_pool_mp_return_in_cache() - return a netmem to the allocation cache.
> + * @pool:      pool from which pages were allocated
> + * @netmem:    netmem to return
> + *
> + * Return already allocated and accounted netmem to the page pool's allocation
> + * cache. The function doesn't provide synchronisation and must only be called
> + * from the napi context.

Maybe add:

/* Caller must verify that there is room in the cache */

> + */
> +void page_pool_mp_return_in_cache(struct page_pool *pool, netmem_ref netmem)
> +{
> +       if (WARN_ON_ONCE(pool->alloc.count >= PP_ALLOC_CACHE_REFILL))
> +               return;

The caller must verify this anyway, right? so maybe this WARN_ON_ONCE
is too defensive.

> +
> +       page_pool_dma_sync_for_device(pool, netmem, -1);
> +       page_pool_fragment_netmem(netmem, 1);
> +       pool->alloc.cache[pool->alloc.count++] = netmem;
> +}
> --
> 2.43.5
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-10-29 23:05 ` [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
@ 2024-11-01 20:06   ` Mina Almasry
  2024-11-01 21:09     ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Mina Almasry @ 2024-11-01 20:06 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>
> From: Pavel Begunkov <[email protected]>
>
> Implement a page pool memory provider for io_uring to receieve in a
> zero copy fashion. For that, the provider allocates user pages wrapped
> around into struct net_iovs, that are stored in a previously registered
> struct net_iov_area.
>
> Unlike with traditional receives, for which pages from a page pool can
> be deallocated right after the user receives data, e.g. via recv(2),
> we extend the lifetime by recycling buffers only after the user space
> acknowledges that it's done processing the data via the refill queue.
> Before handing buffers to the user, we mark them by bumping the refcount
> by a bias value IO_ZC_RX_UREF, which will be checked when the buffer is
> returned back. When the corresponding io_uring instance and/or page pool
> are destroyed, we'll force back all buffers that are currently in the
> user space in ->io_pp_zc_scrub by clearing the bias.
>
> Refcounting and lifetime:
>
> Initially, all buffers are considered unallocated and stored in
> ->freelist, at which point they are not yet directly exposed to the core
> page pool code and not accounted to page pool's pages_state_hold_cnt.
> The ->alloc_netmems callback will allocate them by placing into the
> page pool's cache, setting the refcount to 1 as usual and adjusting
> pages_state_hold_cnt.
>
> Then, either the buffer is dropped and returns back to the page pool
> into the ->freelist via io_pp_zc_release_netmem, in which case the page
> pool will match hold_cnt for us with ->pages_state_release_cnt. Or more
> likely the buffer will go through the network/protocol stacks and end up
> in the corresponding socket's receive queue. From there the user can get
> it via an new io_uring request implemented in following patches. As
> mentioned above, before giving a buffer to the user we bump the refcount
> by IO_ZC_RX_UREF.
>
> Once the user is done with the buffer processing, it must return it back
> via the refill queue, from where our ->alloc_netmems implementation can
> grab it, check references, put IO_ZC_RX_UREF, and recycle the buffer if
> there are no more users left. As we place such buffers right back into
> the page pools fast cache and they didn't go through the normal pp
> release path, they are still considered "allocated" and no pp hold_cnt
> is required. For the same reason we dma sync buffers for the device
> in io_zc_add_pp_cache().
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  io_uring/zcrx.c | 215 ++++++++++++++++++++++++++++++++++++++++++++++++
>  io_uring/zcrx.h |   5 ++
>  2 files changed, 220 insertions(+)
>
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index a276572fe953..aad35676207e 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -2,7 +2,12 @@
>  #include <linux/kernel.h>
>  #include <linux/errno.h>
>  #include <linux/mm.h>
> +#include <linux/nospec.h>
> +#include <linux/netdevice.h>
>  #include <linux/io_uring.h>
> +#include <net/page_pool/helpers.h>
> +#include <net/page_pool/memory_provider.h>
> +#include <trace/events/page_pool.h>
>
>  #include <uapi/linux/io_uring.h>
>
> @@ -14,6 +19,16 @@
>
>  #define IO_RQ_MAX_ENTRIES              32768
>
> +__maybe_unused
> +static const struct memory_provider_ops io_uring_pp_zc_ops;
> +
> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
> +{
> +       struct net_iov_area *owner = net_iov_owner(niov);
> +
> +       return container_of(owner, struct io_zcrx_area, nia);
> +}
> +

We discussed this before I disappeared on vacation but I'm not too
convinced to be honest, sorry.

It's invalid to call io_zcrx_iov_to_area on a devmem niov and vice
versa, right? So current and future code has to be very careful to
call the right helpers on the right niovs.

At the very least there needs to be a comment above all these
container_of helpers:

/* caller must have verified that this niov is devmem/io_zcrx */.

However I feel like even a comment is extremely error prone. These
container_of's are inside of the call stack of some helpers. I would
say we need a check. If we're concerned about performance, the check
can be behind DEBUG_NET_WARN_ON(), although even that is a bit iffy,
but could be fine. Doing this without a check seems too risky to me.

>  static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
>                                  struct io_uring_zcrx_ifq_reg *reg)
>  {
> @@ -99,6 +114,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
>                 goto err;
>
>         for (i = 0; i < nr_pages; i++) {
> +               struct net_iov *niov = &area->nia.niovs[i];
> +
> +               niov->owner = &area->nia;
>                 area->freelist[i] = i;
>         }
>
> @@ -230,3 +248,200 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
>  {
>         lockdep_assert_held(&ctx->uring_lock);
>  }
> +
> +static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
> +{
> +       return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
> +}
> +
> +static bool io_zcrx_put_niov_uref(struct net_iov *niov)
> +{
> +       if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
> +               return false;
> +
> +       return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
> +}
> +

Sorry, I have to push back a bit against this. The refcounting of
netmem is already complicated. the paged netmem has 2 refcounts and
care needs to be taken when acquiring and dropping refcounts. net_iov
inherited the pp_ref_count but not the paged refcount, and again need
some special handling. skb_frag_unref takes very special care checking
pp->recycle, is_pp_netmem, and others to figure out the correct
refcount to put based on the type of the netmem and skb flag.

This code ignores all these generic code
skb_frag_unref/napi_pp_put_page/etc paths and uses raw access to
niv->pp_ref_count. If this is merged as-is, for posterity any changes
in netmem refcounting need to also account for this use case opting
out of these generic code paths that handle all other skb reffing
including devmem.

Additionally since you're opting out of the generic unreffing paths
you're also (as mentioned before) bypassing the pp recycling. AFAICT
that may be hurting your performance. IIUC you refill
PP_ALLOC_CACHE_REFILL (64) entries everytime _alloc_netmems is
entered, and you don't recycle netmem any other way, so your slow path
is entered 1/64 of the page_pool_alloc calls? That should be much
worse than what the normal pp recycling does, which returns all freed
netmem into its alloc.cache or the ptr_ring and hits *_alloc_netmems
much more rarely. There are also regular perf improvements and testing
to the generic pool recycling paths you're also opting out of.

I see a lot of downsides to opting out of the generic use cases. Is
there any reason the normal freeing paths are not applicable to your
use case?

> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
> +                                     struct net_iov *niov)
> +{
> +}
> +

Looks unused/empty.

> +static inline u32 io_zcrx_rqring_entries(struct io_zcrx_ifq *ifq)
> +{
> +       u32 entries;
> +
> +       entries = smp_load_acquire(&ifq->rq_ring->tail) - ifq->cached_rq_head;
> +       return min(entries, ifq->rq_entries);
> +}
> +
> +static struct io_uring_zcrx_rqe *io_zcrx_get_rqe(struct io_zcrx_ifq *ifq,
> +                                                unsigned mask)
> +{
> +       unsigned int idx = ifq->cached_rq_head++ & mask;
> +
> +       return &ifq->rqes[idx];
> +}
> +
> +static void io_zcrx_ring_refill(struct page_pool *pp,
> +                               struct io_zcrx_ifq *ifq)
> +{
> +       unsigned int entries = io_zcrx_rqring_entries(ifq);
> +       unsigned int mask = ifq->rq_entries - 1;
> +
> +       entries = min_t(unsigned, entries, PP_ALLOC_CACHE_REFILL - pp->alloc.count);
> +       if (unlikely(!entries))
> +               return;
> +
> +       do {
> +               struct io_uring_zcrx_rqe *rqe = io_zcrx_get_rqe(ifq, mask);
> +               struct io_zcrx_area *area;
> +               struct net_iov *niov;
> +               unsigned niov_idx, area_idx;
> +
> +               area_idx = rqe->off >> IORING_ZCRX_AREA_SHIFT;
> +               niov_idx = (rqe->off & ~IORING_ZCRX_AREA_MASK) / PAGE_SIZE;
> +
> +               if (unlikely(rqe->__pad || area_idx))
> +                       continue;
> +               area = ifq->area;
> +
> +               if (unlikely(niov_idx >= area->nia.num_niovs))
> +                       continue;
> +               niov_idx = array_index_nospec(niov_idx, area->nia.num_niovs);
> +
> +               niov = &area->nia.niovs[niov_idx];
> +               if (!io_zcrx_put_niov_uref(niov))
> +                       continue;
> +               page_pool_mp_return_in_cache(pp, net_iov_to_netmem(niov));

I'm sorry I pushed back against the provider filling the pp cache in
earlier iterations. That was very wrong.

*_alloc_netmems was meant to be a mirror of
__page_pool_alloc_pages_slow. Since the latter fills the pp->cache,
the former should also fill the pp->cache. The dmabuf mp is actually
deficient in this regard. I'll look into fixing it. This part is more
than fine for me.

I'm still unsure about opting out of the generic freeing paths as I
mentioned above though.

> +       } while (--entries);
> +
> +       smp_store_release(&ifq->rq_ring->head, ifq->cached_rq_head);
> +}
> +
> +static void io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *ifq)
> +{
> +       struct io_zcrx_area *area = ifq->area;
> +
> +       spin_lock_bh(&area->freelist_lock);
> +       while (area->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) {
> +               struct net_iov *niov;
> +               u32 pgid;
> +
> +               pgid = area->freelist[--area->free_count];
> +               niov = &area->nia.niovs[pgid];
> +
> +               page_pool_mp_return_in_cache(pp, net_iov_to_netmem(niov));
> +
> +               pp->pages_state_hold_cnt++;
> +               trace_page_pool_state_hold(pp, net_iov_to_netmem(niov),
> +                                          pp->pages_state_hold_cnt);
> +       }
> +       spin_unlock_bh(&area->freelist_lock);
> +}
> +
> +static void io_zcrx_recycle_niov(struct net_iov *niov)
> +{
> +       struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
> +
> +       spin_lock_bh(&area->freelist_lock);
> +       area->freelist[area->free_count++] = net_iov_idx(niov);
> +       spin_unlock_bh(&area->freelist_lock);
> +}
> +
> +static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
> +{
> +       struct io_zcrx_ifq *ifq = pp->mp_priv;
> +
> +       /* pp should already be ensuring that */
> +       if (unlikely(pp->alloc.count))
> +               goto out_return;
> +
> +       io_zcrx_ring_refill(pp, ifq);
> +       if (likely(pp->alloc.count))
> +               goto out_return;
> +
> +       io_zcrx_refill_slow(pp, ifq);
> +       if (!pp->alloc.count)
> +               return 0;
> +out_return:
> +       return pp->alloc.cache[--pp->alloc.count];
> +}
> +
> +static bool io_pp_zc_release_netmem(struct page_pool *pp, netmem_ref netmem)
> +{
> +       struct net_iov *niov;
> +
> +       if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
> +               return false;
> +
> +       niov = netmem_to_net_iov(netmem);
> +
> +       if (io_zcrx_niov_put(niov, 1))
> +               io_zcrx_recycle_niov(niov);
> +       return false;
> +}
> +
> +static void io_pp_zc_scrub(struct page_pool *pp)
> +{
> +       struct io_zcrx_ifq *ifq = pp->mp_priv;
> +       struct io_zcrx_area *area = ifq->area;
> +       int i;
> +
> +       /* Reclaim back all buffers given to the user space. */
> +       for (i = 0; i < area->nia.num_niovs; i++) {
> +               struct net_iov *niov = &area->nia.niovs[i];
> +               int count;
> +
> +               if (!io_zcrx_put_niov_uref(niov))
> +                       continue;
> +               io_zcrx_recycle_niov(niov);
> +
> +               count = atomic_inc_return_relaxed(&pp->pages_state_release_cnt);
> +               trace_page_pool_state_release(pp, net_iov_to_netmem(niov), count);
> +       }
> +}
> +
> +static int io_pp_zc_init(struct page_pool *pp)
> +{
> +       struct io_zcrx_ifq *ifq = pp->mp_priv;
> +       struct io_zcrx_area *area = ifq->area;
> +       int ret;
> +
> +       if (!ifq)
> +               return -EINVAL;
> +       if (pp->p.order != 0)
> +               return -EINVAL;
> +       if (!pp->p.napi)
> +               return -EINVAL;
> +
> +       ret = page_pool_mp_init_paged_area(pp, &area->nia, area->pages);
> +       if (ret)
> +               return ret;
> +
> +       percpu_ref_get(&ifq->ctx->refs);
> +       ifq->pp = pp;
> +       return 0;
> +}
> +
> +static void io_pp_zc_destroy(struct page_pool *pp)
> +{
> +       struct io_zcrx_ifq *ifq = pp->mp_priv;
> +       struct io_zcrx_area *area = ifq->area;
> +
> +       page_pool_mp_release_area(pp, &ifq->area->nia);
> +
> +       ifq->pp = NULL;
> +
> +       if (WARN_ON_ONCE(area->free_count != area->nia.num_niovs))
> +               return;
> +       percpu_ref_put(&ifq->ctx->refs);
> +}
> +
> +static const struct memory_provider_ops io_uring_pp_zc_ops = {
> +       .alloc_netmems          = io_pp_zc_alloc_netmems,
> +       .release_netmem         = io_pp_zc_release_netmem,
> +       .init                   = io_pp_zc_init,
> +       .destroy                = io_pp_zc_destroy,
> +       .scrub                  = io_pp_zc_scrub,
> +};
> diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
> index a8db61498c67..464b4bd89b64 100644
> --- a/io_uring/zcrx.h
> +++ b/io_uring/zcrx.h
> @@ -5,6 +5,9 @@
>  #include <linux/io_uring_types.h>
>  #include <net/page_pool/types.h>
>
> +#define IO_ZC_RX_UREF                  0x10000
> +#define IO_ZC_RX_KREF_MASK             (IO_ZC_RX_UREF - 1)
> +
>  struct io_zcrx_area {
>         struct net_iov_area     nia;
>         struct io_zcrx_ifq      *ifq;
> @@ -22,10 +25,12 @@ struct io_zcrx_ifq {
>         struct io_ring_ctx              *ctx;
>         struct net_device               *dev;
>         struct io_zcrx_area             *area;
> +       struct page_pool                *pp;
>
>         struct io_uring                 *rq_ring;
>         struct io_uring_zcrx_rqe        *rqes;
>         u32                             rq_entries;
> +       u32                             cached_rq_head;
>
>         unsigned short                  n_rqe_pages;
>         struct page                     **rqe_pages;
> --
> 2.43.5
>


--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request
  2024-10-29 23:05 ` [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request David Wei
@ 2024-11-01 20:11   ` Mina Almasry
  2024-11-01 21:17     ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Mina Almasry @ 2024-11-01 20:11 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>
...
> +static void io_zcrx_get_buf_uref(struct net_iov *niov)
> +{
> +       atomic_long_add(IO_ZC_RX_UREF, &niov->pp_ref_count);
> +}
> +

This is not specific to io_rcrx I think. Please rename this and put it
somewhere generic, like netmem.h.

Then tcp_recvmsg_dmabuf can use the same helper instead of the very
ugly call it currently does:

- atomic_long_inc(&niov->pp_ref_count);
+ net_iov_pp_ref_get(niov, 1);

Or something.

In general I think io_uring code can do whatever it wants with the
io_uring specific bits in net_iov (everything under net_area_owner I
think), but please lets try to keep any code touching the generic
net_iov fields (pp_pagic, pp_ref_count, and others) in generic
helpers.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue
  2024-10-29 23:05 ` [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
@ 2024-11-01 20:16   ` Mina Almasry
  2024-11-01 20:35     ` Jens Axboe
  0 siblings, 1 reply; 47+ messages in thread
From: Mina Almasry @ 2024-11-01 20:16 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>
> From: David Wei <[email protected]>
>
> Set the page pool memory provider for the rx queue configured for zero
> copy to io_uring. Then the rx queue is reset using
> netdev_rx_queue_restart() and netdev core + page pool will take care of
> filling the rx queue from the io_uring zero copy memory provider.
>
> For now, there is only one ifq so its destruction happens implicitly
> during io_uring cleanup.
>
> Signed-off-by: David Wei <[email protected]>
> ---
>  io_uring/zcrx.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++--
>  io_uring/zcrx.h |  2 ++
>  2 files changed, 86 insertions(+), 2 deletions(-)
>
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index 477b0d1b7b91..3f4625730dbd 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -8,6 +8,7 @@
>  #include <net/page_pool/helpers.h>
>  #include <net/page_pool/memory_provider.h>
>  #include <trace/events/page_pool.h>
> +#include <net/netdev_rx_queue.h>
>  #include <net/tcp.h>
>  #include <net/rps.h>
>
> @@ -36,6 +37,65 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
>         return container_of(owner, struct io_zcrx_area, nia);
>  }
>
> +static int io_open_zc_rxq(struct io_zcrx_ifq *ifq, unsigned ifq_idx)
> +{
> +       struct netdev_rx_queue *rxq;
> +       struct net_device *dev = ifq->dev;
> +       int ret;
> +
> +       ASSERT_RTNL();
> +
> +       if (ifq_idx >= dev->num_rx_queues)
> +               return -EINVAL;
> +       ifq_idx = array_index_nospec(ifq_idx, dev->num_rx_queues);
> +
> +       rxq = __netif_get_rx_queue(ifq->dev, ifq_idx);
> +       if (rxq->mp_params.mp_priv)
> +               return -EEXIST;
> +
> +       ifq->if_rxq = ifq_idx;
> +       rxq->mp_params.mp_ops = &io_uring_pp_zc_ops;
> +       rxq->mp_params.mp_priv = ifq;
> +       ret = netdev_rx_queue_restart(ifq->dev, ifq->if_rxq);
> +       if (ret)
> +               goto fail;
> +       return 0;
> +fail:
> +       rxq->mp_params.mp_ops = NULL;
> +       rxq->mp_params.mp_priv = NULL;
> +       ifq->if_rxq = -1;
> +       return ret;
> +}
> +

I don't see a CAP_NET_ADMIN check. Likely I missed it. Is that done
somewhere? Binding user memory to an rx queue needs to be a privileged
operation.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue
  2024-11-01 20:16   ` Mina Almasry
@ 2024-11-01 20:35     ` Jens Axboe
  2024-11-01 21:12       ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Jens Axboe @ 2024-11-01 20:35 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Pavel Begunkov, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 11/1/24 2:16 PM, Mina Almasry wrote:
> On Tue, Oct 29, 2024 at 4:06?PM David Wei <[email protected]> wrote:
>>
>> From: David Wei <[email protected]>
>>
>> Set the page pool memory provider for the rx queue configured for zero
>> copy to io_uring. Then the rx queue is reset using
>> netdev_rx_queue_restart() and netdev core + page pool will take care of
>> filling the rx queue from the io_uring zero copy memory provider.
>>
>> For now, there is only one ifq so its destruction happens implicitly
>> during io_uring cleanup.
>>
>> Signed-off-by: David Wei <[email protected]>
>> ---
>>  io_uring/zcrx.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  io_uring/zcrx.h |  2 ++
>>  2 files changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
>> index 477b0d1b7b91..3f4625730dbd 100644
>> --- a/io_uring/zcrx.c
>> +++ b/io_uring/zcrx.c
>> @@ -8,6 +8,7 @@
>>  #include <net/page_pool/helpers.h>
>>  #include <net/page_pool/memory_provider.h>
>>  #include <trace/events/page_pool.h>
>> +#include <net/netdev_rx_queue.h>
>>  #include <net/tcp.h>
>>  #include <net/rps.h>
>>
>> @@ -36,6 +37,65 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
>>         return container_of(owner, struct io_zcrx_area, nia);
>>  }
>>
>> +static int io_open_zc_rxq(struct io_zcrx_ifq *ifq, unsigned ifq_idx)
>> +{
>> +       struct netdev_rx_queue *rxq;
>> +       struct net_device *dev = ifq->dev;
>> +       int ret;
>> +
>> +       ASSERT_RTNL();
>> +
>> +       if (ifq_idx >= dev->num_rx_queues)
>> +               return -EINVAL;
>> +       ifq_idx = array_index_nospec(ifq_idx, dev->num_rx_queues);
>> +
>> +       rxq = __netif_get_rx_queue(ifq->dev, ifq_idx);
>> +       if (rxq->mp_params.mp_priv)
>> +               return -EEXIST;
>> +
>> +       ifq->if_rxq = ifq_idx;
>> +       rxq->mp_params.mp_ops = &io_uring_pp_zc_ops;
>> +       rxq->mp_params.mp_priv = ifq;
>> +       ret = netdev_rx_queue_restart(ifq->dev, ifq->if_rxq);
>> +       if (ret)
>> +               goto fail;
>> +       return 0;
>> +fail:
>> +       rxq->mp_params.mp_ops = NULL;
>> +       rxq->mp_params.mp_priv = NULL;
>> +       ifq->if_rxq = -1;
>> +       return ret;
>> +}
>> +
> 
> I don't see a CAP_NET_ADMIN check. Likely I missed it. Is that done
> somewhere? Binding user memory to an rx queue needs to be a privileged
> operation.

There's only one caller of this, and it literally has a CAP_NET_ADMIN at
the very top. Patch 9 adds the registration.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-11-01 20:06   ` Mina Almasry
@ 2024-11-01 21:09     ` Pavel Begunkov
  2024-11-03 21:52       ` Pavel Begunkov
  2024-11-04 19:54       ` Mina Almasry
  0 siblings, 2 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-01 21:09 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 11/1/24 20:06, Mina Almasry wrote:
...
>> +__maybe_unused
>> +static const struct memory_provider_ops io_uring_pp_zc_ops;
>> +
>> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
>> +{
>> +       struct net_iov_area *owner = net_iov_owner(niov);
>> +
>> +       return container_of(owner, struct io_zcrx_area, nia);
>> +}
>> +
> 
> We discussed this before I disappeared on vacation but I'm not too
> convinced to be honest, sorry.
> 
> It's invalid to call io_zcrx_iov_to_area on a devmem niov and vice
> versa, right? So current and future code has to be very careful to

Yes

> call the right helpers on the right niovs.
> 
> At the very least there needs to be a comment above all these
> container_of helpers:
> 
> /* caller must have verified that this niov is devmem/io_zcrx */.
> 
> However I feel like even a comment is extremely error prone. These
> container_of's are inside of the call stack of some helpers. I would
> say we need a check. If we're concerned about performance, the check
> can be behind DEBUG_NET_WARN_ON(), although even that is a bit iffy,
> but could be fine. Doing this without a check seems too risky to me.

No, it doesn't need a check nor it needs a comment. The very
essence of virtual function tables is that they're coupled
together with objects for which those function make sense and
called only for those objects. The only way to get here with
invalid net_iovs is to take one page pool and feed it with
net_iovs from other another page pool that won't be sane in
the first place.

That would be an equivalent of:

struct file *f1 = ...;
struct file *f2 = ...;

f1->f_op->read(f2, ...);

Maybe it looks strange for you in C, but it's same as putting
comments that a virtual function that it should be called only
for objects of that class:

struct A {
	virtual void foo() = 0;
};
struct B: public A {
	void foo() override {
		// we should only be called with objects of type
		// struct B (or anything inheriting it), check that
		if (!reinterpret_cast<struct B*>(this))
			throw;
		...
	}
}

>>   static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
>>                                   struct io_uring_zcrx_ifq_reg *reg)
>>   {
>> @@ -99,6 +114,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
>>                  goto err;
>>
>>          for (i = 0; i < nr_pages; i++) {
>> +               struct net_iov *niov = &area->nia.niovs[i];
>> +
>> +               niov->owner = &area->nia;
>>                  area->freelist[i] = i;
>>          }
>>
>> @@ -230,3 +248,200 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
>>   {
>>          lockdep_assert_held(&ctx->uring_lock);
>>   }
>> +
>> +static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
>> +{
>> +       return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
>> +}
>> +
>> +static bool io_zcrx_put_niov_uref(struct net_iov *niov)
>> +{
>> +       if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
>> +               return false;
>> +
>> +       return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
>> +}
>> +
> 
> Sorry, I have to push back a bit against this. The refcounting of
> netmem is already complicated. the paged netmem has 2 refcounts and
> care needs to be taken when acquiring and dropping refcounts. net_iov
> inherited the pp_ref_count but not the paged refcount, and again need
> some special handling. skb_frag_unref takes very special care checking

Which is why it's using net_iovs.

> pp->recycle, is_pp_netmem, and others to figure out the correct

pp->recycle has nothing to do with the series. We don't add
it in any special way, and if it's broken it's broken even
for non-proivder buffers.

> refcount to put based on the type of the netmem and skb flag.

Just same as with the ->[un]readable flag, which is not
functionally needed, and if it's screwed many things can
go very wrong.

> This code ignores all these generic code
> skb_frag_unref/napi_pp_put_page/etc paths and uses raw access to

I don't see the point, they are not used because they're not
needed. Instead of checking whether it came from a page pool
and whether it's net_iov or not, in the path io_uring returns
it we already apriori know that they're from a specific page
pool, net_iov and from the current provider.

Same for optimisations provided by those helpers, they are
useful when you're transferring buffers from one context to
another, e.g. task recieve path -> napi / page_pool. In this
case they're already fetched in the right context without any
need to additionally jumping through the hoops. If anything,
it'd be odd to jump out of a window to climb a rope on the
other side of the building when you could've just walked 5
meters to the other room.

> niv->pp_ref_count. If this is merged as-is, for posterity any changes

Ok, let's add a helper then

> in netmem refcounting need to also account for this use case opting
> out of these generic code paths that handle all other skb reffing
> including devmem.
> 
> Additionally since you're opting out of the generic unreffing paths
> you're also (as mentioned before) bypassing the pp recycling. AFAICT
> that may be hurting your performance. IIUC you refill
> PP_ALLOC_CACHE_REFILL (64) entries everytime _alloc_netmems is
> entered, and you don't recycle netmem any other way, so your slow path
> is entered 1/64 of the page_pool_alloc calls? That should be much

One virtual call per 64 buffers gets enough of ammortisation. The
cache size can be extended if necessary.

> worse than what the normal pp recycling does, which returns all freed
> netmem into its alloc.cache or the ptr_ring and hits *_alloc_netmems

You return it from a syscall (a special sockopt), I'm pretty sure
overhead of just that syscall without any handling would be more
expensive than one virtual function call. Then you need to hit the
fast cache, and it's not unconditional, it has to be lucky enough
so that napi is not run or scheduled, and even then it has to
be very careful to avoid races. That's the best case for <64 entries
recycling, otherwise it's ptr_ring and spinlocks.

Note, the normal (non-zc) recycling happens in the receive
syscall, but it's not the normal path, and just like devmem we
have to give the buffer to the user and wait until it's returned
back.

> much more rarely. There are also regular perf improvements and testing
> to the generic pool recycling paths you're also opting out of.

For performance, see above. As for testing, tests come after code
functionality, not the other way around. Why we're even adding any
zero copy and interface when it could be old good and well tested
non-zerocopy recv(2)

> I see a lot of downsides to opting out of the generic use cases. Is
> there any reason the normal freeing paths are not applicable to your
> use case?
> 
>> +static inline void io_zc_add_pp_cache(struct page_pool *pp,
>> +                                     struct net_iov *niov)
>> +{
>> +}
>> +
> 
> Looks unused/empty.

Indeed, slipped through.
...

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue
  2024-11-01 20:35     ` Jens Axboe
@ 2024-11-01 21:12       ` Pavel Begunkov
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-01 21:12 UTC (permalink / raw)
  To: Jens Axboe, Mina Almasry, David Wei
  Cc: io-uring, netdev, Jakub Kicinski, Paolo Abeni, David S. Miller,
	Eric Dumazet, Jesper Dangaard Brouer, David Ahern,
	Stanislav Fomichev, Joe Damato, Pedro Tammela

On 11/1/24 20:35, Jens Axboe wrote:
> On 11/1/24 2:16 PM, Mina Almasry wrote:
>> On Tue, Oct 29, 2024 at 4:06?PM David Wei <[email protected]> wrote:
>>>
>>> From: David Wei <[email protected]>
>>>
>>> Set the page pool memory provider for the rx queue configured for zero
>>> copy to io_uring. Then the rx queue is reset using
>>> netdev_rx_queue_restart() and netdev core + page pool will take care of
>>> filling the rx queue from the io_uring zero copy memory provider.
>>>
>>> For now, there is only one ifq so its destruction happens implicitly
>>> during io_uring cleanup.
>>>
>>> Signed-off-by: David Wei <[email protected]>
>>> ---
>>>   io_uring/zcrx.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++--
>>>   io_uring/zcrx.h |  2 ++
>>>   2 files changed, 86 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
>>> index 477b0d1b7b91..3f4625730dbd 100644
>>> --- a/io_uring/zcrx.c
>>> +++ b/io_uring/zcrx.c
>>> @@ -8,6 +8,7 @@
>>>   #include <net/page_pool/helpers.h>
>>>   #include <net/page_pool/memory_provider.h>
>>>   #include <trace/events/page_pool.h>
>>> +#include <net/netdev_rx_queue.h>
>>>   #include <net/tcp.h>
>>>   #include <net/rps.h>
>>>
>>> @@ -36,6 +37,65 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
>>>          return container_of(owner, struct io_zcrx_area, nia);
>>>   }
>>>
>>> +static int io_open_zc_rxq(struct io_zcrx_ifq *ifq, unsigned ifq_idx)
>>> +{
>>> +       struct netdev_rx_queue *rxq;
>>> +       struct net_device *dev = ifq->dev;
>>> +       int ret;
>>> +
>>> +       ASSERT_RTNL();
>>> +
>>> +       if (ifq_idx >= dev->num_rx_queues)
>>> +               return -EINVAL;
>>> +       ifq_idx = array_index_nospec(ifq_idx, dev->num_rx_queues);
>>> +
>>> +       rxq = __netif_get_rx_queue(ifq->dev, ifq_idx);
>>> +       if (rxq->mp_params.mp_priv)
>>> +               return -EEXIST;
>>> +
>>> +       ifq->if_rxq = ifq_idx;
>>> +       rxq->mp_params.mp_ops = &io_uring_pp_zc_ops;
>>> +       rxq->mp_params.mp_priv = ifq;
>>> +       ret = netdev_rx_queue_restart(ifq->dev, ifq->if_rxq);
>>> +       if (ret)
>>> +               goto fail;
>>> +       return 0;
>>> +fail:
>>> +       rxq->mp_params.mp_ops = NULL;
>>> +       rxq->mp_params.mp_priv = NULL;
>>> +       ifq->if_rxq = -1;
>>> +       return ret;
>>> +}
>>> +
>>
>> I don't see a CAP_NET_ADMIN check. Likely I missed it. Is that done
>> somewhere? Binding user memory to an rx queue needs to be a privileged
>> operation.
> 
> There's only one caller of this, and it literally has a CAP_NET_ADMIN at
> the very top. Patch 9 adds the registration.

Right, Patch 9/15, checked very early before creating any objects.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request
  2024-11-01 20:11   ` Mina Almasry
@ 2024-11-01 21:17     ` Pavel Begunkov
  2024-11-05 23:09       ` Mina Almasry
  0 siblings, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-01 21:17 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 11/1/24 20:11, Mina Almasry wrote:
> On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>>
> ...
>> +static void io_zcrx_get_buf_uref(struct net_iov *niov)
>> +{
>> +       atomic_long_add(IO_ZC_RX_UREF, &niov->pp_ref_count);
>> +}
>> +
> 
> This is not specific to io_rcrx I think. Please rename this and put it
> somewhere generic, like netmem.h.
> 
> Then tcp_recvmsg_dmabuf can use the same helper instead of the very
> ugly call it currently does:
> 
> - atomic_long_inc(&niov->pp_ref_count);
> + net_iov_pp_ref_get(niov, 1);
> 
> Or something.
> 
> In general I think io_uring code can do whatever it wants with the
> io_uring specific bits in net_iov (everything under net_area_owner I
> think), but please lets try to keep any code touching the generic
> net_iov fields (pp_pagic, pp_ref_count, and others) in generic
> helpers.

I'm getting confused, io_uring shouldn't be touching these
fields, but on the other hand should export net/ private
netmem_priv.h and page_pool_priv.h and directly hard code a bunch
of low level setup io_uring that is currently in page_pool.c

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-11-01 21:09     ` Pavel Begunkov
@ 2024-11-03 21:52       ` Pavel Begunkov
  2024-11-04 19:54       ` Mina Almasry
  1 sibling, 0 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-03 21:52 UTC (permalink / raw)
  To: Mina Almasry, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 11/1/24 21:09, Pavel Begunkov wrote:
> On 11/1/24 20:06, Mina Almasry wrote:
> ...
>>> +__maybe_unused
>>> +static const struct memory_provider_ops io_uring_pp_zc_ops;
>>> +
>>> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
>>> +{
>>> +       struct net_iov_area *owner = net_iov_owner(niov);
>>> +
>>> +       return container_of(owner, struct io_zcrx_area, nia);
>>> +}
>>> +
>>
>> We discussed this before I disappeared on vacation but I'm not too
>> convinced to be honest, sorry.

To expand on this one, a few weeks ago I outlined how you can employ
the compiler and verify correctness, and I don't really see a way to
convince you unless you're willing to check your claim that it can
go wrong. Turning it the other way around, if you see a path where it
could go wrong, please do let me know, and it'll certainly get fixed,
but until then I don't believe it's anyhow a blocker for the series.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-11-01 21:09     ` Pavel Begunkov
  2024-11-03 21:52       ` Pavel Begunkov
@ 2024-11-04 19:54       ` Mina Almasry
  2024-11-04 21:14         ` Pavel Begunkov
  2024-11-11 21:15         ` David Wei
  1 sibling, 2 replies; 47+ messages in thread
From: Mina Almasry @ 2024-11-04 19:54 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Fri, Nov 1, 2024 at 2:09 PM Pavel Begunkov <[email protected]> wrote:
>
> On 11/1/24 20:06, Mina Almasry wrote:
> ...
> >> +__maybe_unused
> >> +static const struct memory_provider_ops io_uring_pp_zc_ops;
> >> +
> >> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
> >> +{
> >> +       struct net_iov_area *owner = net_iov_owner(niov);
> >> +
> >> +       return container_of(owner, struct io_zcrx_area, nia);
> >> +}
> >> +
> >
> > We discussed this before I disappeared on vacation but I'm not too
> > convinced to be honest, sorry.
> >
> > It's invalid to call io_zcrx_iov_to_area on a devmem niov and vice
> > versa, right? So current and future code has to be very careful to
>
> Yes
>
> > call the right helpers on the right niovs.
> >
> > At the very least there needs to be a comment above all these
> > container_of helpers:
> >
> > /* caller must have verified that this niov is devmem/io_zcrx */.
> >
> > However I feel like even a comment is extremely error prone. These
> > container_of's are inside of the call stack of some helpers. I would
> > say we need a check. If we're concerned about performance, the check
> > can be behind DEBUG_NET_WARN_ON(), although even that is a bit iffy,
> > but could be fine. Doing this without a check seems too risky to me.
>
> No, it doesn't need a check nor it needs a comment. The very
> essence of virtual function tables is that they're coupled
> together with objects for which those function make sense and
> called only for those objects. The only way to get here with
> invalid net_iovs is to take one page pool and feed it with
> net_iovs from other another page pool that won't be sane in
> the first place.
>

That could happen. In fact the whole devmem tcp paths are very
carefully written to handle that

net_iovs are allocated from the page_pool, put in skbs, and then sit
in the sk receive queue. In pathological cases (user is
re/misconfiguring flow steering) we can have 1 sk receive queue that
has a mix of page skbs, devmem skbs, and io_uring skbs, and other
skbs.

Code that is processing the skbs in the receive queue has no idea
whether what kind of skb it is. That's why that code needs to check
whether the skb has readable frags, and that's why in this very series
you needed to add a check in tcp_recvmsg_dmabuf to make sure that its
a dmabuf skb, and you need to add a check to io_zcrx_recv_frag that
the frag inside it is io_uring niov. The code would be wrong without
it.

All I'm trying to say is that it's very error prone to rely on folks
writing and reviewing code to check that whenever dmabuf/io_rcrx/etc
handling is done, somewhere in the call stack a type verification
check has been made, and a DEBUG_NET_WARN could help avoid some subtle
memory corruption bugs.

> That would be an equivalent of:
>
> struct file *f1 = ...;
> struct file *f2 = ...;
>
> f1->f_op->read(f2, ...);
>
> Maybe it looks strange for you in C, but it's same as putting
> comments that a virtual function that it should be called only
> for objects of that class:
>
> struct A {
>         virtual void foo() = 0;
> };
> struct B: public A {
>         void foo() override {
>                 // we should only be called with objects of type
>                 // struct B (or anything inheriting it), check that
>                 if (!reinterpret_cast<struct B*>(this))
>                         throw;
>                 ...
>         }
> }
>
>

I'm not really sure I followed here. We do not get any type of
compiler or type safety from this code because the dma-buf niovs and
io_uring niovs are the same net_iov type.

We can get type safety by defining new types for dmabuf_net_iov and
io_uring_net_iov, then provide helpers:

dmabuf_net_iov *net_iov_to_dmabuf();
io_uring_net_iov *net_iov_to_io_uring();

The helpers can check the niov is of the right type once and do a
cast,  then the object with the specific type can be passed to all
future heplers without additional checks. This is one way to do it I
guess.

> >>   static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
> >>                                   struct io_uring_zcrx_ifq_reg *reg)
> >>   {
> >> @@ -99,6 +114,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
> >>                  goto err;
> >>
> >>          for (i = 0; i < nr_pages; i++) {
> >> +               struct net_iov *niov = &area->nia.niovs[i];
> >> +
> >> +               niov->owner = &area->nia;
> >>                  area->freelist[i] = i;
> >>          }
> >>
> >> @@ -230,3 +248,200 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
> >>   {
> >>          lockdep_assert_held(&ctx->uring_lock);
> >>   }
> >> +
> >> +static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
> >> +{
> >> +       return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
> >> +}
> >> +
> >> +static bool io_zcrx_put_niov_uref(struct net_iov *niov)
> >> +{
> >> +       if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
> >> +               return false;
> >> +
> >> +       return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
> >> +}
> >> +
> >
> > Sorry, I have to push back a bit against this. The refcounting of
> > netmem is already complicated. the paged netmem has 2 refcounts and
> > care needs to be taken when acquiring and dropping refcounts. net_iov
> > inherited the pp_ref_count but not the paged refcount, and again need
> > some special handling. skb_frag_unref takes very special care checking
>
> Which is why it's using net_iovs.
>
> > pp->recycle, is_pp_netmem, and others to figure out the correct
>
> pp->recycle has nothing to do with the series. We don't add
> it in any special way, and if it's broken it's broken even
> for non-proivder buffers.
>
> > refcount to put based on the type of the netmem and skb flag.
>
> Just same as with the ->[un]readable flag, which is not
> functionally needed, and if it's screwed many things can
> go very wrong.
>
> > This code ignores all these generic code
> > skb_frag_unref/napi_pp_put_page/etc paths and uses raw access to
>
> I don't see the point, they are not used because they're not
> needed. Instead of checking whether it came from a page pool
> and whether it's net_iov or not, in the path io_uring returns
> it we already apriori know that they're from a specific page
> pool, net_iov and from the current provider.
>
> Same for optimisations provided by those helpers, they are
> useful when you're transferring buffers from one context to
> another, e.g. task recieve path -> napi / page_pool. In this
> case they're already fetched in the right context without any
> need to additionally jumping through the hoops. If anything,
> it'd be odd to jump out of a window to climb a rope on the
> other side of the building when you could've just walked 5
> meters to the other room.
>

For me, "they are not used because they're not needed." is not enough
justification to ignore the generic code paths that support generic
use cases and add your own freeing path and recycling that needs to
work adjacent to generic paths for posterity. You need to provide
concrete reasons why the current code paths don't work for you and
can't be made to work for you.

Is it very complicated to napi_pp_put_page() niovs as the user puts
them in the refill queue without adding a new syscall? If so, is it
possible to do a niov equivalent of page_pool_put_page_bulk() of the
refill queue while/as you process the RX path?

If you've tested the generic code paths to be performance deficient
and your recycling is indeed better, you could improve the page_pool
to pull netmems when it needs to like you're doing here, but in a
generic way that applies to the page allocator and other providers.
Not a one-off implementation that only applies to your provider.

If you're absolutely set on ignoring the currently supported reffing
and implementing your own reffing and recycling for your use case,
sure, that could work, but please don't overload the
niov->pp_ref_count reserved for the generic use cases for this. Add
io_zcrx_area->io_uring_ref or something and do whatever you want with
it. Since it's not sharing the pp_ref_count with the generic code
paths I don't see them conflicting in the future.

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 04/15] net: prepare for non devmem TCP memory providers
  2024-11-01 17:41     ` Pavel Begunkov
@ 2024-11-04 20:20       ` Mina Almasry
  2024-11-04 21:24         ` Pavel Begunkov
  2024-11-11 19:01         ` David Wei
  0 siblings, 2 replies; 47+ messages in thread
From: Mina Almasry @ 2024-11-04 20:20 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Fri, Nov 1, 2024 at 10:41 AM Pavel Begunkov <[email protected]> wrote:
> ...
> >> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >> index e928efc22f80..31e01da61c12 100644
> >> --- a/net/ipv4/tcp.c
> >> +++ b/net/ipv4/tcp.c
> >> @@ -277,6 +277,7 @@
> >>   #include <net/ip.h>
> >>   #include <net/sock.h>
> >>   #include <net/rstreason.h>
> >> +#include <net/page_pool/types.h>
> >>
> >>   #include <linux/uaccess.h>
> >>   #include <asm/ioctls.h>
> >> @@ -2476,6 +2477,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> >>                          }
> >>
> >>                          niov = skb_frag_net_iov(frag);
> >> +                       if (net_is_devmem_page_pool_ops(niov->pp->mp_ops)) {
> >> +                               err = -ENODEV;
> >> +                               goto out;
> >> +                       }
> >> +
> >
> > I think this check needs to go in the caller. Currently the caller
> > assumes that if !skb_frags_readable(), then the frag is dma-buf, and
>
> io_uring originated netmem that are marked unreadable as well
> and so will end up in tcp_recvmsg_dmabuf(), then we reject and
> fail since they should not be fed to devmem TCP. It should be
> fine from correctness perspective.
>
> We need to check frags, and that's the place where we iterate
> frags. Another option is to add a loop in tcp_recvmsg_locked
> walking over all frags of an skb and doing the checks, but
> that's an unnecessary performance burden to devmem.
>

Checking each frag in tcp_recvmsg_dmabuf (and the equivalent io_uring
function) is not ideal really. Especially when you're dereferencing
nio->pp to do the check which IIUC will pull a cache line not normally
needed in this code path and may have a performance impact.

We currently have a check in __skb_fill_netmem_desc() that makes sure
all frags added to an skb are pages or dmabuf. I think we need to
improve it to make sure all frags added to an skb are of the same type
(pages, dmabuf, iouring). sending it to skb_copy_datagram_msg or
tcp_recvmsg_dmabuf or error.

I also I'm not sure dereferencing ->pp to check the frag type is ever
OK in such a fast path when ->pp is not usually needed until the skb
is freed? You may have to add a flag to the niov to indicate what type
it is, or change the skb->unreadable flag to a u8 that determines if
it's pages/io_uring/dmabuf.


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-11-04 19:54       ` Mina Almasry
@ 2024-11-04 21:14         ` Pavel Begunkov
  2024-11-04 21:53           ` Pavel Begunkov
  2024-11-11 21:15         ` David Wei
  1 sibling, 1 reply; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-04 21:14 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On 11/4/24 19:54, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 2:09 PM Pavel Begunkov <[email protected]> wrote:
...
>>> However I feel like even a comment is extremely error prone. These
>>> container_of's are inside of the call stack of some helpers. I would
>>> say we need a check. If we're concerned about performance, the check
>>> can be behind DEBUG_NET_WARN_ON(), although even that is a bit iffy,
>>> but could be fine. Doing this without a check seems too risky to me.
>>
>> No, it doesn't need a check nor it needs a comment. The very
>> essence of virtual function tables is that they're coupled
>> together with objects for which those function make sense and
>> called only for those objects. The only way to get here with
>> invalid net_iovs is to take one page pool and feed it with
>> net_iovs from other another page pool that won't be sane in
>> the first place.
>>
> 
> That could happen. In fact the whole devmem tcp paths are very
> carefully written to handle that

What could happen? Calling ops of one page pool with net iovs
of another? Right, you can force yourself to write it this way,
but it's not sane code.

> net_iovs are allocated from the page_pool, put in skbs, and then sit
> in the sk receive queue. In pathological cases (user is
> re/misconfiguring flow steering) we can have 1 sk receive queue that
> has a mix of page skbs, devmem skbs, and io_uring skbs, and other
> skbs.
> 
> Code that is processing the skbs in the receive queue has no idea
> whether what kind of skb it is. That's why that code needs to check
> whether the skb has readable frags, and that's why in this very series
> you needed to add a check in tcp_recvmsg_dmabuf to make sure that its
> a dmabuf skb, and you need to add a check to io_zcrx_recv_frag that
> the frag inside it is io_uring niov. The code would be wrong without
> it.

Right, it's expanded to support multiple possible types instead of
"it's a devmem TCP thing and nothing else can ever use it". And it's
not new, devmem forks off the generic path, it just does it based on
skb->readable, which is no more than an optimisation and could've
been on the type of the buffer, e.g. is_net_iov(netmem).

> All I'm trying to say is that it's very error prone to rely on folks

It's really not, especially comparing to lots of other bits that
are much easier to screw up, skb->readable would be a stark
example, which we did catch failing many times.

> writing and reviewing code to check that whenever dmabuf/io_rcrx/etc
> handling is done, somewhere in the call stack a type verification
> check has been made, and a DEBUG_NET_WARN could help avoid some subtle
> memory corruption bugs.
> 
>> That would be an equivalent of:
>>
>> struct file *f1 = ...;
>> struct file *f2 = ...;
>>
>> f1->f_op->read(f2, ...);
>>
>> Maybe it looks strange for you in C, but it's same as putting
>> comments that a virtual function that it should be called only
>> for objects of that class:
>>
>> struct A {
>>          virtual void foo() = 0;
>> };
>> struct B: public A {
>>          void foo() override {
>>                  // we should only be called with objects of type
>>                  // struct B (or anything inheriting it), check that
>>                  if (!reinterpret_cast<struct B*>(this))
>>                          throw;
>>                  ...
>>          }
>> }
>>
>>
> 
> I'm not really sure I followed here. We do not get any type of
> compiler or type safety from this code because the dma-buf niovs and
> io_uring niovs are the same net_iov type.

Right, because it's C, but the basic idea of virtual dispatch
is in there.

> We can get type safety by defining new types for dmabuf_net_iov and
> io_uring_net_iov, then provide helpers:
> 
> dmabuf_net_iov *net_iov_to_dmabuf();
> io_uring_net_iov *net_iov_to_io_uring();

Directly aliasing it to parts of struct page doesn't leave
much space to extending types. The only option is for all
those types to have exactly same layout, but then there is
no much point in doing so.

> The helpers can check the niov is of the right type once and do a
> cast,  then the object with the specific type can be passed to all
> future heplers without additional checks. This is one way to do it I
> guess.
> 
...
>> Same for optimisations provided by those helpers, they are
>> useful when you're transferring buffers from one context to
>> another, e.g. task recieve path -> napi / page_pool. In this
>> case they're already fetched in the right context without any
>> need to additionally jumping through the hoops. If anything,
>> it'd be odd to jump out of a window to climb a rope on the
>> other side of the building when you could've just walked 5
>> meters to the other room.
>>
> 
> For me, "they are not used because they're not needed." is not enough
> justification to ignore the generic code paths that support generic
> use cases and add your own freeing path and recycling that needs to
> work adjacent to generic paths for posterity. You need to provide
> concrete reasons why the current code paths don't work for you and
> can't be made to work for you.

No, it more than justifies it, it's neither needed nor makes sense for
the chosen API, and the API is chosen so that it avoids those extra
steps of crossing contexts an extra time.

What you're saying is that It should work in a less efficient way and
(perhaps arguably) be less convenient to the user as it now needs to
care about batching, because that's how devmem TCP does it. It's not
really a good argument.

Let me give you a devmem TCP example of what you're saying. Why can't
you use the generic (copy) TCP path for devmem TCP? It's well
tested. The reason that it's about zero copy and copying adds... hmm...
a "copy" doesn't justify avoiding the generic path.

> Is it very complicated to napi_pp_put_page() niovs as the user puts
> them in the refill queue without adding a new syscall? If so, is it
> possible to do a niov equivalent of page_pool_put_page_bulk() of the
> refill queue while/as you process the RX path?

That adds an extra jump from one context to another for no apparent
reason just as mentioned above.

> If you've tested the generic code paths to be performance deficient
> and your recycling is indeed better, you could improve the page_pool
> to pull netmems when it needs to like you're doing here, but in a
> generic way that applies to the page allocator and other providers.
> Not a one-off implementation that only applies to your provider.

If I read it right, you're saying you need to improve devmem TCP
instead of doing an io_uring API, just like you indirectly declared
in the very beginning a couple of weeks ago. Again, if you're
against having an io_uring user API in general or against some
particular aspects of the API, then please state it clearly. If not,
I can leave the idea to you to entertain once it's merged.

> If you're absolutely set on ignoring the currently supported reffing
> and implementing your own reffing and recycling for your use case,
> sure, that could work, but please don't overload the
> niov->pp_ref_count reserved for the generic use cases for this. Add
> io_zcrx_area->io_uring_ref or something and do whatever you want with
> it. Since it's not sharing the pp_ref_count with the generic code
> paths I don't see them conflicting in the future.

That would be a performance problem, I don't believe they can't
live together.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 04/15] net: prepare for non devmem TCP memory providers
  2024-11-04 20:20       ` Mina Almasry
@ 2024-11-04 21:24         ` Pavel Begunkov
  2024-11-11 19:01         ` David Wei
  1 sibling, 0 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-04 21:24 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On 11/4/24 20:20, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 10:41 AM Pavel Begunkov <[email protected]> wrote:
...
>> io_uring originated netmem that are marked unreadable as well
>> and so will end up in tcp_recvmsg_dmabuf(), then we reject and
>> fail since they should not be fed to devmem TCP. It should be
>> fine from correctness perspective.
>>
>> We need to check frags, and that's the place where we iterate
>> frags. Another option is to add a loop in tcp_recvmsg_locked
>> walking over all frags of an skb and doing the checks, but
>> that's an unnecessary performance burden to devmem.
>>
> 
> Checking each frag in tcp_recvmsg_dmabuf (and the equivalent io_uring
> function) is not ideal really. Especially when you're dereferencing
> nio->pp to do the check which IIUC will pull a cache line not normally
> needed in this code path and may have a performance impact.

Even if it's cold, all net_iov processed are expected to come
from the same page pool and would be cached. And it should be of
a comparable hotness as dmabuf_genpool_chunk_owner accesses below,
considering that there is enough of processing happening around,
it should be of a worry and can be improve upon later.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-11-04 21:14         ` Pavel Begunkov
@ 2024-11-04 21:53           ` Pavel Begunkov
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-04 21:53 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On 11/4/24 21:14, Pavel Begunkov wrote:
> On 11/4/24 19:54, Mina Almasry wrote:
>> On Fri, Nov 1, 2024 at 2:09 PM Pavel Begunkov <[email protected]> wrote:
...
>> If you've tested the generic code paths to be performance deficient
>> and your recycling is indeed better, you could improve the page_pool
>> to pull netmems when it needs to like you're doing here, but in a
>> generic way that applies to the page allocator and other providers.
>> Not a one-off implementation that only applies to your provider.
> 
> If I read it right, you're saying you need to improve devmem TCP
> instead of doing an io_uring API, just like you indirectly declared
> in the very beginning a couple of weeks ago. Again, if you're
> against having an io_uring user API in general or against some
> particular aspects of the API, then please state it clearly. If not,
> I can leave the idea to you to entertain once it's merged.

On top of it, that wouldn't make sense for the normal page pool path,
it already pushes pages via a ring (ptr_ring + caches) from one
context to another. The difference is that buffers with these zero
copy interfaces make an extra stop in the user space, from where
we directly push into the page pool, just like you can directly push
via ptr_ring when you're already in the kernel, even though it
requires more logic to handle untrusted user space.

It only makes for zero copy providers, and I do remember you was
suggested to take the same approach, I think it was Stan, but since
it didn't materialise I assume it's not of interest to devmem TCP.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 03/15] net: page_pool: create hooks for custom page providers
  2024-10-29 23:05 ` [PATCH v7 03/15] net: page_pool: create hooks for custom page providers David Wei
@ 2024-11-05 16:28   ` Alexander Lobakin
  2024-11-06  1:08     ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Alexander Lobakin @ 2024-11-05 16:28 UTC (permalink / raw)
  To: David Wei
  Cc: io-uring, netdev, Jens Axboe, Pavel Begunkov, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Mina Almasry,
	Stanislav Fomichev, Joe Damato, Pedro Tammela

From: David Wei <[email protected]>
Date: Tue, 29 Oct 2024 16:05:06 -0700

> From: Jakub Kicinski <[email protected]>
> 
> The page providers which try to reuse the same pages will
> need to hold onto the ref, even if page gets released from
> the pool - as in releasing the page from the pp just transfers
> the "ownership" reference from pp to the provider, and provider
> will wait for other references to be gone before feeding this
> page back into the pool.
> 
> Signed-off-by: Jakub Kicinski <[email protected]>
> [Pavel] Rebased, renamed callback, +converted devmem
> Signed-off-by: Pavel Begunkov <[email protected]>
> Signed-off-by: David Wei <[email protected]>
> ---
>  include/net/page_pool/types.h |  9 +++++++++
>  net/core/devmem.c             | 14 +++++++++++++-
>  net/core/page_pool.c          | 17 +++++++++--------
>  3 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h
> index c022c410abe3..8a35fe474adb 100644
> --- a/include/net/page_pool/types.h
> +++ b/include/net/page_pool/types.h
> @@ -152,8 +152,16 @@ struct page_pool_stats {
>   */
>  #define PAGE_POOL_FRAG_GROUP_ALIGN	(4 * sizeof(long))
>  
> +struct memory_provider_ops {
> +	netmem_ref (*alloc_netmems)(struct page_pool *pool, gfp_t gfp);
> +	bool (*release_netmem)(struct page_pool *pool, netmem_ref netmem);
> +	int (*init)(struct page_pool *pool);
> +	void (*destroy)(struct page_pool *pool);
> +};
> +
>  struct pp_memory_provider_params {
>  	void *mp_priv;
> +	const struct memory_provider_ops *mp_ops;
>  };
>  
>  struct page_pool {
> @@ -215,6 +223,7 @@ struct page_pool {
>  	struct ptr_ring ring;
>  
>  	void *mp_priv;
> +	const struct memory_provider_ops *mp_ops;
>  
>  #ifdef CONFIG_PAGE_POOL_STATS
>  	/* recycle stats are per-cpu to avoid locking */
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 5c10cf0e2a18..01738029e35c 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -26,6 +26,8 @@
>  /* Protected by rtnl_lock() */
>  static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1);
>  
> +static const struct memory_provider_ops dmabuf_devmem_ops;
> +
>  static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool,
>  					       struct gen_pool_chunk *chunk,
>  					       void *not_used)
> @@ -117,6 +119,7 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding)
>  		WARN_ON(rxq->mp_params.mp_priv != binding);
>  
>  		rxq->mp_params.mp_priv = NULL;
> +		rxq->mp_params.mp_ops = NULL;
>  
>  		rxq_idx = get_netdev_rx_queue_index(rxq);
>  
> @@ -142,7 +145,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>  	}
>  
>  	rxq = __netif_get_rx_queue(dev, rxq_idx);
> -	if (rxq->mp_params.mp_priv) {
> +	if (rxq->mp_params.mp_ops) {
>  		NL_SET_ERR_MSG(extack, "designated queue already memory provider bound");
>  		return -EEXIST;
>  	}
> @@ -160,6 +163,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>  		return err;
>  
>  	rxq->mp_params.mp_priv = binding;
> +	rxq->mp_params.mp_ops = &dmabuf_devmem_ops;
>  
>  	err = netdev_rx_queue_restart(dev, rxq_idx);
>  	if (err)
> @@ -169,6 +173,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>  
>  err_xa_erase:
>  	rxq->mp_params.mp_priv = NULL;
> +	rxq->mp_params.mp_ops = NULL;
>  	xa_erase(&binding->bound_rxqs, xa_idx);
>  
>  	return err;
> @@ -388,3 +393,10 @@ bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem)
>  	/* We don't want the page pool put_page()ing our net_iovs. */
>  	return false;
>  }
> +
> +static const struct memory_provider_ops dmabuf_devmem_ops = {
> +	.init			= mp_dmabuf_devmem_init,
> +	.destroy		= mp_dmabuf_devmem_destroy,
> +	.alloc_netmems		= mp_dmabuf_devmem_alloc_netmems,
> +	.release_netmem		= mp_dmabuf_devmem_release_page,
> +};
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a813d30d2135..c21c5b9edc68 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -284,10 +284,11 @@ static int page_pool_init(struct page_pool *pool,
>  		rxq = __netif_get_rx_queue(pool->slow.netdev,
>  					   pool->slow.queue_idx);
>  		pool->mp_priv = rxq->mp_params.mp_priv;
> +		pool->mp_ops = rxq->mp_params.mp_ops;
>  	}
>  
> -	if (pool->mp_priv) {
> -		err = mp_dmabuf_devmem_init(pool);
> +	if (pool->mp_ops) {
> +		err = pool->mp_ops->init(pool);

Can't we just switch-case instead of indirect calls?
IO_URING is bool, it can't be a module, which means its functions will
be available here when it's enabled. These ops are easy to predict (no
ops, dmabuf, io_uring), so this really looks like an enum with 3 entries
+ switch-case ("regular" path is out if this switch-case under likely etc).

>  		if (err) {
>  			pr_warn("%s() mem-provider init failed %d\n", __func__,
>  				err);
> @@ -584,8 +585,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp)
>  		return netmem;

Thanks,
Olek

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request
  2024-11-01 21:17     ` Pavel Begunkov
@ 2024-11-05 23:09       ` Mina Almasry
  2024-11-11 22:02         ` David Wei
  0 siblings, 1 reply; 47+ messages in thread
From: Mina Almasry @ 2024-11-05 23:09 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Fri, Nov 1, 2024 at 2:16 PM Pavel Begunkov <[email protected]> wrote:
>
> On 11/1/24 20:11, Mina Almasry wrote:
> > On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
> >>
> > ...
> >> +static void io_zcrx_get_buf_uref(struct net_iov *niov)
> >> +{
> >> +       atomic_long_add(IO_ZC_RX_UREF, &niov->pp_ref_count);
> >> +}
> >> +
> >
> > This is not specific to io_rcrx I think. Please rename this and put it
> > somewhere generic, like netmem.h.
> >
> > Then tcp_recvmsg_dmabuf can use the same helper instead of the very
> > ugly call it currently does:
> >
> > - atomic_long_inc(&niov->pp_ref_count);
> > + net_iov_pp_ref_get(niov, 1);
> >
> > Or something.
> >
> > In general I think io_uring code can do whatever it wants with the
> > io_uring specific bits in net_iov (everything under net_area_owner I
> > think), but please lets try to keep any code touching the generic
> > net_iov fields (pp_pagic, pp_ref_count, and others) in generic
> > helpers.
>
> I'm getting confused, io_uring shouldn't be touching these
> fields, but on the other hand should export net/ private
> netmem_priv.h and page_pool_priv.h and directly hard code a bunch
> of low level setup io_uring that is currently in page_pool.c
>

The only thing requested from this patch is to turn
io_zcrx_get_buf_uref into something more generic. I'm guessing your
confusion is following my other comments in "[PATCH v7 06/15] net:
page pool: add helper creating area from pages". Let me take a closer
look at my feedback there.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 06/15] net: page pool: add helper creating area from pages
  2024-11-01 18:16     ` Pavel Begunkov
@ 2024-11-05 23:34       ` Mina Almasry
  2024-11-06  0:50         ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Mina Almasry @ 2024-11-05 23:34 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Fri, Nov 1, 2024 at 11:16 AM Pavel Begunkov <[email protected]> wrote:
>
> On 11/1/24 17:33, Mina Almasry wrote:
> > On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
> >>
> >> From: Pavel Begunkov <[email protected]>
> >>
> >> Add a helper that takes an array of pages and initialises passed in
> >> memory provider's area with them, where each net_iov takes one page.
> >> It's also responsible for setting up dma mappings.
> >>
> >> We keep it in page_pool.c not to leak netmem details to outside
> >> providers like io_uring, which don't have access to netmem_priv.h
> >> and other private helpers.
> >>
> >
> > I honestly prefer leaking netmem_priv.h details into the io_uring
> > rather than having io_uring provider specific code in page_pool.c.
>
> Even though Jakub didn't comment on this patch, but he definitely
> wasn't fond of giving all those headers to non net/ users. I guess
> I can't please everyone. One middle option is to make the page
> pool helper more granular, i.e. taking care of one netmem at
> a time, and moving the loop to io_uring, but I don't think it
> changes anything.
>

My issue is that these:

+int page_pool_mp_init_paged_area(struct page_pool *pool,
+                               struct net_iov_area *area,
+                               struct page **pages);
+void page_pool_mp_release_area(struct page_pool *pool,

Are masquerading as generic functions to be used by many mp but
they're really io_uring specific. dmabuf and the hugepage provider
would not use them AFAICT. Would have liked not to have code specific
to one mp in page_pool.c, and I was asked to move the dmabuf specific
functions to another file too IIRC.

These helpers depend on:

page_pool_set_pp_info: in netmem_priv.h
net_iov_to_netmem(niov): in netmem.h
page_pool_dma_map_page: can be put in page_pool/helpers.h?
page_pool_release_page_dma(pool, netmem):  can be put in page_pool/helpers.h?

I would prefer moving page_pool_set_pp_info (and the stuff it calls
into) to netmem.h and removing io_uring mp specific code from
page_pool.c.

> ...
> >>   #include <linux/dma-direction.h>
> >> @@ -459,7 +460,8 @@ page_pool_dma_sync_for_device(const struct page_pool *pool,
> >>                  __page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
> >>   }
> >>
> >> -static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem)
> >> +static bool page_pool_dma_map_page(struct page_pool *pool, netmem_ref netmem,
> >> +                                  struct page *page)
> >
> > I have to say this is confusing for me. Passing in both the netmem and
> > the page is weird. The page is the one being mapped and the
> > netmem->dma_addr is the one being filled with the mapping.
>
> the page argument provides a mapping, the netmem gives the object
> where the mapping is set. netmem could be the same as the page
> argument, but I don't think it's inherently wrong, and it's an
> internal helper anyway. I can entirely copy paste the function, I
> don't think it's anyhow an improvement.
>
> > Netmem is meant to be an abstraction over page. Passing both makes
> > little sense to me. The reason you're doing this is because the
> > io_uring memory provider is in a bit of a weird design IMO where the
> > memory comes in pages but it doesn't want to use paged-backed-netmem.
>
> Mina, as explained it before, I view it rather as an abstraction
> that helps with finer grained control over memory and extending
> it this way, I don't think it's such a stretch, and it doesn't
> change much for the networking stack overall. Not fitting into
> devmem TCP category doesn't make it weird.
>
> > Instead it uses net_iov-backed-netmem and there is an out of band page
> > to be managed.
> >
> > I think it would make sense to use paged-backed-netmem for your use
> > case, or at least I don't see why it wouldn't work. Memory providers
>
> It's a user page, we can't make assumptions about it, we can't
> reuse space in struct page like for pp refcounting (unlike when
> it's allocated by the kernel), we can't use the normal page
> refcounting.
>

You actually can't reuse space in struct net_iov either for your own
refcounting, because niov->pp_ref_count is a mirror to
page->pp_ref_count and strictly follows the patterns of that. But
that's the issue to be discussed on the other patch...

> If that's the direction people prefer, we can roll back to v1 from
> a couple years ago, fill skbs fill user pages, attach ubuf_info to
> every skb, and whack-a-mole'ing all places where the page could be
> put down or such, pretty similarly what net_iov does. Honestly, I
> thought that reusing common infra so that the net stack doesn't
> need a different path per interface was a good idea.
>

The common infra does support page-backed-netmem actually.

> > were designed to handle the hugepage usecase where the memory
> > allocated by the provider is pages. Is there a reason that doesn't
> > work for you as well?
> >
> > If you really need to use net_iov-backed-netmem, can we put this
> > weirdness in the provider? I don't know that we want a generic-looking
> > dma_map function which is a bit confusing to take a netmem and a page.>
> ...
> >> +
> >> +static void page_pool_release_page_dma(struct page_pool *pool,
> >> +                                      netmem_ref netmem)
> >> +{
> >> +       __page_pool_release_page_dma(pool, netmem);
> >> +}
> >> +
> >
> > Is this wrapper necessary? Do you wanna rename the original function
> > to remove the __ instead of a adding a wrapper?
>
> I only added it here to cast away __always_inline since it's used in
> a slow / setup path. It shouldn't change the binary, but I'm not a huge
> fan of leaving the hint for the code where it's not needed.
>

Right, it probably makes sense to make the modifications you want on
the original function rather than create a no-op wrapper to remove the
__always_inline.


--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 06/15] net: page pool: add helper creating area from pages
  2024-11-05 23:34       ` Mina Almasry
@ 2024-11-06  0:50         ` Pavel Begunkov
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-06  0:50 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On 11/5/24 23:34, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 11:16 AM Pavel Begunkov <[email protected]> wrote:
...
>> Even though Jakub didn't comment on this patch, but he definitely
>> wasn't fond of giving all those headers to non net/ users. I guess

>> I can't please everyone. One middle option is to make the page
>> pool helper more granular, i.e. taking care of one netmem at
>> a time, and moving the loop to io_uring, but I don't think it

                  ^^^

>> changes anything.
>>
> 
> My issue is that these:
> 
> +int page_pool_mp_init_paged_area(struct page_pool *pool,
> +                               struct net_iov_area *area,
> +                               struct page **pages);
> +void page_pool_mp_release_area(struct page_pool *pool,
> 
> Are masquerading as generic functions to be used by many mp but
> they're really io_uring specific. dmabuf and the hugepage provider
> would not use them AFAICT. Would have liked not to have code specific
> to one mp in page_pool.c, and I was asked to move the dmabuf specific
> functions to another file too IIRC.
> 
> These helpers depend on:
> 
> page_pool_set_pp_info: in netmem_priv.h
> net_iov_to_netmem(niov): in netmem.h
> page_pool_dma_map_page: can be put in page_pool/helpers.h?
> page_pool_release_page_dma(pool, netmem):  can be put in page_pool/helpers.h?
> 
> I would prefer moving page_pool_set_pp_info (and the stuff it calls
> into) to netmem.h and removing io_uring mp specific code from
> page_pool.c.

I just already described it above but somewhat less detailed. FWIW,
page_pool_dma_map_page() + page_pool_set_pp_info() has to be combined
into a single function as separately they don't make a good public
API. I can change it this way, but ultimately there is no much
difference, it needs to export functions that io_uring can use.
Does it make a better API? I doubt it, and it can be changed later
anyway. Let's not block the patchset for minor bikeshedding.

...
>>> Instead it uses net_iov-backed-netmem and there is an out of band page
>>> to be managed.
>>>
>>> I think it would make sense to use paged-backed-netmem for your use
>>> case, or at least I don't see why it wouldn't work. Memory providers
>>
>> It's a user page, we can't make assumptions about it, we can't
>> reuse space in struct page like for pp refcounting (unlike when
>> it's allocated by the kernel), we can't use the normal page
>> refcounting.
>>
> 
> You actually can't reuse space in struct net_iov either for your own
> refcounting, because niov->pp_ref_count is a mirror to
> page->pp_ref_count and strictly follows the patterns of that. But
> that's the issue to be discussed on the other patch...

Right, which is why it's used for usual buffer refcounting that
page pool / net stack recognises. It's also not a problem to extend
it for performance reasons, those are internals and can be changed
and there are ways to change it.

>> If that's the direction people prefer, we can roll back to v1 from
>> a couple years ago, fill skbs fill user pages, attach ubuf_info to
>> every skb, and whack-a-mole'ing all places where the page could be
>> put down or such, pretty similarly what net_iov does. Honestly, I
>> thought that reusing common infra so that the net stack doesn't
>> need a different path per interface was a good idea.
>>
> 
> The common infra does support page-backed-netmem actually.

Please read again why user pages can't be passed directly,
if you take a look at struct page and where pp_ref_count and
other bits are, you'll find a huge union.

>>> were designed to handle the hugepage usecase where the memory
>>> allocated by the provider is pages. Is there a reason that doesn't
>>> work for you as well?
>>>
>>> If you really need to use net_iov-backed-netmem, can we put this
>>> weirdness in the provider? I don't know that we want a generic-looking
>>> dma_map function which is a bit confusing to take a netmem and a page.>
>> ...
>>>> +
>>>> +static void page_pool_release_page_dma(struct page_pool *pool,
>>>> +                                      netmem_ref netmem)
>>>> +{
>>>> +       __page_pool_release_page_dma(pool, netmem);
>>>> +}
>>>> +
>>>
>>> Is this wrapper necessary? Do you wanna rename the original function
>>> to remove the __ instead of a adding a wrapper?
>>
>> I only added it here to cast away __always_inline since it's used in
>> a slow / setup path. It shouldn't change the binary, but I'm not a huge
>> fan of leaving the hint for the code where it's not needed.
>>
> 
> Right, it probably makes sense to make the modifications you want on
> the original function rather than create a no-op wrapper to remove the
> __always_inline.

Removing __always_inline from a function deemed hot enough to need
the attribute is not a good idea, not in an unrelated feature
patchset. FWIW, it's not a new practice either. If maintainers will
insist on removing it I'll do.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 03/15] net: page_pool: create hooks for custom page providers
  2024-11-05 16:28   ` Alexander Lobakin
@ 2024-11-06  1:08     ` Pavel Begunkov
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-06  1:08 UTC (permalink / raw)
  To: Alexander Lobakin, David Wei
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Mina Almasry, Stanislav Fomichev, Joe Damato,
	Pedro Tammela

On 11/5/24 16:28, Alexander Lobakin wrote:
...
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index a813d30d2135..c21c5b9edc68 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -284,10 +284,11 @@ static int page_pool_init(struct page_pool *pool,
>>   		rxq = __netif_get_rx_queue(pool->slow.netdev,
>>   					   pool->slow.queue_idx);
>>   		pool->mp_priv = rxq->mp_params.mp_priv;
>> +		pool->mp_ops = rxq->mp_params.mp_ops;
>>   	}
>>   
>> -	if (pool->mp_priv) {
>> -		err = mp_dmabuf_devmem_init(pool);
>> +	if (pool->mp_ops) {
>> +		err = pool->mp_ops->init(pool);
> 
> Can't we just switch-case instead of indirect calls?
> IO_URING is bool, it can't be a module, which means its functions will
> be available here when it's enabled. These ops are easy to predict (no
> ops, dmabuf, io_uring), so this really looks like an enum with 3 entries
> + switch-case ("regular" path is out if this switch-case under likely etc).

Because it better frames the provider api and doesn't require
io_uring calls sticking off the net code, i.e. decouples subsystems
better, that's while these calls are not in the hot path (in case of
io_uring it's ammortised). But you're right that it can be turned into
a switch, I just don't think it's better, and that's how it was done
in the original patch.

>>   		if (err) {
>>   			pr_warn("%s() mem-provider init failed %d\n", __func__,
>>   				err);
>> @@ -584,8 +585,8 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp)
>>   		return netmem;
> 
> Thanks,
> Olek

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 04/15] net: prepare for non devmem TCP memory providers
  2024-11-04 20:20       ` Mina Almasry
  2024-11-04 21:24         ` Pavel Begunkov
@ 2024-11-11 19:01         ` David Wei
  2024-11-15  0:43           ` Mina Almasry
  1 sibling, 1 reply; 47+ messages in thread
From: David Wei @ 2024-11-11 19:01 UTC (permalink / raw)
  To: Mina Almasry, Pavel Begunkov
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 2024-11-04 05:20, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 10:41 AM Pavel Begunkov <[email protected]> wrote:
>> ...
>>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>>> index e928efc22f80..31e01da61c12 100644
>>>> --- a/net/ipv4/tcp.c
>>>> +++ b/net/ipv4/tcp.c
>>>> @@ -277,6 +277,7 @@
>>>>   #include <net/ip.h>
>>>>   #include <net/sock.h>
>>>>   #include <net/rstreason.h>
>>>> +#include <net/page_pool/types.h>
>>>>
>>>>   #include <linux/uaccess.h>
>>>>   #include <asm/ioctls.h>
>>>> @@ -2476,6 +2477,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
>>>>                          }
>>>>
>>>>                          niov = skb_frag_net_iov(frag);
>>>> +                       if (net_is_devmem_page_pool_ops(niov->pp->mp_ops)) {
>>>> +                               err = -ENODEV;
>>>> +                               goto out;
>>>> +                       }
>>>> +
>>>
>>> I think this check needs to go in the caller. Currently the caller
>>> assumes that if !skb_frags_readable(), then the frag is dma-buf, and
>>
>> io_uring originated netmem that are marked unreadable as well
>> and so will end up in tcp_recvmsg_dmabuf(), then we reject and
>> fail since they should not be fed to devmem TCP. It should be
>> fine from correctness perspective.
>>
>> We need to check frags, and that's the place where we iterate
>> frags. Another option is to add a loop in tcp_recvmsg_locked
>> walking over all frags of an skb and doing the checks, but
>> that's an unnecessary performance burden to devmem.
>>
> 
> Checking each frag in tcp_recvmsg_dmabuf (and the equivalent io_uring
> function) is not ideal really. Especially when you're dereferencing
> nio->pp to do the check which IIUC will pull a cache line not normally
> needed in this code path and may have a performance impact.

This check is needed currently because the curent assumption in core
netdev code is that !skb_frags_readable() means devmem TCP. Longer term,
we need to figure out how to distinguish skb frag providers in both code
and Netlink introspection.

Since your concerns here are primarily around performance rather than
correctness, I suggest we defer this as a follow up series.

> 
> We currently have a check in __skb_fill_netmem_desc() that makes sure
> all frags added to an skb are pages or dmabuf. I think we need to
> improve it to make sure all frags added to an skb are of the same type
> (pages, dmabuf, iouring). sending it to skb_copy_datagram_msg or
> tcp_recvmsg_dmabuf or error.

It should not be possible for drivers to fill in an skb with frags from
different providers. A provider can only change upon a queue reset.

> 
> I also I'm not sure dereferencing ->pp to check the frag type is ever
> OK in such a fast path when ->pp is not usually needed until the skb
> is freed? You may have to add a flag to the niov to indicate what type
> it is, or change the skb->unreadable flag to a u8 that determines if
> it's pages/io_uring/dmabuf.
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-11-04 19:54       ` Mina Almasry
  2024-11-04 21:14         ` Pavel Begunkov
@ 2024-11-11 21:15         ` David Wei
  2024-11-14 20:56           ` Mina Almasry
  1 sibling, 1 reply; 47+ messages in thread
From: David Wei @ 2024-11-11 21:15 UTC (permalink / raw)
  To: Mina Almasry, Pavel Begunkov
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 2024-11-04 11:54, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 2:09 PM Pavel Begunkov <[email protected]> wrote:
>>
>> On 11/1/24 20:06, Mina Almasry wrote:
>> ...
>>>> +__maybe_unused
>>>> +static const struct memory_provider_ops io_uring_pp_zc_ops;
>>>> +
>>>> +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
>>>> +{
>>>> +       struct net_iov_area *owner = net_iov_owner(niov);
>>>> +
>>>> +       return container_of(owner, struct io_zcrx_area, nia);
>>>> +}
>>>> +
>>>
>>> We discussed this before I disappeared on vacation but I'm not too
>>> convinced to be honest, sorry.
>>>
>>> It's invalid to call io_zcrx_iov_to_area on a devmem niov and vice
>>> versa, right? So current and future code has to be very careful to
>>
>> Yes
>>
>>> call the right helpers on the right niovs.
>>>
>>> At the very least there needs to be a comment above all these
>>> container_of helpers:
>>>
>>> /* caller must have verified that this niov is devmem/io_zcrx */.
>>>
>>> However I feel like even a comment is extremely error prone. These
>>> container_of's are inside of the call stack of some helpers. I would
>>> say we need a check. If we're concerned about performance, the check
>>> can be behind DEBUG_NET_WARN_ON(), although even that is a bit iffy,
>>> but could be fine. Doing this without a check seems too risky to me.
>>
>> No, it doesn't need a check nor it needs a comment. The very
>> essence of virtual function tables is that they're coupled
>> together with objects for which those function make sense and
>> called only for those objects. The only way to get here with
>> invalid net_iovs is to take one page pool and feed it with
>> net_iovs from other another page pool that won't be sane in
>> the first place.
>>
> 
> That could happen. In fact the whole devmem tcp paths are very
> carefully written to handle that
> 
> net_iovs are allocated from the page_pool, put in skbs, and then sit
> in the sk receive queue. In pathological cases (user is
> re/misconfiguring flow steering) we can have 1 sk receive queue that
> has a mix of page skbs, devmem skbs, and io_uring skbs, and other
> skbs.
> 
> Code that is processing the skbs in the receive queue has no idea
> whether what kind of skb it is. That's why that code needs to check
> whether the skb has readable frags, and that's why in this very series
> you needed to add a check in tcp_recvmsg_dmabuf to make sure that its
> a dmabuf skb, and you need to add a check to io_zcrx_recv_frag that
> the frag inside it is io_uring niov. The code would be wrong without
> it.

The checks are already there in e.g. io_zcrx_recv_frag() to prevent
io_zcrx_iov_to_area() from being called on a devmem niov.

io_zcrx_copy_chunk() does not need a check because it is guaranteed to
only call io_zcrx_iov_to_area() on an io_uring niov. Copying happens in
two cases:

1. Not a niov
2. If offset is in the linearized part of an skb

Both cases do not apply to devmem so it is safe even in the case of an
skb rcvbuf with a mixture of non-niov, devmem niov and io_uring niov.

> 
> All I'm trying to say is that it's very error prone to rely on folks
> writing and reviewing code to check that whenever dmabuf/io_rcrx/etc
> handling is done, somewhere in the call stack a type verification
> check has been made, and a DEBUG_NET_WARN could help avoid some subtle
> memory corruption bugs.

This is a fair ask. I'll address it in the next iteration.

> 
>> That would be an equivalent of:
>>
>> struct file *f1 = ...;
>> struct file *f2 = ...;
>>
>> f1->f_op->read(f2, ...);
>>
>> Maybe it looks strange for you in C, but it's same as putting
>> comments that a virtual function that it should be called only
>> for objects of that class:
>>
>> struct A {
>>         virtual void foo() = 0;
>> };
>> struct B: public A {
>>         void foo() override {
>>                 // we should only be called with objects of type
>>                 // struct B (or anything inheriting it), check that
>>                 if (!reinterpret_cast<struct B*>(this))
>>                         throw;
>>                 ...
>>         }
>> }
>>
>>
> 
> I'm not really sure I followed here. We do not get any type of
> compiler or type safety from this code because the dma-buf niovs and
> io_uring niovs are the same net_iov type.
> 
> We can get type safety by defining new types for dmabuf_net_iov and
> io_uring_net_iov, then provide helpers:
> 
> dmabuf_net_iov *net_iov_to_dmabuf();
> io_uring_net_iov *net_iov_to_io_uring();
> 
> The helpers can check the niov is of the right type once and do a
> cast,  then the object with the specific type can be passed to all
> future heplers without additional checks. This is one way to do it I
> guess.
> 
>>>>   static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
>>>>                                   struct io_uring_zcrx_ifq_reg *reg)
>>>>   {
>>>> @@ -99,6 +114,9 @@ static int io_zcrx_create_area(struct io_ring_ctx *ctx,
>>>>                  goto err;
>>>>
>>>>          for (i = 0; i < nr_pages; i++) {
>>>> +               struct net_iov *niov = &area->nia.niovs[i];
>>>> +
>>>> +               niov->owner = &area->nia;
>>>>                  area->freelist[i] = i;
>>>>          }
>>>>
>>>> @@ -230,3 +248,200 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx)
>>>>   {
>>>>          lockdep_assert_held(&ctx->uring_lock);
>>>>   }
>>>> +
>>>> +static bool io_zcrx_niov_put(struct net_iov *niov, int nr)
>>>> +{
>>>> +       return atomic_long_sub_and_test(nr, &niov->pp_ref_count);
>>>> +}
>>>> +
>>>> +static bool io_zcrx_put_niov_uref(struct net_iov *niov)
>>>> +{
>>>> +       if (atomic_long_read(&niov->pp_ref_count) < IO_ZC_RX_UREF)
>>>> +               return false;
>>>> +
>>>> +       return io_zcrx_niov_put(niov, IO_ZC_RX_UREF);
>>>> +}
>>>> +
>>>
>>> Sorry, I have to push back a bit against this. The refcounting of
>>> netmem is already complicated. the paged netmem has 2 refcounts and
>>> care needs to be taken when acquiring and dropping refcounts. net_iov
>>> inherited the pp_ref_count but not the paged refcount, and again need
>>> some special handling. skb_frag_unref takes very special care checking
>>
>> Which is why it's using net_iovs.
>>
>>> pp->recycle, is_pp_netmem, and others to figure out the correct
>>
>> pp->recycle has nothing to do with the series. We don't add
>> it in any special way, and if it's broken it's broken even
>> for non-proivder buffers.
>>
>>> refcount to put based on the type of the netmem and skb flag.
>>
>> Just same as with the ->[un]readable flag, which is not
>> functionally needed, and if it's screwed many things can
>> go very wrong.
>>
>>> This code ignores all these generic code
>>> skb_frag_unref/napi_pp_put_page/etc paths and uses raw access to
>>
>> I don't see the point, they are not used because they're not
>> needed. Instead of checking whether it came from a page pool
>> and whether it's net_iov or not, in the path io_uring returns
>> it we already apriori know that they're from a specific page
>> pool, net_iov and from the current provider.
>>
>> Same for optimisations provided by those helpers, they are
>> useful when you're transferring buffers from one context to
>> another, e.g. task recieve path -> napi / page_pool. In this
>> case they're already fetched in the right context without any
>> need to additionally jumping through the hoops. If anything,
>> it'd be odd to jump out of a window to climb a rope on the
>> other side of the building when you could've just walked 5
>> meters to the other room.
>>
> 
> For me, "they are not used because they're not needed." is not enough
> justification to ignore the generic code paths that support generic
> use cases and add your own freeing path and recycling that needs to
> work adjacent to generic paths for posterity. You need to provide
> concrete reasons why the current code paths don't work for you and
> can't be made to work for you.

We are already using io_uring specific code, though. The entire feature
requires the use of io_uring, and sockets that are set up for it can
only be (properly) consumed via io_uring specific functions. The return
path is no different.

Consuming via standard syscalls is still functionally correct. In that
case, the skbs when freed will go back to the page pool via the generic
paths e.g. napi_pp_put_page(). But in the intended fast path, returning
via the io_uring specific refill ring, there is no reason why we must
use the generic return functions.

> 
> Is it very complicated to napi_pp_put_page() niovs as the user puts
> them in the refill queue without adding a new syscall? If so, is it
> possible to do a niov equivalent of page_pool_put_page_bulk() of the
> refill queue while/as you process the RX path?
> 
> If you've tested the generic code paths to be performance deficient
> and your recycling is indeed better, you could improve the page_pool
> to pull netmems when it needs to like you're doing here, but in a
> generic way that applies to the page allocator and other providers.
> Not a one-off implementation that only applies to your provider.
> 
> If you're absolutely set on ignoring the currently supported reffing
> and implementing your own reffing and recycling for your use case,
> sure, that could work, but please don't overload the
> niov->pp_ref_count reserved for the generic use cases for this. Add
> io_zcrx_area->io_uring_ref or something and do whatever you want with
> it. Since it's not sharing the pp_ref_count with the generic code
> paths I don't see them conflicting in the future.

Why insist on this? Both page/niov and devmem/io_uring niov are mutually
exclusive. There is no strong technical reason to not re-use
pp_ref_count.

> 
> --
> Thanks,
> Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request
  2024-11-05 23:09       ` Mina Almasry
@ 2024-11-11 22:02         ` David Wei
  0 siblings, 0 replies; 47+ messages in thread
From: David Wei @ 2024-11-11 22:02 UTC (permalink / raw)
  To: Mina Almasry, Pavel Begunkov
  Cc: io-uring, netdev, Jens Axboe, Jakub Kicinski, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 2024-11-05 15:09, Mina Almasry wrote:
> On Fri, Nov 1, 2024 at 2:16 PM Pavel Begunkov <[email protected]> wrote:
>>
>> On 11/1/24 20:11, Mina Almasry wrote:
>>> On Tue, Oct 29, 2024 at 4:06 PM David Wei <[email protected]> wrote:
>>>>
>>> ...
>>>> +static void io_zcrx_get_buf_uref(struct net_iov *niov)
>>>> +{
>>>> +       atomic_long_add(IO_ZC_RX_UREF, &niov->pp_ref_count);
>>>> +}
>>>> +
>>>
>>> This is not specific to io_rcrx I think. Please rename this and put it
>>> somewhere generic, like netmem.h.
>>>
>>> Then tcp_recvmsg_dmabuf can use the same helper instead of the very
>>> ugly call it currently does:
>>>
>>> - atomic_long_inc(&niov->pp_ref_count);
>>> + net_iov_pp_ref_get(niov, 1);
>>>
>>> Or something.
>>>
>>> In general I think io_uring code can do whatever it wants with the
>>> io_uring specific bits in net_iov (everything under net_area_owner I
>>> think), but please lets try to keep any code touching the generic
>>> net_iov fields (pp_pagic, pp_ref_count, and others) in generic
>>> helpers.
>>
>> I'm getting confused, io_uring shouldn't be touching these
>> fields, but on the other hand should export net/ private
>> netmem_priv.h and page_pool_priv.h and directly hard code a bunch
>> of low level setup io_uring that is currently in page_pool.c
>>
> 
> The only thing requested from this patch is to turn
> io_zcrx_get_buf_uref into something more generic. I'm guessing your
> confusion is following my other comments in "[PATCH v7 06/15] net:
> page pool: add helper creating area from pages". Let me take a closer
> look at my feedback there.
> 

Sounds good, I'll rename io_zcrx_get_buf_uref() to something more
generic. But I'll leave changing the existing calls for a future patch.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-11-11 21:15         ` David Wei
@ 2024-11-14 20:56           ` Mina Almasry
  2024-11-15 23:14             ` Jakub Kicinski
  0 siblings, 1 reply; 47+ messages in thread
From: Mina Almasry @ 2024-11-14 20:56 UTC (permalink / raw)
  To: David Wei
  Cc: Pavel Begunkov, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Mon, Nov 11, 2024 at 1:15 PM David Wei <[email protected]> wrote:
> > Is it very complicated to napi_pp_put_page() niovs as the user puts
> > them in the refill queue without adding a new syscall? If so, is it
> > possible to do a niov equivalent of page_pool_put_page_bulk() of the
> > refill queue while/as you process the RX path?
> >
> > If you've tested the generic code paths to be performance deficient
> > and your recycling is indeed better, you could improve the page_pool
> > to pull netmems when it needs to like you're doing here, but in a
> > generic way that applies to the page allocator and other providers.
> > Not a one-off implementation that only applies to your provider.
> >
> > If you're absolutely set on ignoring the currently supported reffing
> > and implementing your own reffing and recycling for your use case,
> > sure, that could work, but please don't overload the
> > niov->pp_ref_count reserved for the generic use cases for this. Add
> > io_zcrx_area->io_uring_ref or something and do whatever you want with
> > it. Since it's not sharing the pp_ref_count with the generic code
> > paths I don't see them conflicting in the future.
>
> Why insist on this? Both page/niov and devmem/io_uring niov are mutually
> exclusive. There is no strong technical reason to not re-use
> pp_ref_count.
>

Conflict between devmem (specifically) and io_uring is not my concern.
My concern is possible future conflict between io_uring refcounting
and page/devmem refcounting.

Currently net_iov refcounting and page refcounting is unified (as much
as possible). net_iov->pp_ref_count is an exact mirror in usage of
page->pp_ref_count and the intention is for it to remain so. This
patch reuses net_iov->pp_ref_count in a way that doesn't really apply
to page->pp_ref_count and already some deviation in use case is
happening which may lead to conflicting requirements in the future
(and to be fair, may not).

But I've been bringing this up a lot in the past (and offering
alternatives that don't introduce this overloading) and I think this
conversation has run its course. I'm unsure about this approach and
this could use another pair of eyes. Jakub, sorry to bother you but
you probably are the one that reviewed the whole net_iov stuff most
closely. Any chance you would take a look and provide direction here?
Maybe my concern is overblown...

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 04/15] net: prepare for non devmem TCP memory providers
  2024-11-11 19:01         ` David Wei
@ 2024-11-15  0:43           ` Mina Almasry
  0 siblings, 0 replies; 47+ messages in thread
From: Mina Almasry @ 2024-11-15  0:43 UTC (permalink / raw)
  To: David Wei
  Cc: Pavel Begunkov, io-uring, netdev, Jens Axboe, Jakub Kicinski,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Mon, Nov 11, 2024 at 11:01 AM David Wei <[email protected]> wrote:
>
> On 2024-11-04 05:20, Mina Almasry wrote:
> > On Fri, Nov 1, 2024 at 10:41 AM Pavel Begunkov <[email protected]> wrote:
> >> ...
> >>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >>>> index e928efc22f80..31e01da61c12 100644
> >>>> --- a/net/ipv4/tcp.c
> >>>> +++ b/net/ipv4/tcp.c
> >>>> @@ -277,6 +277,7 @@
> >>>>   #include <net/ip.h>
> >>>>   #include <net/sock.h>
> >>>>   #include <net/rstreason.h>
> >>>> +#include <net/page_pool/types.h>
> >>>>
> >>>>   #include <linux/uaccess.h>
> >>>>   #include <asm/ioctls.h>
> >>>> @@ -2476,6 +2477,11 @@ static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb,
> >>>>                          }
> >>>>
> >>>>                          niov = skb_frag_net_iov(frag);
> >>>> +                       if (net_is_devmem_page_pool_ops(niov->pp->mp_ops)) {
> >>>> +                               err = -ENODEV;
> >>>> +                               goto out;
> >>>> +                       }
> >>>> +
> >>>
> >>> I think this check needs to go in the caller. Currently the caller
> >>> assumes that if !skb_frags_readable(), then the frag is dma-buf, and
> >>
> >> io_uring originated netmem that are marked unreadable as well
> >> and so will end up in tcp_recvmsg_dmabuf(), then we reject and
> >> fail since they should not be fed to devmem TCP. It should be
> >> fine from correctness perspective.
> >>
> >> We need to check frags, and that's the place where we iterate
> >> frags. Another option is to add a loop in tcp_recvmsg_locked
> >> walking over all frags of an skb and doing the checks, but
> >> that's an unnecessary performance burden to devmem.
> >>
> >
> > Checking each frag in tcp_recvmsg_dmabuf (and the equivalent io_uring
> > function) is not ideal really. Especially when you're dereferencing
> > nio->pp to do the check which IIUC will pull a cache line not normally
> > needed in this code path and may have a performance impact.
>
> This check is needed currently because the curent assumption in core
> netdev code is that !skb_frags_readable() means devmem TCP. Longer term,
> we need to figure out how to distinguish skb frag providers in both code
> and Netlink introspection.
>

Right. In my mind the skb_frags_readable() check can be extended to
tell us whether the entire skb is io_uring or devmem or readable. So
that we don't have to:

1. Do a per-frag check, and
2. pull and keep an entire new cacheline hot to do the check.

> Since your concerns here are primarily around performance rather than
> correctness, I suggest we defer this as a follow up series.
>

OK.

> >
> > We currently have a check in __skb_fill_netmem_desc() that makes sure
> > all frags added to an skb are pages or dmabuf. I think we need to
> > improve it to make sure all frags added to an skb are of the same type
> > (pages, dmabuf, iouring). sending it to skb_copy_datagram_msg or
> > tcp_recvmsg_dmabuf or error.
>
> It should not be possible for drivers to fill in an skb with frags from
> different providers. A provider can only change upon a queue reset.
>

Right, drivers shouldn't fill in an skb with different providers. We
should probably protect the core net from weird driver behavior. We
could assert/force at skb_add_rx_frag_netmem() time that the entire
skb has frags of the same type. Then we only need to check
skb->frags[0] once to determine the type of the entire skb.

But as you mention this is a performance optimization. Probably OK to
punt this to a later iteration. But to me the changes are
straightforward enough that it may have been good to carry them in the
first iteration anyway. But OK to leave this out.

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-11-14 20:56           ` Mina Almasry
@ 2024-11-15 23:14             ` Jakub Kicinski
  2024-11-26 15:26               ` Pavel Begunkov
  0 siblings, 1 reply; 47+ messages in thread
From: Jakub Kicinski @ 2024-11-15 23:14 UTC (permalink / raw)
  To: Mina Almasry
  Cc: David Wei, Pavel Begunkov, io-uring, netdev, Jens Axboe,
	Paolo Abeni, David S. Miller, Eric Dumazet,
	Jesper Dangaard Brouer, David Ahern, Stanislav Fomichev,
	Joe Damato, Pedro Tammela

On Thu, 14 Nov 2024 12:56:14 -0800 Mina Almasry wrote:
> But I've been bringing this up a lot in the past (and offering
> alternatives that don't introduce this overloading) and I think this
> conversation has run its course. I'm unsure about this approach and
> this could use another pair of eyes. Jakub, sorry to bother you but
> you probably are the one that reviewed the whole net_iov stuff most
> closely. Any chance you would take a look and provide direction here?
> Maybe my concern is overblown...

Sorry I haven't read this code very closely (still!?) really hoping 
for next version to come during the merge window when time is more
abundant :S

From scanning the quoted context I gather you're asking about using 
the elevated ->pp_ref_count for user-owned pages? If yes - I also
find that part.. borderline incorrect. The page pool stats will show
these pages as allocated which normally means held by the driver or 
the stack. Pages given to the user should effectively leave the pool.
So I'm guessing this is some optimization. Same for patch 8.

But let me not get sucked into this before we wrap up the net-next PR..

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider
  2024-11-15 23:14             ` Jakub Kicinski
@ 2024-11-26 15:26               ` Pavel Begunkov
  0 siblings, 0 replies; 47+ messages in thread
From: Pavel Begunkov @ 2024-11-26 15:26 UTC (permalink / raw)
  To: Jakub Kicinski, Mina Almasry
  Cc: David Wei, io-uring, netdev, Jens Axboe, Paolo Abeni,
	David S. Miller, Eric Dumazet, Jesper Dangaard Brouer,
	David Ahern, Stanislav Fomichev, Joe Damato, Pedro Tammela

On 11/15/24 23:14, Jakub Kicinski wrote:
> On Thu, 14 Nov 2024 12:56:14 -0800 Mina Almasry wrote:
>> But I've been bringing this up a lot in the past (and offering
>> alternatives that don't introduce this overloading) and I think this
>> conversation has run its course. I'm unsure about this approach and
>> this could use another pair of eyes. Jakub, sorry to bother you but
>> you probably are the one that reviewed the whole net_iov stuff most
>> closely. Any chance you would take a look and provide direction here?
>> Maybe my concern is overblown...
> 
> Sorry I haven't read this code very closely (still!?) really hoping
> for next version to come during the merge window when time is more
> abundant :S
> 
>  From scanning the quoted context I gather you're asking about using
> the elevated ->pp_ref_count for user-owned pages? If yes - I also
> find that part.. borderline incorrect. The page pool stats will show
> these pages as allocated which normally means held by the driver or
> the stack. Pages given to the user should effectively leave the pool.

It can't just drop all net_iov refs here, otherwise the buffer might
be returned back to page pool and reused while the user still reading
the data. We can't even be smart in the release callback as it might
never get there and be reallocated purely via alloc.cache. And either
way, tunneling all buffers into ->release_netmem would be horrible
for performance, and it'd probably even need a check in
page_pool_recycle_in_cache().

Fixing it for devmem TCP (which also holds a net_iov ref while it's
given to user, so we're not unique here) sounds even harder as
they're stashed in a socket xarray page pool knows nothing about,
so it might need some extra counting on top?

This set has a problem with page_pool_alloc_frag*() helpers, so
we'd either need to explicitly chip away some bits from ->pp_ref_count
or move user counting out of net_iov and double the cost of one
of the main sources of overhead, and then being very inventive
optimising it in the future, but that won't solve the "should
leave the pool" problem.

If it's just stats printing, it should be quite easy to fix
for the current set, ala some kind of "mask out bits responsible
for user refs". And I don't immediately have an idea of how to
address it for devmem TCP.

Also note, that if sth happens with io_uring or such, those
"user" refs are going to be dropped by the kernel off a page
pool callback, so it's not about leaking buffers.

> So I'm guessing this is some optimization.
> Same for patch 8.

This one will need some more explanation, otherwise it'd be a guess
game. What is incorrect? The overall approach or some implementation
aspect? It's also worth to note that it's a private queue, stopping
the napi attached to it shouldn't interfere with other queues and
users in the system, that's it assuming steering configured right.

> But let me not get sucked into this before we wrap up the net-next PR..

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2024-11-26 15:26 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-29 23:05 [PATCH v7 00/15] io_uring zero copy rx David Wei
2024-10-29 23:05 ` [PATCH v7 01/15] net: prefix devmem specific helpers David Wei
2024-11-01 14:57   ` Mina Almasry
2024-10-29 23:05 ` [PATCH v7 02/15] net: generalise net_iov chunk owners David Wei
2024-10-29 23:05 ` [PATCH v7 03/15] net: page_pool: create hooks for custom page providers David Wei
2024-11-05 16:28   ` Alexander Lobakin
2024-11-06  1:08     ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 04/15] net: prepare for non devmem TCP memory providers David Wei
2024-11-01 17:09   ` Mina Almasry
2024-11-01 17:41     ` Pavel Begunkov
2024-11-04 20:20       ` Mina Almasry
2024-11-04 21:24         ` Pavel Begunkov
2024-11-11 19:01         ` David Wei
2024-11-15  0:43           ` Mina Almasry
2024-10-29 23:05 ` [PATCH v7 05/15] net: page_pool: add ->scrub mem provider callback David Wei
2024-10-29 23:05 ` [PATCH v7 06/15] net: page pool: add helper creating area from pages David Wei
2024-11-01 17:33   ` Mina Almasry
2024-11-01 18:16     ` Pavel Begunkov
2024-11-05 23:34       ` Mina Almasry
2024-11-06  0:50         ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 07/15] net: page_pool: introduce page_pool_mp_return_in_cache David Wei
2024-11-01 20:05   ` Mina Almasry
2024-10-29 23:05 ` [PATCH v7 08/15] net: add helper executing custom callback from napi David Wei
2024-10-29 23:05 ` [PATCH v7 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
2024-10-29 23:05 ` [PATCH v7 10/15] io_uring/zcrx: add io_zcrx_area David Wei
2024-10-29 23:05 ` [PATCH v7 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
2024-11-01 20:06   ` Mina Almasry
2024-11-01 21:09     ` Pavel Begunkov
2024-11-03 21:52       ` Pavel Begunkov
2024-11-04 19:54       ` Mina Almasry
2024-11-04 21:14         ` Pavel Begunkov
2024-11-04 21:53           ` Pavel Begunkov
2024-11-11 21:15         ` David Wei
2024-11-14 20:56           ` Mina Almasry
2024-11-15 23:14             ` Jakub Kicinski
2024-11-26 15:26               ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 12/15] io_uring/zcrx: add io_recvzc request David Wei
2024-11-01 20:11   ` Mina Almasry
2024-11-01 21:17     ` Pavel Begunkov
2024-11-05 23:09       ` Mina Almasry
2024-11-11 22:02         ` David Wei
2024-10-29 23:05 ` [PATCH v7 13/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2024-11-01 20:16   ` Mina Almasry
2024-11-01 20:35     ` Jens Axboe
2024-11-01 21:12       ` Pavel Begunkov
2024-10-29 23:05 ` [PATCH v7 14/15] io_uring/zcrx: add copy fallback David Wei
2024-10-29 23:05 ` [PATCH v7 15/15] io_uring/zcrx: throttle receive requests David Wei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox